Data Linkage Tool — Rahul Jeyasingh

Overview

A production data linkage tool for Big Data that performs record similarity matching across datasets — not hardcoded for a specific schema or domain, but fully configuration-driven. Built at Infosys and hosted as a job on GCP. This tool increased the productivity of business claims processing by 12 times.

How It Works

The architecture operates in two stages:

Stage 1 — Ranking: Candidate pairs are surfaced using ANN (Approximate Nearest Neighbor) vector search combined with TF-IDF scoring. This narrows the search space efficiently without requiring brute-force comparison of every record pair.

Stage 2 — Evaluation: Shortlisted pairs are scored using Jaro-Winkler distance, fuzzy matching algorithms, and weighted ratio metrics to determine match quality.

What Makes It Distinctive

Schema details, matching column pairs, and rule conditions are all defined externally through configuration — not embedded in code. This means the same tool can be pointed at entirely different datasets, with different structures and matching logic, without code changes.

It's a genuinely reusable piece of infrastructure. The same matching logic was later repurposed as the core differentiator in my DRISHTI defence challenge proposal, where it powers design change impact analysis in shipbuilding.

Technical Details

ANN vector search for scalable candidate generation
TF-IDF scoring for relevance ranking
Jaro-Winkler distance for string similarity evaluation
Fuzzy matching algorithms for flexible record comparison
Polars and Pandas for high-performance data processing
External configuration for schema, column pairs, and rule conditions
Hosted as a GCP job for production workloads
12x productivity increase in business claims processing