Overview
A production data linkage tool for Big Data that performs record similarity matching across datasets — not hardcoded for a specific schema or domain, but fully configuration-driven. Built at Infosys and hosted as a job on GCP. This tool increased the productivity of business claims processing by 12 times.
How It Works
The architecture operates in two stages:
Stage 1 — Ranking: Candidate pairs are surfaced using ANN (Approximate Nearest Neighbor) vector search combined with TF-IDF scoring. This narrows the search space efficiently without requiring brute-force comparison of every record pair.
Stage 2 — Evaluation: Shortlisted pairs are scored using Jaro-Winkler distance, fuzzy matching algorithms, and weighted ratio metrics to determine match quality.
What Makes It Distinctive
Schema details, matching column pairs, and rule conditions are all defined externally through configuration — not embedded in code. This means the same tool can be pointed at entirely different datasets, with different structures and matching logic, without code changes.
It's a genuinely reusable piece of infrastructure. The same matching logic was later repurposed as the core differentiator in my DRISHTI defence challenge proposal, where it powers design change impact analysis in shipbuilding.
Technical Details
- ANN vector search for scalable candidate generation
- TF-IDF scoring for relevance ranking
- Jaro-Winkler distance for string similarity evaluation
- Fuzzy matching algorithms for flexible record comparison
- Polars and Pandas for high-performance data processing
- External configuration for schema, column pairs, and rule conditions
- Hosted as a GCP job for production workloads
- 12x productivity increase in business claims processing