← Back

Case study

Unifying 400,000 records across sources with no shared key

Combo — French HR SaaS for restaurants and hospitality (~130 employees, 10,000+ customers)

The problem

No shared identifier. Restaurant names almost never match legal entity names.

Combo had 124,000 restaurant listings from Google Maps and 277,000 from French business registries. No shared identifier between them. A Google listing for "Chez Marcel" might match "SAS Groupe Restauration Marcel et Fils" in the registry—or it might not. Without linking the two, neither dataset was useful for prospecting.

  • No shared identifier. Restaurant names almost never match legal entity names.
  • Ownership structures hide behind holding companies and brands. One owner, five legal entities, three trade names.
  • Naive matching = 34 billion pairs. Not happening.

The solution

A focused implementation with clear guardrails

Blocking by postal code + 500m radius cut 34 billion pairs down to 410,000 (99.999% reduction).

Scoring: 70% geographic distance, 20% address similarity, 10% name similarity (trigram Jaccard). Bidirectional best-match ranking to avoid one-to-many collisions.

Vector search (text-embedding-005, BigQuery IVF index) for semantic queries across the unified data. Daily refresh with hash-based change detection.

Why it worked

  • Geography is the strongest signal when names diverge. 70% weight on distance.
  • Conservative thresholds: average accepted score 0.83, average match distance 50m. Ambiguous matches flagged, not forced.
  • Registry provider choice mattered. One had 99% coordinate coverage, another 40%. That gap made geographic matching viable.
  • Incremental updates with change detection. No full rebuilds, daily refresh.

The results

Measurable outcomes (without hype)

  • Two datasets, one queryable source. Commercial presence + legal entity structure in one place.
  • 99.999% reduction in match computation. Runs on standard BigQuery.
  • ~85% precision on semantic queries. Powers the sales agent from the other case study.
  • Daily incremental updates. Only changed records get processed.