Created by Robert Berry / @no0p_
Record linking solves one specific data quality problem.
Two or more records,
referencing the same entity,
without a key.
“Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources.
Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier”
But how can we discern which are matches without a key?
Label the records in table1 as set A, the records in table2 as set B.
Define a set of possible matches as A X B, such that each pair (a∈A,b∈B) is a possible match or non-match.
The set of true pairs is labeled set T where (a=b), and the set of false pairs is labeled set F where (a≠b).
Solve the problem by distinguishing whether a given pair belongs to set T or set F.
T and F are a partition of A X B.
Which is resolved by a function returning the probability that any old tuple from A and any old tuple from B are a true pair.
$P(truepair) = f(a, b)$
$P(truepair) = f(a, b) = f(x_1, x_2, x_3, \ldots x_n)$
Where $x_1, x_2, \ldots x_n$ are numeric metrics of the pairing
$d_j = \frac{1}{3}(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m})$
Where:
$m$ is the number of matching characters.
$t$ is half the number of transpositions.
Levenshtein Distance of Name | $\mathbb{Z}$ |
Trigram Distance of Name | $[0,1]$ |
Geo Spatial Distance of Address | $\mathbb{R}$ |
Name Shares Middle Initial Character | $0/1$ |
What is the optimal threshold?
$P(true pair) = g(t) = \frac{1}{1 + e^{-t}}$
$t = \beta_0 + \beta_1x_1 + \ldots \beta_nx_n$
Barack Obama vs. Barry Obama
$x_1$ = Jaro Winkler Distance = 0.89
$x_2$ = Trigram Distance = 0.2
$P(true pair) = f(x_1, x_2) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2)}}$
Logistic regression model solves for $\beta$ coefficients to complete equation.
Can derive features from joined tables
Same Company/Organization/Relation | $0/1$ |
Date/Time of Overlapping Activity | $\mathbb{R}$ |
Geo Spatial Distance of Record or Relation | $\mathbb{R}$ |
Same Phone Number | $0/1$ |
pgxnclient install madlib
or from source on github:
Configure; make
madpack -p postgres -c user@0.0.0.0/db install
Create training tables, apply SQL function, use model tables
Manually insert records into table.
Or leverage easily linkable records, e.g. 20% of company_a and company_b records have an IRS TID you can link on, but still have representative typo and alternative spelling examples.
Provides coefficients for $t = \beta_0 + \beta_1x_1 + \ldots \beta_nx_n$
Returns a probability tuple pair is a true pair.
$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$
$Precision = \frac{TP}{TP + FP}$
$Recall = \frac{TP}{TP + FN}$
$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = 0.995$
$Precision = \frac{TP}{TP + FP} = 0.985$
$Recall = \frac{TP}{TP + FN} = 0.992$
$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = 0.99$
$Precision = \frac{TP}{TP + FP} = 0.914$
$Recall = \frac{TP}{TP + FN} = 0.662$
Model only had name string as a parameter. Poor recall results.
- Does not require application changes
- Preserves Original Information
- 'Probabilistic Foreign Key' Sounds fun
- Reasonably Painless Access Pattern
- Duplication and Linking Case
- Manual Review
- M*N ~ (N^2) pair comparisons to find best match
- Compare only against best candidates
- Indexable value that can exclude candidates
- trigram distance
- Geographic restriction
- Kmeans cluster id
- LSH bucket
- Probably OK to have fairly loose comparison set
Identify Problem as multiple records for same entity without a key
This can be formulated as a binary classification of paired records as members of a set of true pairs or false pairs in A X B
Solve binary classification in Postgresql with Madlib logistic regression module
String metric features & features from relations
Create probabilistic foreign keys
Manually review close calls
- We should strive for correct data with referential integrity
- Be comfortable in a world of imperfect databases