A similarity map is a list of instructions that define how to compare two data sets. Each instruction defines a pair of associated fields in the two data sets and one or more similarity functions that should be used to compare the fields.
Details
The similarity map is used to create a similarity encoder. Similarity encoders are used (1) to encode pairs of data sets into a single tensor that can be used as input to a neural network, and (2) to retrieve information about constructing refutation claims for reasoning with neural-symbolic matching models. No actual encoding operation are performed by the similarity map. This design allows defining the similarity map once and using it to encode multiple pairs of data sets or to encode different subsets of the same data.
Instructions
Instructions are named lists of vectors. Each instruction (list element) defines
a pair of fields, one from each data set, that should be compared (element key) and
a collection of similarity functions that should be used to compare the associated fields (element value).
The instruction keys should follow the association string format.
Association Strings
An association string is a string that contains two field names separated by
a tilde (~) character. The first field name is the name of the field in the
left data set and the second field name is the name of the field in the
right data set. In the general case, where the field names are different
in the two data sets, the association string should be formatted as
left_column~right_column. This instructs the model to associate the
left_column of the Left input data with the right_column of the Right
input data.
If the column names of the input data for a field are identical for both
records, the association string can be the common field name. For example,
if both the Left and Right input dataset have a column named title,
then an association string title instructs the model to associate the
title column of the Left and Right input data.
Similarity Functions
The vector of similarity functions defines the operations of the similarity
encoder for each association string. Each association can have multiple
similarity operations, in which case the similarity encoder applies all
similarity operations to the associated columns. The caller can define the
similarity operations by providing one of the predefined similarities.
The predefined string similarities and ratios are calculated using the
implementation of
RapidFuzz.
A list of available similarities can be obtained by calling
available_similarities.
