Skip to contents

A similarity map is a list of instructions that define how to compare two data sets. Each instruction defines a pair of associated fields in the two data sets and one or more similarity functions that should be used to compare the fields.

Usage

SimilarityMap(instructions)

Arguments

instructions

A list of instructions that define how to compare

Value

A similarity map object

Details

The similarity map is used to create a similarity encoder. Similarity encoders are used (1) to encode pairs of data sets into a single tensor that can be used as input to a neural network, and (2) to retrieve information about constructing refutation claims for reasoning with neural-symbolic matching models. No actual encoding operation are performed by the similarity map. This design allows defining the similarity map once and using it to encode multiple pairs of data sets or to encode different subsets of the same data.

Instructions

Instructions are named lists of vectors. Each instruction (list element) defines

  • a pair of fields, one from each data set, that should be compared (element key) and

  • a collection of similarity functions that should be used to compare the associated fields (element value).

The instruction keys should follow the association string format.

Association Strings

An association string is a string that contains two field names separated by a tilde (~) character. The first field name is the name of the field in the left data set and the second field name is the name of the field in the right data set. In the general case, where the field names are different in the two data sets, the association string should be formatted as left_column~right_column. This instructs the model to associate the left_column of the Left input data with the right_column of the Right input data.

If the column names of the input data for a field are identical for both records, the association string can be the common field name. For example, if both the Left and Right input dataset have a column named title, then an association string title instructs the model to associate the title column of the Left and Right input data.

Similarity Functions

The vector of similarity functions defines the operations of the similarity encoder for each association string. Each association can have multiple similarity operations, in which case the similarity encoder applies all similarity operations to the associated columns. The caller can define the similarity operations by providing one of the predefined similarities. The predefined string similarities and ratios are calculated using the implementation of RapidFuzz. A list of available similarities can be obtained by calling available_similarities.

Examples

instructions <- list(
  "title" = c("damerau_levenshtein", "jaro"),
  "author~author" = c("lcsseq", "hamming"),
  "year~year" = c("gaussian", "euclidean")
)
smap <- SimilarityMap(instructions)