Setup
Data Owner A ←— parameters —→ Data Owner B
↓ ↓
Bloom Filters Bloom Filters
↓ ↓
train Encoder train Encoder
↓ ↓
encode Bloom Filters encode Bloom Filters
| |
————→ Linkage Unit ←————
Linkage Mapper Data Generation
Data Owner A ←— b_decode(D) —— Data Owner B
| ↑
a_encode(b_decode(D)) | | random Data D
————→ Linkage Unit —————
↓
pairs (d, a_encode(b_decode(d)))
running run_all_configs.py -cdir <config_directory> in /src/ will do the following for each configuration file in /src/<config_directory>/ :
- build two autoencoders of the same structure (specified in the configuration file) for two data owners A,B and fit them on their respective sets of Bloom-Filters
- encode the two datasets using the fitted encoders, normalize the encoded data
- generate training data in order to build a mapper between the two encodings. This is done as follows:
- a random dataset is sampled from an n-dimensional standard normal distribution in the linkage unit (n being the dimension of the encodings) and sent to data owner B
- the datapoints are transformed to fit the the output distribution of Bs encoder and fed into the decoder network
- the decoder outputs are sent to A, fed into As encoder network, the encoder outputs are normalized and sent to the linkage unit
- the linkage unit trains a mapper model on the obtained pairs of datapoints
- the datasets are linked by applying the mapper to Bs encoded datapoints and searching for the nearest neighbor of the output in As encoded data. If the distance is below a certain threshold (specified in the configuration file), the two datapoints are considered a match.
the linkage results, as well as all models, generated datasets and training progress data, for a configuration file /src/<config_directory>/<configname>.json are stored in /src/<config_directory>/<configname>/ .