Welcome to MOLGENIS/connect

Background

MOLGENIS/connect is a semi-automatic data integration system built in MOLGENIS that can assist researchers in finding, matching and pooling data from different biobanks. During the data integration process, Molgenis/connect not only can suggest relevant data elements from biobanks for the given interest of research variables but also is capable of generating data tranformation algorithms for data integration. In addition, users can easily interact with the system to improve upon the suggested mappings and algorithms. 

The instruction on how to deploy it in tomcat 7 can be found here. The source code is avaiable on GitHub


Demo

Click here for a demo or watch a live demo on YouTube

The demo is created using data from the Healthy Obese Project. The target schema consists of approximiately 90 core data elements e.g. Body Mass Index, History of Hypertension representing the research question. Three biobanks (LifeLines, Prevend and Mitchelstown) are selected in this demo as the source datasets for which we have the data. The task is to harmonize the 90 core data elements in the three biobanks separately and then pool the harmonized results into one dataset.

Demo data

Although the demo version does not have full functionality, it allows you to view all the mappings and algorithms in the demo mapping project that have been generated in advance. In addition you will be able to try out the semantic search and the algorithm generator functions in the demo. To get access to MOLGENIS/connect, please contact the administrator for login credentials. Try out the examples below, you can directly get results by clicking one of the three example links.


Click 'Example 1' to see the overview of the demo mapping project overview. 



Click 'Example 2' to see the mapping + algorithm auto-generated for the target data element Body Mass Index kg/m2 (BMI) in the biobank lifelines.

The explanation of generating the algorithm for BMI in lifelines

  • the semantic search is applied to find all the relevant data elements in lifelines for BMI.
  1. BMI: Body Mass Index
  2. HEIGHT: Height in cm
  3. WEIGHT: Weight in kg
  4. .....
  • the algorithm generator checks if there are any pre-defined templates associated with BMI and the template is detecetd in the database shown below.
 
$('weight').div($('height').pow(2)).value()

  • the algorithm generator implements the BMI template with the data elements found in lifelines. Units of mapped data elements are adjusted accordingly e.g. HEIGHT is divided by 100.
 
$('WEIGHT').div($('HEIGHT').div(100.0).pow(2)).value()


Click 'Example 3' to see the mapping + algorithm auto-generated for the target data element History of Hypertension in the biobank lifelines. In the example, you will find that: 

  • the first suggested data element is 'Do you ever have high blood pressure' because 'high blood pressure' is a synonym of 'Hypertension'. 
  • the target categories and source categories are automatically matched based on the lexical similarity.


Click 'Example 4' to see the mapping + algorithm auto-generated for the target data element Current Consumption Frequency of Potatoes in the biobank lifelines. In the example, target and source categories are automatically matched following a strategry, where

  • both of the target and source categories are first converted to quantifiable amounts.
  • the source amounts are matched to the closest target amounts.

You will see that only the first source data element is used in the algorithm. However the second element is also related to Potatoes, therefore

  • you need to select the checkbox of the second data element.
  • you will see that the algorithm is updated based on the new selection of data elements. 

Technical design

First, we developed the semantic search that uses ontology-based query expansion to find relevant data elements from biobanks, irrespective of variations in the terminologies used. Second, we created the algorithm generator that can automatically generate data transformation algorithms to convert these data elements to the target schema, including unit conversion, category mapping, and more complex recurring conversion patterns e.g. calculation of BMI.

 

Not available