Introduction
The CTMM TraIT project recently added the Cell Line Use Case (CLUC) to tranSMART. The CLUC is a collection of data on colorectal and prostate cell lines from an exceptionally broad set of platforms, as shown in the table below.
This diverse set is used to:
Standardize data formats and data processing pipelines from the four domain
Test the integration pipelines within the TraIT translational toolset
By incorporating the same platforms as used for ongoing research projects, this cell line set gives a representative test set comparable to real patient data, without the legal burden of handling personal data. The TraIT Cell Line Use Case transmart-ready files are available under the CC0 license for download here.
Please use the following citation when making use of this dataset: Bierkens, Mariska & Bijlard, Jochem "The TraIT cell line use case." Manuscript in preparation. More information can also be found on the Bio-IT World Poster "Multi-omics data analysis in tranSMART using the Cell Line Use Case dataset".
Table of contents
General remarks
Note that folder structure is very important in the upload process, make sure to structure your data in the correct way (figure 1). For more detailed information about the data type you wish to load please refer to the section dedicated to that specific data type.
It is important to setup the batchdb.properties file to provide transmart-batch with the location and login information needed to load the data. A detailed explanation on the properties file can be found here.
For the tutorial the assumption is made that the data is loaded into a local database with default settings, meaning that the database is located on the same machine that has the data folders and the ETL pipeline scripts.
Important note: As transmart-batch currently does not have a pipeline for VCF data this data typ will have to be loaded with Kettle.
Setting up transmart-batch and general documentation
For the complete documentation on transmart-batch please look here.
To use transmart-batch with 16.1 or 16.2 you can use the V1.0 release. To use the latest version please clone the git repository and build transmart-batch:
git clone https://github.com/thehyve/transmart-batch.git cd transmart-batch ./gradlew capsule
After building you should see transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar
Batchdb.properties file
The properties file contains information as the location of the database, the username and password that are used to upload the data to the database. The properties is build up of four lines indicating which database is being used, either PostgreSQL or Oracle, the location of the database and the user.
PostgreSQL batch.jdbc.driver=org.postgresql.Driver batch.jdbc.url=jdbc:postgresql://localhost:5432/transmart batch.jdbc.user=tm_cz batch.jdbc.password=tm_cz
Oracle batch.jdbc.driver=oracle.jdbc.driver.OracleDriver batch.jdbc.url=jdbc:oracle:thin:@localhost:1521:ORCL batch.jdbc.user=tm_cz batch.jdbc.password=tm_cz
Data structure and loading the data
In order to load the data properly the scripts need to know were the data is located, in order to achieve this the data structure is more of less set. In the data (available here) the only thing you have to do is extract the files and you are ready to load. The following figure gives an overview of the data types and the way the folder structure is build up. More details about particular datatypes can be found in there respective sections.
Getting the data to the server
If you want to upload the data to a server you first need to get the data on the server. The easiest way to do this is by opening a terminal window and connect to the server:
ssh username@serverAddress
When the connection is made open a new terminal window (do not close the window where you connected to the server) and navigate to the study you want to copy. From the folder the study is located in run the following command:
scp -r study_name username@serverAddress:~(default, folder on server to put the data, ~ is your home folder)
Loading the data
To load the data transmart-batch needs three files.
batchdb.properties file
data to be loaded params file, this can be the data type or the annotation platform params file
Setting up transmart-data
To upload the Cell-line use case data you need the transmart-data folder, which can be found here:
git clone https://github.com/transmart/transmart-data
transmart-data contains a collection of scripts and the basic folder structure you need to upload the data. After downloading the transmart-data folder from github you need to update the ETL-pipeline by running the following command from the transmart-data folder:
make -C env update_etl_git make -C env data-integration
After the transmart-ETL update is done there should now be two folders in /transmart-data/env called tranSMART-ETL and data-integration.
The next step is to configure the vars
file in transmart-data, there is a sample file called vars.sample
, make a copy of this and name it vars
.
The vars file contains information for both oracle and postgres databases, as we are using postgres so only the following parameters must be set correctly:
When you are done setting up the vars file, in the transmart-data folder run:
source vars