The CTMM TraIT project recently added the Cell Line Use Case (CLUC) to tranSMART. The CLUC is a collection of data on colorectal and prostate cell lines from an exceptionally broad set of platforms, as shown in the table below.
This diverse set is used to:
Standardize data formats and data processing pipelines from the four domain
Test the integration pipelines within the TraIT translational toolset
By incorporating the same platforms as used for ongoing research projects, this cell line set gives a representative test set comparable to real patient data, without the legal burden of handling personal data. The TraIT Cell Line Use Case transmart-ready files are available under the CC0 license for download here.
Please use the following citation when making use of this dataset: Bierkens, Mariska & Bijlard, Jochem "The TraIT cell line use case." Manuscript in preparation. More information can also be found on the Bio-IT World Poster "Multi-omics data analysis in tranSMART using the Cell Line Use Case dataset".
Table of contents
Note that folder structure is very important in the upload process, make sure to structure your data in the correct way (figure 1). For more detailed information about the data type you wish to load please refer to the section dedicated to that specific data type.
It is important to setup the batchdb.properties file to provide transmart-batch with the location and login information needed to load the data. A detailed explanation on the properties file can be found here.
For the tutorial the assumption is made that the data is loaded into a local database with default settings, meaning that the database is located on the same machine that has the data folders and the ETL pipeline scripts.
Important note: As transmart-batch currently does not have a pipeline for VCF data this data typ will have to be loaded with Kettle.
Setting up transmart-batch and general documentation
For the complete documentation on transmart-batch please look here.
After building you should see transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar
The properties file contains information as the location of the database, the username and password that are used to upload the data to the database. The properties is build up of four lines indicating which database is being used, either PostgreSQL or Oracle, the location of the database and the user.
Data structure and loading the data
In order to load the data properly the scripts need to know were the data is located, in order to achieve this the data structure is more of less set. In the data (available here) the only thing you have to do is extract the files and you are ready to load. The following figure gives an overview of the data types and the way the folder structure is build up. More details about particular datatypes can be found in there respective sections.
Getting the data to the server
If you want to upload the data to a server you first need to get the data on the server. The easiest way to do this is by opening a terminal window and connect to the server:
When the connection is made open a new terminal window (do not close the window where you connected to the server) and navigate to the study you want to copy. From the folder the study is located in run the following command:
Loading the data
To load the data transmart-batch needs three files.
data to be loaded params file, this can be the data type or the annotation platform params file
Setting up transmart-data
To upload the Cell-line use case data you need the transmart-data folder, which can be found here:
transmart-data contains a collection of scripts and the basic folder structure you need to upload the data. After downloading the transmart-data folder from github you need to update the ETL-pipeline by running the following command from the transmart-data folder:
After the transmart-ETL update is done there should now be two folders in /transmart-data/env called tranSMART-ETL and data-integration.
The next step is to configure the
vars file in transmart-data, there is a sample file called
vars.sample, make a copy of this and name it
The vars file contains information for both oracle and postgres databases, as we are using postgres so only the following parameters must be set correctly:
By default set to localhost
This is the port on which the database can be reached. Default this is set for a localhost to port 5432. When using a server you need to forward this to the port the SSH connection is established.
Database name to which the data will be uploaded, the default name of the database is transmart
Your username to access the database. Default set to tm_cz
NOTE: this is not the same as a login used to access the data via the web client.
Password to access the database. Leave empty if there is no password. Default is tm_cz
Path to the directory where postgres is installed. When installed locally with a package manager the location probably will be
/usr/local/bin/. be sure to end the pathname with a /
path to where the kettle scripts are located. If you updated the ETL-pipeline used above the location will be
<path_to>/transmart-data/env/tranSMART-ETL/Postgres/GPL-1.0/Kettle/Kettle-ETL/. be sure to replace
<path_to> with actual location of the folder.
Same as KETTLE_JOBS_PSQL. KETTLE_JOBS=$KETTLE_JOBS_PSQL
This is in the transmart-data folder under
This is in the transmart-data folder under env/data-integration/kitchen.sh. KITCHEN=
Points to R script used when uploading clinical data. Can be found in the transmart-data folder under:
load.sh scripts to find the loading scripts the location needs to be added to the path. The loading scripts are in
Lastly the parameters set need to be exported to the environment.
export PGHOST PGPORT PGDATABASE PGUSER PGPASSWORD PGSQL_BIN \
KETTLE_JOBS_PSQL KETTLE_JOBS R_JOBS_PSQL KITCHEN KETTLE_HOME PATH
When you are done setting up the vars file, in the transmart-data folder run: