TranSMART-batch version for this dataset:
The TraIT CLUC data can be downloaded here: http://beehub.nl/TraIT-Datateam/Public%20CLUC/trait-cluc-v1.0.zip
The CTMM TraIT project recently added the Cell Line Use Case (CLUC) to tranSMART. The CLUC is a collection of data on colorectal and prostate cell lines from an exceptionally broad set of platforms, as shown in the table below.
This diverse set is used to:
- Standardize data formats and data processing pipelines from the four domain
- Test the integration pipelines within the TraIT translational toolset
By incorporating the same platforms as used for ongoing research projects, this cell line set gives a representative test set comparable to real patient data, without the legal burden of handling personal data. The TraIT Cell Line Use Case is available under the CC0 license at http://beehub.nl/TraIT-Datateam/Public%20CLUC/Cell-line.tar.gz.
Please use the following citation when making use of this dataset: Bierkens, Mariska & Bijlard, Jochem "The TraIT cell line use case." Manuscript in preparation. More information can also be found on the Bio-IT World Poster "Multi-omics data analysis in tranSMART using the Cell Line Use Case dataset".
- TranSMART-batch version for this dataset:
- General remarks
- Setting up transmart-data
- Data structure and loading the data
- Clinical data
- Gene expression data
- Copy Number Variation ('aCGH') data
- Small Genomic Variants ('VCF')
- Advanced loading
- Common errors during loading
- Uploading using unified shell-scripts
Throughout the document you will find
<path_to> tags, these are place holders for a path to were the transmart-data folder is located, for example on this computer the transmart-data folder is located in
/Users/name/transmart-data, this should be inserted into the tag. (so
Note that folder structure is very important in the upload process, make sure to structure your data in the correct way (figure 1). As a general rule, commands that start with
"make -C" need to be called from the transmart-data folder, all the other commands need to be run from specific files were the data is located. For more detailed information about the data type you wish to load please refer to the section dedicated to that specific data type.
For the tutorial the assumption is made that the data is loaded into a local database with default settings, meaning that the database is located on the same machine that has the data folders and the ETL pipeline scripts. For remote loading or loading to databases without default settings please check the Advanced Loading section.
Setting up transmart-data
To upload the Cell-line use case data you need the transmart-data folder, which can be found here:
transmart-data contains a collection of scripts and the basic folder structure you need to upload the data. After downloading the transmart-data folder from github you need to update the ETL-pipeline by running the following command from the transmart-data folder:
After the transmart-ETL update is done there should now be two folders in /transmart-data/env called tranSMART-ETL and data-integration.
The next step is to configure the
vars file in transmart-data, there is a sample file called
vars.sample, make a copy of this and name it
The vars file contains information for both oracle and postgres databases, as we are using postgres so only the following parameters must be set correctly:
By default set to localhost
This is the port on which the database can be reached. Default this is set for a localhost to port 5432. When using a server you need to forward this to the port the SSH connection is established.
Database name to which the data will be uploaded, the default name of the database is transmart
Your username to access the database. Default set to tm_cz
NOTE: this is not the same as a login used to access the data via the web client.
Password to access the database. Leave empty if there is no password. Default is tm_cz
Path to the directory where postgres is installed. When installed locally with a package manager the location probably will be
/usr/local/bin/. be sure to end the pathname with a /
path to where the kettle scripts are located. If you updated the ETL-pipeline used above the location will be
<path_to>/transmart-data/env/tranSMART-ETL/Postgres/GPL-1.0/Kettle/Kettle-ETL/. be sure to replace
<path_to> with actual location of the folder.
Same as KETTLE_JOBS_PSQL. KETTLE_JOBS=$KETTLE_JOBS_PSQL
This is in the transmart-data folder under
This is in the transmart-data folder under env/data-integration/kitchen.sh. KITCHEN=
Points to R script used when uploading clinical data. Can be found in the transmart-data folder under:
load.sh scripts to find the loading scripts the location needs to be added to the path. The loading scripts are in
Lastly the parameters set need to be exported to the environment.
export PGHOST PGPORT PGDATABASE PGUSER PGPASSWORD PGSQL_BIN \
KETTLE_JOBS_PSQL KETTLE_JOBS R_JOBS_PSQL KITCHEN KETTLE_HOME PATH
When you are done setting up the vars file, in the transmart-data folder run:
For Mac OSX:
The loading scripts use the -e option from the function readlink. This is a function that is not in the readlink that is installed on Mac OSX, to bypass this problem you are required to install greadlink (stands for GNU readlink). The easiest way to do this is by using homebrew and the following command:
If you do not have homebrew, go here and get it.
After installing the coreutils go to ~/transmart-data/samples/postgres and open process_params.inc. on line 20 you will find:
Change this to
params=$(greadlink -e "$1"), save the change and you should be ok.
Data structure and loading the data
In order to load the data properly the scripts need to know were the data is located, in order to achieve this the data structure is more of less set. In the data Cell-line.tar.gz the only thing you have to do is extract the files and we advice to put them in ../transmart-data/samples/studies. The following figure gives an overview of the data types and the way the folder structure is build up. More details about particular datatypes can be found in there respective sections. One requirement for the data upload to succeed is to have a
logs folder in all the folders that contain data that is uploaded, so for example, in the aCGH/gene/180k folder will be a folder called
logs. All the loading processes will be stored in log files that are in the respective datatypes subdirectory.
Getting the data to the server
If you want to upload the data to a server you first need to get the data on the server. The easiest way to do this is by opening a terminal window and connect to the server:
When the connection is made open a new terminal window (do not close the window where you connected to the server) and navigate to the study you want to copy. From the folder the study is located in run the following command:
Loading the data
In the Cell-line folder there are 2 scripts,
load.sh. Before loading anything check if the environment is set by running:
The chmod command sets the correct permissions for the load scripts to run.
If the output of the script tells you Environment to upload data into tranSMART looks OK you can start uploading the data, if it gives an error look at the error and check what is missing to start uploading. (did you source the vars file?)
NOTE: The environment check also checks if the kettle.properties file is present. It is important to know that the kettle.properties file is generated when the clinical data is loaded the first time.
When everything is set you can run the following command to load the entire study.
It is possible to load each data type individually, this might be worth considering as some data loads may take a long time to complete. Look at the dedicated parts on the data types below to learn more.
To load just the clinical data, from the clinical folder run:
in the clinical data folder you need the following files:
Indicates where the column and word mapping file are located.
There are two datafiles. One contains the characteristics data, the other has the non-high throughput data. The names of the files can be arbitrarily chosen as long as they are specified in the column mapping file. The files should be tab-separated files.
The column mapping file contains 6 columns, filename, category code, column number, data label, data label source and control vocab cd. The filename is the file name of a tab separated datafile, the category code is used to indicate part of the tree shown in tranSMART (subject is a reserved term to indicate the patients). The column number indicates which column should be used from the datafile, the data label indicates the leaf note name (SUBJ_ID is a reserved term to indicate the subjets). The last 2 columns can be empty. Each Category Code - Data label pair should have a unique name, and each unique name should have 1 column assigned from a data file.
The word mapping file can be used to transform values in the data file according to for example a codebook. The file should contain 4 columns, a file name where the value is located that should be replaced, the column number of the concept in the data file, value to be replaced, new value.
In case of the CLUC data there are 2 data files, 1 containing the characteristics like age, gender, disease and sample ID. The second data file contains some non-high throughput molecular profiling data describing gains or losses of selected genes. The column mapping file maps the columns in the datafolder to the correct tree structure shown in tranSMART, it tells for example that column 5 in the data file is the age column and should be stored under a variable called Age.
When the Postgres database is setup with the default values the tutorial works fine, but when you try to load data to a database with a different name or try to load to a database on a different port this requires some more insight to how the data loading works. As shown above the data loading process needs the following files and settings to be in place:
varsfile to set the correct environment variables
- kettle.properties - this file is generated based on the vars file the first time clinical data is loaded.
- File structure - Study name is derived from this, and for the tutorial the file structure is set. Adding new data types or uploading your own data is not bound by the structure.
- transmart-ETL & data-integration folders
As mentioned, in order for the data upload to work you need a correct
vars file, and the
kettle.properties file is generated for you the first time you load clinical data. While this is true for the very first time you load data into a database using the freshly set up transmart-data, loading data to additional databases with slightly different settings for database name or port will result in failed results if you do not update the kettle.properties file properly.
Updating the vars file to handle the new database and sourcing it does not change anything in the kettle.properties file, the file that is used by KETTLE to retrieve database name and port number. So for example your first data load is to a database called transmart which runs on port 5432, if you now want to load to an additional local database called transmart_test that also runs on port 5432 and you only adjust the vars file all the data will still be loaded into the first transmart database.
The easiest way to adjust kettle.properties is to delete it and let the clinical upload to the second database generate a new one, but this means that switching databases to upload additional data requires you to reload the clinical data on every database switch. To overcome this problem you can just manually change the database name and port number to your required setting.
Uploading to a remote host:
Important to note here is that you can upload data in two different ways, on a local installation or on a remote installation. Loading to the a local installation is what was shown above, loading to a remote installation requires you to connect to the remote host using ssh and depending on what you loaded already might require some additional setup. Loading remotely requires you to set up an SSH tunnel to the server where the database is located by running:
This means that in your local vars file, and kettle.properties you need to fill in 25432. This port is forwarded to the database_port, 5432.
After setting up this ssh, you can open a new terminal window (remember to source the vars file) and load the data on the remote server.
Up to here the loading of the CLUC data served like an example for your own data uploads. The loading process should have given you an idea to the folder structure that is used to upload data, the requirements and order the data can be loaded in. Loading new data or your own data can be done by closely looking at the example upload files provided for the CLUC. The figure shows a generic setup for the data types shown above, with clinical being the only one that has no annotation files. On the left, in red, data_type1 is shown were each data set (data_set1 & data_set2) of that data type has a different platform annotation. On the right, in blue, data_type2 shows a different structure with only one annotation folder which contains annotation for both data sets. In the data upload we have seen so far, aCGH shows a good example of the difference between these set ups. In the case where you have multiple annotations this could be due to region annotation with the different data sets using different regions, in the second case where you have one annotation file this could be due to gene level annotation.
When uploading your own study be sure to pay attention to the file name and the naming in
study.params. Be sure to upload the clinical data before uploading any high dimensional data.
To upload the study you will need to generate your own load.sh scripts, these are bash scripts which are in each folder that contains data that need to be loaded. The scripts use pushd and popd to adjust the current working directory, this means all of the load scripts are formatting in one of the following ways (example files taken from CLUC):
The scripts show how the expression data is loaded for the Affymetrix array including platform annotation. Executing the top load.sh goes down the 3 scripts to first execute the load_annotaton.sh before going back up and loading the actual data. More on pushd, popd and stack building can be found here.
set -e sets the environment needed for the upload
set -x sets more information to be printed during the loading
Common errors during loading
This has probably something to do with an incorrect sourced vars file or a old kettle.properties file pointing to a different database or having the incorrect port. Check the environment variables by typing env, or get the individual variables with echo $variable_name. This only checks for the incorrect vars file. Go to ../transmart-data/samples/postgres/kettle-home and check the kettle.properties file for the correct settings, with the correct settings being the correct database name and port number.
This is a problem in the database which is fixed by running the following command in psql
This error tells you that the database you are trying to reach is not located where you thought it would be. This could be a problem with the PGHOST or PGPORT variable. To solve this problem you need to figure out what the location of the database is and which port accepts the connection you are trying to establish. As a start you could try to reconnect to the server and start with echo $PGHOST and see if the variable is already defined. If it is this probably points to the database you are looking for.
Uploading using unified shell-scripts
For the upload described here to work we need a working upload environment to be present. We installed tranSMART as described at the transmart-foundation wiki. The described "vars" file defines the following environment.
You "vars" file may differ a little, but the variables mentioned here should be defined, else the upload will fail.
We provide a little shell-script 'check_env.sh', situated in the top-directory of this study, which checks a few things to see if the environment is OK.
The directory structure of the Cell-line study follows a naming convention (italic names) as shown in the next figure. This naming convention must be seen as a proposal, because the convention is now based on the datatypes found in tranSMART seen from a developers perspective. This is probably not what we want.
In each directory you will find a script "load.sh". This script loads the data in that directory (and the data found in the sub-directories) in to tranSMART. So, if you execute the script "load.sh" in the top-directory, all data for this study will be upload into tranSMART. This is not recommended until you are sure this works for you. Let's start with doing it one by one.
Another important concept to notice are the "params" files, which you find in each directory containing data to be uploaded. These files are mandatory and define (possible) variables that influence how the data is uploaded. Also notice the file "study.params" in the top-directory, which contains variables at study level (available for all datasets).
Goto directory "clinical" and execute "load.sh".
This is what should happen:
- An R-script is executed, which transforms the data-files into a file which can be uploaded into the landing-zone of tranSMART
- The transformed data is uploaded into the landing-zone (283 lines)
- The stored procedure "i2b2_load_clinical_data" is called (return code should be "1")
Goto directory "acgh" and execute "load.sh".
This should upload 4 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
note: for the "gene" datasets to be successfully uploaded, you first have to upload the "annotation" data (annotation data is shared by both datasets)
Goto directory "expression" and execute "load.sh"
This should upload 2 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
Goto directory "Myrna" and execute "load.sh"
It is assumed that the mirna-dictionary is already available in tranSMART. This should have been done during installation time (
Goto directory "proteomics" and execute "load.sh"
It is assumed that the uniprot-dictionary is already available in tranSMART. This should have been done during installation time (
Goto directory "rnaseq" and execute "load.sh"
This should upload 7 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
note: be sure you upload the "annotation" dataset first.
Goto directory "vcf" and execute "load.sh"
This should upload 8 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
Note: uploading a vcf dataset twice will result in undefined behaviour, because the "old" dataset is not removed.