Skip to end of metadata
Go to start of metadata
Icon

We recommend to use transmart-batch going forward instead: Tutorial: How to load the Cell Line Use Case dataset with transmart-batch

TranSMART-batch version for this dataset:

https://github.com/tranSMART-Foundation/transmart-batch/blob/master/docs/how_to_load_trait_cluc.md

The TraIT CLUC data can be downloaded here: http://beehub.nl/TraIT-Datateam/Public%20CLUC/trait-cluc-v1.0.zip

Introduction

The CTMM TraIT project recently added the Cell Line Use Case (CLUC) to tranSMART. The CLUC is a collection of data on colorectal and prostate cell lines from an exceptionally broad set of platforms, as shown in the table below.

This diverse set is used to:

  • Standardize data formats and data processing pipelines from the four domain
  • Test the integration pipelines within the TraIT translational toolset

By incorporating the same platforms as used for ongoing research projects, this cell line set gives a representative test set comparable to real patient data, without the legal burden of handling personal data. The TraIT Cell Line Use Case is available under the CC0 license at http://beehub.nl/TraIT-Datateam/Public%20CLUC/Cell-line.tar.gz.

Please use the following citation when making use of this dataset: Bierkens, Mariska & Bijlard, Jochem "The TraIT cell line use case." Manuscript in preparation. More information can also be found on the Bio-IT World Poster "Multi-omics data analysis in tranSMART using the Cell Line Use Case dataset".

 

 

General remarks

Throughout the document you will find <path_to> tags, these are place holders for a path to were the transmart-data folder is located, for example on this computer the transmart-data folder is located in /Users/name/transmart-data, this should be inserted into the tag. (so <path_to>/samples --> /Users/name/transmart-data/samples

Note that folder structure is very important in the upload process, make sure to structure your data in the correct way (figure 1). As a general rule, commands that start with "make -C" need to be called from the transmart-data folder, all the other commands need to be run from specific files were the data is located. For more detailed information about the data type you wish to load please refer to the section dedicated to that specific data type.

For the tutorial the assumption is made that the data is loaded into a local database with default settings, meaning that the database is located on the same machine that has the data folders and the ETL pipeline scripts. For remote loading or loading to databases without default settings please check the Advanced Loading section.

 


Setting up transmart-data

To upload the Cell-line use case data you need the transmart-data folder, which can be found here:


transmart-data contains a collection of scripts and the basic folder structure you need to upload the data. After downloading the transmart-data folder from github you need to update the ETL-pipeline by running the following command from the transmart-data folder:

 

After the transmart-ETL update is done there should now be two folders in /transmart-data/env called tranSMART-ETL and data-integration.

The next step is to configure the vars file in transmart-data, there is a sample file called vars.sample, make a copy of this and name it vars.
The vars file contains information for both oracle and postgres databases, as we are using postgres so only the following parameters must be set correctly:
 

 Example vars file
 More information on the variables
 PGHOST

By default set to localhost

 PGPORT

This is the port on which the database can be reached. Default this is set for a localhost to port 5432. When using a server you need to forward this to the port the SSH connection is established.

 PGDATABASE

Database name to which the data will be uploaded, the default name of the database is transmart

 PGUSER

Your username to access the database. Default set to tm_cz
NOTE: this is not the same as a login used to access the data via the web client.

 PGPASSWORD

Password to access the database. Leave empty if there is no password. Default is tm_cz

 PGSQL_BIN

Path to the directory where postgres is installed. When installed locally with a package manager the location probably will be /usr/local/bin/. be sure to end the pathname with a /

 KETTLE_JOBS_PSQL

path to where the kettle scripts are located. If you updated the ETL-pipeline used above the location will be <path_to>/transmart-data/env/tranSMART-ETL/Postgres/GPL-1.0/Kettle/Kettle-ETL/. be sure to replace <path_to> with actual location of the folder.

 KETTLE_JOBS

 Same as KETTLE_JOBS_PSQL. KETTLE_JOBS=$KETTLE_JOBS_PSQL

 KETTLE_HOME

 This is in the transmart-data folder under samples/postgres/kettle-home. KETTLE_HOME=<path_to>/transmart-data/samples/postgres/kettle-home

 KITCHEN

 This is in the transmart-data folder under env/data-integration/kitchen.sh. KITCHEN=<path_to>/transmart-data/env/data-integration/kitchen.sh

 R_JOBS_PSQL

Points to R script used when uploading clinical data. Can be found in the transmart-data folder under: <path_to>/transmart-data/env/tranSMART-ETL/Postgres/GPL-1.0/R 

 PATH (adding loading script location)

 For the load.sh scripts to find the loading scripts the location needs to be added to the path. The loading scripts are in <path_to>/transmart-data/samples/postgres.

PATH=<path_to>/transmart-data/samples/postgres:$PATH

 export

 Lastly the parameters set need to be exported to the environment.

export PGHOST PGPORT PGDATABASE PGUSER PGPASSWORD PGSQL_BIN \

KETTLE_JOBS_PSQL KETTLE_JOBS R_JOBS_PSQL KITCHEN KETTLE_HOME PATH

 

When you are done setting up the vars file, in the transmart-data folder run:

 

For Mac OSX:

The loading scripts use the -e option from the  function readlink. This is a function that is not in the readlink that is installed on Mac OSX, to bypass this problem you are required to install greadlink (stands for GNU readlink). The easiest way to do this is by using homebrew and the following command:

If you do not have homebrew, go here and get it.

After installing the coreutils go to ~/transmart-data/samples/postgres and open process_params.inc. on line 20 you will find:

Change this to params=$(greadlink -e "$1"), save the change and you should be ok.

 

 


Data structure and loading the data

In order to load the data properly the scripts need to know were the data is located, in order to achieve this the data structure is more of less set. In the data Cell-line.tar.gz the only thing you have to do is extract the files and we advice to put them in ../transmart-data/samples/studies. The following figure gives an overview of the data types and the way the folder structure is build up. More details about particular datatypes can be found in there respective sections. One requirement for the data upload to succeed is to have a logs folder in all the folders that contain data that is uploaded, so for example, in the aCGH/gene/180k folder will be a folder called logs. All the loading processes will be stored in log files that are in the respective datatypes subdirectory.

Getting the data to the server

If you want to upload the data to a server you first need to get the data on the server. The easiest way to do this is by opening a terminal window and connect to the server:

When the connection is made open a new terminal window (do not close the window where you connected to the server) and navigate to the study you want to copy. From the folder the study is located in run the following command:

 

Loading the data

In the Cell-line folder there are 2 scripts, check_env.sh and load.sh. Before loading anything check if the environment is set by running:

The chmod command sets the correct permissions for the load scripts to run. 

If the output of the script tells you Environment to upload data into tranSMART looks OK you can start uploading the data, if it gives an error look at the error and check what is missing to start uploading. (did you source the vars file?)
NOTE: The environment check also checks if the kettle.properties file is present. It is important to know that the kettle.properties file is generated when the clinical data is loaded the first time. 

 

When everything is set you can run the following command to load the entire study.

 Before you do run this command and load the entire study, read the following:

It is possible to load each data type individually, this might be worth considering as some data loads may take a long time to complete. Look at the dedicated parts on the data types below to learn more.

Clinical data

To load just the clinical data, from the clinical folder run:

 

in the clinical data folder you need the following files:

 

 clinical.params

Indicates where the column and word mapping file are located.

 Cell-line data file

There are two datafiles. One contains the characteristics data, the other has the non-high throughput data. The names of the files can be arbitrarily chosen as long as they are specified in the column mapping file. The files should be tab-separated files.

 column mapping file

The column mapping file contains 6 columns, filename, category code, column number, data label, data label source and control vocab cd. The filename is the file name of a tab separated datafile, the category code is used to indicate part of the tree shown in tranSMART (subject is a reserved term to indicate the patients). The column number indicates which column should be used from the datafile, the data label indicates the leaf note name (SUBJ_ID is a reserved term to indicate the subjets). The last 2 columns can be empty. Each Category Code - Data label pair should have a unique name, and each unique name should have 1 column assigned from a data file. 

 word mapping file

The word mapping file can be used to transform values in the data file according to for example a codebook. The file should contain 4 columns, a file name where the value is located that should be replaced, the column number of the concept in the data file, value to be replaced, new value.

 

In case of the CLUC data there are 2 data files, 1 containing the characteristics like age, gender, disease and sample ID. The second data file contains some non-high throughput molecular profiling data describing gains or losses of selected genes. The column mapping file maps the columns in the datafolder to the correct tree structure shown in tranSMART, it tells for example that column 5 in the data file is the age column and should be stored under a variable called Age. 




 


Gene expression data

Before data can be loaded into tranSMART, the platform used to generate the data must be loaded. The load.sh scripts are set up to first load annotation/platform data before uploading the actual measured data. The annotation/platform files are located in annotation folders (see image to the right)

Microarray data

mRNA array

Agilent 44K mRNA microarray

Place the data in your local transmart-data folder, in folder: samples/studies/Cell-line/expression
Goto this directory:  samples/studies/Cell-line/expression/Agilent 44K mRNA microarray

 

Affymetrix exon array

Place the data in your local transmart-data folder, in folder: samples/studies/Cell-line/expression
Goto this directory: samples/studies/Cell-line/expression/Affymetrix exon array

 

miRNA

Next to platform data miRNA also needs a dictionary before data can be used in advanced analysis. The dictionary maps the small miRNA parts to genes which allows tranSMART to use the miRNA in the advanced analysis work flow. With a clean tranSMART installation only the gene dictionary is loaded, so it could be the miRNA dictionary is not loaded for your instance of tranSMART. To load the dictionary run the following command from the transmart-data/ folder:

if the dictionary is already present running this command will just return that the dictionary is already loaded.

Agilent miRNA microarray

The annotation/platform information is loaded together with the data. The image on the right shows the file structure for this to work.

To load the miRNA data run the load.sh script from the miRNA folder

in the miRNA folder there are 4 files that are needed for the upload to be successful:

 

 mirna.params

Example file:

 mirna_data.txt

Name of the data file is specified in the params file. The file should be a tab separated file with in the first column "id_ref" which refers to the annotation and the second to nth column representing samples. All the values should be in quotes:

"ref_id"
"1"
"2"
.........

 

sample 1
"-0.84"
"0"
...sample n
"-0.225"
"0"
 mirna_sample_mapping.txt

Empty file, this file needs to be present for the upload to work.

 mirna_subject_sample_mapping.txt

Name of the subject sample map is specified in the params file. The tab separated file has 10 columns, from left to right "trial_name", "site_id", "subject_id", "sample_cd", "platform", "tissue_type", "attr1", "attr2", "cat_cd", "src_cd". The "platform" column should contain the platform id under which the miRNA annotation was uploaded. The "cat_cd" column contains the path to the data as shown in the folder structure in tranSMART. All fields should be enclosed in quotes.

Tutorial: How to load the Cell Line Use Case dataset with transmart-data

 

 

 

 

 

 

 


 

RNAseq data

For the RNAseq data the folder structure is different from the other data types. The platform annotation files are directly in the rnaseq folder and not nested with the data. This means the annotations should be loaded before the rest of the data can. If you load both the GA II and HiSeq2000 datasets at the same time the load script takes care of this. Running the following command from ..../Cell-line/rnaseq will load all the RNAseq data for the CLUC dataset.

 

Illumina GA II RNAseq

The RNAseq data contains (sequence) read counts for transcripts, i.e. it is a measurement for the relative abundance of transcripts.
Goto directory:  .../Cell-line/rnaseq/Illumina GA II RNAseq and run the load command

The folder should contain the following three files:

 Cell-line_samples.txt

 No strict file name, is specified in the params file.

The file has 5 tab separated columns, Platform/GPL_ID, GENE_ID, SAMPLE_ID, readcount, normalized_readcount.

 Cell-line_subjectmapping.txt

No strict file name, is specified in the params file

Tab separated file containing 10 columns, STUDY_ID, SITE_ID, SUBJECT_ID, SAMPLE_ID, PLATFORM, TISSUETYPE, ATTR1, ATTR2, CATEGORY_CD and SOURCE_CD. For more information please follow this link or check the example file.

 rnaseq.params

Name is predetermined, contains the names of the data file and mapping file and optional settings for the source_cd.

Illumina HiSeq2000 RNAseq

Similar to GA II RNAseq but more sub folders, each subfolder contains 1 sample, to upload them all type the load command from .../Cell-line/rnaseq/Illumina HiSeq2000 RNAseq. Each sample sub folder has the same three files as displayed above.

 

 


 

Copy Number Variation ('aCGH') data 

For the array CGH the data is available from 2 different arrays, 180k and 224k Agilent micro arrays. The data has been processed on gene and region level generating a total of 4 data sets to load. The figure on the right shows the folder structure of the data. The platform annotation on gene level for both the 180k and 224k arrays is the same, which is reflected in the folder structure. The platform annotation needs to be loaded before the actual data so when loading the gene level data by itself this requires extra attention.

All of the data can be loaded with one command, if you want to only load a part of the data just navigate to the proper map before executing the command.

All of the folders containing data should have the following three files:

 Cell-line_samples.txt

 No strict file name, is specified in the params file.

The file structure is build up as the first column containing either gene id or region id and each set of 7 columns following this first describe a sample. From left to right these columns are: sample.chip, sample.segm, sample.flag, sample.loss, sample.norm, sample.gain, sample.amp. The columns are separated by tabs. For more information please follow this link.

 Cell-line_subjectmapping.txt

No strict file name, is specified in the params file

Tab separated file containing 10 columns, STUDY_ID, SITE_ID, SUBJECT_ID, SAMPLE_ID, PLATFORM, TISSUETYPE, ATTR1, ATTR2, CATEGORY_CD and SOURCE_CD. For more information please follow this link or check the example file.

 acgh.params

Name is predetermined, contains the names of the data file and mapping file and optional settings for the source_cd.


 

Small Genomic Variants ('VCF')

There a total of 8 datasets available obtained from 3 different platforms. To load all of them at once simply go into the vcf directory and run

As these datasets are quite large compared to the other data types loading may take some time. 
Note: uploading a vcf dataset twice will result in undefined behaviour, because the "old" dataset is not removed. 

Complete Genomics DNAseq & Illumina GAII RNAseq both have one vcf file to load while Illumina HiSeq2000 RNAseq contains the remanning 6. Each folder with a vcf file should have the following three files

 Cell-line.vcf

The actual VCF file. The VCF files for the Cell-line use case are annotated with HGNC gene symbols and Ensembl Gene Identifiers. For more information about the VCF file format please follow this link.

 Subject_sample_mapping

 The subject sample mapping file maps the actual sample names to the sample IDs given in the VCF file. For example VCaP is given the ID GS000008107-ASM in the VCF file.

 vcf.params

Specifies the VCF file to upload, the subject sample map to use, genome build used to process the samples and builds the concept path shown in the tranSMART tree. Click here to see an example file with more detailed explanation.

 

 

 


 

Proteomics

Next to platform data Proteomics also needs a dictionary before data can be used in advanced analysis. The dictionary maps the proteins to genes which allows tranSMART to use the proteins in the advanced analysis work flow. With a clean tranSMART installation only the gene dictionary is loaded, so it could be the protein dictionary is not loaded for your instance of tranSMART. To load the dictionary run the following command from the transmart-data/ folder:

 

 

LC-MS/MS

Protein quantities

In the proteomics folder you will find a annotation folder, the data files, mapping files and a parameter file. To load the proteomics data and its annotation run:

In the proteomics folder there are 4 files that are needed for the upload to be successful:

 proteomics.params

Name is predetermined, contains the names of the data file and mapping file and optional settings for the source_cd.

 proteomics_data.txt

Tab separated file containing the data.

The first column in the file should correspond to the platform probe IDs, rest of the columns can be anything from raw measures to fasta headers. Just make sure the headers have clear names as these will be used in the subject sample mapping.

 proteomics_columns_mapping.txt

Indicates which columns should be used from proteomics_data.txt

Note that the column counting starts at 0, so the actual columns taken from the datafile are 28 to 43 (see example:)

 proteomics_subject_sample_mapping

Maps samples to subjects and indicates the location in the tree the data should get. Should include a header file with the following 10 columns: trail_name, site_id, subject_id, sample_cd, platform, tissue_type, attr1, attr2, cat_cd and src_cd

See example file here.

 


 

Advanced loading 

When the Postgres database is setup with the default values the tutorial works fine, but when you try to load data to a database with a different name or try to load to a database on a different port this requires some more insight to how the data loading works. As shown above the data loading process needs the following files and settings to be in place:

  • correct vars file to set the correct environment variables
  • kettle.properties - this file is generated based on the vars file the first time clinical data is loaded.
  • File structure - Study name is derived from this, and for the tutorial the file structure is set. Adding new data types or uploading your own data is not bound by the structure.
  • transmart-ETL & data-integration folders

 

 How to load data to multiple databases (including remote hosts)

As mentioned, in order for the data upload to work you need a correct vars file, and the kettle.properties file is generated for you the first time you load clinical data. While this is true for the very first time you load data into a database using the freshly set up transmart-data, loading data to additional databases with slightly different settings for database name or port will result in failed results if you do not update the kettle.properties file properly.

Updating the vars file to handle the new database and sourcing it does not change anything in the kettle.properties file, the file that is used by KETTLE to retrieve database name and port number. So for example your first data load is to a database called transmart which runs on port 5432, if you now want to load to an additional local database called transmart_test that also runs on port 5432 and you only adjust the vars file all the data will still be loaded into the first transmart database.

The easiest way to adjust kettle.properties is to delete it and let the clinical upload to the second database generate a new one, but this means that switching databases to upload additional data requires you to reload the clinical data on every database switch. To overcome this problem you can just manually change the database name and port number to your required setting.
 

Uploading to a remote host:
Important to note here is that you can upload data in two different ways, on a local installation or on a remote installation. Loading to the a local installation is what was shown above, loading to a remote installation requires you to connect to the remote host using ssh and depending on what you loaded already might require some additional setup. Loading remotely requires you to set up an SSH tunnel to the server where the database is located by running:

ssh user@example.org -L 25432:example.org:5432

This means that in your local vars file, and kettle.properties you need to fill in 25432. This port is forwarded to the database_port, 5432.

After setting up this ssh, you can open a new terminal window (remember to source the vars file) and load the data on the remote server.

 Adding new data/your own data using load.sh

Up to here the loading of the CLUC data served like an example for your own data uploads. The loading process should have given you an idea to the folder structure that is used to upload data, the requirements and order the data can be loaded in. Loading new data or your own data can be done by closely looking at the example upload files provided for the CLUC. The figure shows a generic setup for the data types shown above, with clinical being the only one that has no annotation files. On the left, in red, data_type1 is shown were each data set (data_set1 & data_set2) of that data type has a different platform annotation. On the right, in blue, data_type2 shows a different structure with only one annotation folder which contains annotation for both data sets. In the data upload we have seen so far, aCGH shows a good example of the difference between these set ups. In the case where you have multiple annotations this could be due to region annotation with the different data sets using different regions, in the second case where you have one annotation file this could be due to gene level annotation.

When uploading your own study be sure to pay attention to the file name and the naming in study.params. Be sure to upload the clinical data before uploading any high dimensional data.

 

To upload the study you will need to generate your own load.sh scripts, these are bash scripts which are in each folder that contains data that need to be loaded. The scripts use pushd and popd to adjust the current working directory, this means all of the load scripts are formatting in one of the following ways (example files taken from CLUC):

scripts in tree nodes (no data to load)
scripts in tree nodes (data to load)
Scripts in leaf folders

The scripts show how the expression data is loaded for the Affymetrix array including platform annotation. Executing the top load.sh goes down the 3 scripts to first execute the load_annotaton.sh before going back up and loading the actual data. More on pushd, popd and stack building can be found here.

 

set -e sets the environment needed for the upload

set -x sets more information to be printed during the loading

Common errors during loading

 No error message but the data is not in the database

 This has probably something to do with an incorrect sourced vars file or a old kettle.properties file pointing to a different database or having the incorrect port. Check the environment variables by typing env, or get the individual variables with echo $variable_name. This only checks for the incorrect vars file. Go to ../transmart-data/samples/postgres/kettle-home and check the kettle.properties file for the correct settings, with the correct settings being the correct database name and port number.

 Error: operator does not exist: character varying >integer

This is a problem in the database which is fixed by running the following command in psql

 error: are you sure.... /tmp/.s.localhost:5432

 This error tells you that the database you are trying to reach is not located where you thought it would be. This could be a problem with the PGHOST or PGPORT variable. To solve this problem you need to figure out what the location of the database is and which port accepts the connection you are trying to establish. As a start you could try to reconnect to the server and start with echo $PGHOST and see if the variable is already defined. If it is this probably points to the database you are looking for.

 

 

Uploading using unified shell-scripts

 Click here to expand...

For the upload described here to work we need a working upload environment to be present. We installed tranSMART as described at the transmart-foundation wiki. The described "vars" file defines the following environment.

You "vars" file may differ a little, but the variables mentioned here should be defined, else the upload will fail.
We provide a little shell-script 'check_env.sh', situated in the top-directory of this study, which checks a few things to see if the environment is OK. 

The directory structure of the Cell-line study follows a naming convention (italic names) as shown in the next figure. This naming convention must be seen as a proposal, because the convention is now based on the datatypes found in tranSMART seen from a developers perspective. This is probably not what we want.

 

 Click here to see image overview

In each directory you will find a script "load.sh". This script loads the data in that directory (and the data found in the sub-directories) in to tranSMART. So, if you execute the script "load.sh" in the top-directory, all data for this study will be upload into tranSMART. This is not recommended until you are sure this works for you. Let's start with doing it one by one.
Another important concept to notice are the "params" files, which you find in each directory containing data to be uploaded. These files are mandatory and define (possible) variables that influence how the data is uploaded. Also notice the file "study.params" in the top-directory, which contains variables at study level (available for all datasets). 

clinical

Goto directory "clinical" and execute "load.sh".

This is what should happen:

  • An R-script is executed, which transforms the data-files into a file which can be uploaded into the landing-zone of tranSMART 
  • The transformed data is uploaded into the landing-zone (283 lines)
  • The stored procedure "i2b2_load_clinical_data" is called (return code should be "1")

agh

Goto directory "acgh" and execute "load.sh".

This should upload 4 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
note: for the "gene" datasets to be successfully uploaded, you first have to upload the "annotation" data (annotation data is shared by both datasets) 

expression

Goto directory "expression" and execute "load.sh"

This should upload 2 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.

mirna

Goto directory "Myrna" and execute "load.sh"

It is assumed that the mirna-dictionary is already available in tranSMART. This should have been done during installation time (make -C data/postgres/ load_proteomics_dictionary).

Proteomics

Goto directory "proteomics" and execute "load.sh"

It is assumed that the uniprot-dictionary is already available in tranSMART. This should have been done during installation time (make -C data/postgres/ load_proteomics_dictionary)

rnaseq

Goto directory "rnaseq" and execute "load.sh"

This should upload 7 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
note: be sure you upload the "annotation" dataset first.

vcf

Goto directory "vcf" and execute "load.sh" 

This should upload 8 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
Note: uploading a vcf dataset twice will result in undefined behaviour, because the "old" dataset is not removed. 

 

1 Comment

  1. Anonymous

    Is it possible to download the test data TraIT CLUC?

    It seems that when using the URL you mention at the top of this page: https://beehub.nl/TraIT-Datateam/Public%20CLUC/trait-cluc-v1.0.zip , a username and password is needed.

    How cold I download this data?

    Thank you very much.