Introduction

The CTMM TraIT project recently added the Cell Line Use Case (CLUC) to tranSMART. The CLUC is a collection of data on colorectal and prostate cell lines from an exceptionally broad set of platforms, as shown in the table below.

This diverse set is used to:

Standardize data formats and data processing pipelines from the four domain
Test the integration pipelines within the TraIT translational toolset

By incorporating the same platforms as used for ongoing research projects, this cell line set gives a representative test set comparable to real patient data, without the legal burden of handling personal data. The TraIT Cell Line Use Case is available under the CC0 license at ## TO DO: Insert dataset location ##.

Please use the following citation when making use of this dataset: Bierkens, Mariska & Bijlard, Jochem "The TraIT cell line use case." Manuscript in preparation. More information can also be found on the Bio-IT World Poster "Multi-omics data analysis in tranSMART using the Cell Line Use Case dataset".

Table of contents

General remarks

Note that folder structure is very important in the upload process, make sure to structure your data in the correct way (figure 1). For more detailed information about the data type you wish to load please refer to the section dedicated to that specific data type.

It is important to setup the batchdb.properties file to provide transmart-batch with the location and login information needed to load the data. A detailed explanation on the properties file can be found here.

For the tutorial the assumption is made that the data is loaded into a local database with default settings, meaning that the database is located on the same machine that has the data folders and the ETL pipeline scripts.

Important note: As transmart-batch currently does not have a pipeline for VCF data this data typ will have to be loaded with Kettle.

Setting up transmart-batch and general documentation

For the complete documentation on transmart-batch please look here.

To use transmart-batch with 16.1 or 16.2 you can use the V1.0 release. To use the latest version please clone the git repository and build transmart-batch:

git clone https://github.com/thehyve/transmart-batch.git
cd transmart-batch
./gradlew capsule

After building you should see transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar

Batchdb.properties file

The properties file contains information as the location of the database, the username and password that are used to upload the data to the database. The properties is build up of four lines indicating which database is being used, either PostgreSQL or Oracle, the location of the database and the user.

Example properties file (postgres)

    batch.jdbc.driver=org.postgresql.Driver
    batch.jdbc.url=jdbc:postgresql://localhost:5432/transmart
    batch.jdbc.user=tm_cz
    batch.jdbc.password=tm_cz

Data structure and loading the data

In order to load the data properly the scripts need to know were the data is located, in order to achieve this the data structure is more of less set. In the data ##TO DO: add data link ## the only thing you have to do is extract the files and you are ready to load. The following figure gives an overview of the data types and the way the folder structure is build up. More details about particular datatypes can be found in there respective sections.

Cell line use case folder structure

Getting the data to the server

If you want to upload the data to a server you first need to get the data on the server. The easiest way to do this is by opening a terminal window and connect to the server:

 ssh username@serverAddress

When the connection is made open a new terminal window (do not close the window where you connected to the server) and navigate to the study you want to copy. From the folder the study is located in run the following command:

 scp -r study_name username@serverAddress:~(default, folder on server to put the data, ~ is your home folder)

Loading the data

To load the data transmart-batch needs three files.

batchdb.properties file
study.params
data to be loaded params file, this can be the data type or the annotation platform params file

### Vanaf hier: ###

Different data types
VCF files als laatste
- data loading, pointers naar tmp file
- Wijs op de study.params die anders is
- Als geladen is geeft SQL statement om data vrij te geven
Tags, inclusief de VCF sql statement die nodig is.

Clinical data

To load just the clinical data, run:

<path_to>/transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar -c <path_to>/batchdb.properties -p <path_to>/clinical.params

If you are reloading data add the n flag, this forces transmart-batch to restart an already completed job again.

in the clinical data folder you need the following files:

Indicates where the column and word mapping file are located.

# Mandetory
COLUMN_MAP_FILE=Cell-line_columns.txt
#Optional
WORD_MAP_FILE=Cell-line_wordmap.txt

There are three datafiles. The first contains the characteristics data, the second has the non-high throughput molecular profiling data (NHTMP) and the last was added to support EGA IDs. The names of the files can be arbitrarily chosen as long as they are specified in the column mapping file. The files should be tab-separated files.

The column mapping file contains 7 columns, filename, category code, column number, data label, data label source, control vocab cd and concept type. The filename is the file name of a tab separated datafile, the category code is used to indicate part of the tree shown in tranSMART (subject is a reserved term to indicate the patients). The column number indicates which column should be used from the datafile, the data label indicates the leaf note name (SUBJ_ID is a reserved term to indicate the subjets). The data label source and control vocab cd columns can be empty. The last column, concept type, is an optional column used in transmart-batch to indicate either NUMERICAL or CATEGORICAL data values. Each Category Code - Data label pair should have a unique name, and each unique name should have 1 column assigned from a data file.

The word mapping file can be used to transform values in the data file according to for example a codebook. The file should contain 4 columns, a file name where the value is located that should be replaced, the column number of the concept in the data file, value to be replaced, new value.

FILENAME	COLUMN_NUMBER	FROM	TO
Cell-line_data.txt	3	1	Male
Cell-line_data.txt	4	Yes	1

In case of the CLUC data there are 3 data files, the first contains the characteristics data, the second data file contains some non-high throughput molecular profiling (NHTMP) data describing gains or losses of selected genes and the last was added to support EGA IDs. The column mapping file maps the columns in the datafolder to the correct tree structure shown in tranSMART, it tells for example that column 5 in the data file is the age column and should be stored under a variable called Age.

Gene expression data

Before data can be loaded into tranSMART, the platform used to generate the data must be loaded. The annotation/platform files are located in annotation folders (see image to the right) and have their own params files to load the annotation data.

Microarray data

mRNA array

Annotation data:

<path_to>/transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar -c <path_to>/batchdb.properties -p <path_to>/mrna_annotation.params

Measured data

<path_to>/transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar -c <path_to>/batchdb.properties -p <path_to>/expression.params

miRNA

Requires transmart-data and kettle

## setup kettle var, point to end of page

Next to platform data miRNA also needs a dictionary before data can be used in advanced analysis. The dictionary maps the small miRNA parts to genes which allows tranSMART to use the miRNA in the advanced analysis work flow. With a clean tranSMART installation only the gene dictionary is loaded, so it could be the miRNA dictionary is not loaded for your instance of tranSMART. To load the dictionary run the following command from the transmart-data/ folder:

make -C data/postgres/ load_mirna_dictionary

if the dictionary is already present running this command will just return that the dictionary is already loaded.

Agilent miRNA microarray

The annotation/platform information is loaded together with the data. The image on the right shows the file structure for this to work.

miRNA microarray annotation and data

<path_to>/transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar -c <path_to>/batchdb.properties -p <path_to>/mirna_annotation.params
 
<path_to>/transmart-batch/build/lib/transmart-batch-1.1-SNAPSHOT-capsule.jar -c <path_to>/batchdb.properties -p <path_to>/mirna.params

in the miRNA folder there are 4 files that are needed for the upload to be successful:

Example file:

# Mandatory
DATA_FILE_PREFIX=mirna_data
MAP_FILENAME=mirna_subject_sample_mapping.txt
SAMPLE_MAP_FILENAME=mirna_sample_mapping.txt
MIRNA_TYPE=MIRNA_QPCR
INC_LOAD=N
DATA_TYPE=R
# Optional

Name of the data file is specified in the params file. The file should be a tab separated file with in the first column "id_ref" which refers to the annotation and the second to nth column representing samples. All the values should be in quotes:

"ref_id"

"1"

"2"

.........

sample 1

"-0.84"

"0"

...sample n

"-0.225"

"0"

Empty file, this file needs to be present for the upload to work.

Name of the subject sample map is specified in the params file. The tab separated file has 10 columns, from left to right "trial_name", "site_id", "subject_id", "sample_cd", "platform", "tissue_type", "attr1", "attr2", "cat_cd", "src_cd". The "platform" column should contain the platform id under which the miRNA annotation was uploaded. The "cat_cd" column contains the path to the data as shown in the folder structure in tranSMART. All fields should be enclosed in quotes.

RNAseq data

For the RNAseq data the folder structure is different from the other data types. The platform annotation files are directly in the rnaseq folder and not nested with the data. This means the annotations should be loaded before the rest of the data can. If you load both the GA II and HiSeq2000 datasets at the same time the load script takes care of this. Running the following command from ..../Cell-line/rnaseq will load all the RNAseq data for the CLUC dataset.

 bash load.sh

Illumina GA II RNAseq

The RNAseq data contains (sequence) read counts for transcripts, i.e. it is a measurement for the relative abundance of transcripts.
Goto directory: .../Cell-line/rnaseq/Illumina GA II RNAseq and run the load command

The folder should contain the following three files:

No strict file name, is specified in the params file.

The file has 5 tab separated columns, Platform/GPL_ID, GENE_ID, SAMPLE_ID, readcount, normalized_readcount.

No strict file name, is specified in the params file

Tab separated file containing 10 columns, STUDY_ID, SITE_ID, SUBJECT_ID, SAMPLE_ID, PLATFORM, TISSUETYPE, ATTR1, ATTR2, CATEGORY_CD and SOURCE_CD. For more information please follow this link or check the example file.

Name is predetermined, contains the names of the data file and mapping file and optional settings for the source_cd.

# Mandatory
RNASEQ_DATA_FILE=Cell-line-data.txt
SUBJECT_SAMPLE_MAPPING=Cell-line-subject_sample_mapping.txt
# Optional
SOURCE_CD=RNASEQGAIIMRNA

Illumina HiSeq2000 RNAseq

Similar to GA II RNAseq but more sub folders, each subfolder contains 1 sample, to upload them all type the load command from .../Cell-line/rnaseq/Illumina HiSeq2000 RNAseq. Each sample sub folder has the same three files as displayed above.

Copy Number Variation ('aCGH') data

For the array CGH the data is available from 2 different arrays, 180k and 224k Agilent micro arrays. The data has been processed on gene and region level generating a total of 4 data sets to load. The figure on the right shows the folder structure of the data. The platform annotation on gene level for both the 180k and 224k arrays is the same, which is reflected in the folder structure. The platform annotation needs to be loaded before the actual data so when loading the gene level data by itself this requires extra attention.

All of the data can be loaded with one command, if you want to only load a part of the data just navigate to the proper map before executing the command.

 bash load.sh

All of the folders containing data should have the following three files:

No strict file name, is specified in the params file.

The file structure is build up as the first column containing either gene id or region id and each set of 7 columns following this first describe a sample. From left to right these columns are: sample.chip, sample.segm, sample.flag, sample.loss, sample.norm, sample.gain, sample.amp. The columns are separated by tabs. For more information please follow this link.

No strict file name, is specified in the params file

Tab separated file containing 10 columns, STUDY_ID, SITE_ID, SUBJECT_ID, SAMPLE_ID, PLATFORM, TISSUETYPE, ATTR1, ATTR2, CATEGORY_CD and SOURCE_CD. For more information please follow this link or check the example file.

Name is predetermined, contains the names of the data file and mapping file and optional settings for the source_cd.

# Mandatory
DATA_FILE_PREFIX=Cell-line_samples.txt
MAP_FILENAME=Cell-line_subjectmapping.txt
# Optional
SOURCE_CD=STD2

Small Genomic Variants ('VCF')

There a total of 8 datasets available obtained from 3 different platforms. To load all of them at once simply go into the vcf directory and run

bash load.sh

As these datasets are quite large compared to the other data types loading may take some time.
Note: uploading a vcf dataset twice will result in undefined behaviour, because the "old" dataset is not removed.

Complete Genomics DNAseq & Illumina GAII RNAseq both have one vcf file to load while Illumina HiSeq2000 RNAseq contains the remanning 6. Each folder with a vcf file should have the following three files

The actual VCF file. The VCF files for the Cell-line use case are annotated with HGNC gene symbols and Ensembl Gene Identifiers. For more information about the VCF file format please follow this link.

The subject sample mapping file maps the actual sample names to the sample IDs given in the VCF file. For example VCaP is given the ID GS000008107-ASM in the VCF file.

Specifies the VCF file to upload, the subject sample map to use, genome build used to process the samples and builds the concept path shown in the tranSMART tree. Click here to see an example file with more detailed explanation.

Proteomics

Next to platform data Proteomics also needs a dictionary before data can be used in advanced analysis. The dictionary maps the proteins to genes which allows tranSMART to use the proteins in the advanced analysis work flow. With a clean tranSMART installation only the gene dictionary is loaded, so it could be the protein dictionary is not loaded for your instance of tranSMART. To load the dictionary run the following command from the transmart-data/ folder:

make -C data/postgres/ load_proteomics_dictionary

LC-MS/MS

Protein quantities

In the proteomics folder you will find a annotation folder, the data files, mapping files and a parameter file. To load the proteomics data and its annotation run:

bash load.sh

In the proteomics folder there are 4 files that are needed for the upload to be successful:

Name is predetermined, contains the names of the data file and mapping file and optional settings for the source_cd.

# Mandatory
MAP_FILENAME=proteomics_subject_sample_mapping.txt
COLUMN_MAPPING_FILE=proteomics_columns_mapping.txt
DATA_FILE_PREFIX=proteomics_data.txt
INC_LOAD=N
# Optional
DATA_TYPE=R
#LOG_BASE=2
#SOURCE_CD=STD

Tab separated file containing the data.

The first column in the file should correspond to the platform probe IDs, rest of the columns can be anything from raw measures to fasta headers. Just make sure the headers have clear names as these will be used in the subject sample mapping.

Indicates which columns should be used from proteomics_data.txt

Note that the column counting starts at 0, so the actual columns taken from the datafile are 28 to 43 (see example:)

qewr    kjg     scd

proteomics_data.txt     27      42

Maps samples to subjects and indicates the location in the tree the data should get. Should include a header file with the following 10 columns: trail_name, site_id, subject_id, sample_cd, platform, tissue_type, attr1, attr2, cat_cd and src_cd

See example file here.

Advanced loading

When the Postgres database is setup with the default values the tutorial works fine, but when you try to load data to a database with a different name or try to load to a database on a different port this requires some more insight to how the data loading works. As shown above the data loading process needs the following files and settings to be in place:

correct vars file to set the correct environment variables
kettle.properties - this file is generated based on the vars file the first time clinical data is loaded.
File structure - Study name is derived from this, and for the tutorial the file structure is set. Adding new data types or uploading your own data is not bound by the structure.
transmart-ETL & data-integration folders

As mentioned, in order for the data upload to work you need a correct vars file, and the kettle.properties file is generated for you the first time you load clinical data. While this is true for the very first time you load data into a database using the freshly set up transmart-data, loading data to additional databases with slightly different settings for database name or port will result in failed results if you do not update the kettle.properties file properly.

Updating the vars file to handle the new database and sourcing it does not change anything in the kettle.properties file, the file that is used by KETTLE to retrieve database name and port number. So for example your first data load is to a database called transmart which runs on port 5432, if you now want to load to an additional local database called transmart_test that also runs on port 5432 and you only adjust the vars file all the data will still be loaded into the first transmart database.

The easiest way to adjust kettle.properties is to delete it and let the clinical upload to the second database generate a new one, but this means that switching databases to upload additional data requires you to reload the clinical data on every database switch. To overcome this problem you can just manually change the database name and port number to your required setting.

Uploading to a remote host:
Important to note here is that you can upload data in two different ways, on a local installation or on a remote installation. Loading to the a local installation is what was shown above, loading to a remote installation requires you to connect to the remote host using ssh and depending on what you loaded already might require some additional setup. Loading remotely requires you to set up an SSH tunnel to the server where the database is located by running:

ssh username@server_address -L localport:server_address:database_port

ssh user@example.org -L 25432:example.org:5432

This means that in your local vars file, and kettle.properties you need to fill in 25432. This port is forwarded to the database_port, 5432.

After setting up this ssh, you can open a new terminal window (remember to source the vars file) and load the data on the remote server.

Up to here the loading of the CLUC data served like an example for your own data uploads. The loading process should have given you an idea to the folder structure that is used to upload data, the requirements and order the data can be loaded in. Loading new data or your own data can be done by closely looking at the example upload files provided for the CLUC. The figure shows a generic setup for the data types shown above, with clinical being the only one that has no annotation files. On the left, in red, data_type1 is shown were each data set (data_set1 & data_set2) of that data type has a different platform annotation. On the right, in blue, data_type2 shows a different structure with only one annotation folder which contains annotation for both data sets. In the data upload we have seen so far, How to load the Cell Line Use Case dataset with transmart-batch shows a good example of the difference between these set ups. In the case where you have multiple annotations this could be due to region annotation with the different data sets using different regions, in the second case where you have one annotation file this could be due to gene level annotation.

When uploading your own study be sure to pay attention to the file name and the naming in study.params. Be sure to upload the clinical data before uploading any high dimensional data.

To upload the study you will need to generate your own load.sh scripts, these are bash scripts which are in each folder that contains data that need to be loaded. The scripts use pushd and popd to adjust the current working directory, this means all of the load scripts are formatting in one of the following ways (example files taken from CLUC):

scripts in tree nodes (no data to load)

#!/bin/bash
# script taken from expresion folder
set -x

pushd "Affymetrix exon array"      ; ./load.sh; popd
pushd "Agilent 44K mRNA microarray"; ./load.sh; popd

scripts in tree nodes (data to load)

#!/bin/bash
# script taken from Affymetrix exon array folder
set -e

pushd annotation      ; ./load.sh; popd
load_expression.sh

Scripts in leaf folders

#!/bin/bash
# script taken from Affymetrix exon array/annotation
set -e

load_annotation.sh

The scripts show how the expression data is loaded for the Affymetrix array including platform annotation. Executing the top load.sh goes down the 3 scripts to first execute the load_annotaton.sh before going back up and loading the actual data. More on pushd, popd and stack building can be found here.

set -e sets the environment needed for the upload

set -x sets more information to be printed during the loading

Common errors during loading

This has probably something to do with an incorrect sourced vars file or a old kettle.properties file pointing to a different database or having the incorrect port. Check the environment variables by typing env, or get the individual variables with echo $variable_name. This only checks for the incorrect vars file. Go to ../transmart-data/samples/postgres/kettle-home and check the kettle.properties file for the correct settings, with the correct settings being the correct database name and port number.

This is a problem in the database which is fixed by running the following command in psql

CREATE CAST(VARCHAR AS NUMERIC) WITH INOUT AS IMPLICIT;

This error tells you that the database you are trying to reach is not located where you thought it would be. This could be a problem with the PGHOST or PGPORT variable. To solve this problem you need to figure out what the location of the database is and which port accepts the connection you are trying to establish. As a start you could try to reconnect to the server and start with echo $PGHOST and see if the variable is already defined. If it is this probably points to the database you are looking for.

Uploading using unified shell-scripts

For the upload described here to work we need a working upload environment to be present. We installed tranSMART as described at the transmart-foundation wiki. The described "vars" file defines the following environment.

PGHOST=localhost
PGPORT=5432
PGDATABASE=transmart
PGUSER=tm_cz
PGPASSWORD=?????
PGSQL_BIN="/usr/bin/"
KETTLE_JOBS_PSQL=/opt/transmart-data/env/tranSMART-ETL/Postgres/GPL-1.0/Kettle/Kettle-ETL/
KETTLE_JOBS=$KETTLE_JOBS_PSQL
R_JOBS_PSQL=/opt/transmart-data/env/tranSMART-ETL/Postgres/GPL-1.0/R
KITCHEN=/opt/transmart-data/env/data-integration/kitchen.sh
KETTLE_HOME=/opt/transmart-data/samples/postgres/kettle-home
PATH=/opt/transmart-data/samples/postgres:/opt/R/bin:$PATH
export PGHOST PGPORT PGDATABASE PGUSER PGPASSWORD PGSQL_BIN \
        KETTLE_JOBS_PSQL KETTLE_JOBS R_JOBS_PSQL KITCHEN KETTLE_HOME PATH

You "vars" file may differ a little, but the variables mentioned here should be defined, else the upload will fail.
We provide a little shell-script 'check_env.sh', situated in the top-directory of this study, which checks a few things to see if the environment is OK.

The directory structure of the Cell-line study follows a naming convention (italic names) as shown in the next figure. This naming convention must be seen as a proposal, because the convention is now based on the datatypes found in tranSMART seen from a developers perspective. This is probably not what we want.

In each directory you will find a script "load.sh". This script loads the data in that directory (and the data found in the sub-directories) in to tranSMART. So, if you execute the script "load.sh" in the top-directory, all data for this study will be upload into tranSMART. This is not recommended until you are sure this works for you. Let's start with doing it one by one.
Another important concept to notice are the "params" files, which you find in each directory containing data to be uploaded. These files are mandatory and define (possible) variables that influence how the data is uploaded. Also notice the file "study.params" in the top-directory, which contains variables at study level (available for all datasets).

clinical

Goto directory "clinical" and execute "load.sh".

This is what should happen:

An R-script is executed, which transforms the data-files into a file which can be uploaded into the landing-zone of tranSMART
The transformed data is uploaded into the landing-zone (283 lines)
The stored procedure "i2b2_load_clinical_data" is called (return code should be "1")

agh

Goto directory "acgh" and execute "load.sh".

This should upload 4 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
note: for the "gene" datasets to be successfully uploaded, you first have to upload the "annotation" data (annotation data is shared by both datasets)

expression

Goto directory "expression" and execute "load.sh"

This should upload 2 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.

mirna

Goto directory "Myrna" and execute "load.sh"

It is assumed that the mirna-dictionary is already available in tranSMART. This should have been done during installation time (make -C data/postgres/ load_proteomics_dictionary).

Proteomics

Goto directory "proteomics" and execute "load.sh"

It is assumed that the uniprot-dictionary is already available in tranSMART. This should have been done during installation time (make -C data/postgres/ load_proteomics_dictionary)

rnaseq

Goto directory "rnaseq" and execute "load.sh"

This should upload 7 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
note: be sure you upload the "annotation" dataset first.

vcf

Goto directory "vcf" and execute "load.sh"

This should upload 8 datasets (see diagram above). Alternatively you can do them one by one by executing the "load.sh" scripts in the subdirectories.
Note: uploading a vcf dataset twice will result in undefined behaviour, because the "old" dataset is not removed.

Space shortcuts

Page tree

Introduction

General remarks

Setting up transmart-batch and general documentation

Batchdb.properties file

Data structure and loading the data

Getting the data to the server

Loading the data

Clinical data

Gene expression data

Microarray data

mRNA array

Annotation data:

Measured data

miRNA

Agilent miRNA microarray

RNAseq data

Illumina GA II RNAseq

Illumina HiSeq2000 RNAseq

Copy Number Variation ('aCGH') data

Small Genomic Variants ('VCF')

Proteomics

LC-MS/MS

Protein quantities

Advanced loading

Common errors during loading

Uploading using unified shell-scripts

clinical

agh

expression

mirna

Proteomics

rnaseq

vcf