Skip to end of metadata
Go to start of metadata

tranSMART is capable of loading VCF data files using scripts in transmart-data. The VCF format is described in https://github.com/samtools/hts-specs

http://samtools.github.io/hts-specs/VCFv4.1.pdf

http://samtools.github.io/hts-specs/VCFv4.2.pdf

The official JAVA API for reading vcf files is here: https://github.com/samtools/htsjdk

See also the info about Small genomic variants (VCF format) in general.

How to load VCF files?

tl;dr

  • Set parameters in samples/common/<study_id>/vcf.params file.
  • Run make -C samples/{postgres,oracle} load_vcf_<study_id>

Intro

There are two ways to load a VCF file into tranSMART. An easy and a hard way. The easy way works very well for predefined datasets, but also works for private datasets. The hard way, however, gives more control over the loading process, so that is very well suited for testing and loading own VCF files. 

Please note: the VCF loading scripts as well as the latest VCF database schema currently reside in the london_hackathon branch of transmart-data.

Prerequisites

Both ways of loading data have some prerequisites:

  • the normal transmart-data requirements apply: https://github.com/thehyve/transmart-data
  • perl must be installed on your system, and accessible on the path. In most linux environments this will be the case
  • the transmart-data vars file has been sourced in the shell you are using. For example run: . ./vars

Parameters

For both ways, a set of parameters is needed.  An example of a parameters file can be found in samples/common/scripts/vcf/vcf.params.

Please note: if you are loading a predefined dataset, this file has most probably alread been generated. If so, it will be stored as samples/common/<study_id>/vcf.params. See for example the Cell-line study

The following parameters are needed:

# The full path to the VCF file you want to load
VCF_FILE=/tmp/my-vcf-file.vcf

# The full path to the subject-sample-mapping file
# This file is a tab-separated file, with two columns, containing
# subject_id as known in clinical data in the first column;
# sample_id as known in VCF file in the second column
SUBJECT_SAMPLE_MAPPING_FILE=/tmp/my-subject-sample-mapping.txt

# A (temp) directory where the output files from parsing the file
# will be stored. Must be writable by the current user
VCF_TEMP_DIR=/tmp/vcf


# Short textual description of the source of the data
DATASOURCE=unknown

# Initials of the user that is loading this dataset, for future reference
ETL_USER=TD


# Unique identifier for the current dataset. Only a single VCF file can be loaded into a dataset, you will need to merge multiple VCF files into one if you have more,
DATASET_ID=SomaticMutations123

# Study identifier as it is used in the clinical data. Is used to look up the subjects
STUDY_ID=GSE8581

# Concept path to store the VCF data in the clinical data tree.
# The concept path must be specified completely, where parts of the
# path are separated by the \ sign. The path must NOT start with a
# slash.
# For example: Public Studies\GSE8581\Genomic Variants
# N.B. Use quotes around the parameter if it contains spaces and
# escape the slashes
CONCEPT_PATH="Public Studies\\GSE8581\\Somatic VCF"

 

# Identifier for the genome build that is used as a reference
GENOME_BUILD=hg36

# Identifier for the platform to use. A platform for VCF currently
# only describes the genome build. If unsure, use 'VCF_<genome_build>'
GPL_ID=VCF_$GENOME_BUILD

The easy way

  • Make sure the prerequisites are met.
  • Make sure that the samples/common/<study_id>/vcf.params exists and has the correct parameters. In this case, the STUDY_ID parameter is not needed, as it can be derived from the path name.
  • Make sure that the clinical data for the study has already been loaded, as the genomic variant data will be associated with the clinical patients
  • Run the following command (in the transmart-data root directory):

        make -C samples/common load_vcf_<study_id>

The more manual way

  • Make sure the prerequisites are met.
  • You can copy the existing params file to your own location, enter the right parameters and afterwards source it in your shell: . ./vcf.params
  • Make sure that the clinical data for the study has already been loaded, as the genomic variant data will be associated with the clinical patients
  • Run the following command (in the transmart-data root directory)

          make -C samples/{oracle,postgres} load_vcf

You can also use the following make targets to have more fine-grained control over the loading process:

 

  • parse_vcf: parses the VCF file and generates intermediate txt files into the VCF_TEMP_DIR.
  • load_parsed_vcf_data loads the VCF data itself. This command requires the VCF file to be parsed already. The resulting txt files are to be stored in the VCF_TEMP_DIR specified in the params file.
  • load_parsed_vcf_mapping loads the VCF mapping, but not the data. This command requires the VCF file to be parsed already. The resulting sql files are to be stored in the VCF_TEMP_DIR specified in the params file. 
    This command only makes sense if the data is loaded as well.
  • load_vcf: parses the VCF file and loads the data and mapping into the database.
    This target equals running the targets parse_vcfload_parsed_vcf_data and load_parsed_vcf_mapping.
  • load_vcf_data: parses the VCF file and only loads the VCF data itself. This results in VCF data loaded in the deapp schema, but not mapped to patients, and the data won't show up in the dataset explorer tree.
    This target equals running the targets parse_vcf and load_parsed_vcf_data.
  • load_vcf_mapping parses the VCF file and only loads the VCF mapping. This is only useful if the VCF data itself is already loaded, otherwise a mapping to subjects will be made, and a node in the tree is created, but the data itself doesn't exist.
    This target equals running the targets parse_vcf  and load_parsed_vcf_mapping.

Structure of the scripts

The VCF loading scripts are split into two parts:

  • a common part to parse the VCF file and the subject-sample-mapping file. The output of this step is a set of tab separated files and a set of SQL files to be loaded into the database. These common scripts are located in transmart-data/samples/common/_scripts/vcf
  • a database specific part to load the data into the database. Currently, a postgres version and an oracle version exist. These scripts are located in transmart-data/samples/<db>/_scripts/vcf
    • the postgres script replaces some oracle-specific SQL (syntax to use sequences and  from dual). Afterwards, it uses PSQL to load the data
    • the oracle script uses the transmart-data LoadTsvFiles.groovy script to load the data into the database.
  • No labels