Skip to end of metadata
Go to start of metadata

The ability to load High Dimensional Data (HDD) (high through put, molecular data) was one of the drivers for tranSMART development. tranSMART has incorporated i2b2 schema that can handle low dimensional data (LDD). DEAPP schema was added for HDD data. There are several tables within this schema that were created for different high dimensional data types. 

In simple terms LDD data is subject-variable-value; HDD data is subject-sample-variable-value. To load HDD data 3 files have to be created: Data File, Platform file and Subject-Sample mapping file.

Platform concept is adopted from the Gene Expression Omnibus (GEO) data structure for microarray, next-generation sequencing, and other forms of high-throughput functional genomics data repository. According to GEO “A Platform record is composed of a summary description of the array or sequencer and, for array-based Platforms, a data table defining the array template.” For tranSMART HDD data, Platform is simply a file that matches detecting or detected moiety (probe, antibody, peptide, ligand, transcript, etc) to the relevant standard Gene ID/Symbol, UniProt ID, Metabolite ID or miRBase. In 17.1 gene transcript ID will be added as additional standard ID option. This will allow to better annotate non-coding, anti-sense, pseudogenes, variants and other messages detected by arrays or sequencing. When a Platform file is loaded, it links the data to the data specific Dictionary that should be already loaded when tranSMART is installed.

Data must be normalized as required by the data type, experimental method, instrument, etc.

Either Raw data "R" or log2 transformed data "L" can be loaded. tranSMART analysis workflows use either log transformed data or calculated z-score. Therefore, when data is loaded as R, log2 values are calculated in the process of data loading. Depending on the loader being used, calculation is performed by either Stored Procedures or by custom, loader specific procedures. There are some minor differences in raw data processing for different data types: "0" and negative values can be replaced with "0.001", or "0.001" can be added to all values, etc. For some of the data types replacement of "0" with "0.001" is not the best approach and can lead to high variance on the lower end (genes that were not measured for maybe technical reasons could be identified as a highly differential).

Therefore, it is highly recommended to load HDD data as log2 transformed. Data specific and experiment specific approaches should be used to deal with "0" and negative values.

 

High Dimensional (HDD) Gene Expression Data in tranSMART is historically defined as data generated by measuring gene expression level using microarray technology. Data generated by other technologies can also be loaded as "Expression" as long as an entity being detected can be mapped to a gene and each probe/reagent used for detection is unique (one entity per probe, multiple probes for the same entity are possible).

There are three tables in tranSMART for NGS data, two for RNA Sequencing data and one for miRNA Sequencing data. One of the RNASeq Data Tables is intended for loading raw read count observation data which can be used with Group Test for RNASeq Advanced Workflow. R package behind this workflow (EdgeR) includes normalization step. All other Analysis Workflows require pre-normalized RNASeq data (RPKM, FPKM, TPM, etc.) which is loaded similar to Expression data. Even though there is a special table for miRNA sequencing data, it can also be loaded as RNASeq. This is mostly a matter of preference for standard IDs. For RNASeq data “probes” which in this case are transcript IDs are mapped to Standard Gene Symbols. For miRNA sequencing data miRBase symbols are used as standard IDs.

tranSMART HDD miRNAQPCR data loading option might be very useful for a specific project focused on collecting this type of data. When miRNAQPCR is loaded, store procedure transforms this data into negative values and calls it log2. It makes perfect sense as long as you expect it and don't do any additional transformations to the data. Therefore, miRNAQPCR data should be loaded as dCt values. dCt value represents negative log2 of a transcript abundance. Negative of a negative gives you actual log2 values that can be used in Advanced Workflows. But there is no RNAQPCR table to load RNA qPCR data that is more commonly generated in research. Having a specific procedure just for miRNAQPCR and not for RNAQPCR might be confusing.

High dimensional qPCR data such as TLDA arrays can be perfectly loaded into RNAseq table as "L" (log transformed) when properly normalized. A reasonable approach would be to process QPCR data in the same spirit as RNAseq to be able to compare results between two methods. For example: Negative dCt values where transcripts with Ct higher than an agreed upon cutoff for more than XX% of samples are removed (similar to RNAseq data normalized for a typical analysis where transcripts with 0 counts for more than XX% of samples are also usually removed), dCt calculated and converted into negative dCt. Negative dCT value represent log2 relative abundance of a gene message 

Note: there are qPCR methods that quantify absolute gene transcripts amount per sample in femtograms. These methods are not usually highthroughput and can be loaded as subject level LDD data. For HDD loading purposes this data would be Raw 'R'.

There are two formats for loading protein quantification data as High Dimensional: Mass Spec Proteomics and RBM.

RBM format is very specialized and was created to load data from the output of the RBM instrument. Unless you are loading RBM, this "Data type" is not very useful. And even RBM data can be loaded into the "Mass Spec Proteomics" table.

But format developed for Mass Spec Proteomics is quite generic and can be easily adapted to any other quantitative proteomics data format as long as the method/assay detects and quantifies specific proteins and protein isoforms that have UniProt IDs. Quantification of protein variants, mutations or modifications where different entities being quantified map to the same UniProt ID can also be loaded as High Dimensional data with some creative "porbe IDs". Advanced workflows use either "probe" or "probe-protein name" as legends for graphs and tables. Therefore, if you add a prefix describing protein modification/variant being detected to the "Probe", you can load data for isoforms and modified proteins.

Metabolomics Data file includes several fields that are either redundant and are also included in the platform file or just ignored. Data file columns: PATHWAY_SORTORDER, BIOCHEMICAL, SUPER_PATHWAY, SUB_PATHWAY, COMP_ID, PLATFORM, RI, MASS, CAS, PUBCHEM, KEGG, HMDB, Sample ID

The platform data file defines the platform and maps metabolites (Biochemical) to metabolites’ HMDB IDs, Pathways and Super Pathways. Biochemical name is the primary identifier. Other information is optional. Only HMDB IDs are accepted at present time. Missing or any other IDs will be replaced by “PRIVATE” in the analysis results output. HMDB ID and HMDB Common Name are loaded as a dictionary. If Pathway data is loaded, it will be available in the HDD drop down filter in the Advanced Workflows. All Metabolites mapped to the selected Pathway will be used in the analysis.

 

Some metabolites have very long names that exceeds Metabolomics table limit. We recommend using Shorthand Notation for Metabolomics.

 




  • No labels