Introduction

The tranSMART 19.1 release is the first new release for the new developments under the Dell Technologies project.

Version numbers are updated throughout to 19.1.

Biomart domain

Column widths for organism and species in biomart tables are set to the number of characters in a VARCHAR2 or character varying column. In earlier versions they were double the number of characters.

Experiment objects are extended to support new storing data for newly enabled fields in the Browse tab.

Folder management

The description column is expanded to 4000 characters. This is used as the description for programs, studies assays, analyses, folders and files.

Date values are supported to document the creation and update dates for Browse tab items. Date item subtypes can be 'DATE' (date only), 'TIME' (date and time) and 'CURRENTTIME' (use the current time as the value).

Date values can be automatically updated without being displayed by adding the prefix 'HIDDEN' to the type (e.g. 'HIDDENCURRENTTIME')

Galaxy export

The postgreSQL driver is updated from 4.2.2.jre7 to 4.3.3.

The Oracle driver is updated from ojcbc7:12.1.0.1 to ojdbc8:21.5.0.0.

IpaApi

The org.apache.httpcomponents.httpclient library is updated from 4.4.1 to 4.5.13.

The commons-io library is updated from 1.3.2 to 2.7

MyDAS

The junit library is updated from 4.7 and 4.11 to 4.13.2.

The mysql-connector-java library is updated from 5.1.12 to 8.0.20.

The commons-collections library is updated from 3.2.1 to 3.2.2

The org.apache.httpcomponents.httpclient library is updated from 4.4.1 to 4.5.13.

Rmodules

The org.rosuda.Rserve library is updated from 1.7.3 to org.rosuda.REngine.Rserve 1.8.1.

The com.google.guava library is upated from 19.0 to 30.1.1-jre

The hight of the heatmap image is increased from 800 to 1200 npixels, and the text height is increased by 50%.

SampleTypes and Timepoints are compiled into lists in the High Dimension popup window under Advanced Workflows.

A new error message is generated is the analysis job disk space is full, or the job directory cannot be created.

SmartR

The rserve library is updated from 1.7.3 to 1.8.1.

The com.google.guava library is upated from 19.0 to 30.1.1-jre. the JobTasksService is updated to use the new guava version.

For testing, jasmine-ajax is updated from 3.2.0 to 3.3.1 and jasmine-core from 2.6.0 to 3.6.0; karma is updated from 0.13.19 to 6.3.17; karma-jasmine is updated from 0.3.6 to 2.0.1; jarma-phantomjs-launcher is updated from 0.2.3 to 1.0.4.

Transmart Core Db

The com.google.guava library is updated from 19.0 to 30.1.1-jre.

The junit library is updated from 4.11 to 4.13.2.

All organism names are fixed at 100 characters.

Peptide names are fixed at 200 characters.

Where a gene symbol (locus name) is found twice, the numeric gene id is also used to distinguish the genes.

Transmart Core Db Tests

Tests are updated to use the new tranSMART-specific query tables.

Transmart Custom

The custom preloaded image files no load with a version number.

TransmartApp web interface

In the navigation tree, testing for an Editable node is performed earlier to display i2b2 demo data with the correct icon.

Now displays the '>>' icon to reveal the navigation tree after hiding it with the '<<' icon.

The postgreSQL driver is updated from 4.2.2.jre7 to 4.3.3.

The Oracle driver is updated from ojcbc7:12.1.0.1 to ojdbc8:21.5.0.0.

The com.google.guava library is updated from 19.0 to 30.1.1-jre.

The org.rosuda.Rserve library is updated from 1.7.3 to org.rosuda.REngine.Rserve 1.8.1

The org.apache.poi libraries are updated from 3.1-FINAL to 5.0.0.

The internal jquery-ui javascript library is updated from 1.10.4 to 1.11.1.

The GenePattern and gp-modules libraries are updated from an old unnamed version to 3.9.0.

A new Config.groovy variable com.recomdata.transmartSummary defines a customizable welcome message for the Login and Analysis pages.

A new Config.groovy variable org.transmart.i2b2.view.enable allows the display of i2b2 data as study 'I2B2' by an admin user.

Query management code is updated to use the new tranSMART-specific query tables, allowing i2b2 exclusive use of its own query tables.

Concept node counts are retrieved from the new table i2b2metadata.tm_concept_counts which replaces i2b2demodata.concept_counts for compatibility with i2b2 metadata.

A patient sex of 'x' or 'unknown' is displayed as an empty string rather than a null value.

Code to handle certain early (closed source) tranSMART studies as special cases has been removed.

Multiline strings are replaced by simple strings for generated SQL statements.

In editing a Browse tab study, the PubMed URL is automatically generated from a PubMedId, and the DOI URL is generated for a DOI code.

Icons for sorting columns in ascending or descending order are corrected.

The script to build the plugins and the transmart.war file is updated.

On the Browse tab, metadata for a study has been extended. A study now includes data for:

ItemTypeDescription
OverallDesignFree textOverall design text from GEO
StudyTargetTextStudy target description
StudyETLidTextIdentifies data source, e.g. a GEO accession or an internal identifier
NumberOfSamplesTextNumber of samples. Like NumberOfSubjects, this is a text field that usually contains an integer
StudyStartDateDateDate in yyyy-mm-dd format for the start of the study
StudyCompletionDateDateDate in yyyy-mm-dd format for the completion of the study
StudyPersonNameTextName of the primary contact
StudyPersonRolesTextRoles of the primary contact
StudyPersonContactTextof the primary contact
StudyPersonInstitutionTextof the primary contact
StudyPersonAddressTextof the primary contact
StudyEntryTimeDateTimeDefaults to current time when the browse entry is created
StudyLastUpdateDateTimeDefaults to current time when an update is saved

A Program can define a contact name as text. This is disabled by default. Previous versions had separate first, middle and last names for the program contact.

Transmart data

Database schema

The tranSMART database schema is now fully compatible with the latest release of i2b2 (1.7.13). Future release of both platforms will appear together. All i2b2 tables, procedures and functions are included in a clean tranSMART 19.1 installation. The i2b2 demodata can optionally be loaded.

The i2b2 webclient and server can be installed using the tranSMART database, loaded with i2b2 data.

The additional schemas and tables used by tranSMART are retained and updated.

A new table stores the StudyId and top node for each tranSMART study. In previous releases this was a view rather than a table. As both values are defined as parameters for any new study it is much simpler to store them. Searching for the top node was time consuming, especially on very large databases.

The query management tables for tranSMART have been renamed by changing the QT_ prefix to a QTM_ prefix. This allows both tranSMART and i2b2 to run on the same database safely. We hope to recombine the tables, and share queries and results, in a future release.

Some tranSMART tables giving an overall summary across all studies have been moved from i2b2demodata to i2b2metadata. This is a more appropriate schema, following the i2b2 examples.

The database installation scripts have been tested on postgres versions up to 14.4. We expect they will work on any version from 9.5 upward.

The database includes new schemas (i2b2hive, i2b2imdata, i2b2pm, i2b2workdata) to support a full i2b2 database. All i2b2-specific tables have their initial contents defined as for a standard i2b2 installation.

Schema amapp

Tables defining metadata terms for Browse tab annotation.

New terms for Study and Program are defined, supporting all the originally developed Study terms.

These metadata terms are supported by the new browse and program ETL targets.

Schema biomart

Maximum length for an organism name is extended to 200 characters in the annotation table.

Primary keys are set to IDs and previous primary keys changed to unique keys to cleanup tables from eraly releases.

Schema biomart_stage


Schema biomart_user


Schema deapp


Schema fmapp


Schema galaxy



Schema i2b2demodata

Updated to include all tables in a standard i2b2 1.7.13 installation.

The default password for this user/role is changed to the i2b2 default 'demouser'.

Schema i2b2hive

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 1.7.13 installation.

Preloaded with the same metadata as a new i2b2 installation.

Schema i2b2imdata

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 installation.

Schema i2b2metadata

Updated to include all tables in a standard i2b2 1.7.13 installation.

The default password for this user/role is changed to the i2b2 default 'demouser'.

Schema i2b2pm

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 1.7.13 installation.

Preloaded with the same metadata as a new i2b2 installation.

Schema i2b2workdata

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 1.7.13 installation.

Schema searchapp

Search keyword data is updated.

Schema tm_cz


Schema tm_lz


Schema tm_wz


Schema ts_batch


Data dictionaries

The pre-loaded metadata from Entrez (genes for human and mouse) and Medline (disease MeSH terms) has been updated.

Additional model organism species have been defined to support the COVID-19 annotation project. The most popular cell lines have been added as species, for example 'HeLa' or 'Vero', as a cell line is not the same as a whole organism.

Configuration

New placeholders are provided for org.transmart.i2b2.view.enable to enable viewing i2b2 data, and for com.recomdata.transmartIntro and com.recomdata.transmartWelcome to display alternative HTML text on the Browse and Login pages. transmartIntro replaces the default Browse tab introductory text. transmartWelcome is an additional paragraph that appears below.

Database comparison script

The ora-pg-compare.pl script is extended to compare PostgresQL and Oracle database definitions for tranSMART and i2b2.

This script is used to keep database definitions for both platforms (tranSMART and i2b2) in sync.

Scripts

Installation

Install scripts

The install scripts have been rewritten to make them more general for any Unix operating system. A single script Scripts/Install/InstallTransmart.sh now automatically detects the operating system and version, and loads appropriate packages to support tranSMART, Rserve, solr, Kettle, and other components.

Each operating system uses different names for various packages, or supports different versions. We maintain a list for each system that we support, and can add more as required.

Supported versions include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, ... (more to be added)

Docker instances

Five docker images are in preparation to be used as alternatives to a full install. These are intended to get new users up and running quickly. For a production system a full installation is strongly recommended.

A postgresql database using docker can easily lead to loss of the database content. A regular backup strategy is strongly advised (database dump, and backup of docker files, retaining source files for data loads) to recover from loss of data.

NameDescription
transmart-appServer running transmartApp and gwava under tomcat8, with the current online transmartmanual
transmart-dbPostgres 14 database running on port 5432 with an optional configuration script to fine-tune the postgresql server
transmart-loadData loading (ETL) scripts to fetch from the transmartfoundation library and other sources and install studies using Pentaho Data Integration (Kettle) 9.3
transmart-rserveFully installed Rserve with all required additional packages from R and BioConductor running on port 6311 as a non-privileged user
transmart-solrSolR indexing for the Browse tab, Sample Explorer and other searches running on port 8983 as a non-privileged user

Libraries

The various libraries included in the transmartApp web interface and the database utilities have been updated to their latest versions. When issues arise which have security implications we are automatically notified through GitHub and can apply and distribute updates, usually a simple updated transmart.war file.

Particular attention was paid to issues with the log4j2 library (earlier tranSMART releases used the obsolete log4j1 library), and to updating the drivers for postgresql and oracle.

Data loading

The third-party tools for loading earlier releases are no longer maintained by their developers. Major improvements have been made to the standard (Kettle) ETL procedures. Further improvements are planned in coming releases.

Kettle release

TranSMART uses the latest Pentaho Data Integration (Kettle) Community Edition 9.3.

The volume of log messages has been reduced to make it easier to check for success of for any error conditions.

Log file processing is further simplified by a new script that automatically searches for files to load for a study, and parses the log files to report any errors in the input data or elsewhere.

Study loading in a single step

A new target 'all' searches for all data types for a study. Browse tab data is loaded first, then clinical data, then any reference annotation, and finally expression, rnaseq and any other data types.

The search currently relies on finding the study targets in the public datasets feed. This ensures that there is an external (or local) source for the study files so that they can be purged and re-fetched if data is to be reloaded. We hope to support simply adding local *.tar.xz files under samples/studies in a future release. For 19.1 you can add your study files to a local server (http or ftp), create a local datasets_index file, and add its URL to the end of the sample/studies/public_feeds file then 'make update_datasets'.

Clinical data loading

Loading clinical data can be many times faster in some cases. A number of bottlenecks have been identified and resolved. In older releases, studies with large numbers of concepts (nodes in the tree) and large numbers of patients could take a very long time to load, and would also slow the loading of any further studies. The initial slow loading was resolved, and the impact on later studies was resolved.

We recommend running on a recent PostgreSQL release to take advantage of improvements in query optoimization. We have seen slow performance on very old versions, though they are still supported by the tranSMART code.

RNAseq data loading

There are major performance improvements in this release for rnaseq data. The SQL code for postgresql has been reviewed and updated, especially for very large input datasets.

Browse Tab Loading

In previous releases the Browse Tab was intended to be populated manually by an administrator user.

In this release, new targets are added to populate the browse data for a study, and to load the Program data that it should be stored under.

The Study metadata has been extended to included all fields currently defined - many were ignored in previous releases.

The current set of transmart foundation and other library studies will be extended to provide browse tab content automatically generated from GEO with a little manual intervention (e.g. classifying the disease)

In 19.1 the Program title has to be used as the fixed identifier when loading study Browse targets. We aim to add a ProgramID in a future release to allow the Program Title to be edited - useful to change the order in which the programs appear.

Concept counts

In the transmartApp web interface each concept in the navigation tree has an associated subject count. When loading very large data sets the counting of concepts was very inefficient in previous releases.

A complete rewrite of the procedures resulted in a 100-fold improvement in execution time.

A review of the calls to these procedures has reduced the occasions where nodes are recounted.

I2b2 data

Data for tranSMART is loaded as discrete studies, each under their own top node and recorded in the new trial node table (replacing a calculated view).

Any other data is labeled as 'I2B2' and treated as a single 'I2B2' study accessible only by admin users.

The i2b2 demo dataset can be loaded after the tranSMART database has been created with:

        make -j 1 -C i2b2demodata/postgres load

(For an Oracle database,  replace 'postgres' with 'oracle')

  • No labels