Introduction

The tranSMART 19.1 release is the first new release for the new developments under the Dell Technologies project.

Version numbers are updated throughout to 19.1.

TranSMART is now tested with Postgres releases up to 15.3, and with Oracle releases up to 19.3.

Older versions from Postgres 9.5+ and Oracle 12.0+ should continue to work.

This release supports R versions up to 4.3.1.

Building

A new script transmartApp-build.sh build transmart.war and gwava.,war from source code.

Install scripts are available for Ubuntu 18.04, 20.04 and 22.04. We will add further operating systems to the Scripts artifact as they are tested.

Docker

A new repository transmart-docker builds 5 docker images:

ImageDescription
transmart-appWeb client
transmart-dbPostgres database
transmart-loadKettle ETL procedures to load database
transmart-solrSolR server for browse, rwg and sample data
transmart-rserveRserve server

We would like to extend the transmart-db database to provide docker images pre-loaded with specific sets of data.

Search/filter

The pull-down menu can be customized though new configuration parameters to hide categories that should only be searched specifically.

A search for ALL can exclude specific categories through new configuration parameters.

Search terms with spaces are now supported.

Grid View

High-dimensional values (e.g. gene expression or RNAseq) broke the Grid View display if no values were found for a query subset. An empty set of values is now returned, and the Grid View panel displays correctly.

Biomart domain

Column widths for organism and species in biomart tables are set to the number of characters in a VARCHAR2 or character varying column. In earlier versions they were double the number of characters.

Experiment objects are extended to support storing data for newly enabled fields in the Browse tab.

Folder management

The description column is expanded to 4000 characters. This is used as the description for programs, studies assays, analyses, folders and files.

Date values are supported to document the creation and update dates for Browse tab items. Date item subtypes can be 'DATE' (date only), 'TIME' (date and time) and 'CURRENTTIME' (use the current time as the value).

Date values can be automatically updated without being displayed by adding the prefix 'HIDDEN' to the type (e.g. 'HIDDENCURRENTTIME')

Galaxy export

The postgreSQL driver is updated from 4.2.2.jre7 to 4.6.0.

The Oracle driver is updated from ojcbc7:12.1.0.1 to ojdbc8:23.2.0.0.

IpaApi

The org.apache.httpcomponents.httpclient library is updated from 4.4.1 to 4.5.13.

The org.apache.logging.log4j.log4j-core library is updated from 2.17.1 to 2.20.0.

The commons-io library is updated from 1.3.2 to 2.7

MyDAS

The junit library is updated from 4.7 and 4.11 to 4.13.2.

The mysql-connector-java library is updated from 5.1.12 to 8.0.20.

The commons-collections library is updated from 3.2.1 to 3.2.2

The org.apache.httpcomponents.httpclient library is updated from 4.4.1 to 4.5.13.

The org.apache.logging.log4j.log4j-core library is updated from 2.17.1 to 2.20.0.

The org.mortbay.jetty library is updated from 6.1.0 to 6.1.23.

The mysql.mysql-connector-java library is updated from 5.1.12 to 8.0.28.

Rmodules

Images for Heatmaps are increased in size. Font sizes are corrected to display (where possible) all subjects/samples and probes/genes.

PDF generation is updated using alternative libraries to fix size, content and visibility issues.

New options allow the user to specify a pixel per cell value for heatmaps. In most cases the new calculated default will give clear results.

Heatmaps and other workflows have a correction for a case of data with duplicate records to use the mean value. Previous releases used R's default correction 'length' which broke the results. In general, data should not have such duplicate values.

SampleTypes and Timepoints are compiled into lists in the High Dimension popup window under Advanced Workflows.

A new error message is generated if the analysis job disk space is full, or the job directory cannot be created.

The org.rosuda.Rserve library is updated from 1.7.3 to org.rosuda.REngine.Rserve 1.8.1.

The com.google.guava library is updated from 19.0 to 32.0.1-jre.

Search domain

No change.

SmartR

Code changes to support changed formats in the output of the most recent releases of R.

The rserve library is updated from 1.7.3 to 1.8.1.

The com.google.guava library is updated from 19.0 to 32.0.1-jre. the JobTasksService is updated to use the new guava version.

For testing, jasmine-ajax is updated from 3.2.0 to 3.3.1 and jasmine-core from 2.6.0 to 3.6.0; karma is updated from 0.13.19 to 6.3.17; karma-jasmine is updated from 0.3.6 to 2.0.1; karma-phantomjs-launcher is updated from 0.2.3 to 1.0.4.

Spring Security Auth0

No change.

Transmart Core API

The com.google.guava library is updated from 19.0 to 30.1.1-jre.

The junit library is updated from 4.11 to 4.13.2.

Transmart Core Db

The com.google.guava library is updated from 19.0 to 30.1.1-jre.

The junit library is updated from 4.11 to 4.13.2.

All organism names are fixed at 100 characters.

Peptide names are fixed at 200 characters.

Where a gene symbol (locus name) is found twice, the numeric gene id is also used to distinguish the genes.

Transmart Core Db Tests

Tests are updated to use the new tranSMART-specific query tables.

Transmart Custom

The custom preloaded image files now load with a version number.

Transmart Fractalis

No change.

No change.

Transmart Gwas Plugin

No change.

Transmart Java

The oracle jdbc driver is updated to ojdbc8 23.2.0.0.

Transmart Legacy DB

No change.

Transmart Metacore Plugin

No change.

Transmart MyDAS

No change.

Transmart REST API

The postgreSQL jdbc driver is updated from 42.2.2.jre7 to 42.6.0.

The oracle jdbc driver is updated from ojdbc7 12.1.0.1 to ojdbc8 23.2.0.0.

Transmart Shared

No change.

Transmart XNAT Importer

No change.

Transmart XNAT Viewer

No change.

TransmartApp web interface

In the navigation tree, testing for an Editable node is performed earlier to display i2b2 demo data with the correct icon.

Now displays the '>>' icon to reveal the navigation tree after hiding it with the '<<' icon.

The postgreSQL driver is updated from 4.2.2.jre7 to 4.6.0.

The Oracle driver is updated from ojcbc7:12.1.0.1 to ojdbc8:23.2.0.0.

The com.google.guava library is updated from 19.0 to 30.1.1-jre.

The org.rosuda.Rserve library is updated from 1.7.3 to org.rosuda.REngine.Rserve 1.8.1

The org.apache.poi libraries are updated from 3.1-FINAL to 5.0.0.

The internal jquery-ui javascript library is updated from 1.10.4 to 1.11.1.

The GenePattern and gp-modules libraries are updated from an old unnamed version to 3.9.0.

A new Config.groovy variable com.recomdata.transmartSummary defines a customizable welcome message for the Login and Analysis pages.

A new Config.groovy variable org.transmart.i2b2.view.enable allows the display of i2b2 data as study 'I2B2' by an admin user.

Query management code is updated to use the new tranSMART-specific query tables, allowing i2b2 exclusive use of its own query tables.

Study IDs are maintained in a new table, replacing the generated materialized view of earlier releases.

Concept node counts are retrieved from the new table i2b2metadata.tm_concept_counts which replaces i2b2demodata.concept_counts for compatibility with i2b2 metadata.

A patient sex of 'x' or 'unknown' is displayed as an empty string rather than a null value.

Code to handle certain early (closed source) tranSMART studies as special cases has been removed.

Multi-line strings are replaced by simple strings for generated SQL statements.

In editing a Browse tab study, the PubMed URL is automatically generated from a PubMedId, and the DOI URL is generated for a DOI code.

Icons for sorting columns in ascending or descending order are corrected.

The script to build the plugins and the transmart.war file is updated.

On the Browse tab, metadata for a study has been extended. A study now includes data for:

ItemTypeDescription
OverallDesignFree textOverall design text from GEO
StudyTargetTextStudy target description
StudyETLidTextIdentifies data source, e.g. a GEO accession or an internal identifier
NumberOfSamplesTextNumber of samples. Like NumberOfSubjects, this is a text field that usually contains an integer
StudyStartDateDateDate in yyyy-mm-dd format for the start of the study
StudyCompletionDateDateDate in yyyy-mm-dd format for the completion of the study
StudyPersonNameTextName of the primary contact
StudyPersonRolesTextRoles of the primary contact
StudyPersonContactTextof the primary contact
StudyPersonInstitutionTextof the primary contact
StudyPersonAddressTextof the primary contact
StudyEntryTimeDateTimeDefaults to current time when the browse entry is created
StudyLastUpdateDateTimeDefaults to current time when an update is saved

Dates are entered as year-month-day to avoid ambiguities and conflicts between US and international date formats (e.g. 9/10/11).

A Program can define a contact name as text. This is disabled by default. Previous versions had separate first, middle and last names for the program contact.

Transmart data

Database schema

The tranSMART database schema is now fully compatible with the latest release of i2b2 (1.7.13). Future release of both platforms will appear together. All i2b2 tables, procedures and functions are included in a clean tranSMART 19.1 installation. The i2b2 demodata can optionally be loaded.

The i2b2 webclient and server can be installed using the tranSMART database, loaded with i2b2 data.

The additional schemas and tables used by tranSMART are retained and updated.

A new table stores the StudyId and top node for each tranSMART study. In previous releases this was a view rather than a table. As both values are defined as parameters for any new study it is much simpler to store them. Searching for the top node was time consuming, especially on very large databases.

The query management tables for tranSMART have been renamed by changing the QT_ prefix to a QTM_ prefix. This allows both tranSMART and i2b2 to run on the same database safely. We hope to recombine the tables, and share queries and results, in a future release.

Some tranSMART tables giving an overall summary across all studies have been moved from i2b2demodata to i2b2metadata. This is a more appropriate schema, following the i2b2 examples.

The database installation scripts have been tested on postgreSQL versions up to 15.3. We expect they will work on any version from 9.5 upward.

The dsatabase installation scripts have been tested on Oracle versions 19.3 and 12.1 with partition support.

The database includes new schemas (i2b2hive, i2b2imdata, i2b2pm, i2b2workdata) to support a full i2b2 database. All i2b2-specific tables have their initial contents defined as for a standard i2b2 installation.

Oracle index names have been updated to match i2b2 (1.7.13).

The generated 'vars' file now includes a variable TRANSMARTDATA set to the location of the vars file which should be in the transmart-data top level directory.

Schema amapp

Tables defining metadata terms for Browse tab annotation.

New terms for Study and Program are defined, supporting all the originally developed Study terms.

These metadata terms are supported by the new browse and program ETL targets.

Schema biomart

Maximum length for an organism name is extended to 200 characters in the annotation table.

Primary keys are set to IDs and previous primary keys changed to unique keys to cleanup tables from early releases.

Schema biomart_stage


Schema biomart_user


Schema deapp


Schema fmapp


Schema galaxy



Schema i2b2demodata

Updated to include all tables in a standard i2b2 1.7.13 installation.

The default password for this user/role is changed to the i2b2 default 'demouser'.

Schema i2b2hive

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 1.7.13 installation.

Preloaded with the same metadata as a new i2b2 installation.

Schema i2b2imdata

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 installation.

Schema i2b2metadata

Updated to include all tables in a standard i2b2 1.7.13 installation.

The default password for this user/role is changed to the i2b2 default 'demouser'.

Schema i2b2pm

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 1.7.13 installation.

Preloaded with the same metadata as a new i2b2 installation.

Schema i2b2workdata

This standard i2b2 schema is new to tranSMART in 19.1.

Includes all tables in a standard i2b2 1.7.13 installation.

Schema searchapp

Search keyword data is updated.

Schema tm_cz

Tables with a category_cd column are consistent sizes for each data type.

Schema tm_lz

Table lz_src_clinical_data was populated with a copy of the initial data state but was never used. This table is now ignored in clinical ETL.

Schema tm_wz

Tables with a category_cd column are consistent sizes for each data type.

Schema ts_batch

No change in this release.

Data dictionaries

The pre-loaded metadata from Entrez (genes for human and mouse) and Medline (disease MeSH terms) has been updated.

Additional model organism species have been defined to support the COVID-19 annotation project. The most popular cell lines have been added as species, for example 'HeLa' or 'Vero', as a cell line is not the same as a whole organism.

Configuration

New placeholders are provided for org.transmart.i2b2.view.enable to enable viewing i2b2 data, and for com.recomdata.transmartIntro and com.recomdata.transmartWelcome to display alternative HTML text on the Browse and Login pages. transmartIntro replaces the default Browse tab introductory text. transmartWelcome is an additional paragraph that appears below.

New confguration options com.recomdata.category.all.* can be set false to exclude a categgory (e.g. gene) from the default ALL category search.

New configuration options com.recomdata.category.hide.* can be set true to drop a category from the pull-down search menu.

All known configuration parameters are described on the Admin panel Config page. Any other parameters appear at the bottom of that page.

Database comparison script

The ora-pg-compare.pl script is extended to compare PostgreSQL and Oracle database definitions for tranSMART and i2b2.

This script is used to keep database definitions for both platforms (tranSMART and i2b2) in sync.

Utility scripts

New script savetable.sh saves a .csv file (tab-delimited) from any table (parameters are schema table)

New script loadtable.sh loads data for any table from a .csv (tab-delimited) file.

New script update_sequences.sh checks all table definitions for columns with default values that depend on a sequence. Default values and triggers are checked. With no parameters, it simply reports sequence values that do not fit with the largest column value. The -update switch updates sequences to generate the next value for any dependent column. Special processing is used for concept codes with a 'TM' prefix. the script is intended for use by developers to test changes to initial data loads, but can be used by any site where data may have been copied into tables with values that can exceed the current sequence. In such cases eventually the sequence will hit an existing value and an update will fail. if the sequence is not increased.

R and Rserve

The version of R is updated to 4.3.1.

The version of Rserve called to run advanced workflows is updated to 1.8-12.

Setting up the systemd rserve service is updated to define outputs to the system logfile and to support running as a tomcat 9 user.

Other packages with specific versions in the install scripts are updated: QDNAseq is updated to 1.36.0

SolR

Solr was downloaded from an old legacy server at one of the developers. Solr is now downloaded from the official distribution site.

DB Doc

SchemaSpy is updated to 6.2.4 and tested on Postgres and Oracle. This version uses java11. An environment variable can be set if another java version is the current default.

Transmart Test

The geb testing library is updated from 2.1 to release 4.1.

The selenium version is updated from 3.11.0 to 3.141.59.

The groovy version is updated from 2.4.0 to 2.5.14.

The spock core version is updated from 1.1-groovy-2.4 to 1.3-groovy-2.5.

The com.google.guava library is updated from 24.1-jre to 30.1.1-jre.

The junit version is updated from 4.12 to 4.13.1.

The surefire version is downgraded from 2.21.0 to 2.19 following geb examples.

The org.apache.commons.commons-lang3 library is downgraded from 3.7 to 3.5 following geb examples.

The chrome driver is added.

Scripts

Installation

Install scripts

The install scripts have been rewritten to make them more general for any Unix operating system. A single script Scripts/Install/InstallTransmart.sh now automatically detects the operating system and version, and loads appropriate packages to support tranSMART, Rserve, solr, Kettle, and other components.

Each operating system uses different names for various packages, or supports different versions. We maintain a list for each system that we support, and can add more as required.

Supported versions include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, ... (more to be added)

Docker instances

Five docker images are in preparation to be used as alternatives to a full install. These are intended to get new users up and running quickly. For a production system a full installation is strongly recommended.

A PostgreSQL database using docker can easily lead to loss of the database content. A regular backup strategy is strongly advised (database dump, and backup of docker files, retaining source files for data loads) to recover from loss of data.

NameDescription
transmart-appServer running transmartApp and gwava under tomcat8, with the current online transmartmanual
transmart-dbPostgreSQL 14 database running on port 5432 with an optional configuration script to fine-tune the PostgreSQL server
transmart-loadData loading (ETL) scripts to fetch from the transmartfoundation library and other sources and install studies using Pentaho Data Integration (Kettle) 9.3
transmart-rserveFully installed Rserve with all required additional packages from R and BioConductor running on port 6311 as a non-privileged user
transmart-solrSolR indexing for the Browse tab, Sample Explorer and other searches running on port 8983 as a non-privileged user

Libraries

The various libraries included in the transmartApp web interface and the database utilities have been updated to their latest versions. When issues arise which have security implications we are automatically notified through GitHub and can apply and distribute updates, usually a simple updated transmart.war file.

Particular attention was paid to issues with the log4j2 library (earlier tranSMART releases used the obsolete log4j1 library), and to updating the drivers for PostgreSQL and oracle.

Data loading

The third-party tools for loading earlier releases are no longer maintained by their developers. Major improvements have been made to the standard (Kettle) ETL procedures. Further improvements are planned in coming releases.

Available datasets

'make update_datasets' now reports the number of dataets and the number of studies for each listed source. By default the library.transmartfoundation.org server is searched. Other servers and local directories/servers can bve added to the d

Debugging

Two new values can be defined in the database table tm_cz.etl_settings.

Paramname 'DEBUG' with value 'yes' will repeat any messages to the tm_cz.cz_audit_log and cz-error_log tables as a NOTICE to the postgresql log file (and to the terminal if run interactively). This has the great advantage that output from a failed run will still be saved in the logfile while the updates to the audit and error tables are lost.

Paramname 'CLEANTABLES' with value 'no' will leave temporary working tables so that they can be reviewed after loading. They will be emptied at the start of a fresh ETL priocedure.

Kettle release

TranSMART ETL procedures use Pentaho Data Integration (Kettle) Community Edition 9.3. The providers have moved the Kettle download site, breaking previous tranSMART releases.

We have tested the more recent Kettle 9.4 but found there are major changes that break the scripts used by tranSMART ETL procedures. We will continue to check the status of this and any future releases.

The volume of log messages has been reduced to make it easier to check for success of for any error conditions.

Log file processing is further simplified by a new script that automatically searches for files to load for a study, and parses the log files to report any errors in the input data or elsewhere.

Kettle scripts have been reviewed and reformatted to make support easier.

Study loading in a single step

A new target 'all' searches for all data types for a study.

As loading proceeds, the full log files are stored and a summary is reported to the terminal.

Browse tab data is loaded first, then clinical data, then any reference annotation, and finally expression, rnaseq and any other data types.

The search currently relies on finding the study targets in the public datasets feed. This ensures that there is an external (or local) source for the study files so that they can be purged and re-fetched if data is to be reloaded. We hope to support simply adding local *.tar.xz files under samples/studies in a future release. For 19.1 you can add your study files to a local server (http or ftp), create a local datasets_index file, and add its URL to the end of the sample/studies/public_feeds file then 'make update_datasets'.

Clinical data loading

Loading clinical data can be many times faster in some cases. A number of bottlenecks have been identified and resolved. In older releases, studies with large numbers of concepts (nodes in the tree) and large numbers of patients could take a very long time to load, and would also slow the loading of any further studies. The initial slow loading was resolved, and the impact on later studies was resolved.

Further speedup was achieved by rewriting the code to count the number of subjects for each node. In some large cases this was 100-fold faster.

Intermediate processing steps were made faster by using more efficient transformations, and by changes to the way working tables are indexed.

One example clinical dataset that required 30 days in previous releases can now be loaded in around 30 minutes.

We recommend running on a recent PostgreSQL release to take advantage of improvements in query optimization. We have seen slow performance on very old versions, though they are still supported by the tranSMART code.

Loading subject data also loads the PATIENT_DIMENSION and PATIENT_MAPPING tables, and the ENCOUNTER_MAPPING tables. These are also populated for other data types if a subject is created.

Annotation platform loading

Previous releases required multiple ref_annotation parameter files for a study with multiple microarrays,  or multiple data types.

Ref_annotation parameters now support a PLATFORMS parameter which can have a list of platforms. If PLATFORMS is defined then any PLATFORM_DATA_TYPE is ignored and taken instead from the annotation parameter file for each platform.

Gene Expression (Microarray) data loading

On postgres, with datatype 'T' the partition table is created if not already present.

Although supported by past releases, on Oracle partitioning was turned off because older Oracle versions did not all support partitioned tables.

For Oracle 19.3 it was re-enabled fpr mrna microarray data when the tables are created.

RNAseq data loading

There are major performance improvements in this release for rnaseq data. The SQL code for PostgreSQL has been reviewed and updated, especially for very large input datasets.

Genes/loci with zero counts were ignored when log values were calculated. Zero counts are now set to a very small value. Previously zeros were loaded as NULL values. A very low value indicates a known lack of transcripts and generates a valid low log intensiy value. These markers now appear normally on heatmaps rather than white (missing).

Kettle scripts are fixed to support multiple input data files for rnaseq.

RBM data loading

The RBM loading scripts on postgres used a different sequence to the table definition when creating IDs. This is now corrected. Unless some other ETL system was also in use this would not have caused a problem in previous releases.

Browse tab loading

In previous releases the Browse Tab was intended to be populated manually by an administrator user.

In this release, new targets are added to populate the browse data for a study, and to load the Program data that it should be stored under.

The Study metadata has been extended to included all fields currently defined - many were ignored in previous releases.

The current set of transmart foundation and other library studies will be extended to provide browse tab content automatically generated from GEO with a little manual intervention (e.g. classifying the disease)

In 19.1 the Program title has to be used as the fixed identifier when loading study Browse targets. We aim to add a ProgramID parameter in a future release to allow the Program Title to be edited - useful to change the order in which the programs appear.

Concept counts

In the transmartApp web interface each concept in the navigation tree has an associated subject count. When loading very large data sets the counting of concepts was very inefficient in previous releases. Performance was also poor on databases with a large set of existing data.

A complete rewrite of the procedures resulted in a 100-fold improvement in execution time.

A review of the calls to these procedures has reduced the occasions where nodes are recounted.

I2b2 data

Data for tranSMART is loaded as discrete studies, each under their own top node and recorded in the new trial node table (replacing a calculated view).

Any other data is labeled as 'I2B2' and treated as a single 'I2B2' study accessible only by admin users.

The i2b2 demo dataset can be loaded after the tranSMART database has been created with:

        make -j 1 -C i2b2demodata/postgres load

(For an Oracle database,  replace 'postgres' with 'oracle')

The i2b2 demo data has been updated from the original i2b2 1.7.13 set, for example to resolve an issue with patients calculated ages no longer matching the age ranges which had been set several years earlier.

transmart-etl

The postgreSQL jdbc driver has been updated from 42.2.2.jre7 to 42.6.0.

The Oracle driver has been updated to ojdbc8 release 23.2.0.0.

The junit library is updated from 4.11 to 4.13.2.

The org.apache.logging.log4j.log4j-core library is updated from 2.17.1 to 2.20.0.

transmart-ICE

The postgreSQL jdbc driver has been updated to 42.6.0

The Oracle driver has been updated to ojdbc8 release 23.2.0.0

transmart-batch

The postgreSQL jdbc driver has been updated to 42.6.0

The Oracle driver has been updated from release 14 to ojdbc8 release 23.2.0.0

  • No labels