Release 19.0 (May 2020) is a major update to the previous release 16.3.
This version is renumbered to reflect the year of release, and to indicate the major effort in rewriting and reorganizing code.
This is also the first official tranSMART release to be compatble with the parallel i2b2-transmart project. The intention is for i2b2-transmart to use this release, perhaps with a small number of additions to integrate with their latest changes in other codebases.
Installation instructions for tranSMART 19.0 are in preparation at Install the tranSMART 19.0 release
A test version of the full release is available at http://postgres-test.transmartfoundation.org/transmart
Details of the beta test server data and of the features to be tested are in the beta test public instance wiki page
TranSMART 19.0 code is reorganized in a single repository https://github.com/tranSMART-Foundation/transmart
The top level directories are merged copies of the tranSMART 16.3 repositories with minor changes.
The combined repository makes release branches simple to manage. A single branch in the main repository can be used to generate all the artifacts for a release dostribution.
The directories mirror the original organization of the source code for the tranSMART 17 project to simplify direct comparisons,
The test code previously under transmart-core-db/transmart-core-db-tests has been relocated to its own top-level directory.
This makes building and testing simpler, and was also the organization chosen by the transmart 17 server-only project.
The two RNAseq datatypes are better separated in the core code.
The release 16.3 tranSMART-ETL repository has been renamed to all lower case. Unused legacy directories have been purged from the repository, greatly reducing the size of the zip file generated for each release.
The transmart-extensions plugin in release 16.3 has been split into its three components:
This also reflects the rearrangement in the tranSMART 17 project.
The old release 16.x 'blend4-plugin' is renamed galaxy-export-plugin.
Throughout the code the name 'blend4j' has been replaced to make the functionality clear.
SmartR was developed by the eTRIKS project and released in tranSMART 16.2 with a set ogf interactive analysis workflows that supersede many of the functions of the "Advanced Workflows" tab.
We are testing new SmartR workflows developed by other partners in the eTRIKS project to provide the remainder of the "Advanced Workflows" functionality.
The Advanced Workflows remain active in this release. We anticipate that users will require them in order to reproduce previous analysis, and they can be used to compare results and encourage migration to SmartR.
Fractalis was developed for i2b2-transmart by the same author as SmartR (Sasha Herzinger at the University of Luxembourg) and superseded several of the SmartR workflows.
We are working on the full integration of Fractalis into tranSMART 19.0. This is a work in progress, involving new ETL interfaces between ttranSMART and Fractalis.
The revised transmart manual is added as help pages inder the defautl URL /transmartmanual. Links from the web interface bring up the appropriate section in a new tab.
Additional configuration parameters were added by the old transmartPro project to link to external help pages. These remain available ifor sites that have extended tranSMART (for example by adding their own local advanced workflows) so that they can be linked to local help pages.
Help links have been added to pages where they were missin gin previous releases, including the Comparison, Summary, and GridView tabs under Analyze.
The Gene Signature tab provides a way to maintain lists of genes, SNPs or probeIDs to define high-dimensional analyses (heatmaps etc.). These have been updated and tested. A new stored procedure is added to load platforms to match gene lists (the platform must be in the right table, with the species defined, for the gene signature to pick up the required genes.
Gene Signatures have rich metadata to docuyment their origins, but these depend heavily on the avalable concepts and metadata in the installed database.
Script are in preparation to add further metadata concepts.
If there is interest, we could automate the addition of missing platforms when they are first added to load new studies.
Gene siugnatures can be made public to make them visible and usabel by any user.
A public set of gene signatures can be a helpful addition to a tranSMART installation where a set of markers is of special local interest.
Gene Lists are a simple version of the Gene Signatures.
Validation tests have been improved to mark any gene, SNP or probe that is not found in the database. It remains up to the user to check the markup before saving as the interface only saves these lists on successful validation.
The safest way is to load from a file of gene names or IDs as this is easier to edit and reload.
A se of extra pages are available to users logged in as administratotors via the Admin tab
Lists the metadata for the tranSMART web application.
Tests the Solr server is up and reports the number of items under each category.
The page was added in 16.3 and is extended in 19.0. All configuration values are retrieved, categorized and reported.
Known parameters are documented with a description and a report of any assumed default value.
All other known values are added to the appropriate tables.
Any additional values are appended to an 'Unknown' table at the bottom off the page. These may be temprary variables from the Config.groovy script or possible errors in parameter names to be corrected.
Some customization options have been added to the database in 19.0.
These include storing image files and icons so they can be updqated without restarting.
Further customization options are planned for future releases. Please contact the developers if you have suggestions for features you would like to see.
Multiple versions of jQuery have been replaced by a single version across all plugins.
Browser console warnings have been fixed for FireFox.
Timing issues in the Analyze tab have been addressed. One that remains is the failure to display the current Query on the Advanced Workflow tab. Visiting another tab and returning to the Advanced Workflows ensures the tab is fully initialized and the query is then visible.
Location and timing of the definition for drag-and-drop in Analyze tab sub-pages has been updated.
The Rserve service is updated to provide better control for installations installing R in transmart-data.
The template can also be used for load R installations.
Whichever is used, the Admin page "Check support connections" page will test all the required R packages are available.
The R installation requires the latest Rserve from Rforge.Net as the current version in R has a fatal datatype error with some workflows. This has been an issue in R for at least two years.
The Rserve service template writes to a logfile to debugging output from R and error messages can be traced more easily when anlaysis has issues.
A template is provided to install a service to launch solr.
Apache solr is used to index and search the Browse tab metadata, the navigation tree and the SampleExplorer.
The solr server writes to a logfile to help tacing issues.
The solr server also provides help through its administration interface.
The database schema has been updated to resolve, as far as possible, differences with the i2b2 1.7.12 schemas.
Where columns are date or time values in i2b2 and string values in transmart, they have been corrected to the appropriate date or time values. Initial testing found no conflicts in ETL procedures.
Column widths have been defined to be the same size across multiple tables (e.g. subject_id). Local installations are free to increase these sizes if they require longer strings but should beware of potential impact on database performance.
Required columns are updated in transmart so that any columns required by i2b2 in a common table are also required in transmart. One date needed to be defined in ETL procedures using an obvious default value. No other impacts have been noted in initial testing.
Triggers are required for some tables in i2b2. Although postgres can be configured automatically to increment unique identifiers for new rows in a table, the i2b2 code may include a call to increment a named sequence to generate a unique id value when a row is inserted. This necessitates defining a sequence with this name and using the same sequence in a trigger function to maintain database integrity.
Integers in postgres are defined as type 'int' unless extremely large values are expected. Values that can exceed 1 billion are defined as type 'bigint'.
Four additional i2b2 schemas have been included:
Though not used (currently) by tranSMART the aim is to create a database that can be populated and used as an i2b2 platform.
I2b2 does not define tablespaces by default. A set of tablespace names were agreed with the i2b2-transmart developers for implementation in the tranSMART schema.
All i2b2 tables are in tablespace I2B2 and all i2b2 indices are in tablespace I2B2_INDEX.
These reflect the tablespaces TRANSMART and INDX used for the transmart-specific tables and indexes.
Earlier tranSMART releases defined 3 additional tablespaces: biomart, deapp and search_app. These were no longer used - though they were created with a new database.
They have been removed from tranSMART 19.0. For this release they are ignored by tranSMART code
New tables have been added to the schemas used by tranSMART 16.3.
Support for transmart_batch on Oracle is included by default. In tranSMART 16.3 this had to be installed before transmart_batch could be used with an Oracle database.
We are working on including support for tMDataLoader on Oracle and Postgres by default. This involves running the respective installation scripts and incorpporating the changes into the Postgres and Oracle schema definitions, including the tMDataLoader specific versions of ETL functions.
The usernames and passwords for the database roles are defined for Kettle when the database is created. By default the username and password are the same as the schema. While this is simple for developers, production instances should define unique passwords for each role.
By adding these as environment variables when launching the database creation target the Kettle properties files will be populated. We recommend retaining the role names as these are used in many places in the code. The kettle properties files and the transmart-data/vars file shjould be secured from read access by other users.
A limited set of usernames is defined for this release, as for previous releases. These have a default password, usually 'transmart2016' except for username 'admin' with default password 'admin'. These are simple for developers.
Production instances should change the passwords through the transmartApp web interface using the 'Admin' tab to manage the available user accounts.
Notwithstanding this, there is also a configuration option (turned off by default) to enable a guest login that by default will allow a visitor to a transmart server to login as username 'guest'. This is useful for public servers to provide access without setting up a user-specific login.
If using this option, changing to another username (to become an 'admin' user) requires logout through the 'Utilities' menu, or explicitly using the login page URL.
A new database (for Postgres or Oracle) is defined as a target in transmart-data.
A common set of data is loaded for both databases. This includes a set of standard ontologies (or data dictionaries). In this release these are:
The data from these sources is included in the searchapp.search_keywords table with synonyms in the searchapp.search_keyword_terms table.
Further data dictionaries can be loaded, and the data for the abvove dictionaries can be updated, using the loader utility under transmart-etl
|Pathways||KEGG||Last public release||biomart.|
This release is built using Grails 2.5.4.
Grals can be installed using:
sdk install grails 2.5.4
Earlier releases up to 16.3 were built using Grails 2.3.11 and Grails 2.3.7
Each directory has a build script.
Grails components use the ./grailsw script with the targets package-plugin (or packagePlugin) and maven-install (or mavenInstall)
Maven builds use the ./gradlew script with targets clean, build and publishToMavenLocal
In two comnponent where these scripts are not yet built, the build used maven directly.
For two external packages mydas and IpaApi, cd to the directory and build with:
The recommended build order to ensure dependencies are satisfied by previously built components is:
|transmart-shared||transmart-shared||-||New in 19.0. Provides generic utilities to return security information about the current user to remove the need to pass the username around in service calls.|
|transmart-fractalis||transmart-fractalis||-||New in 19.0. Adds the Fractalis interactive workflow tab, integration is a work in progress.|
The original DAS code imported into tranSMART because there is no guarantee the original distribution will remain available.
A small section of code is intended to be autogenerated, but this is easy to maintain by hand if any of the dependencies change.
|dalliance||dalliance||-||The dalliance genome browser.|
Moved to a new directory for 19.0.
Tests for the transmart-core-db code, also used by SmartR and transmartApp.
Moved to a new directory for 19.0.
Moved to a new directory for 19.0.Domain definitions for the large number of tables in the biomart schema.
Moved to a new directory for 19.0.
Domain defintions for tables in the searchapp schema.
This covers keyword searching and user, role and access management as both are defined ni the searchapp schema.
New in 19.0. Provides services to customize aspects of tre user interface using new application tables.
Documentation is needed for these new capabilities.
Provides all the Advanced Workflows.
There is also a dependency on data loaded into the searchapp schema to define the inputs,, outputs, parameters and scripts for each workflow and to defined their names and the order in wuich they are presented. This was intended in the pre-open source versions of tranSMART to allow users/administrators to edit these settings, but this makes little sense to control fixed scripts distributed in the transmart.war file. There is no known instance of a site developing their own Advanced Workflows thourh these mechanisms.
New in 19.0. A new Auth0 controller and services.
Documentation is needed for these new capabilities.
|IpaApi||IpaApi||-||This code provides one third-party SmartR workflow to interface to Ingenuity Pathway Analysis. SmartR includes hooks to load the IpaApi workflow|
|Several potential new workflows from the eTRIKS project are candidates for inclusion.|
Directory renamed from blend4j-plugin for 19.0.
A version of data export that transfers to an instance of Galaxy for further analysis.
The user needs credentials to use the galaxy instance, defined in the server Config.groovy file.
|Ths is the full transmart server, providingall the functions of the User Interface plus the access methods for the RESTful API to generate ans serve authentication tokens and to serve results when these topkens are presented.|
Grails 2.5.4 uses Java 8.
Development is using the openjdk java 8. We will confirm later the suitability of oracle java 8 which is onlt available from third-party sourecs for the Ubuntu 18 test systems.
There is no longer a need for a legacy Java 7 install to work on tranSMART development.
The pivot utility in the Kettle ETLs has been recompiled with Java 8.
This invoilves a major reorganization of the source files and changes to hardcoded file paths.
This upgrade is a major step towards preparing for an upgrade to Grails 3 or the newly released Grails 4 in a future tranSMART release
A major code review was conducted by Burt Beckwith at Harvard as part of the inclusion of tranSMART 16.1 and 16.2 code in the i2b2-transmart project.
The planned changes were described in the i2b2-transmart roadmap and summarized below
A new directory spring-security-auth0 provides Auth0 services.
Coding standards have been applied to groovy code:
All source files that used Log4j and calls to log.info:
Replace configuration parameter references with @Value definitions using with org.springframework.beans.factory.annotation.Value
Add @Autowired references with org.springframework.beans.factory.annotation.Autowired
Throughout the code of release 16.3 types were undefined. Adding explicit types wherever possible provides validation of the type sactually passed and improves the usefulness of error messages when code breaks.
A single method can replace a set of identical code segments making testing and maintenance far more robust.
Domain classes have been reviewed and matched to the updates database schemas.
The new SecurityService in transmart-shared provides calls to return information on users and roles for implementing access polcies.
A new directory transmart-shared provides utility functions. These inclide generic checks for the capabilities of the authenticated user, allowing the username to be removed from many method calls where it was being passed down.
Release 19.0 is built using grails 2.5.4 which depends only on java 8.
Tests updates in transart-core-db-tests
One test currently fails. It is testing something that is supposed to fail, but should be trapping the error condition and reposting a test success.
Closures are defined as methods with a set of parameters, replacing the closures and parameter fetching in earlier releases.
Much of the code remains unchanged within the method aside for parameter handling and other coding standard changes (see above)
Parameters are explicitly defined in each method
This give cleaner code where it is obvious what parameters are used and what parameters are available to control the result.
Groovy Servlet Pages (.gsp files) cleaned up to avoid interpreted code critical to functionality
Standard indenting of HTML within GSP pages.
Especially in Oracle code, tranSMART 16.3 defined synonyms to allow reference to tables without the schema.
Explicit schema references are cleaner.
They also make it possible to derive the permissions needed for functions/procedures to operate across schemas.
Many usused methods were retaine dbacause it is difficult to be certain they wil never be invoked.
The level of testing undergone by transmart 19.0 makes this an ideal time to remove these methods and check they were indeed redundant.
However, some removed methods in the i2b2-transmart code were unused on Oracle but were required when running on postres and have been reinstated. Examples include handling large objects as strings.
Simplifying the access logging code
Thsio requiers furthe rtesting to make sure the required tranSMART functionality is supported.
These jar files are removed from the code repository. They are downloaded though code dependencies in BuildConfig.grooxy or pom.xml.
These jar files remain in the repository. They may be removed later.
Many classes amd methods now have @CompileStatic. Testing found very few instances where the annotation had to be removed.
StringBuild is used to carete a string and to append to it using '<<' syntax
See the organization of the new single transmart repository
Filters are now in grails-app/filters/org/transmart...
All SQL statements in the code need careful testing to check they work for both Oracle and Postgres
Testing is needed to ensure that code functions as in earlier releases.
Oracle is fully supported using the same release (12.1 or 12.2) as previous versions of tranSMART.
Testing relies on an Oracle Docker instance.
This release has been tested on Postgres up to 9.6, and on Postgres 10, Postgres 11 and Postgres 12. No version-specific issues have been identified.
To date no attempt has been made to take advantage of the new partitioning features in more recent Postgres versions. We continue to monitor these and will consider supporting them in a future release. It is likely that legacy support for the current Postgres schema will be continued in tranSMART.
TranSMART does not support SQL server.
Upgrading the schemas to include SQLserver versions of the tables and stored procedures is relatively straightforward.
Upgrading the source code to include support for a third database would require significant work, but would also test and clean many sections of code with obvious benefits to the quality and robustness of tranSMART.
Changes have been made to support installation on Ubuntu 18.04, Ubuntu 16.04 and Ubuntu 14.04.
Automated install scripts have versions for each version with only limited divergence. For example, Ubuntu 18.04 uses tomcat 8.
Targets for Ubuntu installation targets in transmart-data are updated as appropriate (for example, a different PHP version is available in Ubuntu 16).
Scripts for Ubuntu 18.04 are updated with system libraries installed to cover dependencies for installation of R packages.
TranSMART 19 is being tested on Ubuntu 20 (released in Spring 2020).
TranSMART 19 is being tested on Fedora 32 (released end-April 2020). Code has been built and tested on Fedora 31.
Releases up to 16.3 were tested only with Kettle 4.2.
tranSMART now supports Kettle versions up to pdi-ce-184.108.40.206. We will test Kettle 9.0 as part of the final release process.
Only minor updates to Kettle scripts were needed to satisfy an additional validation. These had prevented upgrading Kettle in earlier tranSMART releases.
New targets are in preparation.
Changes introduced into tranSMART 16 supported loading clinical and all high dimensional data in one step through a series of scripts and new parameter file for the TraIT Cell-Line Use Case poroject.
These scripts can be added to a structured set of directories to support the loading of all data for a study in a single step.
Should there be a failure at any point, going to the appropriate directory and running the load script there will resume just that part of the load after the issue is resolved.
A set of utility scripts are mad einto a load target to create a new Program under the Browse tab.
The disease and therapeutic area fields are validated on loading.
A set of utility scripts to load study metadata for the Browse tab can now be involked as load_browse targets.
The input files can be created for studies in the existing curated data library.
The input data includes the text from GEO (reduced to 2000 characters), disease and therapeutic area, number of patients, citation details, study type and objectives, etc.
The program must be loaded before the study
A potential load target can add Assay metadata into the Browse tab using scripts in preparation.
The platform information should be validated against the database ontologies before loading.
A set of utility scripts are in prearation to load sample data into the Sample Explorer tab.
The input files can be made available for studies in the existing curated data library.
In GEO samples have limited informations but at least includes sample ID and organism.
Cleanup of SQL source code
Loading raw high-dimensional data in earlier versions could take a very long time. A SQL statement testing whether log intensity could be calculated for each raw intensity was creating very large loads on the server.
Preprocessing the raw intensities to identify usable values allowed this step to be simplified. Log intensity values are now calculated on a simple pass through the data using very low resources.
A missing condition in a SQL statement caused RNAseq gene expression to load extremely slowly, and to consume vary large memory and tmp space resources. No other datatypes were affected.
Previous releases loaded high-dimensional data (Microarray mRNA expression, RNAseq counts, etc.) with columns labelled as TISSUE_TYPE, SAMPLE_TYPE amd TIMEPOINT.
The stored procedures all reversed the meaning of the first two columns internally. This is corrected in release 19.0.
As recent released removed the ability to select on these columns when launching analyses the issue shas not been noticed.
A future release may restore the picking of sample and tussue types, and of timepoints when launching heatmaps and other workflows.
The libraries of curated studies at library.transmartfoundation.org will be reviewed to ensure these terms are in the correct column.
Consistent usage will be applied to tissue types, timepoints, and the sample treatments across these 200+ studies.
The libraries of curated studies at library.transmartfoundation.org
have a variety of representations for common terms in the clinical data tree.
These studies will be reviewed to conform to a common set of terms to make it easier to work with multiple studies in tranSMART. Terms in common use (e.g. 'Medical History') should appear in the same place for each study.
Previous releases have been hard to debug when loading data using Kettle. A number of issues are addressed in tranSMART 19:
Kettle logging level can be set with an environment variable KETTLE_LOG_LEVEL with possible values
The value is passed to Kettle as the -level parameter
When ETL procedures run for a very long time (see notes above for high-dimensional data, but also an issue for some very large clinical data loads) it is difficult in earlier tranSMART releases to identify the step causing problems.
Although tranSMART ETL procedures log each step to the audit tables, this logging is part of the ETL transaction. If the transaction should fail or if the ETL job is canceled the logging data will also be lost.
The audit log utilities in tranSMART 19 can also write to the database log file. This is an immediate write and output can be followed while the ETL job is running. An added benefit is that when run from the command line the log output is also printed to the console. This provides an immediate report of the audit messages so an inspection of the code can indicate which step is currently running.
To set up this additional logging, create a row in the new tm_cz.etl_settings table:
psql -c "insert into tm_cz.etl_settings (paramname,paramvalue) values ('debug','yes')"
A second parameter is tested to skip the deletion of temporary tables so that their content can be inspected after an ETL job has run. The tables will be cleared at the start of a new ETL job for the same datatype.
psql -c "insert into tm_cz.etl_settings (paramname,paramvalue) values ('cleantables','no')"
Many messages reported “loading” with a row count for the previous step in earlier tranSMART versions.
All such messages should report the end of the step with its row count and description.
Several stored procedures reported errors, and returned an error status, but this was ignored by the calling procedure.
Release 19.0 checks the return status for all calls that have a return value.
The job_id and log_base values werre reported with a large number of leading zeroes.
The datatypes used as outputs by stored procedures and the Kettle scripts have been changed and specific formats introduced to report only the integer value.
Kettle jobs have been pretty-printed to make the XML easier to read.
The return values from failed stored procedures have been standardized as zero for success and any other value for a failure. The tests in Kettle have been modified where the meaning of zero and one have been changed.
Stored procedures called during ETL by Kettle and other ETL systems have been reviewed and updated.
As noted above, return values of non-zero now all indicate an error condition.
Many instances were found of procedures (functions in postgres) called by other procedures with no tests for their return values. In all cases these are now tested and will cause immediate action (usually a return with an error) by the calling procedure.
An example was an error in the calculation of log intensities and zscore values for some high-dimensional datatypes failing silently.
All platform annotation files for all datatypes should include only one platform.
RNAseq expression data now uses a named platform. Loading the platform annotation populates gene names and gene IDs as for Expression platforms.
Where the probe ID is an Ensembl gene ID a single platform for each species can cover all supposed GEO platforms. For RNAseq data GEO records the sequencing technology and the organism as the platform.
We plan to add inremental updating of these platforms on a per-study basis to catch any additional probe IDs a study may use where there is no Ensembl Gene ID defined, and also to catch the addition of new IDs by Ensemnbl after the original platform load.
Test for missing gene ID and gene name information are more efficient in this release. In earlier versions loading RNAseq platform data as an anopnymized incremental update could take considerable time.
tranSMART has two sets of RNAseq ETL procedures. One is for gene-based expression counts, the other is for expression mapped to chromosome position. They were implemented around the same time for the tranSMART 16.1 release.
The names are defined interbally in several places in the source code. They have been made clearer in tranSMART 19. The internal name RNASEQ_COG (developed by Cognizant for Sanofi) is used by RNAseq expression counts.
The postgresql script tm_cz.i2b2_move_study missed many of the changes needed to rename or move a study. The updated script requires two inputs: the original path of the top node for the study and the new path. Any new nodes are automatically created. The function takes an additional jobId parameter which is NULL when run from the command line.
psql -U tm_cz -W -c "select* from tm_cz.i2b2_move_study('\Public Studies\Asthma_Barczak GSE34466\', '\Public Studies\Asthma\Barczak GSE34466', NULL)"