Skip to end of metadata
Go to start of metadata

To perform the ETL processes, you can use either Kitchen or Spoon. Kitchen is the application that can read the Kettle-scripts that will define and execute the necessary transformations. Spoon provides a graphical interface for viewing a script and setting its parameters. Under the hood, Spoon calls Kitchen with the appropriate parameters. The folder transmart-ETL/Kettle/Postgres contains example startup scripts for use in a bash-shell. The first section below describes the general procedure for loading a script and setting its parameters.

4.1 Input Files

In order to perform ETL, you will need to generate one or more input les. All input files should be tab delimited text files. The sections below will describe all of these files, and the meaning of all columns. Input file columns that are marked as optional are not required to contain data,

however the columns themselves (i.e. tab characters) should be there.

 

Note: You are only allowed to omit entire columns from the file if they are the last columns.

For example, the column mapping le consists of four required columns and four optional ones. Even if you will not use column ve, seven and eight, you need to have a file with six columns in total. This is illustrated in Figure 5.

The column headers are only considered in the data les. E.g. in clinical data les they can be used as data labels, and in gene expression data les they identify the sample. For all other input files, i.e. the ones containing conguration rather than data, only the column position is considered. The sections below will clearly describe if a header row is required or not for all files.

FilenameCategoryColNbDataLabel   CtrlVocabCode 
GSEXXXX clinical.txt 1 SUBJ_ID  
GSEXXXX clinical.txtDemographics+Age2   L-123
GSEXXXX clinical.txt 3 OMIT  

(a)

Filename-->Category-->ColNb-->DataLabel-->-->ControlVocabCode

GSEXXXX clinical.txt-->-->1-->SUBJ ID-->-->

GSEXXXX clinical.txt-->Demographics+Age-->2-->-->-->L-123

GSEXXXX clinical.txt-->-->3-->OMIT-->-->

(b)

Figure 5: Example of a column mapfile with three unused columns. The final two columns can be omitted from the file, however column five should be included. (a): Structure of the example.

(b): Tab separated file. The right arrows indicate tab characters.

4.2 Loading A Script

When invoking kitchen you should pass at least the -file=filename option, where filename refers to the Kettle-script that you which to execute. In most cases you will also need to pass several -param options, which set parameters that are used in the script. The syntax for this option is -param:PARAM_NAME=PARAM_VALUE. You can instruct kitchen to write its output to a log file with the -log=logfile option. The default logging level is Basic. If you are having trouble uploading data, you can increase the logging level to e.g. Debug by passing the -level=Debug option. You can check the full list of available logging levels at 

http://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation#KitchenUserDocumentation-Setthelogginglevel

Spoon works in nearly the same way, but provides a graphical interface for setting the script parameters. On the main window the script itself is graphically represented so it can be viewed and edited. Start up spoon by executing the spoon.bat (on Windows) or spoon.sh (on Linux) script. Choose Open under the File menu, and navigate to the script you which to execute. Click on the Run this job icon in the toolbar or choose Run from the Action menu (see Figure 6). A dialog similar to the one in Figure 7 should appear. The section marked in green represents the parameters to be passed to the script, and should be filled out manually. These parameters will depend on the job and will be discussed in the following sections. The section marked in red are global variables. The values for global variables used by the current script are taken from your kettle.properties file, and are shown in this section. Above the variables section there is a drop down box marked in blue which enables you to modify the logging level.


Figure 6: Starting a job with Spoon

Figure 7: Configuring a job with Spoon

  • No labels