IntLIM: Integration through Linear Modeling

Introduction

Interpretation of metabolomics data is very challenging. Yet it can be eased through integration of metabolomics with other ‘omics’ data. The IntLIM package, which includes a user-friendly RShiny web app, aims to integrate metabolomics data with transcriptomic data. Unlike other approaches, IntLIM is focused on understanding how specific gene-metabolite associations are affected by phenotypic features. To this end, we develop a linear modeling approach that describes how gene-metabolite associations are affected by phenotype. More information can be found on our publication “IntLIM: integration using linear models of metabolomics and gene expression data”The workflow involves the following steps: 1) input gene expression/metabolomics data files, 2) filter data sets by gene and metabolite abundances and imputed values, 3) run the linear model to extract FDR-adjusted interaction p-values, 4) filter results by p-values and Spearman correlation differences, and 5) plot/visualize specific gene-metabolite associations.

Installation

IntLIM is an R package and can be run on version >= 3.2.0.

Download (or upgrade) R here: https://cloud.r-project.org/

RStudio (an interface to R than can make R easier to use) can be download here (not required): https://www.rstudio.com/products/rstudio/download3/

Prior to installing IntLIM, it is necessary to have the the Bioconductor package MultiDataSet (Hernandez-Ferrer et al, 2017).

The following command then installs MultiDataSet.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("MultiDataSet")

The function install_github() from the “devtools” package (Wickham and Chang, 2015) installs IntLIM directly from. To install IntLIM, enter the following:

install.packages(devtools)
install_github(“/mathelab/IntLIM”)

IntLIM can be loaded using the library function

library(IntLIM)

Inputting Gene Expression and Metabolite Abundances

The first step is importing in the gene expression and metabolite data. To this end we have provided a sample data file for users wishing to use IntLIM.

IntLIM requires a specific format for gene expression and metabolite data. For IntLIM, we require the metabolite data, gene expression data, metabolite meta data, gene meta data, and sample meta data. We also need a CSV meta-file that lists the location of the other files. These need to be in the same folder. The formats are described below. In addition, we provide a sample set of files.

Please be sure that all files noted in the CSV file, including the CSV file, are in the same folder. Do not include path names in the filenames.

Users need to input a CSV file with two required columns: ‘type’ and ‘filenames’.

The CSV file is expecected to have the following 2 columns and 6 rows:

type,filenames
metabData,myfilename
geneData,myfilename
metabMetaData,myfilename (optional)
geneMetaData,myfilename (optional)
sampleMetaData,myfilename"

The transcriptomic and metabolomic data and meta-data is stored in a series of comma-separated-values (.CSV) files. The 5 files consist of normalized/log2-transformed metabolite abundance data, metabolite feature meta data, log2-transformed gene expression data, gene expression feature data, and sample meta data. A meta-file lists the location of the other 5 fives. This file is input into IntLIM. These files should ideally be in the same folder. Of note, input data files should be in a specific format:

File type	Description
metabData	rows are metabolites, columns are samples
geneData	rows are genes, columns are samples
metabMetaData	rows are metabolites, features are columns
geneMetaData	rows are genes, features are columns
sampleMetaData	rows are samples, features are columns

For the metabData and geneData files, the first row contains the feature ids and the first column contains the sample ids.

For the sampleMetaData, the first column of the sampleMetaData file is assumed to be the sample id, and those sample ids should match the first row of metabData and geneData (e.g. it is required that all sample ids in the metabData and geneData are also in the sampleMetaDatafile).

Additionally, the metabMetaData, geneMetaData, and SampleMetaData need to contain an ‘id’ column that contains the name of the features (metabolite or gene) or sample (sample id, name, etc).

A small data set is embedded in a package. This consists of the National Cancer Institute-60 (NCI-60) cell line data with a reduced number of genes for faster calculation. To access it use the following commands. csvfile describes the location of the meta-file describing the location of the other 5 input files.

     dir <- system.file("extdata", package="IntLIM", mustWork=TRUE)
     csvfile <- file.path(dir, "NCItestinput.csv")
     csvfile

## [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/IntLIM/extdata/NCItestinput.csv"

Through the ReadData() function, users input the above .csv files containing normalized and log2-transformed gene expression data, normalized and log2-transformed metabolite abundance data, (optional) gene expression meta data, (optional) metabolite meta data, and sample meta data. The gene expression data, meta data, and sample meta-data are input into an ExpressionSet object. The metabolite abundance data, meta data, and sample data is input into a new MetaboliteSet object, a new eSet object designed to contain metabolomics data (Hernandez-Ferrer, et al., 2017). Both objects are integrated into a MultiDataSet object- a multi-‘omics object allowing integration of eSet objects from different ‘omics data sets. metabid and geneid describe the column name in the metabolite meta adata and the gene expression data to select as the identifer for that particular feature.

inputData <- IntLIM::ReadData(inputFile = csvfile,metabid='id',geneid='id')

## [1] "CreateMultiDataSet created"

The ShowStats() function allows the user to summarize the metabolomic and gene expression data (how many total genes, metabolites, and samples in each data-set as well as samples in common). We have a data-set that consists of 20 cell lines (samples), 1448 genes, and 257 metabolites.

IntLIM::ShowStats(IntLimObject = inputData)

##   Num_Genes Num_Metabolites Num_Samples_withGeneExpression
## 1      1448             257                             20
##   Num_Samples_withMetabolomics Num_Samples_inCommon
## 1                           20                   20

Filtering and Observing Data

Optionally, the FilterData() function allows the user to filter out the features (genes or metabolites) based on their mean values. Users should input a percentile cutoff and any feature with a mean value below that cutoff will be removed. Furthermore we can filter out metabolites by percentage of missing or imputed values. For the analysis, we filter out the genes with the lowest 10% of gene expression and metabolites with more than 80% imputed values. This is done as below. We henceforth have 1303 genes and 212 metabolites for 20 cell line samples.

inputDatafilt <- IntLIM::FilterData(inputData,geneperc = 0.10, metabmiss = 0.80)

## [1] "No metabolite filtering by percentile is applied"

IntLIM::ShowStats(inputDatafilt)

##   Num_Genes Num_Metabolites Num_Samples_withGeneExpression
## 1      1303             212                             20
##   Num_Samples_withMetabolomics Num_Samples_inCommon
## 1                           20                   20

The PlotDistribution() function allows users to produce a boxplot of the distribution of gene expression and metabolite abundance data. This is done as below.

IntLIM::PlotDistributions(inputData = inputDatafilt)

Prior to running the model, the user can also perform a principal component analysis of the gene expression and metabolite data using the PlotPCA() function. The stype command allows the user to select a column from the sample meta data that color-codes the samples into two categories (two cancer types, tumor vs. non-tumor, etc). For the sample set, we select PBO (prostate/breast/ovarian cancers) vs. the leukemia group.

IntLIM::PlotPCA(inputData = inputDatafilt,stype = "PBO_vs_Leukemia")

Run IntLIM

The linear models are run by the RunIntLim() function. The stype command allows the user to select a column from the sample meta data for the two categories to be compared (two cancer types, tumor vs. non-tumor, etc). Currently, IntLIM only supports comparison of two categories (“PBO_vs_Leukemia” is selected). The resulting object from the analysis is an IntLimResults object containing slots for un-adjusted and False Discovery Rate (FDR)-adjusted p-values for the “(g:p)” interaction coefficient. A significant FDR-adjusted p-value implies that the slope of gene-metabolite association in one phenotype is different from the other. The RunIntLim() function is based heavily on the MultiLinearModel functions developed for the ClassComparison package part of oompa (http://oompa.r-forge.r-project.org).

myres <- IntLIM::RunIntLim(inputData = inputDatafilt,stype="PBO_vs_Leukemia")

## [1] "Running the analysis on"
## 
## Leukemia      PBO 
##        6       14 
## [1] "10 % complete"
## [1] "20 % complete"
## [1] "30 % complete"
## [1] "40 % complete"
## [1] "50 % complete"
## [1] "60 % complete"
## [1] "70 % complete"
## [1] "80 % complete"
## [1] "90 % complete"
## [1] "100 % complete"
##    user  system elapsed 
##   8.236   0.098   8.336

The DistPValues() function allows the user to observe a histogram of the p-values prior to adjustment

IntLIM::DistPvalues(IntLimResults = myres)

The pvalCorrVolcano() function allows users to observe a volcano plot comparing the Spearman correlation difference between groups to the –log10(FDR-adjusted p-value)

IntLIM::pvalCorrVolcano(inputResults = myres, inputData = inputDatafilt, diffcorr = 0.5, pvalcutoff = 0.05)

## [1] "259297 gene-metabolite pairs found given cutoffs"

Process the results and filter pairs of genes-metabolites based on adjusted p-values and differences in correlation coefficients between groups 1 and 2. Then plot heatmap of significant gene-metabolite pairs

Filter Results

The ProcessResults() function allows the user to filter the results by FDR p-values (default set at 0.05) and by the absolute value difference of the gene-metabolite Spearman correlation (default set at 0.50) between the two groups. The output is a list of gene-metabolite pairs and gene-metabolite Spearman correlations for each of the two groups

myres <- IntLIM::ProcessResults(inputResults = myres, inputData = inputDatafilt, pvalcutoff = 0.10, diffcorr = 0.5)

## [1] "2721 gene-metabolite pairs found given cutoffs"

The CorrHeatmap() produces a clustered heatmap of the gene-metabolite Spearman correlations for each of these groups.

IntLIM::CorrHeatmap(myres)

Visualize and Export Results

A PlotGMPair() function allows the user to plot a chosen gene-metabolite association for selected groups. An example is shown below for DLG4 vs. (p-Hydroxyphenyl)lactic acid.

Plot a pair of interest:

IntLIM::PlotGMPair(inputDatafilt,stype="PBO_vs_Leukemia","DLG4","(p-Hydroxyphenyl)lactic acid")

Of note, plots are generated using Highcharter (http://jkunst.com/highcharter/) and Plotly (Sievert, et al.) (https://plot.ly), which enables interactive visualization, allowing users to promptly assess the effect of changing parameters on analysis results and accelerating discovery of phenotype-specific gene-metabolite pairs. This will greatly allow the workflow to be accessible to non-computational biologists.

The OutputData() and OutputResults() function allows the users to output the data and results of the analyses into zipped CSV files.

Lastly, various writing functions are implemented. The OutputData() and OutputResults() function allows the users to output the data and results of the analyses into zipped CSV files.

IntLIM::OutputData(inputData=inputDatafilt,filename="~/FilteredData.zip")
OutputResults(inputResults=myres,filename="~/MyResults.csv")

ShinyApp User Interface

A Shiny App embedded in the package provides a user friendly interface for running IntLim (https://shiny.rstudio.com). This can be run by calling runIntLIMApp() below.

    runIntLIMApp()

References

Siddiqui JK, Baskin E, Liu M, Cantemir-Stone CZ, Zhang B, Bonneville R, McElroy JP, Coombes KR, Mathé EA. IntLIM: integration using linear models of metabolomics and gene expression data. BMC bioinformatics. 2018 Dec;19(1):81.

Hernandez-Ferrer, C., et al. MultiDataSet: an R package for encapsulating multiple data sets with application to omic data integration. BMC bioinformatics 2017;18(1):36.

Sievert, C., et al. plotly: Create Interactive Web Graphics via’plotly. js’. 2016. R package version 3.6. 0. In.

Wickham, H. and Chang, W. devtools: Tools to make developing R code easier. R package version 2015;1(0).