Introduction

ALTRE (ALTered Regulatory Elements) is an R software package that streamlines and simplifies the analysis of data generated from genome-wide chromatin accessibility assays such as DNase-seq (a.k.a. DHS-seq), FAIRE-seq, and ATAC-seq.

Regulatory elements (REs) are involved in regulating gene transcription – they control genes and pathways that can be investigated as putative therapeutic targets, or they may serve as targets themselves. Chromatin accessibility assays allow us to identify the location of REs genome-wide by identifying “open” chromatin (i.e. euchromatin). Identifying REs that differ between cell types, such as cancerous and noncancerous cell lines and tissues, holds promise for discovering new mechanisms involved in cellular development and disease progression.

Workflow

ALTRE takes as input a CSV file which provides the names and metadata information for the alignment files (BAM format) and peak files (BED format) of samples to be analyzed. The software then takes users through defining altered peaks using both binary data (presence/absence of peaks) and quantitative data (peak intensity). Moreover, the package includes functions that merge and annotate peaks (e.g. as TSS-proximal and TSS-distal), perform enrichment analysis for REs of interest, and provide visualization functions. The analysis can be conducted through the R command line or through an RShiny web interface for those who are not familiar with the R statistical language. This vignette further explains the purpose of the package and establishes the workflow.

Graphing analysis results

Results of this analysis will be displayed in a number of graphs throughout the workflow. These graphs are interactive – mouse-over data points to find further information, such as the exact coordinates (e.g. p-value) of a plotted sample, or click a sample or condition in the legend to hide it from the graph – this can allow the remaining samples to be compared more easily as the graph will zoom in and out to accommodate changes.

Example data

To demonstrate the functionality of ALTRE, we have provided an example dataset (BAM and BED files) of cancerous (A549) and normal (SAEC) lung cell types downloaded from the ENCODE project (see https://mathelab.github.io/ALTREsampledata/). Of note, this vignette uses only a subset of the data corresponding to Chromosome 21 (results may not be as meaningful).

1. Load Data

Load Metadata The metadata associated with data files to be analyzed in ALTRE is supplied as a CSV file. The software will automatically retrieve the file path of your input CSV so it is important that all analysis files are in the same folder as your CSV file.

After downlading all files in the example datasets in one folder, download the CSV at https://raw.githubusercontent.com/mathelab/AltreDataRepo/master/DNaseEncodeExample.csv and place it in the same folder. Also, be sure to set your working directory in the R environment to point to the directory where you have placed your CSV file.

library(ALTRE)
csvfile <- loadCSVFile("DNaseEncodeExample.csv")
csvfile
## # A tibble: 4 × 5
##                     bamfiles               peakfiles sample replicate
##                        <chr>                   <chr>  <chr>     <chr>
## 1 A549_ENCFF001CLE_chr21.bam A549_ENCFF001WCZ.bed.gz   A549         I
## 2 SAEC_ENCFF001EFI_chr21.bam SAEC_ENCFF001WSH.bed.gz   SAEC         I
## 3 A549_ENCFF001CLJ_chr21.bam A549_ENCFF001WDA.bed.gz   A549        II
## 4 SAEC_ENCFF001EFN_chr21.bam SAEC_ENCFF001WSI.bed.gz   SAEC        II
## # ... with 1 more variables: datapath <chr>

Load peak files Read in the peak files (BED format) with the loadBedFiles() function. Only the first three columns (chr, start, end) of the peak files are required an read in. Additional columns are allowed but ignored.

samplePeaks <- loadBedFiles(csvfile)
samplePeaks
## GRangesList object of length 4:
## $A549_I 
## GRanges object with 276677 ranges and 2 metadata columns:
##            seqnames               ranges strand |   sample replicate
##               <Rle>            <IRanges>  <Rle> | <factor>  <factor>
##        [1]     chr1     [ 10241,  10349]      * |     A549         I
##        [2]     chr1     [237719, 237872]      * |     A549         I
##        [3]     chr1     [564496, 564831]      * |     A549         I
##        [4]     chr1     [565253, 566084]      * |     A549         I
##        [5]     chr1     [566586, 567276]      * |     A549         I
##        ...      ...                  ...    ... .      ...       ...
##   [276673]     chrY [59024237, 59024354]      * |     A549         I
##   [276674]     chrY [59027624, 59027997]      * |     A549         I
##   [276675]     chrY [59028579, 59028819]      * |     A549         I
##   [276676]     chrY [59029626, 59029819]      * |     A549         I
##   [276677]     chrY [59031344, 59031507]      * |     A549         I
## 
## ...
## <3 more elements>
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths

2. Get Reproducible Peaks

In the second step, ALTRE identifies consensus peaks among sample replicates – peaks that are present in all (or, less stringently, the majority) of sample replicates. The function getConsensusPeaks() processes an input of two or more peak files per sample and returns a GRanges list. At least two biological replicates are required in order to identify consensus peaks.

consensusPeaks <- getConsensusPeaks(samplepeaks = samplePeaks,
                                    minreps = 2)

The function plotConsensusPeaks() creates a bar graph comparing the number of replicate peaks identified vs. total number of peaks.

plotConsensusPeaks(samplepeaks = consensusPeaks)

3. Annotate Peaks

The output of the function getConsensusPeaks() is then further processed by the combineAnnotatePeaks() function, which in turn accomplishes three main tasks:

  1. Combines consensus peaks for each samples type into one large master list of peaks for each sample type.

  2. Categorizes peaks as TSS-proximal and TSS-distal, based on the distance of the peaks from a transcription start site (TSS). In this vignette we will use the default promoter distance argument: peaks within 1,500 bp of a TSS are considered TSS-proximal; those not within 1,500 bp of a TSS are considered TSS-distal.

  3. Optionally, merges peaks within a certain distance of each other. Merging close-by peaks loosens the stringency of the region comparison. Multiple regions of active chromatin within ~1000 bp are likely to represent only one regulatory element. In this vignette we will use the defaults of the function, which merges regions within 1000 bp of each other and merges only within regulatory element type (i.e. only TSS-distal will only be merged with other TSS-distal, and likewise, TSS-proximal only with TSS-proximal). Both these functionalities can be changed via the “distancefromTSSprox”, “distancefromTSSdist” and “regionspecific” arguments.

The combineAnnotatePeaks() function requires a list of TSSs for the TSS-proximal/TSS-distal annotation, which can be read in with the getTSS() function:

TSSannot <- getTSS()

Then the function can be run with:

consensusPeaksAnnotated <- combineAnnotatePeaks(conspeaks = consensusPeaks,
                                           TSS = TSSannot,
                                           merge = TRUE,
                                           regionspecific = TRUE,
                                           distancefromTSSdist = 1500,
                                           distancefromTSSprox = 1000)

Note that the output of the previous function getConsensusPeaks() is fed as input (consensPeaks).

The output of combineAnnotatePeaks() produces a GRanges object containing the annotated, merged peaks – all peaks present in cancerous or normal lung.

Additionally, the function plotCombineAnnotatePeaks() creates a bar graph to showcase the changes in the number and size of TSS-distal and TSS-proximal elements before and after merging close-by peaks.

plotCombineAnnotatePeaks(consensusPeaksAnnotated)

The number of regulatory elements decreases and the size of the regions increases, as expected.

4. Get Read Counts

The next function getCounts() retrieves the number of reads in each regulatory element (RE) for each sample type – this determines the peak height/intensity and approximates the accessibility of the RE in question. Read counts are retrieved from the alignment files (BAM format) that are supplied in the CSV file containing the sample metadata. We have already read in the CSV in the first step of the workflow (load data).

Of note, one cell type must be designated as “reference”, to which the other cell types will be compared. In this example, the reference type is the normal lung cell line (SAEC).

To run the function:

consensusPeaksCounts <- getCounts(annotpeaks = consensusPeaksAnnotated,
                              sampleinfo = csvfile,
                              reference = 'SAEC',
                              chrom = 'chr21')
plotGetCounts(consensusPeaksCounts)

Note: If you are running ALTRE on a Unix or MacOS operating system you can use the function getCountsFast in the place of getCounts. Both functions yeild the same results but getCountsFast is significantly faster. Unfortunately, getCountsFast is not available for Windows.

5. Retrieve and Categorize Candidate Altered Regulatory Elements (REs)

The counts for each REs are processed by the function countanalysis(), which leverages the algorithm from the R package DESeq2 to compare peak intensities across cell types.

alteredPeaks <- countanalysis(counts = consensusPeaksCounts,
                             pval = 0.01,
                             lfcvalue = 1)

Altered REs are next categorized based on associated log2 fold changes and p-values, where larger log2 fold changes and smaller p-values mark REs with a change in signal magnitude between the two cell-types that is unlikely to be caused by chance alone. What magnitude (p-value/log2 fold change) is considered cell-type specific is left up to the user.

alteredPeaksCategorized <- categAltrePeaks(alteredPeaks,
                                    lfctypespecific = 1.5,
                                    lfcshared = 1.2,
                                    pvaltypespecific = 0.01,
                                    pvalshared = 0.05)
plotCountAnalysis(alteredPeaksCategorized)
## Plot shows all data points. Data around the origin has not been trimmed

REs with larger, positive log2 fold changes and smaller p-values have strong statistical evidence indicating that the DNA is more accessible in the lung cancer cell type compared to the normal lung tissue. Therefore, their activity is considered associated with the cancer phenotype. REs with larger, negative log2 fold changes and smaller p-values indicate the opposite – the DNA is less accessible in the cancer type and may therefore be associated with tumor suppressive properties. REs with minimal log2 fold change and high p-values are equally accessible in both cell types, and therefore, their activity is more likely to contribute to the house-keeping functions required in every cell.

The function plotDistCountAnalysis() enables users to view the raw count data in regions identified as type-specific or shared. The log2 transformation of reads per kilo-base per million for each RE is plotted. As expected, for REs identified as higher log-fold change than normal, the lung cancer cell line has much higher counts than he normal lung cell line. The opposite is true for the lower log-fold change regions. For RE with little log-fold change between cell types, the counts are similar. This visual depiction confirms the results of countanalysis and enables users to change parameters if needed.

plotDistCountAnalysis(alteredPeaksCategorized, consensusPeaksCounts)

6. Method Comparison

Once results have been obtained there are additional functions in ALTRE to compare the two major methods for identifying altered REs:

Method 1: based on peak intensity (ALTRE’s countanalysis() function)
Method 2: based on peak presence or absence as determined by the peak calling algorithm

To compare resulting altered REs from these two methods, use the function comparePeaksAltre as follows:

comparePeaksAnalysisResults <- comparePeaksAltre(alteredPeaksCategorized)

The function plotCompareMethodsAll() takes the results from comparePeaksAltre and translates the table into pie charts.

plotCompareMethodsAll(comparePeaksAnalysisResults)

7. Get Putative Enrichment Pathways

Finally, ALTRE performs pathway enrichment analysis, using rGREAT, for cell type-specific or shared REs using the functions runGREAT() and processPathways(). The GREAT algorithm incorporates proximal and distal binding sites and uses a binomial testing procedure to mitigate false positives. More information on the algorithm can be found here. Enriched” pathways are likely to be regulated by the REs linked to the gene cluster (whether type-specific or shared).

In the web application, Gene Ontology pathway annotations are used by default, and include 1) Molecular Function (MF); 2) Biological Process (BP); 3) Cellular Component (CC)). GREAT however allows other pathway categories to be analyzed, including “Pathway Data”, “Regulatory Motifs”,“Phenotype Data and Human Disease”, “Gene Expression”, and “Gene Families”. These pathway categories can be passed onto the runGREAT() function via the paramter pathway_category.

By default, the p-values will be adjusted by “bonferroni” and significant pathways will have adjusted p-values < 0.05 and an enrichment of at least 2.

Note that this vignette only uses Chromosome 21, therefore the pathway analysis results presented may not be biologically relevant.

callPaths <- runGREAT(peaks = alteredPeaksCategorized)
enrichmentTables <- processPathways(callPaths, pathway_category = "GO",
    enrichcutoff = 2, adjpvalcutoff = 0.05)

The function plotGREATenrich() creates a heatmap plot of the analysis for all three RE types (Reference-specific, Experiment-specific, and Shared), with the color-coding corresponding to the significance of the pathway – the lighter the blue, the lower the p-value.

plotGREATenrich(enrichmentTables, maintitle = "GREAT Enrichment Analysis",
    pathwaycateg = "GO_Molecular_Function")

8. Writing output to files To enable users to write results to files, the following 4 functions are implemented:

  1. writeBedFile(): writes a BED file in the working directory with the location of REs, color-coded by type-specificity. The output BED file can be directly loaded in a genome browser. There is no output to the function that can be saved as an object.
writeBedFile(alteredPeaksCategorized, "altrecategpeaks.bed")
  1. writeConsensusRE(): writes a CSV file in the working direcoty with the location of the consensus regulatory elements, along with their annotation (e.g. TSS-proximal/distal, whether the peaks are found in the experiment or reference samples). The file takes as input the output of the getConsensusPeaks() function and returns no output.
writeConsensusRE(consensusPeaksAnnotated, "altreconspeaks.csv")
  1. writeCompareRE(): writes a CSV file with the number of peaks called as experiment- or reference-specific and share using the quantitative/peak intensity or peak/binary approach. The function takes the output of comparePeaksAltre() as input and returns no output.
writeCompareRE(comparePeaksAnalysisResults, "comparealtrepeaks.csv")
  1. writeGREATpath(): writes CSV files showing the results of the functions runGREAT() and processPathways(), which it takes as input. A zipped folder containing 3xP CSV files will be created, where 3 is the type of REs (e.g. Shared, Reference-specific, and Experiment-specific), and P is the number of pathways (e.g. if GO pathway category is used, then P is 3 - Molecular Function, Cellulcar function, Biological Processes).
writeGREATpath(pathresults, "EnrichedPathways.zip")
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.2 (unknown)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ALTRE_0.99.0
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-128                  bitops_1.0-6                 
##   [3] xts_0.9-7                     lubridate_1.6.0              
##   [5] EnsDb.Hsapiens.v75_0.99.12    RColorBrewer_1.1-2           
##   [7] httr_1.2.1                    rprojroot_1.1                
##   [9] GenomeInfoDb_1.8.7            tools_3.3.0                  
##  [11] backports_1.0.4               R6_2.2.0                     
##  [13] rpart_4.1-10                  Hmisc_4.0-1                  
##  [15] DBI_0.5-1                     lazyeval_0.2.0               
##  [17] BiocGenerics_0.18.0           colorspace_1.3-2             
##  [19] GetoptLong_0.1.5              nnet_7.3-12                  
##  [21] gridExtra_2.2.1               mnormt_1.5-5                 
##  [23] DESeq2_1.12.4                 base64_2.0                   
##  [25] Biobase_2.32.0                htmlTable_1.7                
##  [27] rtracklayer_1.32.2            scales_0.4.1                 
##  [29] psych_1.6.9                   readr_1.0.0                  
##  [31] genefilter_1.54.2             stringr_1.1.0                
##  [33] digest_0.6.10                 Rsamtools_1.24.0             
##  [35] foreign_0.8-67                rmarkdown_1.3                
##  [37] XVector_0.12.1                htmltools_0.3.5              
##  [39] highcharter_0.4.0             ensembldb_1.4.7              
##  [41] GlobalOptions_0.0.10          htmlwidgets_0.8              
##  [43] TTR_0.23-1                    quantmod_0.4-7               
##  [45] RSQLite_1.1-1                 BiocInstaller_1.22.3         
##  [47] shiny_0.14.2                  zoo_1.7-14                   
##  [49] jsonlite_1.1                  BiocParallel_1.6.6           
##  [51] acepack_1.4.1                 dplyr_0.5.0                  
##  [53] RCurl_1.95-4.8                magrittr_1.5                 
##  [55] rlist_0.4.6.1                 Formula_1.2-1                
##  [57] Matrix_1.2-7.1                Rcpp_0.12.8                  
##  [59] munsell_0.4.3                 S4Vectors_0.10.3             
##  [61] stringi_1.1.2                 yaml_2.1.14                  
##  [63] SummarizedExperiment_1.2.3    zlibbioc_1.18.0              
##  [65] plyr_1.8.4                    AnnotationHub_2.4.2          
##  [67] grid_3.3.0                    parallel_3.3.0               
##  [69] lattice_0.20-34               Biostrings_2.40.2            
##  [71] splines_3.3.0                 GenomicFeatures_1.24.5       
##  [73] annotate_1.50.1               locfit_1.5-9.1               
##  [75] knitr_1.15.1                  igraph_1.0.1                 
##  [77] GenomicRanges_1.24.3          rjson_0.2.15                 
##  [79] geneplotter_1.50.0            reshape2_1.4.2               
##  [81] biomaRt_2.28.0                stats4_3.3.0                 
##  [83] XML_3.98-1.5                  evaluate_0.10                
##  [85] latticeExtra_0.6-28           data.table_1.10.0            
##  [87] httpuv_1.3.3                  gtable_0.2.0                 
##  [89] openssl_0.9.5                 purrr_0.2.2                  
##  [91] tidyr_0.6.0                   assertthat_0.1               
##  [93] ggplot2_2.2.0                 mime_0.5                     
##  [95] xtable_1.8-2                  broom_0.4.1                  
##  [97] survival_2.40-1               viridisLite_0.1.3            
##  [99] tibble_1.2                    GenomicAlignments_1.8.4      
## [101] AnnotationDbi_1.34.4          memoise_1.0.0                
## [103] IRanges_2.6.1                 cluster_2.0.5                
## [105] interactiveDisplayBase_1.10.3 rGREAT_1.4.2