bcbioRNASeq

Abstract

RNA-seq analysis involves multiple steps from processing raw sequencing data to identifying, organizing, annotating, and reporting differentially expressed genes. bcbio is an open source, community-maintained framework providing automated and scalable RNA-seq methods for identifying gene abundance counts. We have developed bcbioRNASeq, a Bioconductor package that provides ready-to-render templates and wrapper functions to post-process bcbio output data. bcbioRNASeq automates the generation of high-level RNA-seq reports, including identification of differentially expressed genes, functional enrichment analysis and quality control analysis.

Introduction

For a high-level overview of our bcbio RNA-seq analysis pipeline, including detailed explanation of the bcbioRNASeq S4 class definition, first consult our workflow paper published in F1000 Research (Steinbaugh et al. 2018). This vignette is focused on more advanced usage and edge cases that a user may encounter when attempting to load a bcbio dataset and perform downstream quality control analysis.

Note: if you use bcbioRNASeq in published research, please include this citation:

citation("bcbioRNASeq")

## To cite bcbioRNASeq in publications use:
## 
##   Steinbaugh MJ, Pantano L, Kirchner RD, Barrera V, Chapman BA, Piper
##   ME, Mistry M, Khetani RS, Rutherford KD, Hoffman O, Hutchinson JN, Ho
##   Sui SJ. (2018). bcbioRNASeq: R package for bcbio RNA-seq analysis.
##   F1000Research, 6:1976. URL
##   https://f1000research.com/articles/6-1976/v2. DOI
##   10.12688/f1000research.12093.2.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {{bcbioRNASeq}: {R} package for bcbio {RNA-seq} analysis},
##     author = {Michael J. Steinbaugh and Lorena Pantano and Rory D. Kirchner and Victor Barrera and Brad A. Chapman and Mary E. Piper and Meeta Mistry and Radhika S. Khetani and Kayleigh D. Rutherford and Oliver Hofmann and John N. Hutchinson and Shannan J. Ho Sui},
##     journal = {F1000Research},
##     volume = {6},
##     number = {1976},
##     year = {2018},
##     url = {https://f1000research.com/articles/6-1976/v2},
##     doi = {10.12688/f1000research.12093.2},
##   }

## nolint start
suppressPackageStartupMessages({
    library(basejump)
    library(bcbioRNASeq)
})
## nolint end

Loading bcbio data

The bcbioRNASeq constructor function is the main interface connecting bcbio] output data to interactive use in R. It is highly customizable and supports a number of options for advanced use cases. Consult the documentation for additional details.

help(topic = "bcbioRNASeq", package = "bcbioRNASeq")

Upload directory

We have designed the constructor to work as simply as possible by default. The only required argument is uploadDir, the path to the bcbio final upload directory specified with upload: in the YAML configuration. Refer to the bcbio configuration documentation for detailed information on how to set up a bcbio run, which is outside the scope of this vignette.

For example, let’s load up the example bcbio dataset stored internally in the package.

uploadDir <- system.file("extdata", "bcbio", package = "bcbioRNASeq")

bcbio outputs RNA-seq data in a standardized directory structure, which is described in detail in our workflow paper.

sort(list.files(path = uploadDir, full.names = FALSE, recursive = TRUE))

##  [1] "2018-03-18_GSE65267-merged/bcbio-nextgen-commands.log"
##  [2] "2018-03-18_GSE65267-merged/bcbio-nextgen.log"         
##  [3] "2018-03-18_GSE65267-merged/combined.counts"           
##  [4] "2018-03-18_GSE65267-merged/programs.txt"              
##  [5] "2018-03-18_GSE65267-merged/project-summary.yaml"      
##  [6] "2018-03-18_GSE65267-merged/tx2gene.csv"               
##  [7] "control_rep1/salmon/quant.sf"                         
##  [8] "control_rep2/salmon/quant.sf"                         
##  [9] "control_rep3/salmon/quant.sf"                         
## [10] "fa_day7_rep1/salmon/quant.sf"                         
## [11] "fa_day7_rep2/salmon/quant.sf"                         
## [12] "fa_day7_rep3/salmon/quant.sf"                         
## [13] "sample_metadata.csv"

Counts level

By default, bcbioRNASeq imports counts at gene level, which are required for standard differential expression analysis (level = "genes"). For pseudo-aligned counts (e.g. Salmon, Kallisto, Sailfish) (Bray et al. 2016; Patro, Mount, and Kingsford 2014; Patro et al. 2017), tximport (Soneson, Love, and Robinson 2016) is used internally to aggregate transcript-level counts to gene-level counts, and generates length-scaled transcripts per million (TPM) values. For aligned counts processed with featureCounts (Liao, Smyth, and Shi 2014) (e.g. STAR, HISAT2) (Dobin et al. 2013; Dobin and Gingeras 2016; Kim, Langmead, and Salzberg 2015), these values are already returned at gene level, and therefore not handled by tximport. Once the gene-level counts are imported during the bcbioRNASeq call, the DESeq2 package (Love, Huber, and Anders 2014) is then used to generate an internal DESeqDataSet from which we derive normalized and variance-stabilized counts.

object <- bcbioRNASeq(uploadDir = uploadDir, level = "genes")
print(object)

## bcbioRNASeq 0.6.2 with 100 rows and 6 columns
## uploadDir: /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
## dates(2): 2018-03-18 2024-03-28
## level: genes
## caller: salmon
## organism: Mus musculus
## interestingGroups: sampleName
## class: RangedSummarizedExperiment 
## dim: 100 6 
## metadata(27): allSamples bcbioCommandsLog ... wd yaml
## assays(6): counts aligned ... normalized vst
## rownames(100): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000062661 ENSMUSG00000074340
## rowData names(0):
## colnames(6): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3
## colData names(26): averageInsertSize averageReadLength ... treatment
##   x5x3Bias

Alternatively, if you want to perform transcript-aware analysis, such as differential exon usage or splicing analysis, transcript-level counts can be obtained using level = "transcripts". Note that when counts are loaded at transcript level, TPMs are generated with tximport internally, but no additional normalizations or transformations normally calculated for gene-level counts with DESeq2 are generated.

object <- bcbioRNASeq(uploadDir = uploadDir, level = "transcripts", fast = TRUE)
print(object)

## bcbioRNASeq 0.6.2 with 100 rows and 6 columns
## uploadDir: /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
## dates(2): 2018-03-18 2024-03-28
## level: transcripts
## caller: salmon
## organism: Mus musculus
## interestingGroups: sampleName
## fast: TRUE
## class: RangedSummarizedExperiment 
## dim: 100 6 
## metadata(26): allSamples bcbioCommandsLog ... wd yaml
## assays(3): counts avgTxLength tpm
## rownames(100): ENSMUST00000000001 ENSMUST00000000003 ...
##   ENSMUST00000000674 ENSMUST00000000687
## rowData names(0):
## colnames(6): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3
## colData names(26): averageInsertSize averageReadLength ... treatment
##   x5x3Bias

Expression callers

Since bcbio is flexible and supports a number of expression callers, we have provided advanced options in the bcbioRNASeq constructor to support a variety of workflows using the caller argument. Salmon, Kallisto, and Sailfish counts are supported at either gene or transcript level. Internally, these are loaded using tximport. STAR and HISAT2 aligned counts processed with featureCounts are also supported, but only at gene level.

object <- bcbioRNASeq(uploadDir = uploadDir, caller = "star", fast = TRUE)
print(object)

## bcbioRNASeq 0.6.2 with 100 rows and 6 columns
## uploadDir: /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
## dates(2): 2018-03-18 2024-03-28
## level: genes
## caller: star
## organism: Mus musculus
## interestingGroups: sampleName
## fast: TRUE
## class: RangedSummarizedExperiment 
## dim: 100 6 
## metadata(25): allSamples bcbioCommandsLog ... wd yaml
## assays(1): counts
## rownames(100): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000000563 ENSMUSG00000000567
## rowData names(0):
## colnames(6): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3
## colData names(26): averageInsertSize averageReadLength ... treatment
##   x5x3Bias

Sample selection and metadata

If you’d like to load up only a subset of samples, this can be done easily using the samples argument. Note that the character vector declared here must match the description column specified in the sample metadata. Conversely, if you’re working with a large dataset and you simply want to drop a few samples, this can be accomplished with the censorSamples argument.

When working with a bcbio run that has incorrect or outdated metadata, the simplest way to fix this issue is to pass in new metadata from an external spreadsheet (CSV or Excel) using the sampleMetadataFile argument. Note that this can also be used to subset the bcbio dataset, similar to the samples argument (see above), based on the rows that are included in the spreadsheet.

sampleMetadataFile <- file.path(uploadDir, "sample_metadata.csv")
stopifnot(file.exists(sampleMetadataFile))
import(sampleMetadataFile)

##                   fileName  description  treatment day
## 1 control_rep1_R1.fastq.gz control_rep1    control   0
## 2 control_rep2_R1.fastq.gz control_rep2    control   0
## 3 control_rep3_R1.fastq.gz control_rep3    control   0
## 4 fa_day7_rep1_R1.fastq.gz fa_day7_rep1 folic acid   7
## 5 fa_day7_rep2_R1.fastq.gz fa_day7_rep2 folic acid   7
## 6 fa_day7_rep3_R1.fastq.gz fa_day7_rep3 folic acid   7

object <- bcbioRNASeq(
    uploadDir = uploadDir,
    sampleMetadataFile = sampleMetadataFile,
    fast = TRUE
)
sampleData(object)

## DataFrame with 6 rows and 5 columns
##                   day                 fileName   sampleName  treatment
##              <factor>                 <factor>     <factor>   <factor>
## control_rep1        0 control_rep1_R1.fastq.gz control_rep1 control   
## control_rep2        0 control_rep2_R1.fastq.gz control_rep2 control   
## control_rep3        0 control_rep3_R1.fastq.gz control_rep3 control   
## fa_day7_rep1        7 fa_day7_rep1_R1.fastq.gz fa_day7_rep1 folic acid
## fa_day7_rep2        7 fa_day7_rep2_R1.fastq.gz fa_day7_rep2 folic acid
## fa_day7_rep3        7 fa_day7_rep3_R1.fastq.gz fa_day7_rep3 folic acid
##              interestingGroups
##                       <factor>
## control_rep1      control_rep1
## control_rep2      control_rep2
## control_rep3      control_rep3
## fa_day7_rep1      fa_day7_rep1
## fa_day7_rep2      fa_day7_rep2
## fa_day7_rep3      fa_day7_rep3

Genome annotations

When analyzing a dataset against a well-annotated genome, we recommend importing the corresponding metadata using AnnotationHub and ensembldb. This functionality is natively supported in the bcbioRNASeq constructor with using the organism, ensemblRelease, and genomeBuild arguments. For example, with our internal bcbio dataset, we’re analyzing counts generated against the Ensembl Mus musculus GRCm38 genome build (release 87). These parameters can be defined in the object load call to ensure that the annotations match up exactly with the genome used.

object <- bcbioRNASeq(
    uploadDir = uploadDir,
    level = "genes",
    organism = "Mus musculus",
    genomeBuild = "GRCm38",
    ensemblRelease = 87L
)

This will return a GRanges object using the GenomicRanges package (Lawrence et al. 2013), which contains coordinates and rich metadata for each gene or transcript. These annotations are accessible with the rowRanges and rowData functions defined in the SummarizedExperiment package (Huber et al. 2015).

summary(rowRanges(object))

## [1] "GRanges object with 100 ranges and 8 metadata columns"

summary(rowData(object))

## [1] "DataFrame object of length 8 with 0 metadata columns"

Alternatively, transcript-level annotations can also be obtained automatically using this method.

When working with a dataset generated against a poorly-annotated or non-standard genome, we provide a fallback method for loading gene annotations from a general feature format (GFF) file with the gffFile argument. If possible, we recommend providing a general transfer format (GTF) file, which is identical to GFF version 2. GFFv3 is more complicated and non-standard, but Ensembl GFFv3 files are also supported.

If your dataset contains transgenes (e.g. EGFP, TDTOMATO), these features can be defined with the transgeneNames argument, which will automatically populate the rowRanges slot with placeholder metadata.

We recommend loading up data per genome in its own bcbioRNASeq object when possible, so that rich metadata can be imported easily. In the edge case where you need to look at multiple genomes simultaneously, set organism = NULL, and bcbioRNASeq will skip the gene annotation acquisition step.

Refer to the the GenomicRanges and SummarizedExperiment package documentation for more details on working with the genome annotations defined in the rowRanges slot of the object. Here are some useful examples:

seqnames(object)

## factor-Rle of length 100 with 64 runs
##   Lengths:  1  1  1  1  1  1  1  1  5  1  1 ...  1  1  1  2  1  1  1  1  2  1
##   Values : 3  X  16 11 6  13 4  9  11 17 5  ... 6  5  9  11 8  11 7  15 2  3 
## Levels(86): 3 X 16 7 ... CHR_MG3836_PATCH CHR_MG3496_PATCH CHR_MG3172_PATCH

ranges(object)

## IRanges object with 100 ranges and 0 metadata columns:
##                          start       end     width
##                      <integer> <integer> <integer>
##   ENSMUSG00000000001 108107280 108146146     38867
##   ENSMUSG00000000003  77837901  77853623     15723
##   ENSMUSG00000000028  18780447  18811987     31541
##   ENSMUSG00000000049 108343354 108414396     71043
##   ENSMUSG00000000058  17281185  17289115      7931
##                  ...       ...       ...       ...
##   ENSMUSG00000048583 142650766 142666816     16051
##   ENSMUSG00000055022  92051165  92341967    290803
##   ENSMUSG00000061689 156613705 156764363    150659
##   ENSMUSG00000062661  31245823  31295989     50167
##   ENSMUSG00000074340 105973711 105987423     13713

strand(object)

## factor-Rle of length 100 with 42 runs
##   Lengths: 3 7 1 3 2 2 1 2 3 1 2 6 2 3 2 3 ... 2 3 1 4 1 1 2 2 2 4 2 2 1 4 2 4
##   Values : - + - + - + - + - + - + - + - + ... - + - + - + - + - + - + - + - +
## Levels(3): + - *

Variance stabilization

During the bcbioRNASeq constructor call, log2 variance stabilizaton of gene-level counts can be calculated automatically, and is recommended. This is performed internally by the DESeq2 package, using the varianceStabilizingTransformation and/or rlog functions. These transformations will be slotted into assays.

summary(counts(object, normalized = "vst"))

##   control_rep1     control_rep2     control_rep3     fa_day7_rep1   
##  Min.   : 4.919   Min.   : 4.919   Min.   : 4.919   Min.   : 4.919  
##  1st Qu.: 4.919   1st Qu.: 4.919   1st Qu.: 4.919   1st Qu.: 4.919  
##  Median : 6.051   Median : 5.658   Median : 6.009   Median : 6.237  
##  Mean   : 6.713   Mean   : 6.706   Mean   : 6.715   Mean   : 6.762  
##  3rd Qu.: 7.990   3rd Qu.: 8.239   3rd Qu.: 7.815   3rd Qu.: 8.239  
##  Max.   :12.215   Max.   :12.972   Max.   :11.818   Max.   :11.764  
##   fa_day7_rep2     fa_day7_rep3   
##  Min.   : 4.919   Min.   : 4.919  
##  1st Qu.: 5.101   1st Qu.: 4.919  
##  Median : 6.216   Median : 6.185  
##  Mean   : 6.754   Mean   : 6.789  
##  3rd Qu.: 8.033   3rd Qu.: 7.930  
##  Max.   :12.086   Max.   :11.997

Use case dataset

To demonstrate the functionality and configuration of the package, we have taken an experiment from the Gene Expression Omnibus (GEO) public repository of expression data to use as an example use case. The RNA-seq data is from a study of acute kidney injury in a mouse model (GSE65267) (Craciun et al. 2016). The study aims to identify differentially expressed genes in progressive kidney fibrosis and contains samples from mouse kidneys at several time points (n = 3, per time point) after folic acid treatment. From this dataset, we are using a subset of the samples for our use case: before folic acid treatment, and 1, 3, 7 days after treatment.

For the vignette, we are loading a pre-computed version of the example bcbioRNASeq object used in our workflow paper.

object <- import(
    con = cacheUrl(
        pasteUrl(
            "github.com",
            "hbc",
            "bcbioRNASeq",
            "raw",
            "f1000v2",
            "data",
            "bcb.rda",
            protocol = "https"
        ),
        pkg = "bcbioRNASeq"
    )
)
object <- updateObject(object)
print(object)

## bcbioRNASeq 0.6.2 with 51652 rows and 12 columns
## uploadDir: /n/data1/cores/bcbio/bcbioRNASeq/F1000v2/GSE65267-merged/final
## dates(2): 2018-03-18 2018-06-01
## level: genes
## caller: salmon
## organism: Mus musculus
## ensemblRelease: 90
## interestingGroups: day
## class: RangedSummarizedExperiment 
## dim: 51652 12 
## metadata(28): level caller ... previousVersion packageVersion
## assays(6): counts tpm ... rlog vst
## rownames(51652): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000114967 ENSMUSG00000114968
## rowData names(7): geneId geneName ... ncbiGeneId broadClass
## colnames(12): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3
## colData names(25): sampleName day ... x5x3Bias xGc

Sample metadata

For reference, let’s take a look at the sample metadata. By comparison, using colData will return all sample-level metadata, including our quality control metrics generated by bcbio. We recommend instead using sampleData with the clean = TRUE argument in reports, which only returns factor columns of interest.

sampleData(object)

## DataFrame with 12 rows and 7 columns
##                sampleName      day replicate   strain   tissue  treatment
##                  <factor> <factor>  <factor> <factor> <factor>   <factor>
## control_rep1 control_rep1        0         1   Balb/c   kidney control   
## control_rep2 control_rep2        0         2   Balb/c   kidney control   
## control_rep3 control_rep3        0         3   Balb/c   kidney control   
## fa_day1_rep1 fa_day1_rep1        1         1   Balb/c   kidney folic acid
## fa_day1_rep2 fa_day1_rep2        1         2   Balb/c   kidney folic acid
## ...                   ...      ...       ...      ...      ...        ...
## fa_day3_rep2 fa_day3_rep2        3         2   Balb/c   kidney folic acid
## fa_day3_rep3 fa_day3_rep3        3         3   Balb/c   kidney folic acid
## fa_day7_rep1 fa_day7_rep1        7         1   Balb/c   kidney folic acid
## fa_day7_rep2 fa_day7_rep2        7         2   Balb/c   kidney folic acid
## fa_day7_rep3 fa_day7_rep3        7         3   Balb/c   kidney folic acid
##              interestingGroups
##                       <factor>
## control_rep1                 0
## control_rep2                 0
## control_rep3                 0
## fa_day1_rep1                 1
## fa_day1_rep2                 1
## ...                        ...
## fa_day3_rep2                 3
## fa_day3_rep3                 3
## fa_day7_rep1                 7
## fa_day7_rep2                 7
## fa_day7_rep3                 7

Interesting groups

Groups of interest to be used for coloring and sample grouping in the quality control plots can be defined in the bcbioRNASeq object using the interestingGroups argument in the bcbioRNASeq constructor call. Assignment method support with the interestingGroups function is also provided, which can modify the groups of interest after the object has been created.

The interestingGroups definition defaults to sampleName, which is automatically generated from the bcbio description metadata for demultiplexed bulk RNA-seq samples. In this case, the samples will be grouped and colored uniquely.

Interesting groups must be defined using a character vector and refer to column names defined in the colData slot of the object. Note that the bcbioRNASeq package uses lower camel case formatting for column names (e.g. “sampleName”), so the interesting groups should be defined using camel case, not snake (e.g. “sample_name”) or dotted case (e.g. “sample.name”).

This approach was inspired by the DESeq2 package, which uses the argument intgroup in some functions, such as plotPca for labeling groups of interest. We took this idea and ran with it for a number of our quality control functions, which are described in detail below.

interestingGroups(object) <- c("treatment", "day")

Quality control

Our workflow paper describes the quality control process in detail. In addition, our Quality Control R Markdown template contains notes detailing each metric. Here were are going into more technical detail regarding how to customize the appearance of the plots. Note that all quality control plotting functions inherit the interestingGroups defined inside the bcbioRNASeq object. You can also change this dynamially for each function call using the interestingGroups argument.

Read counts

plotTotalReads(object)

plotMappedReads(object)

Mapping rates

Note that the overall mapping rate is relatively low per sample, while the exonic mapping rate is acceptable. High quality samples should have low intronic mapping rate, and high values are indicative of sample degradation and/or contamination.

plotMappingRate(object)

plotExonicMappingRate(object)

plotIntronicMappingRate(object)

rRNA mapping rate

Note that the samples here have a high rRNA mapping rate. This can be indicative of the polyA enrichment or ribo depletion protocol not having removed all ribosomal RNA (rRNA) transcripts. This will reduce the number of biologically meaningful reads in the experiment and is best avoided.

plotRrnaMappingRate(object)

5’->3’ bias

RNA-seq data can have specific biases at either the 5’ or 3’ end of sequenced fragments. It is common to see a small amount of bias, especially if polyA enrichment was performed, or if there is any sample degradation. If a large amount of bias is observed here, be sure to analyze the samples with a Bioanalyzer and check the RIN scores.

plot5Prime3PrimeBias(object)

Feature (gene) distributions

plotFeaturesDetected(object)

plotFeatureSaturation(object)

plotCountsPerFeature(object, geom = "boxplot")

plotCountsPerFeature(object, geom = "density")

Sample similarity

We can visualize sample similarity with principal component analysis (PCA) and hierarchical clustering of sample correlations. These functions support multiple normalization methods with the normalized argument.

plotPca(object)

plotCorrelationHeatmap(object)

Export

The counts accessor function returns raw counts by default. Multiple normalization methods are accessible with the normalized argument.

rawCounts <- counts(object, normalized = FALSE)
normalizedCounts <- counts(object, normalized = TRUE)
tpm <- counts(object, normalized = "tpm")
log2VstCounts <- counts(object, normalized = "vst")

We are providing bcbioRNASeq method support for the export function, which automatically exports matrices defined in assays and corresponding metadata (e.g. rowData, colData) to disk in a single call.

files <- export(object, con = tempdir())
print(files)

## $assays
## $assays$counts
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/assays/counts.csv"
## 
## $assays$tpm
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/assays/tpm.csv"
## 
## $assays$avgTxLength
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/assays/avgTxLength.csv"
## 
## $assays$normalized
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/assays/normalized.csv"
## 
## $assays$rlog
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/assays/rlog.csv"
## 
## $assays$vst
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/assays/vst.csv"
## 
## 
## $rowData
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/rowData.csv"
## 
## $colData
## [1] "/private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp2KSeRu/colData.csv"

Differential expression

bcbioRNASeq integrates with DESeq2 and edgeR for differential expression.

DESeq2

To prepare our dataset for analysis, we need to coerce the bcbioRNASeq object to a DESeqDataSet. We have defined an S4 coercion method that uses the as function in the package.

dds <- as(object, "DESeqDataSet")
print(dds)

## class: DESeqDataSet 
## dim: 51652 12 
## metadata(11): caller countsFromAbundance ... sessionInfo version
## assays(1): counts
## rownames(51652): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000114967 ENSMUSG00000114968
## rowData names(7): geneId geneName ... ncbiGeneId broadClass
## colnames(12): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3
## colData names(25): sampleName day ... x5x3Bias xGc

Since both bcbioRNASeq and DESeqDataSet classes extend RangedSummarizedExperiment, internally we coerce the original bcbioRNASeq object to a RangedSummarizedExperiment and then use the DESeqDataSet constructor, which requires a SummarizedExperiment with integer counts.

The source code for our bcbioRNASeq to DESeqDataSet coercion is accessible with getMethod.

getMethod(
    f = "coerce",
    signature = signature(from = "bcbioRNASeq", to = "DESeqDataSet")
)

We also provide a parameterized starter R Markdown template for standard DESeq2 differential expression, supporting analysis and visualization of multiple contrasts inside a single report. In our workflow paper, we also describe an example LRT analysis.

Now you can perform differential expression analysis with DESeq2. The authors of that package have provided a number of detailed, well documented references online:

edgeR

To prepare our dataset for analysis, we need to coerce the bcbioRNASeq object to a DGEList. Similar to our DESeq2 approach, we have defined an S4 coercion method that hands off to edgeR.

dge <- as(object, "DGEList")
summary(dge)

##         Length Class      Mode   
## counts  619824 -none-     numeric
## samples      3 data.frame list

The source code for our bcbioRNASeq to DGEList coercion is accessible with getMethod.

getMethod(
    f = "coerce",
    signature = signature(from = "bcbioRNASeq", to = "DGEList")
)

Functional analysis

We provide a starter R Markdown template for gene set enrichment analysis (GSEA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis (Kanehisa and Goto 2000; Subramanian et al. 2005), which leverages the functionality of the clusterProfiler package (Yu et al. 2012). This workflow is described in detail in our workflow paper, and is outside the scope of this advanced use vignette. For more information on functional analysis, consult the clusterProfiler vignette or Stephen Turner’s excellent DESeq2 to fgsea workflow.

R session information

utils::sessionInfo()

## R version 4.3.3 (2024-02-29)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ensembldb_2.26.0            AnnotationFilter_1.26.0    
##  [3] GenomicFeatures_1.54.4      AnnotationDbi_1.64.1       
##  [5] bcbioRNASeq_0.6.2           syntactic_0.7.1            
##  [7] pipette_0.15.2              AcidSingleCell_0.4.2       
##  [9] AcidPlyr_0.5.4              AcidPlots_0.7.3            
## [11] AcidMarkdown_0.3.0          AcidGenomes_0.8.0.9000     
## [13] AcidExperiment_0.5.4        AcidBase_0.7.3             
## [15] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
## [17] Biobase_2.62.0              GenomicRanges_1.54.1       
## [19] GenomeInfoDb_1.38.8         IRanges_2.36.0             
## [21] S4Vectors_0.40.2            BiocGenerics_0.48.1        
## [23] MatrixGenerics_1.14.0       matrixStats_1.2.0          
## [25] basejump_0.18.0             BiocStyle_2.30.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3            jsonlite_1.8.8               
##   [3] tximport_1.30.0               magrittr_2.0.3               
##   [5] farver_2.1.1                  rmarkdown_2.26               
##   [7] fs_1.6.3                      BiocIO_1.12.0                
##   [9] zlibbioc_1.48.2               ragg_1.3.0                   
##  [11] vctrs_0.6.5                   Rsamtools_2.18.0             
##  [13] memoise_2.0.1                 RCurl_1.98-1.14              
##  [15] htmltools_0.5.8               S4Arrays_1.2.1               
##  [17] progress_1.2.3                AnnotationHub_3.10.0         
##  [19] curl_5.2.1                    SparseArray_1.2.4            
##  [21] sass_0.4.9                    bslib_0.6.2                  
##  [23] bcbioBase_0.9.0               desc_1.4.3                   
##  [25] cachem_1.0.8                  GenomicAlignments_1.38.2     
##  [27] mime_0.12                     lifecycle_1.0.4              
##  [29] pkgconfig_2.0.3               Matrix_1.6-5                 
##  [31] R6_2.5.1                      fastmap_1.1.1                
##  [33] GenomeInfoDbData_1.2.11       shiny_1.8.1                  
##  [35] digest_0.6.35                 colorspace_2.1-0             
##  [37] patchwork_1.2.0               DESeq2_1.42.1                
##  [39] textshaping_0.3.7             RSQLite_2.3.5                
##  [41] labeling_0.4.3                filelock_1.0.3               
##  [43] fansi_1.0.6                   httr_1.4.7                   
##  [45] abind_1.4-5                   compiler_4.3.3               
##  [47] bit64_4.0.5                   withr_3.0.0                  
##  [49] BiocParallel_1.36.0           viridis_0.6.5                
##  [51] DBI_1.2.2                     highr_0.10                   
##  [53] biomaRt_2.58.2                rappdirs_0.3.3               
##  [55] DelayedArray_0.28.0           rjson_0.2.21                 
##  [57] tools_4.3.3                   interactiveDisplayBase_1.40.0
##  [59] httpuv_1.6.15                 glue_1.7.0                   
##  [61] restfulr_0.0.15               promises_1.2.1               
##  [63] grid_4.3.3                    generics_0.1.3               
##  [65] gtable_0.3.4                  tzdb_0.4.0                   
##  [67] hms_1.1.3                     xml2_1.3.6                   
##  [69] utf8_1.2.4                    XVector_0.42.0               
##  [71] BiocVersion_3.18.1            pillar_1.9.0                 
##  [73] stringr_1.5.1                 vroom_1.6.5                  
##  [75] limma_3.58.1                  later_1.3.2                  
##  [77] dplyr_1.1.4                   BiocFileCache_2.10.1         
##  [79] lattice_0.22-6                rtracklayer_1.62.0           
##  [81] bit_4.0.5                     tidyselect_1.2.1             
##  [83] locfit_1.5-9.9                Biostrings_2.70.3            
##  [85] knitr_1.45                    gridExtra_2.3                
##  [87] bookdown_0.38                 ProtGenerics_1.34.0          
##  [89] edgeR_4.0.16                  xfun_0.43                    
##  [91] statmod_1.5.0                 pheatmap_1.0.12              
##  [93] stringi_1.8.3                 lazyeval_0.2.2               
##  [95] yaml_2.3.8                    evaluate_0.23                
##  [97] codetools_0.2-19              tibble_3.2.1                 
##  [99] AcidGenerics_0.7.7            BiocManager_1.30.22          
## [101] cli_3.6.2                     xtable_1.8-4                 
## [103] systemfonts_1.0.6             munsell_0.5.0                
## [105] jquerylib_0.1.4               Rcpp_1.0.12                  
## [107] dbplyr_2.5.0                  png_0.1-8                    
## [109] XML_3.99-0.14                 parallel_4.3.3               
## [111] pkgdown_2.0.7                 ggplot2_3.5.0                
## [113] readr_2.1.5                   blob_1.2.4                   
## [115] goalie_0.7.8.9000             prettyunits_1.2.0            
## [117] bitops_1.0-7                  viridisLite_0.4.2            
## [119] AcidCLI_0.3.0                 scales_1.3.0                 
## [121] purrr_1.0.2                   crayon_1.5.2                 
## [123] rlang_1.1.3                   KEGGREST_1.42.0

References

The papers and software cited in our workflows are available as a shared library on Paperpile.

Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal Probabilistic RNA-seq Quantification.” Nat. Biotechnol. 34 (5): 525–27. https://doi.org/10.1038/nbt.3519.

Craciun, Florin L, Vanesa Bijol, Amrendra K Ajay, Poornima Rao, Ramya K Kumar, John Hutchinson, Oliver Hofmann, et al. 2016. “RNA Sequencing Identifies Novel Translational Biomarkers of Kidney Fibrosis.” J. Am. Soc. Nephrol. 27 (6): 1702–13. https://doi.org/10.1681/ASN.2015020225.

Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal RNA-seq Aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.

Dobin, Alexander, and Thomas R Gingeras. 2016. “Optimizing RNA-Seq Mapping with STAR.” Methods Mol. Biol. 1415: 245–62. https://doi.org/10.1007/978-1-4939-3572-7\_13.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.

Kanehisa, M, and S Goto. 2000. “KEGG: Kyoto Encyclopedia of Genes and Genomes.” Nucleic Acids Res. 28 (1): 27–30. https://www.ncbi.nlm.nih.gov/pubmed/10592173.

Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60. https://doi.org/10.1038/nmeth.3317.

Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118. https://doi.org/10.1371/journal.pcbi.1003118.

Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30. https://doi.org/10.1093/bioinformatics/btt656.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biol. 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nat. Methods 14 (4): 417–19. https://doi.org/10.1038/nmeth.4197.

Patro, Rob, Stephen M Mount, and Carl Kingsford. 2014. “Sailfish Enables Alignment-Free Isoform Quantification from RNA-seq Reads Using Lightweight Algorithms.” Nat. Biotechnol. 32 (5): 462–64. https://doi.org/10.1038/nbt.2862.

Soneson, Charlotte, Michael I Love, and Mark D Robinson. 2016. “Differential Analyses for RNA-seq: Transcript-Level Estimates Improve Gene-Level Inferences.” F1000Res. 4 (December). https://doi.org/10.12688/f1000research.7563.1.

Steinbaugh, Michael J, Lorena Pantano, Rory D Kirchner, Victor Barrera, Brad A Chapman, Mary E Piper, Meeta Mistry, et al. 2018. “bcbioRNASeq: R Package for Bcbio RNA-seq Analysis.” F1000Res. 6 (June). https://doi.org/10.12688/f1000research.12093.2.

Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc. Natl. Acad. Sci. U. S. A. 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.

28 March 2024