bcbioRNASeq
is an S4 class that extends RangedSummarizedExperiment
, and
is designed to store a bcbio RNA-seq analysis.
Usage
bcbioRNASeq(
uploadDir,
level = c("genes", "transcripts"),
caller = c("salmon", "kallisto", "sailfish", "star", "hisat2"),
samples = NULL,
censorSamples = NULL,
sampleMetadataFile = NULL,
organism = NULL,
genomeBuild = NULL,
ensemblRelease = NULL,
gffFile = NULL,
transgeneNames = NULL,
countsFromAbundance = "lengthScaledTPM",
interestingGroups = "sampleName",
fast = FALSE
)
Arguments
- uploadDir
character(1)
. Final upload directory path.- level
character(1)
. Import counts at gene level ("genes
"; default) or transcript level ("transcripts
"; advanced use). Only tximport-compatible callers (e.g. salmon, kallisto, sailfish) can be loaded at transcript level. Aligned counts from featureCounts-compatible callers (e.g. STAR, HISAT2) can only be loaded at gene level.- caller
character(1)
. Expression caller:"salmon"
(default): Salmon alignment-free, quasi-mapped counts."kallisto"
: Kallisto alignment-free, pseudo-aligned counts."sailfish"
: Sailfish alignment-free, lightweight counts."star"
: STAR (Spliced Transcripts Alignment to a Reference) aligned counts."hisat2"
: HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts) graph-based aligned counts.
- samples
character
. Sample identifiers.- censorSamples
character
. Specify a subset of samples to censor.- sampleMetadataFile
character(1)
. Sample metadata file path. CSV or TSV is preferred, but Excel worksheets are also supported. Check the documentation for conventions and required columns.- organism
character(1)
. Full Latin organism name (e.g."Homo sapiens"
).- genomeBuild
character(1)
. Ensembl genome build assembly name (e.g."GRCh38"
). If setNULL
, defaults to the most recent build available. Note: don't pass in UCSC build IDs (e.g."hg38"
).- ensemblRelease
integer(1)
. Ensembl release version (e.g.100
). We recommend setting this value if possible, for improved reproducibility. When left unset, the latest release available via AnnotationHub/ensembldb is used. Note that the latest version available can vary, depending on the versions of AnnotationHub and ensembldb in use.- gffFile
character(1)
. GFF/GTF (General Feature Format) file. Generally, we recommend using GTF (GFFv2) instead of GFFv3.- transgeneNames
character
. Vector indicating which assay rows denote transgenes (e.g. EGFP, TDTOMATO).- countsFromAbundance
character(1)
. Whether to generate estimated counts using abundance estimates (recommended by default).lengthScaledTPM
is a suitable default, and counts are scaled using the average transcript length over samples and then the library size. Refer totximport::tximport()
for more information on this parameter, but it should only ever be changed when loading some datasets at transcript level (e.g. for DTU analsyis).- interestingGroups
character
. Groups of interest to use for visualization. Corresponds to factors describing the columns of the object.- fast
logical(1)
. Fast mode. Skip internal DESeq2 calculations and transformations. Don't enable this setting when using the quality control R Markdown template. Note that some plotting functions, such asplotPca()
will not work when this mode is enabled.
Details
Automatically imports RNA-seq counts, metadata, and the program versions used from a bcbio RNA-seq run. Simply point to the final upload directory generated by bcbio, and this generator function will take care of the rest.
Sample metadata
When loading a bcbio RNA-seq run, the sample metadata will be imported
automatically from the project-summary.yaml
file in the final upload
directory. If you notice any typos in your metadata after completing the run,
these can be corrected by editing the YAML file.
Alternatively, you can pass in a sample metadata file into the
bcbioRNASeq()
function call using the sampleMetadataFile
argument. This
requires either a CSV or Excel spreadsheet.
The samples in the bcbio run must map to the description
column. The values
provided in description
must be unique. These values will be sanitized into
syntactically valid names (see make.names()
for more information), and
assigned as the column names of the bcbioRNASeq
object. The original values
are stored as the sampleName
column in colData
, and are used for all
plotting functions. Do not attempt to set a sampleId
column, as this is
used internally by the package.
Here is a minimal example of a properly formatted sample metadata file:
description | genotype |
sample1 | wildtype |
sample2 | knockout |
sample3 | wildtype |
sample4 | knockout |
Valid names
R is strict about values that are considered valid for use in names()
and
dimnames()
(i.e. rownames()
and colnames()
. Non-alphanumeric
characters, spaces, and dashes are not valid. Use either underscores or
periods in place of dashes when working in R. Also note that names should
not begin with a number, and will be prefixed with an X
when sanitized.
Consult the documentation in the make.names()
function for more
information. We strongly recommend adhering to these conventions when
labeling samples, to help avoid unexpected downstream behavior in R due to
dimnames()
mismatches.
Genome annotations
bcbioRNASeq()
provides support for automatic import of genome annotations,
which internally get processed into genomic ranges (GRanges
) and are
slotted into the rowRanges()
of the object. Currently, we offer support for
(1) Ensembl genome annotations from AnnotationHub via ensembldb
(recommended); or (2) direct import from a GTF/GFF file using
rtracklayer.
ensembldb requires the organism
and ensemblRelease
arguments to be
defined. When both of these are set, bcbioRNASeq
will attempt to
download and use the pre-built Ensembl genome annotations from
AnnotationHub. This method is preferred over direct loading of a GTF/GFF
file because the AnnotationHub annotations contain additional rich
metadata not defined in a GFF file, specifically description
and entrezId
values.
Alternatively, if you are working with a non-standard or poorly annotated genome that isn't available on AnnotationHub, we provide fall back support for loading the genome annotations directly from the GTF file used by the bcbio RNA-seq pipeline. This should be fully automatic for an R session active on the same server used to run bcbio.
Example bcbio GTF path: genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf
.
In the event that you are working from a remote environment that doesn't
have file system access to the bcbio genomes
directory, we provide
additional fall back support for importing genome annotations from a GTF/GFF
directly with the gffFile
argument.
Internally, genome annotations are imported via the AcidGenomes package, specifically with either of these functions:
Genome build
Ensure that the organism and genome build used with bcio match correctly here
in the function call. In particular, for the legacy Homo sapiens
GRCh37/hg19 genome build, ensure that genomeBuild = "GRCh37"
. Otherwise,
the genomic ranges set in rowRanges()
will mismatch. It is recommended for
current projects that GRCh38/hg38 is used in place of GRCh37/hg19 if
possible.
DESeq2
DESeq2 is run automatically when bcbioRNASeq()
is called, unless
fast = TRUE
is set. Internally, this automatically slots normalized counts
into assays()
, and generates variance-stabilized counts.
Remote connections
When working on a local machine, it is possible to load bcbio run data over a
remote connection using sshfs. When loading a large number of samples, it
is preferable to call bcbioRNASeq()
directly in R on the remote server, if
possible.
Examples
uploadDir <- system.file("extdata/bcbio", package = "bcbioRNASeq")
## Gene level.
object <- bcbioRNASeq(
uploadDir = uploadDir,
level = "genes",
caller = "salmon",
organism = "Mus musculus",
ensemblRelease = 87L
)
#>
#> ────────────────────────────────────────────────────────────────────────────────
#> bcbioRNASeq
#> ────────────────────────────────────────────────────────────────────────────────
#> ℹ Importing bcbio-nextgen RNA-seq run.
#> ── Run info
#> uploadDir:
#> /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
#> projectDir: 2018-03-18_GSE65267-merged
#> ℹ 6 samples detected:
#> • control_rep1
#> • control_rep2
#> • control_rep3
#> • fa_day7_rep1
#> • fa_day7_rep2
#> • fa_day7_rep3
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/project-summary.yaml using yaml::`read_yaml()`.
#> ! Data versions are missing.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/programs.txt using base::`read.table()`.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/bcbio-nextgen.log using base::`readLines()`.
#> ! bcbio-nextgen.log file is empty.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/bcbio-nextgen-commands.log using base::`readLines()`.
#> ! bcbio-nextgen-commands.log file is empty.
#> ── Sample metadata
#> → Getting sample metadata from YAML.
#> → Getting sample quality control metrics from YAML.
#> ── Counts
#> ─── tximport
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/tx2gene.csv using base::`read.table()`.
#> ! No transcript versions to modify.
#> → Importing salmon transcript-level counts from quant.sf files using tximport 1.30.0.
#> countsFromAbundance: lengthScaledTPM
#> txOut: FALSE
#> reading in files with read_tsv
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#>
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> ─── featureCounts
#> → Importing aligned counts from featureCounts.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/combined.counts using base::`read.table()`.
#> ── Feature metadata
#> → Making <GRanges> from Ensembl.
#> → Getting <EnsDb> from AnnotationHub 3.10.0 (2023-10-23).
#> ℹ "AH53222": Ensembl 87 EnsDb for Mus Musculus.
#> → Making <GRanges> from <EnsDb>.
#> Organism: Mus musculus
#> Genome build: GRCm38
#> Release: 87
#> Level: genes
#> → Adding broad class annotations.
#> → Downloading extra gene-level metadata from Ensembl.
#> → Importing /Users/mike/.cache/R/AcidGenomes/BiocFileCache/4b135d467278_gene.txt.gz using base::`read.table()`.
#> → Importing /Users/mike/.cache/R/AcidGenomes/BiocFileCache/4b131d5de02_external_synonym.txt.gz using base::`read.table()`.
#> → Importing /Users/mike/.cache/R/AcidGenomes/BiocFileCache/4b13648c4209_Mus_musculus.GRCm38.87.entrez.tsv.gz using base::`read.table()`.
#> ℹ Removing 2728 `TEC` identifiers.
#> → Defining names by `geneId` column in `mcols()`.
#> ── Metadata
#> ── DESeq2 normalizations
#> → `estimateSizeFactors()`
#> → `DESeq()`
#> using pre-existing size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> → `varianceStabilizingTransformation()`
#> → `fpkm()`
#> ✔ bcbio RNA-seq run imported successfully.
print(object)
#> bcbioRNASeq 0.6.2 with 100 rows and 6 columns
#> uploadDir: /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
#> dates(2): 2018-03-18 2024-03-28
#> level: genes
#> caller: salmon
#> organism: Mus musculus
#> ensemblRelease: 87
#> interestingGroups: sampleName
#> class: RangedSummarizedExperiment
#> dim: 100 6
#> metadata(27): allSamples bcbioCommandsLog ... wd yaml
#> assays(7): counts aligned ... vst fpkm
#> rownames(100): ENSMUSG00000000001 ENSMUSG00000000003 ...
#> ENSMUSG00000062661 ENSMUSG00000074340
#> rowData names(8): broadClass description ... ncbiGeneId seqCoordSystem
#> colnames(6): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3
#> colData names(26): averageInsertSize averageReadLength ... treatment
#> x5x3Bias
## Transcript level.
object <- bcbioRNASeq(
uploadDir = uploadDir,
level = "transcripts",
caller = "salmon",
organism = "Mus musculus",
ensemblRelease = 87L
)
#>
#> ────────────────────────────────────────────────────────────────────────────────
#> bcbioRNASeq
#> ────────────────────────────────────────────────────────────────────────────────
#> ℹ Importing bcbio-nextgen RNA-seq run.
#> ── Run info
#> uploadDir:
#> /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
#> projectDir: 2018-03-18_GSE65267-merged
#> ℹ 6 samples detected:
#> • control_rep1
#> • control_rep2
#> • control_rep3
#> • fa_day7_rep1
#> • fa_day7_rep2
#> • fa_day7_rep3
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/project-summary.yaml using yaml::`read_yaml()`.
#> ! Data versions are missing.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/programs.txt using base::`read.table()`.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/bcbio-nextgen.log using base::`readLines()`.
#> ! bcbio-nextgen.log file is empty.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/bcbio-nextgen-commands.log using base::`readLines()`.
#> ! bcbio-nextgen-commands.log file is empty.
#> ── Sample metadata
#> → Getting sample metadata from YAML.
#> → Getting sample quality control metrics from YAML.
#> ── Counts
#> ─── tximport
#> → Importing salmon transcript-level counts from quant.sf files using tximport 1.30.0.
#> countsFromAbundance: lengthScaledTPM
#> txOut: TRUE
#> reading in files with read_tsv
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#>
#> ── Feature metadata
#> → Making <GRanges> from Ensembl.
#> → Getting <EnsDb> from AnnotationHub 3.10.0 (2023-10-23).
#> ℹ "AH53222": Ensembl 87 EnsDb for Mus Musculus.
#> → Making <GRanges> from <EnsDb>.
#> Organism: Mus musculus
#> Genome build: GRCm38
#> Release: 87
#> Level: transcripts
#> → Adding broad class annotations.
#> → Downloading extra gene-level metadata from Ensembl.
#> → Importing /Users/mike/.cache/R/AcidGenomes/BiocFileCache/4b135d467278_gene.txt.gz using base::`read.table()`.
#> → Importing /Users/mike/.cache/R/AcidGenomes/BiocFileCache/4b131d5de02_external_synonym.txt.gz using base::`read.table()`.
#> → Importing /Users/mike/.cache/R/AcidGenomes/BiocFileCache/4b13648c4209_Mus_musculus.GRCm38.87.entrez.tsv.gz using base::`read.table()`.
#> ℹ Removing 2768 `TEC` identifiers.
#> → Defining names by `txId` column in `mcols()`.
#> ── Metadata
#> ✔ bcbio RNA-seq run imported successfully.
print(object)
#> bcbioRNASeq 0.6.2 with 100 rows and 6 columns
#> uploadDir: /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
#> dates(2): 2018-03-18 2024-03-28
#> level: transcripts
#> caller: salmon
#> organism: Mus musculus
#> ensemblRelease: 87
#> interestingGroups: sampleName
#> fast: TRUE
#> class: RangedSummarizedExperiment
#> dim: 100 6
#> metadata(26): allSamples bcbioCommandsLog ... wd yaml
#> assays(3): counts avgTxLength tpm
#> rownames(100): ENSMUST00000000001 ENSMUST00000000003 ...
#> ENSMUST00000000674 ENSMUST00000000687
#> rowData names(16): broadClass description ... txName txSupportLevel
#> colnames(6): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3
#> colData names(26): averageInsertSize averageReadLength ... treatment
#> x5x3Bias
## Fast mode.
object <- bcbioRNASeq(uploadDir = uploadDir, fast = TRUE)
#>
#> ────────────────────────────────────────────────────────────────────────────────
#> bcbioRNASeq
#> ────────────────────────────────────────────────────────────────────────────────
#> ℹ Importing bcbio-nextgen RNA-seq run.
#> ── Run info
#> uploadDir:
#> /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio
#> projectDir: 2018-03-18_GSE65267-merged
#> ℹ 6 samples detected:
#> • control_rep1
#> • control_rep2
#> • control_rep3
#> • fa_day7_rep1
#> • fa_day7_rep2
#> • fa_day7_rep3
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/project-summary.yaml using yaml::`read_yaml()`.
#> ! Data versions are missing.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/programs.txt using base::`read.table()`.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/bcbio-nextgen.log using base::`readLines()`.
#> ! bcbio-nextgen.log file is empty.
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/bcbio-nextgen-commands.log using base::`readLines()`.
#> ! bcbio-nextgen-commands.log file is empty.
#> ── Sample metadata
#> → Getting sample metadata from YAML.
#> → Getting sample quality control metrics from YAML.
#> ── Counts
#> ─── tximport
#> → Importing /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/Rtmp7t0ccV/temp_libpathb0f511e0f2fd/bcbioRNASeq/extdata/bcbio/2018-03-18_GSE65267-merged/tx2gene.csv using base::`read.table()`.
#> ! No transcript versions to modify.
#> → Importing salmon transcript-level counts from quant.sf files using tximport 1.30.0.
#> countsFromAbundance: lengthScaledTPM
#> txOut: FALSE
#> reading in files with read_tsv
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#>
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> ── Feature metadata
#> bcbio GTF file:
#> /n/app/bcbio/dev/genomes/Mmusculus/GRCm38_90/rnaseq/ref-transcripts.gtf
#> ! bcbio GTF file is not accessible.
#> ! Slotting empty ranges into `rowRanges()`.
#> ── Metadata
#> ✔ bcbio RNA-seq run imported successfully.