This function imports user-defined sample metadata saved in a spreadsheet.
Usage
importSampleData(
file,
lanes = 0L,
pipeline = c("none", "bcbio", "cellranger"),
autopadZeros = FALSE,
...
)
Arguments
- file
character(1)
. File path.- lanes
integer(1)
. Number of lanes used to split the samples into technical replicates suffix (i.e._LXXX
).- pipeline
character(1)
. Analysis pipeline:"none"
: Simple mode, requiring only "sampleId" column."bcbio"
: bcbio mode. See section here in documentation for details."cellranger"
: Cell Ranger mode. Currently requires "directory" column. Used by Chromium R package.
- autopadZeros
logical(1)
. Autopad zeros in sample identifiers, for improved sorting. Currently supported only for non-multiplexed samples. For example:sample_1
,sample_2
, ...sample_10
becomessample_01
,sample_02
, ...sample10
.- ...
Passthrough arguments to
import
method. For example, supportssheet
argument for Microsoft Excel files.
bcbio pipeline
Required column names. The "description"
column is always required, and
must match the bcbio per sample directory names exactly. Inclusion of the
"fileName"
column isn't required but is recommended for data provenance.
Note that some bcbio examples on readthedocs use "samplename"
(note case)
instead of "fileName"
. This function checks for that and will rename the
column to "fileName"
automatically. We're using the sampleName
column
(note case) to define unique sample names, in the event that bcbio has
processed multiplexed samples.
Demultiplexed samples. The samples in the bcbio run must map to the
"description"
column. The values provided in description for demultiplexed
samples must be unique. They must also be syntactically valid, meaning that
they cannot contain illegal characters (e.g. spaces, non-alphanumerics,
dashes) or begin with a number. Consult the documentation in help(topic = "make.names")
for more information on valid names in R.
Multiplexed samples. This applies to some single-cell RNA-seq formats,
including inDrops. In this case, bcbio will output per-sample directories
with this this structure: description-revcomp
. The function checks to
see if the "description"
column is unique. If the values are duplicated,
the function assumes that bcbio processed multiplexed FASTQs, where multiple
samples of interest are barcoded inside a single FASTQ. This this case, you
must supply additional "index"
, "sequence"
, and "sampleName"
columns.
Note that bcbio currently outputs the reverse complement index sequence in
the sample directory names (e.g. "sample-ATAGAGAG"
). Define the forward
index barcode in the sequence
column here, not the reverse complement. The
reverse complement will be calculated automatically and added as the
revcomp
column in the sample metadata.
Examples
## Demultiplexed ====
file <- file.path(
AcidExperimentTestsUrl,
"bcbio-metadata-demultiplexed.csv"
)
x <- importSampleData(file, pipeline = "bcbio")
#> → Downloading <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-demultiplexed.csv> to /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/RtmpZ1UL1t/7ROfTQF34V-170173567631355/pipette-f8a2476f6bb.csv.
#> → Importing <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-demultiplexed.csv> using base::`read.table()`.
print(x)
#> DataFrame with 4 rows and 4 columns
#> sampleName fileName description genotype
#> <factor> <factor> <factor> <factor>
#> sample1 sample1 sample1_R1.fastq.gz sample1 wildtype
#> sample2 sample2 sample2_R1.fastq.gz sample2 knockout
#> sample3 sample3 sample3_R1.fastq.gz sample3 wildtype
#> sample4 sample4 sample4_R1.fastq.gz sample4 knockout
## Multiplexed ====
file <- file.path(
AcidExperimentTestsUrl,
"bcbio-metadata-multiplexed-indrops.csv"
)
x <- importSampleData(file, pipeline = "bcbio")
#> → Downloading <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-multiplexed-indrops.csv> to /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/RtmpZ1UL1t/ImczQpvX9e-170173567657616/pipette-f8a21e407147.csv.
#> → Importing <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-multiplexed-indrops.csv> using base::`read.table()`.
#> ℹ Multiplexed samples detected.
print(x)
#> DataFrame with 8 rows and 8 columns
#> sampleName fileName description index
#> <factor> <factor> <factor> <factor>
#> indrops1_AGAGGATA sample2_1 indrops1_R1.fastq.gz indrops1-AGAGGATA 2
#> indrops1_ATAGAGAG sample1_1 indrops1_R1.fastq.gz indrops1-ATAGAGAG 1
#> indrops1_CTCCTTAC sample3_1 indrops1_R1.fastq.gz indrops1-CTCCTTAC 3
#> indrops1_TATGCAGT sample4_1 indrops1_R1.fastq.gz indrops1-TATGCAGT 4
#> indrops2_AGAGGATA sample2_2 indrops2_R1.fastq.gz indrops2-AGAGGATA 2
#> indrops2_ATAGAGAG sample1_2 indrops2_R1.fastq.gz indrops2-ATAGAGAG 1
#> indrops2_CTCCTTAC sample3_2 indrops2_R1.fastq.gz indrops2-CTCCTTAC 3
#> indrops2_TATGCAGT sample4_2 indrops2_R1.fastq.gz indrops2-TATGCAGT 4
#> sequence aggregate genotype revcomp
#> <factor> <factor> <factor> <factor>
#> indrops1_AGAGGATA TATCCTCT sample2 knockout AGAGGATA
#> indrops1_ATAGAGAG CTCTCTAT sample1 wildtype ATAGAGAG
#> indrops1_CTCCTTAC GTAAGGAG sample3 wildtype CTCCTTAC
#> indrops1_TATGCAGT ACTGCATA sample4 knockout TATGCAGT
#> indrops2_AGAGGATA TATCCTCT sample2 knockout AGAGGATA
#> indrops2_ATAGAGAG CTCTCTAT sample1 wildtype ATAGAGAG
#> indrops2_CTCCTTAC GTAAGGAG sample3 wildtype CTCCTTAC
#> indrops2_TATGCAGT ACTGCATA sample4 knockout TATGCAGT