Import sample metadata — importSampleData • AcidExperiment

This function imports user-defined sample metadata saved in a spreadsheet.

Usage

importSampleData(
  file,
  lanes = 0L,
  pipeline = c("none", "bcbio", "cellranger"),
  autopadZeros = FALSE,
  ...
)

Arguments

file

character(1). File path.

lanes

integer(1). Number of lanes used to split the samples into technical replicates suffix (i.e. _LXXX).

pipeline

character(1). Analysis pipeline:

"none": Simple mode, requiring only "sampleId" column.
"bcbio": bcbio mode. See section here in documentation for details.
"cellranger": Cell Ranger mode. Currently requires "directory" column. Used by Chromium R package.

autopadZeros

logical(1). Autopad zeros in sample identifiers, for improved sorting. Currently supported only for non-multiplexed samples. For example: sample_1, sample_2, ... sample_10 becomes sample_01, sample_02, ... sample10.

...

Passthrough arguments to import method. For example, supports sheet argument for Microsoft Excel files.

Value

DFrame.

Note

Works with local or remote files.

Updated 2023-10-04.

bcbio pipeline

Required column names. The "description" column is always required, and must match the bcbio per sample directory names exactly. Inclusion of the "fileName" column isn't required but is recommended for data provenance. Note that some bcbio examples on readthedocs use "samplename" (note case) instead of "fileName". This function checks for that and will rename the column to "fileName" automatically. We're using the sampleName column (note case) to define unique sample names, in the event that bcbio has processed multiplexed samples.

Demultiplexed samples. The samples in the bcbio run must map to the "description" column. The values provided in description for demultiplexed samples must be unique. They must also be syntactically valid, meaning that they cannot contain illegal characters (e.g. spaces, non-alphanumerics, dashes) or begin with a number. Consult the documentation in help(topic = "make.names") for more information on valid names in R.

Multiplexed samples. This applies to some single-cell RNA-seq formats, including inDrops. In this case, bcbio will output per-sample directories with this this structure: description-revcomp. The function checks to see if the "description" column is unique. If the values are duplicated, the function assumes that bcbio processed multiplexed FASTQs, where multiple samples of interest are barcoded inside a single FASTQ. This this case, you must supply additional "index", "sequence", and "sampleName" columns. Note that bcbio currently outputs the reverse complement index sequence in the sample directory names (e.g. "sample-ATAGAGAG"). Define the forward index barcode in the sequence column here, not the reverse complement. The reverse complement will be calculated automatically and added as the revcomp column in the sample metadata.

Author

Michael Steinbaugh

Examples

## Demultiplexed ====
file <- file.path(
    AcidExperimentTestsUrl,
    "bcbio-metadata-demultiplexed.csv"
)
x <- importSampleData(file, pipeline = "bcbio")
#> → Downloading <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-demultiplexed.csv> to /private/var/folders/9b/4gh0pghx1b71jjd0wjh5mj880000gn/T/RtmpeEDvKY/EUrhFDv6iY-174285370725089/pipette-14b38231ca4e6.csv.
#> → Importing <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-demultiplexed.csv> using base::`read.table()`.
print(x)
#> DataFrame with 4 rows and 4 columns
#>         sampleName            fileName description genotype
#>           <factor>            <factor>    <factor> <factor>
#> sample1    sample1 sample1_R1.fastq.gz     sample1 wildtype
#> sample2    sample2 sample2_R1.fastq.gz     sample2 knockout
#> sample3    sample3 sample3_R1.fastq.gz     sample3 wildtype
#> sample4    sample4 sample4_R1.fastq.gz     sample4 knockout

## Multiplexed ====
file <- file.path(
    AcidExperimentTestsUrl,
    "bcbio-metadata-multiplexed-indrops.csv"
)
x <- importSampleData(file, pipeline = "bcbio")
#> → Downloading <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-multiplexed-indrops.csv> to /private/var/folders/9b/4gh0pghx1b71jjd0wjh5mj880000gn/T/RtmpeEDvKY/VqGf0FI9J6-174285370745151/pipette-14b382d8e0a0c.csv.
#> → Importing <https://r.acidgenomics.com/testdata/acidexperiment/bcbio-metadata-multiplexed-indrops.csv> using base::`read.table()`.
#> ℹ Multiplexed samples detected.
print(x)
#> DataFrame with 8 rows and 8 columns
#>                   sampleName             fileName       description    index
#>                     <factor>             <factor>          <factor> <factor>
#> indrops1_AGAGGATA  sample2_1 indrops1_R1.fastq.gz indrops1-AGAGGATA        2
#> indrops1_ATAGAGAG  sample1_1 indrops1_R1.fastq.gz indrops1-ATAGAGAG        1
#> indrops1_CTCCTTAC  sample3_1 indrops1_R1.fastq.gz indrops1-CTCCTTAC        3
#> indrops1_TATGCAGT  sample4_1 indrops1_R1.fastq.gz indrops1-TATGCAGT        4
#> indrops2_AGAGGATA  sample2_2 indrops2_R1.fastq.gz indrops2-AGAGGATA        2
#> indrops2_ATAGAGAG  sample1_2 indrops2_R1.fastq.gz indrops2-ATAGAGAG        1
#> indrops2_CTCCTTAC  sample3_2 indrops2_R1.fastq.gz indrops2-CTCCTTAC        3
#> indrops2_TATGCAGT  sample4_2 indrops2_R1.fastq.gz indrops2-TATGCAGT        4
#>                   sequence aggregate genotype  revcomp
#>                   <factor>  <factor> <factor> <factor>
#> indrops1_AGAGGATA TATCCTCT   sample2 knockout AGAGGATA
#> indrops1_ATAGAGAG CTCTCTAT   sample1 wildtype ATAGAGAG
#> indrops1_CTCCTTAC GTAAGGAG   sample3 wildtype CTCCTTAC
#> indrops1_TATGCAGT ACTGCATA   sample4 knockout TATGCAGT
#> indrops2_AGAGGATA TATCCTCT   sample2 knockout AGAGGATA
#> indrops2_ATAGAGAG CTCTCTAT   sample1 wildtype ATAGAGAG
#> indrops2_CTCCTTAC GTAAGGAG   sample3 wildtype CTCCTTAC
#> indrops2_TATGCAGT ACTGCATA   sample4 knockout TATGCAGT