## nolint start
suppressPackageStartupMessages({
library(DT)
library(Cellosaurus)
})
## nolint end
Introduction
The package is designed to import Cellosaurus annotations and
serve as a cell line identifier mapping toolkit. The main
Cellosaurus()
generator function returns a
Cellosaurus
object, which extends the S4
DFrame
class from Bioconductor. Additional functions
are defined to work with the Cellosaurus
object, notably
mapCells()
, which provides mapping support to Cellosaurus
identifiers directly from cell line names, DepMap identifiers, and ATCC identifiers.
Cellosaurus table
The primary generator functions imports annotations from the
cellosaurus.txt
file on the FTP server.
object <- Cellosaurus()
print(object)
## Cellosaurus 0.8.2 with 152231 rows and 36 columns
## cells(152231): CVCL_0001 CVCL_0002 ... CVCL_ZZ98 CVCL_ZZ99
## date: 2024-03-11
## release: 48
The object is structured as cells in rows, metadata in columns.
## [1] "accession" "ageAtSampling"
## [3] "atccId" "category"
## [5] "cellLineName" "comments"
## [7] "crossReferences" "date"
## [9] "depmapId" "diseases"
## [11] "hierarchy" "isCancer"
## [13] "isContaminated" "isProblematic"
## [15] "misspellings" "msiStatus"
## [17] "ncbiTaxonomyId" "ncitDiseaseId"
## [19] "ncitDiseaseName" "oncotreeCode"
## [21] "oncotreeLevel" "oncotreeMainType"
## [23] "oncotreeName" "oncotreeParent"
## [25] "oncotreeTissue" "organism"
## [27] "originateFromSameIndividual" "population"
## [29] "referencesIdentifiers" "samplingSite"
## [31] "sangerModelId" "secondaryAccession"
## [33] "sexOfCell" "strProfileData"
## [35] "synonyms" "webPages"
Data is encoded using run-length encoding (Rle
from S4Vectors) to
lower memory overhead.
This approach provides simple spreadsheet-like access to Cellosaurus annotations, which are more intuitive to users than nested JSON-style lists.
i <- seq(from = 1L, to = 10L)
j <- which(vapply(
X = as.data.frame(object),
FUN = is.atomic,
FUN.VALUE = logical(1L)
))
datatable(
data = as.data.frame(object)[i, j],
options = list(scrollX = TRUE)
)
Mapping cell lines
It remains a common problem in cancer research that cell line inventories are only maintained by cell line name, without any systematic standardization against a reference database, such as Cellosaurus, ATCC, or DepMap. The advantage of standardizing upon Cellosaurus identifiers as the primary research resource identifier is that the Cellosaurus database is a superset of cells that include all cells in ATCC and DepMap. The Cellosaurus database also provides nicely curated metadata on problematic cell lines, notably contamination, and common misspellings of cell line names that persist across a number of commercial vendor databases.
The mapCells()
function in the package aims to simplify
mapping of cell line names and identifiers from other databases, notably
DepMap and ATCC, to Cellosaurus identifiers. The function is designed to
be as simple as possible and support mixed input in a single call.
cells <- c("ACH-000551", "CCL-240", "Duadi", "THP1", "RAW 264.7")
i <- mapCells(object, cells = cells)
print(i)
## ACH-000551 CCL-240 Duadi THP1 RAW 264.7
## "CVCL_0004" "CVCL_0002" "CVCL_0008" "CVCL_0006" "CVCL_0493"
datatable(
data = as.data.frame(object)[i, j],
options = list(scrollX = TRUE)
)
Excluding problematic cell lines
The Cellosaurus database nicely keeps track of known issues with cell lines, which can be broken down roughly into two classes: “problematic” and “contaminated”. It can be useful for downstream analysis to exclude these cell lines – in particular, we recommend generally excluding any contaminated cell lines but not necessarily all problematic cell lines.
subset <- excludeProblematicCells(object)
print(nrow(subset))
## [1] 150923
## [1] FALSE
## [1] FALSE
Note that “contaminated” cell lines are a subset of “problematic” cell lines in the database. These in general should be avoided in downstream workflows.
subset <- excludeContaminatedCells(object)
print(nrow(subset))
## [1] 151465
## [1] FALSE
## [1] TRUE
For reference, search the website for “problematic” cell lines with
"Problematic cell line"
and “contaminated” cell lines with
"Problematic cell line: Contaminated"
.
R session information
utils::sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Cellosaurus_0.8.2 DT_0.32
##
## loaded via a namespace (and not attached):
## [1] xfun_0.42 bslib_0.6.1 htmlwidgets_1.6.4
## [4] lattice_0.22-5 crosstalk_1.2.1 vctrs_0.6.5
## [7] tools_4.3.3 bitops_1.0-7 generics_0.1.3
## [10] curl_5.2.1 stats4_4.3.3 parallel_4.3.3
## [13] tibble_3.2.1 fansi_1.0.6 RSQLite_2.3.5
## [16] blob_1.2.4 pkgconfig_2.0.3 syntactic_0.7.1
## [19] Matrix_1.6-5 AcidCLI_0.3.0 dbplyr_2.4.0
## [22] desc_1.4.3 S4Vectors_0.40.2 lifecycle_1.0.4
## [25] GenomeInfoDbData_1.2.11 compiler_4.3.3 textshaping_0.3.7
## [28] GenomeInfoDb_1.38.7 htmltools_0.5.7 sass_0.4.8
## [31] RCurl_1.98-1.14 yaml_2.3.8 pipette_0.15.2
## [34] pkgdown_2.0.7 pillar_1.9.0 crayon_1.5.2
## [37] jquerylib_0.1.4 ellipsis_0.3.2 cachem_1.0.8
## [40] AcidGenerics_0.7.7.9000 AcidBase_0.7.3 goalie_0.7.7
## [43] tidyselect_1.2.0 digest_0.6.34 stringi_1.8.3
## [46] dplyr_1.1.4 purrr_1.0.2 fastmap_1.1.1
## [49] grid_4.3.3 cli_3.6.2 magrittr_2.0.3
## [52] utf8_1.2.4 withr_3.0.0 filelock_1.0.3
## [55] bit64_4.0.5 httr_1.4.7 rmarkdown_2.26
## [58] XVector_0.42.0 bit_4.0.5 ragg_1.2.7
## [61] memoise_2.0.1 evaluate_0.23 knitr_1.45
## [64] GenomicRanges_1.54.1 IRanges_2.36.0 BiocFileCache_2.10.1
## [67] rlang_1.1.3 glue_1.7.0 DBI_1.2.2
## [70] AcidPlyr_0.5.4.9000 BiocGenerics_0.48.1 jsonlite_1.8.8
## [73] R6_2.5.1 systemfonts_1.0.6 fs_1.6.3
## [76] zlibbioc_1.48.0