## nolint start
suppressPackageStartupMessages({
library(syntactic)
})
## nolint end
data(syntactic, package = "AcidTest")
object <- syntactic[["character"]]
Introduction
The syntactic
package returns syntactically valid names from user-defined sample and
other biological metadata. The package improves upon the
make.names
function defined in base R, specifically by
adding smart handling of mixed case acronyms (e.g. mRNA, RNAi),
decimals, and other coventions commonly used in the life sciences.
The package is intended to work in two modes: string mode (default) and file rename mode.
There are five primary naming functions:
-
camelCase
(e.g."helloWorld"
). -
dottedCase
(e.g."hello.world"
). -
snakeCase
(e.g."hello_world"
). -
kebabCase
(e.g."hello-world"
). -
upperCamelCase
(e.g."HelloWorld"
).
Recommended naming conventions
Unsure how to name variables and/or functions in R? Here are my current recommendations, for scientists and bioinformaticians:
- When in doubt, refer to the Bioconductor coding style guide.
- Use
snakeCase
as your daily driver, inside of scripts and R Markdown files. This convention is always legible and consistent. Don’t use this convention inside of packages, however. - For packages, switch to
camelCase
. Use this consistently for function names, arguments, and internal variable names. Camel case is the preferred convention for Bioconductor, which uses the S4 class system for object-oriented programming. S4 generics on Bioconductor are primarily named in camel case; refer to BiocGenerics for details. Note that snake case is used for function names in RStudio / tidyverse packages, but these build on top of the S3 class system, which isn’t used inside of most biological R packages. -
upperCamelCase
should only ever be used for S4 class definitions, such asSummarizedExperiment
. Avoid naming functions with this convention. - Use
kebabCase
for file names. Dashes (hyphens) serve as consistent word boundaries across platforms, whereas underscores do not. This applies in particular to URLs. This post by Jeff Atwood explains nicely why you should use dashes instead of underscores for file names. - Avoid using
dottedCase
in R whenever possible. It’s the original naming convention defined in R, but it’s smart to instead usesnakeCase
and its convention of underscores instead. The S3 class system uses a naming convention ofgeneric.method
, which can get mixed up by variables containing periods (dots) in the name. - Valid names in R can’t start with a number. This is often an issue when importing sequencing data (e.g. FASTQ files). The naming functions will add an “x” prefix in this case.
String mode
In general, stick with snakeCase
or
camelCase
when sanitizing character strings in R.
print(object)
## [1] "%GC" "10uM" "5'-3' bias" "5prime"
## [5] "G2M.Score" "hello world" "HELLO WORLD" "Mazda RX4"
## [9] "nCount" "RNAi clones" "tx2gene" "TX2GeneID"
## [13] "worfdbHTMLRemap" "123"
Use snake case formatting inside of scripts.
snakeCase(object)
## [1] "percent_gc" "x10um" "x5_3_bias"
## [4] "x5prime" "g2m_score" "hello_world"
## [7] "hello_world" "mazda_rx4" "n_count"
## [10] "rnai_clones" "tx2gene" "tx2_gene_id"
## [13] "worfdb_html_remap" "x123"
We recommend using camel case inside of packages. The syntactic package offers two variants: relaxed (default) or strict mode. We prefer relaxed mode for function names, which generally returns acronyms (e.g. ID) more legibly.
camelCase(object, strict = FALSE)
## [1] "percentGC" "x10um" "x5x3Bias" "x5prime"
## [5] "g2mScore" "helloWorld" "helloWORLD" "mazdaRX4"
## [9] "nCount" "rnaiClones" "tx2gene" "tx2GeneID"
## [13] "worfdbHTMLRemap" "x123"
If you’re more old school and prefer using strict camel conventions, that’s also an option.
camelCase(object, strict = TRUE)
## [1] "percentGc" "x10um" "x5x3Bias" "x5prime"
## [5] "g2mScore" "helloWorld" "helloWorld" "mazdaRx4"
## [9] "nCount" "rnaiClones" "tx2gene" "tx2GeneId"
## [13] "worfdbHtmlRemap" "x123"
Here’s the default convention in R, for comparison:
make.names(object)
## [1] "X.GC" "X10uM" "X5..3..bias" "X5prime"
## [5] "G2M.Score" "hello.world" "HELLO.WORLD" "Mazda.RX4"
## [9] "nCount" "RNAi.clones" "tx2gene" "TX2GeneID"
## [13] "worfdbHTMLRemap" "X123"
Additionally, the package exports these string functions:
-
capitalize()
: Capitalize the first letter of all words in a string. -
sentenceCase
: Convert a string into sentence case. -
makeNames()
: A modern variant ofmake.names()
that sanitizes using underscores instead of dots.
File rename mode
The package also supports file name sanitization, using the
syntacticRename
function. This currently includes support
for kebabCase
(recommended), snakeCase
, and
camelCase
, via the fun
argument.
Here’s an example of how to quickly rename files on disk into kebab case:
input <- c(
"mRNA Extraction.pdf",
"inDrops v3 Library Prep.pdf"
)
invisible(file.create(input))
output <- syntacticRename(input, fun = "kebabCase")
## → Renaming /Users/mike/git/monorepo/r-packages/syntactic/vignettes/mRNA Extraction.pdf to /Users/mike/git/monorepo/r-packages/syntactic/vignettes/mrna-extraction.pdf.
## → Renaming /Users/mike/git/monorepo/r-packages/syntactic/vignettes/inDrops v3 Library Prep.pdf to /Users/mike/git/monorepo/r-packages/syntactic/vignettes/indrops-v3-library-prep.pdf.
print(output)
## $from
## [1] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/mRNA Extraction.pdf"
## [2] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/inDrops v3 Library Prep.pdf"
##
## $to
## [1] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/mrna-extraction.pdf"
## [2] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/indrops-v3-library-prep.pdf"
invisible(file.remove(output[["to"]]))
File names containing a prefix that is considered illegal in R can be allowed, which is often useful for sequencing data:
## [1] "1_sample_A.fastq.gz" "2_sample_B.fastq.gz" "3_sample_C.fastq.gz"
## [4] "4_sample_D.fastq.gz"
invisible(file.create(input))
output <- syntacticRename(input, fun = "kebabCase")
## → Renaming /Users/mike/git/monorepo/r-packages/syntactic/vignettes/1_sample_A.fastq.gz to /Users/mike/git/monorepo/r-packages/syntactic/vignettes/1-sample-a.fastq.gz.
## → Renaming /Users/mike/git/monorepo/r-packages/syntactic/vignettes/2_sample_B.fastq.gz to /Users/mike/git/monorepo/r-packages/syntactic/vignettes/2-sample-b.fastq.gz.
## → Renaming /Users/mike/git/monorepo/r-packages/syntactic/vignettes/3_sample_C.fastq.gz to /Users/mike/git/monorepo/r-packages/syntactic/vignettes/3-sample-c.fastq.gz.
## → Renaming /Users/mike/git/monorepo/r-packages/syntactic/vignettes/4_sample_D.fastq.gz to /Users/mike/git/monorepo/r-packages/syntactic/vignettes/4-sample-d.fastq.gz.
print(output)
## $from
## [1] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/1_sample_A.fastq.gz"
## [2] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/2_sample_B.fastq.gz"
## [3] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/3_sample_C.fastq.gz"
## [4] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/4_sample_D.fastq.gz"
##
## $to
## [1] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/1-sample-a.fastq.gz"
## [2] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/2-sample-b.fastq.gz"
## [3] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/3-sample-c.fastq.gz"
## [4] "/Users/mike/git/monorepo/r-packages/syntactic/vignettes/4-sample-d.fastq.gz"
invisible(file.remove(output[["to"]]))
Recursion inside of directories is supported using the
recursive = TRUE
argument.
Our koopa shell
bootloader uses these functions internally for quick interactive file
renaming. In that package, refer to kebab-case
,
snake-case
, and/or camel-case
documentation
for details.
Additional methods
The syntactic package only contains S4 methods defined for
character
vectors, to keep the package lightweight with few
dependencies. Additional S4 methods for Bioconductor classes, including
DataFrame
, GenomicRanges
, and
SummarizedExperiment
, are defined in the basejump
package.
Related packages
If syntactic doesn’t work quite right for your workflow, these popular packages also provide excellent sanitization support:
- janitor by Sam Firke.
- lettercase by Christopher Brown.
- snakecase by Malte Grosser.
R session information
utils::sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] syntactic_0.7.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.2 vctrs_0.6.3 cli_3.6.1
## [4] knitr_1.44 rlang_1.1.1 xfun_0.40
## [7] stringi_1.7.12 purrr_1.0.2 textshaping_0.3.6
## [10] jsonlite_1.8.7 AcidBase_0.7.0 glue_1.6.2
## [13] S4Vectors_0.39.2 rprojroot_2.0.3 htmltools_0.5.6
## [16] stats4_4.3.1 ragg_1.2.5 sass_0.4.7
## [19] rmarkdown_2.25 evaluate_0.22 jquerylib_0.1.4
## [22] fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.3
## [25] memoise_2.0.1 stringr_1.5.0 compiler_4.3.1
## [28] fs_1.6.3 systemfonts_1.0.4 digest_0.6.33
## [31] R6_2.5.1 goalie_0.7.0 magrittr_2.0.3
## [34] AcidCLI_0.3.0 bslib_0.5.1 tools_4.3.1
## [37] AcidGenerics_0.7.0 BiocGenerics_0.47.0 pkgdown_2.0.7
## [40] cachem_1.0.8 desc_1.4.2