library(syntactic)
data(syntactic, package = "AcidTest")
object <- syntactic$character

Introduction

The syntactic package returns syntactically valid names from user-defined sample and other biological metadata. The package improves upon the make.names function defined in base R, specifically by adding smart handling of mixed case acronyms (e.g. mRNA, RNAi), decimals, and other coventions commonly used in the life sciences.

The package is intended to work in two modes: string mode (default) and file rename mode.

There are five primary naming functions:

  • camelCase (e.g. “helloWorld”).
  • dottedCase (e.g. “hello.world”).
  • snakeCase (e.g. “hello_world”).
  • kebabCase (e.g. “hello-world”).
  • upperCamelCase (e.g. “HelloWorld”).

String mode

In general, stick with snakeCase or camelCase when sanitizing character strings in R.

print(object)
##  [1] "%GC"             "10uM"            "5'-3' bias"      "5prime"         
##  [5] "G2M.Score"       "hello world"     "HELLO WORLD"     "Mazda RX4"      
##  [9] "nCount"          "RNAi clones"     "tx2gene"         "TX2GeneID"      
## [13] "worfdbHTMLRemap" "123"

Use snake case formatting inside of scripts.

snakeCase(object)
##  [1] "percent_gc"        "x10um"             "x5_3_bias"        
##  [4] "x5prime"           "g2m_score"         "hello_world"      
##  [7] "hello_world"       "mazda_rx4"         "n_count"          
## [10] "rnai_clones"       "tx2gene"           "tx2_gene_id"      
## [13] "worfdb_html_remap" "x123"

We recommend using camel case inside of packages. The syntactic package offers two variants: relaxed (default) or strict mode. We prefer relaxed mode for function names, which generally returns acronyms (e.g. ID) more legibly.

camelCase(object, strict = FALSE)
##  [1] "percentGC"       "x10um"           "x5x3Bias"        "x5prime"        
##  [5] "g2mScore"        "helloWorld"      "helloWORLD"      "mazdaRX4"       
##  [9] "nCount"          "rnaiClones"      "tx2gene"         "tx2GeneID"      
## [13] "worfdbHTMLRemap" "x123"

If you’re more old school and prefer using strict camel conventions, that’s also an option.

camelCase(object, strict = TRUE)
##  [1] "percentGc"       "x10um"           "x5x3Bias"        "x5prime"        
##  [5] "g2mScore"        "helloWorld"      "helloWorld"      "mazdaRx4"       
##  [9] "nCount"          "rnaiClones"      "tx2gene"         "tx2GeneId"      
## [13] "worfdbHtmlRemap" "x123"

Here’s the default convention in R, for comparison:

make.names(object)
##  [1] "X.GC"            "X10uM"           "X5..3..bias"     "X5prime"        
##  [5] "G2M.Score"       "hello.world"     "HELLO.WORLD"     "Mazda.RX4"      
##  [9] "nCount"          "RNAi.clones"     "tx2gene"         "TX2GeneID"      
## [13] "worfdbHTMLRemap" "X123"

Additionally, the package exports these string functions:

  • capitalize(): Capitalize the first letter of all words in a string.
  • sentenceCase: Convert a string into sentence case.
  • makeNames(): A modern variant of make.names() that sanitizes using underscores instead of dots.

File rename mode

The package also supports file name sanitization, using the rename = TRUE argument for supported functions. This currently includes kebabCase (recommended), snakeCase, and camelCase.

Here’s an example of how to quickly rename files on disk into kebab case:

input <- c(
    "mRNA Extraction.pdf",
    "inDrops v3 Library Prep.pdf"
)
invisible(file.create(input))
output <- kebabCase(input, rename = TRUE)
## Renaming '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/mRNA Extraction.pdf' to '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/mrna-extraction.pdf'.
## Renaming '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/inDrops v3 Library Prep.pdf' to '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/indrops-v3-library-prep.pdf'.
basename(output)
## [1] "mrna-extraction.pdf"         "indrops-v3-library-prep.pdf"

File names containing a prefix that is considered illegal in R can be allowed, which is often useful for sequencing data:

input <- paste0(seq(4), "_sample_", LETTERS[seq(4)], ".fastq.gz")
print(input)
## [1] "1_sample_A.fastq.gz" "2_sample_B.fastq.gz" "3_sample_C.fastq.gz"
## [4] "4_sample_D.fastq.gz"
invisible(file.create(input))
output <- kebabCase(input, rename = TRUE, prefix = FALSE)
## Renaming '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/1_sample_A.fastq.gz' to '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/1-sample-a.fastq.gz'.
## Renaming '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/2_sample_B.fastq.gz' to '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/2-sample-b.fastq.gz'.
## Renaming '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/3_sample_C.fastq.gz' to '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/3-sample-c.fastq.gz'.
## Renaming '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/4_sample_D.fastq.gz' to '/Volumes/widener/git/monorepo/r-packages/syntactic/vignettes/4-sample-d.fastq.gz'.
basename(output)
## [1] "1-sample-a.fastq.gz" "2-sample-b.fastq.gz" "3-sample-c.fastq.gz"
## [4] "4-sample-d.fastq.gz"

Recursion inside of directories is supported using the recursive = TRUE argument.

Our koopa shell bootloader uses these functions internally for quick interactive file renaming. In that package, refer to kebab-case, snake-case, and/or camel-case documentation for details.

Additional methods

The syntactic package only contains S4 methods defined for character vectors, to keep the package lightweight with few dependencies. Additional S4 methods for Bioconductor classes, including DataFrame, GenomicRanges, and SummarizedExperiment, are defined in the basejump package.

R session information

utils::sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] syntactic_0.4.3
## 
## loaded via a namespace (and not attached):
##  [1] AcidGenerics_0.4.0 rprojroot_1.3-2    digest_0.6.25      crayon_1.3.4      
##  [5] assertthat_0.2.1   R6_2.4.1           backports_1.1.10   magrittr_1.5      
##  [9] evaluate_0.14      stringi_1.5.3      rlang_0.4.7        rstudioapi_0.11   
## [13] fs_1.5.0           goalie_0.4.9       ragg_0.3.1         rmarkdown_2.4     
## [17] pkgdown_1.6.1      desc_1.2.0         tools_4.0.2        stringr_1.4.0     
## [21] yaml_2.2.1         xfun_0.18          compiler_4.0.2     systemfonts_0.3.2 
## [25] AcidBase_0.2.0     memoise_1.1.0      htmltools_0.5.0    knitr_1.30