Repository: campbio/celda Branch: master Commit: d9a104fb0959 Files: 301 Total size: 3.4 MB Directory structure: gitextract_ojjfe_ko/ ├── .Rbuildignore ├── .github/ │ ├── .gitignore │ └── workflows/ │ ├── BioC-check.yaml │ └── check-standard.yaml ├── .gitignore ├── CONDUCT.md ├── DESCRIPTION ├── LICENSE ├── NAMESPACE ├── NEWS.md ├── NOTICE ├── R/ │ ├── RcppExports.R │ ├── aaa.R │ ├── accessors.R │ ├── celdaGridSearch.R │ ├── celdaProbabilityMap.R │ ├── celdaUMAP.R │ ├── celda_C.R │ ├── celda_CG.R │ ├── celda_G.R │ ├── celda_functions.R │ ├── celda_heatmap.R │ ├── celdatSNE.R │ ├── celdatosce.R │ ├── clusterProbability.R │ ├── data.R │ ├── decon.R │ ├── elbow.R │ ├── factorizeMatrix.R │ ├── featureModuleLookup.R │ ├── geneSetEnrich.R │ ├── initialize_clusters.R │ ├── loglikelihood.R │ ├── matrixSums.R │ ├── misc.R │ ├── moduleHeatmap.R │ ├── perplexity.R │ ├── plotHeatmap.R │ ├── plot_decontx.R │ ├── plot_dr.R │ ├── recursiveSplit.R │ ├── reorderCelda.R │ ├── reports.R │ ├── selectFeatures.R │ ├── semi_pheatmap.R │ ├── simulateCells.R │ ├── splitModule.R │ ├── split_clusters.R │ └── topRank.R ├── README.md ├── _pkgdown.yml ├── data/ │ ├── celdaCGGridSearchRes.rda │ ├── celdaCGMod.rda │ ├── celdaCGSim.rda │ ├── celdaCMod.rda │ ├── celdaCSim.rda │ ├── celdaGMod.rda │ ├── celdaGSim.rda │ ├── contaminationSim.rda │ ├── sampleCells.rda │ ├── sceCeldaC.rda │ ├── sceCeldaCG.rda │ ├── sceCeldaCGGridSearch.rda │ └── sceCeldaG.rda ├── docs/ │ ├── 404.html │ ├── CONDUCT.html │ ├── LICENSE-text.html │ ├── articles/ │ │ ├── articles/ │ │ │ ├── celda_pbmc3k.html │ │ │ ├── celda_pbmc3k_files/ │ │ │ │ ├── accessible-code-block-0.0.1/ │ │ │ │ │ └── empty-anchor.js │ │ │ │ ├── header-attrs-2.7/ │ │ │ │ │ └── header-attrs.js │ │ │ │ ├── kePrint-0.0.1/ │ │ │ │ │ └── kePrint.js │ │ │ │ └── lightable-0.0.1/ │ │ │ │ └── lightable.css │ │ │ ├── decontX_pbmc4k.html │ │ │ ├── decontX_pbmc4k_files/ │ │ │ │ ├── accessible-code-block-0.0.1/ │ │ │ │ │ └── empty-anchor.js │ │ │ │ └── header-attrs-2.7/ │ │ │ │ └── header-attrs.js │ │ │ ├── installation.html │ │ │ └── installation_files/ │ │ │ ├── accessible-code-block-0.0.1/ │ │ │ │ └── empty-anchor.js │ │ │ └── header-attrs-2.7/ │ │ │ └── header-attrs.js │ │ ├── celda.html │ │ ├── celda_files/ │ │ │ ├── accessible-code-block-0.0.1/ │ │ │ │ └── empty-anchor.js │ │ │ └── header-attrs-2.7/ │ │ │ └── header-attrs.js │ │ ├── celda_pbmc3k.html │ │ ├── celda_pbmc3k_files/ │ │ │ ├── accessible-code-block-0.0.1/ │ │ │ │ └── empty-anchor.js │ │ │ ├── kePrint-0.0.1/ │ │ │ │ └── kePrint.js │ │ │ └── lightable-0.0.1/ │ │ │ └── lightable.css │ │ ├── decontX.html │ │ ├── decontX_files/ │ │ │ ├── accessible-code-block-0.0.1/ │ │ │ │ └── empty-anchor.js │ │ │ └── header-attrs-2.7/ │ │ │ └── header-attrs.js │ │ ├── decontX_pbmc4k.html │ │ ├── decontX_pbmc4k_files/ │ │ │ └── accessible-code-block-0.0.1/ │ │ │ └── empty-anchor.js │ │ ├── index.html │ │ ├── installation.html │ │ └── installation_files/ │ │ └── accessible-code-block-0.0.1/ │ │ └── empty-anchor.js │ ├── authors.html │ ├── bootstrap-toc.css │ ├── bootstrap-toc.js │ ├── docsearch.css │ ├── docsearch.js │ ├── index.html │ ├── news/ │ │ └── index.html │ ├── pkgdown.css │ ├── pkgdown.js │ ├── pkgdown.yml │ ├── reference/ │ │ ├── appendCeldaList.html │ │ ├── availableModels.html │ │ ├── bestLogLikelihood.html │ │ ├── celda.html │ │ ├── celdaCGGridSearchRes.html │ │ ├── celdaCGMod.html │ │ ├── celdaCGSim.html │ │ ├── celdaCMod.html │ │ ├── celdaCSim.html │ │ ├── celdaClusters.html │ │ ├── celdaGMod.html │ │ ├── celdaGSim.html │ │ ├── celdaGridSearch.html │ │ ├── celdaHeatmap.html │ │ ├── celdaModel.html │ │ ├── celdaModules.html │ │ ├── celdaPerplexity-celdaList-method.html │ │ ├── celdaPerplexity.html │ │ ├── celdaProbabilityMap.html │ │ ├── celdaTsne.html │ │ ├── celdaUmap.html │ │ ├── celda_C.html │ │ ├── celda_CG.html │ │ ├── celda_G.html │ │ ├── celdatosce.html │ │ ├── clusterProbability.html │ │ ├── compareCountMatrix.html │ │ ├── contaminationSim.html │ │ ├── countChecksum-celdaList-method.html │ │ ├── countChecksum.html │ │ ├── decontX.html │ │ ├── decontXcounts.html │ │ ├── distinctColors.html │ │ ├── eigenMatMultInt.html │ │ ├── eigenMatMultNumeric.html │ │ ├── factorizeMatrix.html │ │ ├── fastNormProp.html │ │ ├── fastNormPropLog.html │ │ ├── fastNormPropSqrt.html │ │ ├── featureModuleLookup.html │ │ ├── featureModuleTable.html │ │ ├── geneSetEnrich.html │ │ ├── index.html │ │ ├── logLikelihood.html │ │ ├── logLikelihoodHistory.html │ │ ├── matrixNames.html │ │ ├── moduleHeatmap.html │ │ ├── nonzero.html │ │ ├── normalizeCounts.html │ │ ├── params.html │ │ ├── perplexity.html │ │ ├── plotCeldaViolin.html │ │ ├── plotDecontXContamination.html │ │ ├── plotDecontXMarkerExpression.html │ │ ├── plotDecontXMarkerPercentage.html │ │ ├── plotDimReduceCluster.html │ │ ├── plotDimReduceFeature.html │ │ ├── plotDimReduceGrid.html │ │ ├── plotDimReduceModule.html │ │ ├── plotGridSearchPerplexity.html │ │ ├── plotHeatmap.html │ │ ├── plotRPC.html │ │ ├── recodeClusterY.html │ │ ├── recodeClusterZ.html │ │ ├── recursiveSplitCell.html │ │ ├── recursiveSplitModule.html │ │ ├── reorderCelda.html │ │ ├── reportceldaCG.html │ │ ├── resList.html │ │ ├── resamplePerplexity.html │ │ ├── retrieveFeatureIndex.html │ │ ├── runParams.html │ │ ├── sampleCells.html │ │ ├── sampleLabel.html │ │ ├── sceCeldaC.html │ │ ├── sceCeldaCG.html │ │ ├── sceCeldaCGGridSearch.html │ │ ├── sceCeldaG.html │ │ ├── selectBestModel.html │ │ ├── selectFeatures.html │ │ ├── semiPheatmap.html │ │ ├── simulateCells.html │ │ ├── simulateContamination.html │ │ ├── splitModule.html │ │ ├── subsetCeldaList.html │ │ └── topRank.html │ └── sitemap.xml ├── inst/ │ └── rmarkdown/ │ ├── CeldaCG_PlotResults.Rmd │ └── CeldaCG_Run.Rmd ├── man/ │ ├── appendCeldaList.Rd │ ├── availableModels.Rd │ ├── bestLogLikelihood.Rd │ ├── celda.Rd │ ├── celdaCGGridSearchRes.Rd │ ├── celdaCGMod.Rd │ ├── celdaCGSim.Rd │ ├── celdaCMod.Rd │ ├── celdaCSim.Rd │ ├── celdaClusters.Rd │ ├── celdaGMod.Rd │ ├── celdaGSim.Rd │ ├── celdaGridSearch.Rd │ ├── celdaHeatmap.Rd │ ├── celdaModel.Rd │ ├── celdaModules.Rd │ ├── celdaPerplexity-celdaList-method.Rd │ ├── celdaPerplexity.Rd │ ├── celdaProbabilityMap.Rd │ ├── celdaTsne.Rd │ ├── celdaUmap.Rd │ ├── celda_C.Rd │ ├── celda_CG.Rd │ ├── celda_G.Rd │ ├── celdatosce.Rd │ ├── clusterProbability.Rd │ ├── compareCountMatrix.Rd │ ├── contaminationSim.Rd │ ├── countChecksum-celdaList-method.Rd │ ├── countChecksum.Rd │ ├── decontX.Rd │ ├── decontXcounts.Rd │ ├── distinctColors.Rd │ ├── eigenMatMultInt.Rd │ ├── eigenMatMultNumeric.Rd │ ├── factorizeMatrix.Rd │ ├── fastNormProp.Rd │ ├── fastNormPropLog.Rd │ ├── fastNormPropSqrt.Rd │ ├── featureModuleLookup.Rd │ ├── featureModuleTable.Rd │ ├── geneSetEnrich.Rd │ ├── logLikelihood.Rd │ ├── logLikelihoodHistory.Rd │ ├── matrixNames.Rd │ ├── moduleHeatmap.Rd │ ├── nonzero.Rd │ ├── normalizeCounts.Rd │ ├── params.Rd │ ├── perplexity.Rd │ ├── plotCeldaViolin.Rd │ ├── plotDecontXContamination.Rd │ ├── plotDecontXMarkerExpression.Rd │ ├── plotDecontXMarkerPercentage.Rd │ ├── plotDimReduceCluster.Rd │ ├── plotDimReduceFeature.Rd │ ├── plotDimReduceGrid.Rd │ ├── plotDimReduceModule.Rd │ ├── plotGridSearchPerplexity.Rd │ ├── plotHeatmap.Rd │ ├── plotRPC.Rd │ ├── recodeClusterY.Rd │ ├── recodeClusterZ.Rd │ ├── recursiveSplitCell.Rd │ ├── recursiveSplitModule.Rd │ ├── reorderCelda.Rd │ ├── reportceldaCG.Rd │ ├── resList.Rd │ ├── resamplePerplexity.Rd │ ├── retrieveFeatureIndex.Rd │ ├── runParams.Rd │ ├── sampleCells.Rd │ ├── sampleLabel.Rd │ ├── sceCeldaC.Rd │ ├── sceCeldaCG.Rd │ ├── sceCeldaCGGridSearch.Rd │ ├── sceCeldaG.Rd │ ├── selectBestModel.Rd │ ├── selectFeatures.Rd │ ├── semiPheatmap.Rd │ ├── simulateCells.Rd │ ├── simulateContamination.Rd │ ├── splitModule.Rd │ ├── subsetCeldaList.Rd │ └── topRank.Rd ├── src/ │ ├── DecontX.cpp │ ├── Makevars │ ├── Makevars.win │ ├── RcppExports.cpp │ ├── cG_calcGibbsProbY.cpp │ ├── eigenMatMultInt.cpp │ ├── matrixNorm.cpp │ ├── matrixSums.c │ ├── matrixSumsSparse.cpp │ └── perplexity.c ├── tests/ │ ├── testthat/ │ │ ├── test-celda-functions.R │ │ ├── test-celda_C.R │ │ ├── test-celda_CG.R │ │ ├── test-celda_G.R │ │ ├── test-decon.R │ │ ├── test-intialize_cluster.R │ │ ├── test-matrixSums.R │ │ └── test-with_seed.R │ └── testthat.R └── vignettes/ ├── articles/ │ ├── celda_pbmc3k.Rmd │ ├── decontX_pbmc4k.Rmd │ └── installation.Rmd ├── celda.Rmd └── decontX.Rmd ================================================ FILE CONTENTS ================================================ ================================================ FILE: .Rbuildignore ================================================ ^renv$ ^renv\.lock$ ^.*\.Rproj$ ^\.Rproj\.user$ ^CONDUCT\.md$ .travis.yml NOTICE _pkgdown.yml ^doc$ ^Meta$ renv* ^_pkgdown\.yml$ ^docs$ ^pkgdown$ ^\.github$ ================================================ FILE: .github/.gitignore ================================================ *.html ================================================ FILE: .github/workflows/BioC-check.yaml ================================================ # Workflow derived from https://github.com/r-lib/actions/tree/master/examples # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help on: push: branches: [devel, master] pull_request: branches: [devel, master] name: BioC-check jobs: R-CMD-check: runs-on: ${{ matrix.config.os }} name: ${{ matrix.config.os }} (${{ matrix.config.r }}) strategy: fail-fast: false matrix: config: - {os: macOS-latest, r: 'release'} env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} R_KEEP_PKG_SOURCE: yes steps: - uses: actions/checkout@v2 - uses: r-lib/actions/setup-pandoc@v2 - uses: r-lib/actions/setup-r@v2 with: r-version: ${{ matrix.config.r }} http-user-agent: ${{ matrix.config.http-user-agent }} use-public-rspm: true - uses: r-lib/actions/setup-r-dependencies@v2 with: extra-packages: | any::rcmdcheck url::https://cran.r-project.org/src/contrib/Archive/dbplyr/dbplyr_2.3.4.tar.gz - name: Install XQuartz on macOS if: runner.os == 'macOS' run: brew install xquartz --cask - name: Install fftw3 on macOS if: runner.os == 'macOS' run: brew install fftw - name: Run BiocCheck run: | BiocManager::install("BiocCheck") library(BiocCheck) BiocCheck::BiocCheck(".", `quit-with-status` = TRUE, `no-check-R-ver` = TRUE, `no-check-bioc-help` = TRUE ) shell: Rscript {0} ================================================ FILE: .github/workflows/check-standard.yaml ================================================ # Workflow derived from https://github.com/r-lib/actions/tree/master/examples # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help on: push: branches: [devel, master] pull_request: branches: [devel, master] name: R-CMD-check jobs: R-CMD-check: runs-on: ${{ matrix.config.os }} name: ${{ matrix.config.os }} (${{ matrix.config.r }}) strategy: fail-fast: false matrix: config: - {os: macOS-latest, r: 'release'} - {os: windows-latest, r: 'release'} - {os: ubuntu-latest, r: 'release'} env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} R_KEEP_PKG_SOURCE: yes steps: - uses: actions/checkout@v2 - uses: r-lib/actions/setup-pandoc@v2 - uses: r-lib/actions/setup-r@v2 with: r-version: ${{ matrix.config.r }} http-user-agent: ${{ matrix.config.http-user-agent }} use-public-rspm: false - name: Install XQuartz on macOS if: runner.os == 'macOS' run: brew install xquartz --cask - name: Install fftw3 on macOS if: runner.os == 'macOS' run: brew install fftw - uses: r-lib/actions/setup-r-dependencies@v2 with: extra-packages: | any::rcmdcheck any::tinytex url::https://cran.r-project.org/src/contrib/Archive/dbplyr/dbplyr_2.3.4.tar.gz - uses: r-lib/actions/setup-tinytex@v2 - uses: r-lib/actions/check-r-package@v2 - name: Show testthat output if: always() run: find check -name 'testthat.Rout*' -exec cat '{}' \; || true shell: bash - name: Upload check results if: failure() uses: actions/upload-artifact@main with: name: ${{ runner.os }}-r${{ matrix.config.r }}-results path: check ================================================ FILE: .gitignore ================================================ # History files .Rhistory .Rapp.history # Session Data files .RData # Example code in package build process *-Ex.R # Output files from R CMD build /*.tar.gz # Output files from R CMD check /*.Rcheck/ # RStudio files .Rproj.user/ # produced vignettes vignettes/*.html vignettes/*.pdf vignettes/*.log inst/rmarkdown/*.html inst/rmarkdown/*.rds inst/rmarkdown/*.csv # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 .httr-oauth # knitr and R markdown default cache directories /*_cache/ /cache/ # Temporary files created by R markdown *.utf8.md *.knit.md .Rproj.user celda.Rproj .DS_Store src/*.o src/*.dll src/*.so src-i386/* src-x64/* etc/* # Celda log files with default prefix Celda_chain.*log.txt inst/doc doc Meta .Rprofile renv/ renv.lock ================================================ FILE: CONDUCT.md ================================================ # Contributor Code of Conduct As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities. We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct. Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team. Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers. This Code of Conduct is adapted from the Contributor Covenant (http:contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/ ================================================ FILE: DESCRIPTION ================================================ Package: celda Title: CEllular Latent Dirichlet Allocation Version: 1.18.2 Authors@R: c(person("Joshua", "Campbell", email = "camp@bu.edu", role = c("aut", "cre")), person("Shiyi", "Yang", email="syyang@bu.edu", role = c("aut")), person("Zhe", "Wang", email="zhe@bu.edu", role = c("aut")), person("Sean", "Corbett", email = "scorbett@bu.edu", role = c("aut")), person("Yusuke", "Koga", email="ykoga07@bu.edu", role = c("aut"))) Description: Celda is a suite of Bayesian hierarchical models for clustering single-cell RNA-sequencing (scRNA-seq) data. It is able to perform "bi-clustering" and simultaneously cluster genes into gene modules and cells into cell subpopulations. It also contains DecontX, a novel Bayesian method to computationally estimate and remove RNA contamination in individual cells without empty droplet information. A variety of scRNA-seq data visualization functions is also included. Depends: R (>= 4.0), SingleCellExperiment, Matrix VignetteBuilder: knitr Imports: plyr, foreach, ggplot2, RColorBrewer, grid, scales, gtable, grDevices, graphics, matrixStats, doParallel, digest, methods, reshape2, S4Vectors, data.table, Rcpp, RcppEigen, uwot, enrichR, SummarizedExperiment, MCMCprecision, ggrepel, Rtsne, withr, scater (>= 1.14.4), scran, dbscan, DelayedArray, stringr, ComplexHeatmap, gridExtra, circlize Suggests: testthat, knitr, roxygen2, rmarkdown, biomaRt, covr, BiocManager, BiocStyle, TENxPBMCData, singleCellTK, M3DExampleData LinkingTo: Rcpp, RcppEigen License: MIT + file LICENSE Encoding: UTF-8 LazyData: true RoxygenNote: 7.3.1 BugReports: https://github.com/campbio/celda/issues biocViews: SingleCell, GeneExpression, Clustering, Sequencing, Bayesian, ImmunoOncology, DataImport NeedsCompilation: yes ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Joshua D Campbell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: NAMESPACE ================================================ # Generated by roxygen2: do not edit by hand export("celdaClusters<-") export("celdaModules<-") export("decontXcounts<-") export("sampleLabel<-") export(appendCeldaList) export(availableModels) export(bestLogLikelihood) export(celda) export(celdaClusters) export(celdaGridSearch) export(celdaHeatmap) export(celdaModel) export(celdaModules) export(celdaPerplexity) export(celdaProbabilityMap) export(celdaTsne) export(celdaUmap) export(celda_C) export(celda_CG) export(celda_G) export(celdatosce) export(clusterProbability) export(compareCountMatrix) export(countChecksum) export(decontX) export(decontXcounts) export(distinctColors) export(factorizeMatrix) export(featureModuleLookup) export(featureModuleTable) export(geneSetEnrich) export(logLikelihood) export(logLikelihoodHistory) export(matrixNames) export(moduleHeatmap) export(normalizeCounts) export(params) export(perplexity) export(plotCeldaViolin) export(plotDecontXContamination) export(plotDecontXMarkerExpression) export(plotDecontXMarkerPercentage) export(plotDimReduceCluster) export(plotDimReduceFeature) export(plotDimReduceGrid) export(plotDimReduceModule) export(plotGridSearchPerplexity) export(plotHeatmap) export(plotRPC) export(recodeClusterY) export(recodeClusterZ) export(recursiveSplitCell) export(recursiveSplitModule) export(reorderCelda) export(reportCeldaCGPlotResults) export(reportCeldaCGRun) export(resList) export(resamplePerplexity) export(retrieveFeatureIndex) export(runParams) export(sampleLabel) export(selectBestModel) export(selectFeatures) export(simulateCells) export(simulateContamination) export(splitModule) export(subsetCeldaList) export(topRank) exportMethods("celdaClusters<-") exportMethods("celdaModules<-") exportMethods("decontXcounts<-") exportMethods("sampleLabel<-") exportMethods(bestLogLikelihood) exportMethods(celdaClusters) exportMethods(celdaGridSearch) exportMethods(celdaHeatmap) exportMethods(celdaModel) exportMethods(celdaModules) exportMethods(celdaPerplexity) exportMethods(celdaProbabilityMap) exportMethods(celdaTsne) exportMethods(celdaUmap) exportMethods(celda_C) exportMethods(celda_CG) exportMethods(celda_G) exportMethods(celdatosce) exportMethods(clusterProbability) exportMethods(compareCountMatrix) exportMethods(countChecksum) exportMethods(decontX) exportMethods(decontXcounts) exportMethods(factorizeMatrix) exportMethods(featureModuleLookup) exportMethods(geneSetEnrich) exportMethods(logLikelihood) exportMethods(logLikelihoodHistory) exportMethods(matrixNames) exportMethods(moduleHeatmap) exportMethods(params) exportMethods(perplexity) exportMethods(plotCeldaViolin) exportMethods(plotDimReduceCluster) exportMethods(plotDimReduceFeature) exportMethods(plotDimReduceGrid) exportMethods(plotDimReduceModule) exportMethods(plotGridSearchPerplexity) exportMethods(plotRPC) exportMethods(recursiveSplitCell) exportMethods(recursiveSplitModule) exportMethods(reorderCelda) exportMethods(resList) exportMethods(resamplePerplexity) exportMethods(runParams) exportMethods(sampleLabel) exportMethods(selectBestModel) exportMethods(selectFeatures) exportMethods(splitModule) exportMethods(subsetCeldaList) import(Rcpp) import(RcppEigen) import(foreach) import(grDevices) import(graphics) import(grid) import(uwot) importClassesFrom(Matrix,dgCMatrix) importClassesFrom(SingleCellExperiment,SingleCellExperiment) importFrom(MCMCprecision,fit_dirichlet) importFrom(Matrix,colSums) importFrom(Matrix,rowSums) importFrom(Matrix,t) importFrom(RColorBrewer,brewer.pal) importFrom(Rtsne,Rtsne) importFrom(data.table,as.data.table) importFrom(digest,digest) importFrom(doParallel,registerDoParallel) importFrom(enrichR,enrichr) importFrom(enrichR,listEnrichrDbs) importFrom(ggrepel,geom_text_repel) importFrom(grDevices,colorRampPalette) importFrom(grDevices,colors) importFrom(grDevices,hsv) importFrom(grDevices,rgb2hsv) importFrom(grid,grid.pretty) importFrom(gtable,gtable) importFrom(gtable,gtable_add_grob) importFrom(gtable,gtable_height) importFrom(gtable,gtable_width) importFrom(matrixStats,logSumExp) importFrom(methods,.hasSlot) importFrom(methods,is) importFrom(methods,new) importFrom(plyr,mapvalues) importFrom(reshape2,melt) importFrom(scales,brewer_pal) importFrom(scales,dscale) importFrom(scales,hue_pal) importFrom(withr,with_seed) importMethodsFrom(Matrix,"%*%") useDynLib(celda,"_colSumByGroup") useDynLib(celda,"_colSumByGroupChange") useDynLib(celda,"_colSumByGroupChange_numeric") useDynLib(celda,"_colSumByGroup_numeric") useDynLib(celda,"_perplexityG") useDynLib(celda,"_rowSumByGroup") useDynLib(celda,"_rowSumByGroupChange") useDynLib(celda,"_rowSumByGroupChange_numeric") useDynLib(celda,"_rowSumByGroup_numeric") ================================================ FILE: NEWS.md ================================================ # celda v1.18.2 (2024-04-02) * Updated Makevar files to new CRAN standards * Fixed unit test causing error # celda v1.18.1 (2023-11-05) * Update to match Bioconductor release version * Removed multipanelfigure as a dependency # celda v1.14.2 (2023-01-19) * Update to match Bioconductor release version # celda v1.13.0 (2022-10-20) * Bug fixes related to cluster labels stored as factors and plotting * Updated sparse matrix conversion to work with Matrix v1.4-2 # celda v1.12.0 (2022-04-30) * Update to match Bioconductor 3.15 release version # celda v1.11.1 (2022-03-31) * Fixes to reports * Use smoothe splines for perplexity and RPC plots # celda v1.11.0 (2022-03-31) * Improvments to decontX vignette * Added ability to subsample to speed up perplexity calculations * Added ability to use batch parameter with the raw matrix in decontX # celda v1.10.0 (2021-12-28) * Update to match Bioconductor release version # celda v1.9.3 (2021-10-04) * Fixed bug in checking background matrix with decontX * Switched to using Github Actions for Continuous Integration * Fixed plotting bugs in celda results reports * Speed up final step in decontX when creating final decontaminated matrix # celda v1.9.2 (2021-07-19) * Added a `NEWS.md` file to track changes to the package. * Added new tutorials and documentation generated with pkgdown. * Removed warnings in plotRPC functions. * Added use of "displayName" to several functions that show feature names. * Minor bug fix when the input matrix was sparse and contained non-integer values. * Several improvements to plotting functions. # celda v1.7.7 (2021-04-12): * Added handling for sparse matrices # celda v1.7.6 (2021-04-04): * Added functions for creating HTML reports * Fixed bug in decontX plotting # celda v1.7.4 (2021-03-09): * Enable input of raw/droplet matrix into decontX to estimate ambient RNA # celda v1.1.6 (2019-07-16): * Add multiclass decision tree # celda v1.1.4 (2019-05-28): * Add Alternate headings support for plotDimReduceFeature # celda v1.1.3 (2019-05-14): * Add multiclass decision tree (MCDT) cell cluster annotation # celda v1.1.2 (2019-05-14): * Fix a bug in celdaHeatmap # celda v1.0.1 (2019-05-09): * Default seed setting to maintain reproducibility # celda v0.99.34 (2019-04-23): * Minor changes to the vignettes # celda v0.99.23 (2019-04-10): * Remove pheatmap import # celda v0.99.22 (2019-04-09): * Package celda, for bi-clustering of single-cell 'omics data. # celda v0.99.8 (2019-03-11): * Second submission to Bioconductor # celda v0.99.0 (2018-05-15): * First submission to Bioconductor ================================================ FILE: NOTICE ================================================ The celda package incldues other open source software components, including functions adapted from other R libraries. The use of these components is annotated throughout the codebase. The following is a list of these components; their corresponding licenses are listed below. - gtools - pheatmap gtools, pheatmap ----------------- GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Lesser General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. {description} Copyright (C) {year} {fullname} This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. {signature of Ty Coon}, 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. ================================================ FILE: R/RcppExports.R ================================================ # Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 decontXEM <- function(counts, counts_colsums, theta, estimate_eta, eta, phi, z, estimate_delta, delta, pseudocount) { .Call('_celda_decontXEM', PACKAGE = 'celda', counts, counts_colsums, theta, estimate_eta, eta, phi, z, estimate_delta, delta, pseudocount) } decontXLogLik <- function(counts, theta, eta, phi, z, pseudocount) { .Call('_celda_decontXLogLik', PACKAGE = 'celda', counts, theta, eta, phi, z, pseudocount) } decontXInitialize <- function(counts, theta, z, pseudocount) { .Call('_celda_decontXInitialize', PACKAGE = 'celda', counts, theta, z, pseudocount) } calculateNativeMatrix <- function(counts, theta, eta, phi, z, pseudocount) { .Call('_celda_calculateNativeMatrix', PACKAGE = 'celda', counts, theta, eta, phi, z, pseudocount) } cG_calcGibbsProbY_Simple <- function(counts, nGbyTS, nTSbyC, nbyTS, nbyG, y, L, index, gamma, beta, delta) { .Call('_celda_cG_calcGibbsProbY_Simple', PACKAGE = 'celda', counts, nGbyTS, nTSbyC, nbyTS, nbyG, y, L, index, gamma, beta, delta) } cG_CalcGibbsProbY_ori <- function(index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta) { .Call('_celda_cG_CalcGibbsProbY_ori', PACKAGE = 'celda', index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta) } cG_CalcGibbsProbY_fastRow <- function(index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta) { .Call('_celda_cG_CalcGibbsProbY_fastRow', PACKAGE = 'celda', index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta) } cG_CalcGibbsProbY <- function(index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta) { .Call('_celda_cG_CalcGibbsProbY', PACKAGE = 'celda', index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta) } #' Fast matrix multiplication for double x int #' #' @param A a double matrix #' @param B an integer matrix #' @return An integer matrix representing the product of A and B eigenMatMultInt <- function(A, B) { .Call('_celda_eigenMatMultInt', PACKAGE = 'celda', A, B) } #' Fast matrix multiplication for double x double #' #' @param A a double matrix #' @param B an integer matrix #' @return An integer matrix representing the product of A and B eigenMatMultNumeric <- function(A, B) { .Call('_celda_eigenMatMultNumeric', PACKAGE = 'celda', A, B) } #' Fast normalization for numeric matrix #' #' @param R_counts An integer matrix #' @param R_alpha A double value to be added to the matrix as a pseudocount #' @return A numeric matrix where the columns have been normalized to proportions fastNormProp <- function(R_counts, R_alpha) { .Call('_celda_fastNormProp', PACKAGE = 'celda', R_counts, R_alpha) } #' Fast normalization for numeric matrix #' #' @param R_counts An integer matrix #' @param R_alpha A double value to be added to the matrix as a pseudocount #' @return A numeric matrix where the columns have been normalized to proportions fastNormPropLog <- function(R_counts, R_alpha) { .Call('_celda_fastNormPropLog', PACKAGE = 'celda', R_counts, R_alpha) } #' Fast normalization for numeric matrix #' #' @param R_counts An integer matrix #' @param R_alpha A double value to be added to the matrix as a pseudocount #' @return A numeric matrix where the columns have been normalized to proportions fastNormPropSqrt <- function(R_counts, R_alpha) { .Call('_celda_fastNormPropSqrt', PACKAGE = 'celda', R_counts, R_alpha) } #' get row and column indices of none zero elements in the matrix #' #' @param R_counts A matrix #' @return An integer matrix where each row is a row, column indices pair nonzero <- function(R_counts) { .Call('_celda_nonzero', PACKAGE = 'celda', R_counts) } colSumByGroupSparse <- function(counts, group, K) { .Call('_celda_colSumByGroupSparse', PACKAGE = 'celda', counts, group, K) } rowSumByGroupSparse <- function(counts, group, L) { .Call('_celda_rowSumByGroupSparse', PACKAGE = 'celda', counts, group, L) } colSumByGroupChangeSparse <- function(counts, px, group, pgroup, K) { .Call('_celda_colSumByGroupChangeSparse', PACKAGE = 'celda', counts, px, group, pgroup, K) } rowSumByGroupChangeSparse <- function(counts, px, group, pgroup, L) { .Call('_celda_rowSumByGroupChangeSparse', PACKAGE = 'celda', counts, px, group, pgroup, L) } ================================================ FILE: R/aaa.R ================================================ setClass("celdaModel", slots = c(params = "list", # K, L, model priors, checksum names = "list", completeLogLik = "numeric", finalLogLik = "numeric", clusters = "list") ) # z and or y setClass("celda_C", representation(sampleLabel = "factor"), contains = "celdaModel") setClass("celda_G", contains = "celdaModel") setClass("celda_CG", contains = c("celda_C", "celda_G")) setClass("celdaList", slots = c(runParams = "data.frame", resList = "list", countChecksum = "character", perplexity = "matrix", celdaGridSearchParameters = "list") ) ================================================ FILE: R/accessors.R ================================================ #' @title Get or set the cell cluster labels from a celda #' \linkS4class{SingleCellExperiment} object or celda model #' object. #' @description Return or set the cell cluster labels determined #' by \link{celda_C} or \link{celda_CG} models. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_C}, or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. The #' a \link{altExp} slot with name \code{altExpName} will #' be used. Rows represent features and columns represent cells. #' \item Celda model object.} #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param value Character vector of cell cluster labels for replacements. Works #' only if \code{x} is a \linkS4class{SingleCellExperiment} object. #' @return One of #' \itemize{ #' \item Character vector if \code{x} is a #' \linkS4class{SingleCellExperiment} object. #' Contains cell cluster labels for each cell in x. #' \item List if \code{x} is a celda model object. Contains cell cluster #' labels (for celda_C and celdaCG #' Models) and/or feature module labels (for celda_G and celdaCG Models).} #' @export setGeneric("celdaClusters", function(x, altExpName = "featureSubset") { standardGeneric("celdaClusters") }) #' @rdname celdaClusters #' @examples #' data(sceCeldaCG) #' celdaClusters(sceCeldaCG) #' @export setMethod("celdaClusters", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) return(SummarizedExperiment::colData(altExp)$celda_cell_cluster) }) #' @examples #' data(celdaCGMod) #' celdaClusters(celdaCGMod) #' @rdname celdaClusters #' @export setMethod("celdaClusters", signature(x = "celdaModel"), function(x) { return(x@clusters) } ) #' @rdname celdaClusters #' @export setGeneric("celdaClusters<-", function(x, altExpName = "featureSubset", value) { standardGeneric("celdaClusters<-") } ) #' @rdname celdaClusters #' @export setMethod("celdaClusters<-", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset", value) { altExp <- SingleCellExperiment::altExp(x, altExpName) if (!is.factor(value)) { message("Cluster labels are converted to factors.") value <- as.factor(value) } SummarizedExperiment::colData(altExp)$celda_cell_cluster <- value SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) }) #' @title Get or set the feature module labels from a celda #' \linkS4class{SingleCellExperiment} object. #' @description Return or set the feature module cluster labels determined #' by \link{celda_G} or \link{celda_CG} models. #' @param sce A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_G}, or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. #' Rows represent features and columns represent cells. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param value Character vector of feature module labels for replacements. #' Works only if \code{x} is a \linkS4class{SingleCellExperiment} object. #' @return Character vector. Contains feature module labels for each #' feature in x. #' @export setGeneric("celdaModules", function(sce, altExpName = "featureSubset") { standardGeneric("celdaModules") }) #' @rdname celdaModules #' @examples #' data(sceCeldaCG) #' celdaModules(sceCeldaCG) #' @export setMethod("celdaModules", signature(sce = "SingleCellExperiment"), function(sce, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(sce, altExpName) return(SummarizedExperiment::rowData(altExp)$celda_feature_module) }) #' @rdname celdaModules #' @export setGeneric("celdaModules<-", function(sce, altExpName = "featureSubset", value) { standardGeneric("celdaModules<-") } ) #' @rdname celdaModules #' @export setMethod("celdaModules<-", signature(sce = "SingleCellExperiment"), function(sce, altExpName = "featureSubset", value) { altExp <- SingleCellExperiment::altExp(sce, altExpName) if (!is.factor(value)) { message("Module labels are converted to factors.") value <- as.factor(value) } SummarizedExperiment::rowData(altExp)$celda_feature_module <- value SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) }) #' @title Get or set sample labels from a celda #' \linkS4class{SingleCellExperiment} object #' @description Return or set the sample labels for the cells in \code{sce}. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_C}, \link{celda_G}, or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. #' Rows represent features and columns represent cells. #' \item A celda model object.} #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param value Character vector of sample labels for replacements. Works #' only is \code{x} is a \linkS4class{SingleCellExperiment} object. #' @return Character vector. Contains the sample labels provided at model #' creation, or those automatically generated by celda. #' @export setGeneric("sampleLabel", function(x, altExpName = "featureSubset") { standardGeneric("sampleLabel") }) #' @rdname sampleLabel #' @examples #' data(sceCeldaCG) #' sampleLabel(sceCeldaCG) #' @export setMethod("sampleLabel", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) return(SummarizedExperiment::colData(altExp)$celda_sample_label) }) #' @rdname sampleLabel #' @export setGeneric("sampleLabel<-", function(x, altExpName = "featureSubset", value) { standardGeneric("sampleLabel<-") } ) #' @rdname sampleLabel #' @export setMethod("sampleLabel<-", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset", value) { altExp <- SingleCellExperiment::altExp(x, altExpName) if (!is.factor(value)) { message("Sample labels are converted to factors.") value <- as.factor(value) } SummarizedExperiment::colData(altExp)$celda_sample_label <- value SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) }) #' @examples #' data(celdaCGMod) #' sampleLabel(celdaCGMod) #' @rdname sampleLabel #' @export setMethod("sampleLabel", signature(x = "celdaModel"), function(x) { x@sampleLabel } ) #' @title Get parameter values provided for celdaModel creation #' @description Retrieves the K/L, model priors (e.g. alpha, beta), #' and count matrix checksum parameters provided during the creation of the #' provided celdaModel. #' @param celdaMod celdaModel. Options available in #' \code{celda::availableModels}. #' @return List. Contains the model-specific parameters for the provided celda #' model object depending on its class. #' @export setGeneric( "params", function(celdaMod) { standardGeneric("params") } ) #' @rdname params #' @examples #' data(celdaCGMod) #' params(celdaCGMod) #' @export setMethod("params", signature = c(celdaMod = "celdaModel"), function(celdaMod) { celdaMod@params } ) #' @title Get feature, cell and sample names from a celdaModel #' @description Retrieves the row, column, and sample names used to generate #' a celdaModel. #' @param celdaMod celdaModel. Options available in `celda::availableModels`. #' @return List. Contains row, column, and sample character vectors #' corresponding to the values provided when the celdaModel was generated. #' @export setGeneric( "matrixNames", function(celdaMod) { standardGeneric("matrixNames") } ) #' @rdname matrixNames #' @examples #' data(celdaCGMod) #' matrixNames(celdaCGMod) #' @export setMethod("matrixNames", signature = c(celdaMod = "celdaModel"), function(celdaMod) { celdaMod@names } ) #' @title Get run parameters from a celda model #' \code{SingleCellExperiment} or \code{celdaList} object #' @description Returns details on the clustering parameters and model #' priors from the celdaList object when it was created. #' @param x An object of class \linkS4class{SingleCellExperiment} or class #' \code{celdaList}. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return Data Frame. Contains details on the various K/L parameters, chain #' parameters, seed, and final log-likelihoods derived for each model in the #' provided celdaList. #' @export setGeneric("runParams", function(x, altExpName = "featureSubset") { standardGeneric("runParams") } ) #' @examples #' data(sceCeldaCGGridSearch) #' runParams(sceCeldaCGGridSearch) #' @rdname runParams #' @export setMethod("runParams", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) return(altExp@metadata$celda_grid_search@runParams) } ) #' @examples #' data(celdaCGGridSearchRes) #' runParams(celdaCGGridSearchRes) #' @rdname runParams #' @export setMethod("runParams", signature(x = "celdaList"), function(x) { return(x@runParams) } ) #' @title Get final celdaModels from a celda model \code{SCE} or celdaList #' object #' @description Returns all celda models generated during a #' \link{celdaGridSearch} run. #' @param x An object of class \linkS4class{SingleCellExperiment} or #' \code{celdaList}. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return List. Contains one celdaModel object for each of the parameters #' specified in \code{runParams(x)}. #' @export setGeneric( "resList", function(x, altExpName = "featureSubset") { standardGeneric("resList") } ) #' @examples #' data(sceCeldaCGGridSearch) #' celdaCGGridModels <- resList(sceCeldaCGGridSearch) #' @rdname resList #' @export setMethod("resList", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) return(altExp@metadata$celda_grid_search@resList) } ) #' @examples #' data(celdaCGGridSearchRes) #' celdaCGGridModels <- resList(celdaCGGridSearchRes) #' @rdname resList #' @export setMethod("resList", signature(x = "celdaList"), function(x) { return(x@resList) } ) #' @title Get celda model from a celda #' \link[SingleCellExperiment]{SingleCellExperiment} object #' @description Return the celda model for \code{sce} returned by #' \link{celda_C}, \link{celda_G} or \link{celda_CG}. #' @param sce A \link[SingleCellExperiment]{SingleCellExperiment} object #' returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return Character. The celda model. Can be one of "celda_C", "celda_G", or #' "celda_CG". #' @examples #' data(sceCeldaCG) #' celdaModel(sceCeldaCG) #' @export setGeneric("celdaModel", function(sce, altExpName = "featureSubset") { standardGeneric("celdaModel") }) #' @rdname celdaModel #' @export setMethod("celdaModel", signature(sce = "SingleCellExperiment"), function(sce, altExpName = "featureSubset") { if (!altExpName %in% SingleCellExperiment::altExpNames(sce)) { stop(altExpName, " not in 'altExpNames(sce)'. Run ", "selectFeatures(sce) first!") } altExp <- SingleCellExperiment::altExp(sce, altExpName) tryCatch( if (S4Vectors::metadata(altExp)$celda_parameters$model %in% c("celda_C", "celda_G", "celda_CG")) { return(S4Vectors::metadata(altExp)$celda_parameters$model) } else { stop("S4Vectors::metadata(altExp(sce,", " altExpName))$celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") }, error = function(e) { message("S4Vectors::metadata(altExp(sce,", " altExpName))$celda_parameters$model must", " exist! Try running celda model (celda_C, celda_CG, or", " celda_G) first.") stop(e) }) }) #' @title Get perplexity for every model in a celdaList #' @description Returns perplexity for each model in a celdaList as calculated #' by `perplexity().` #' @param celdaList An object of class celdaList. #' @return List. Contains one celdaModel object for each of the parameters #' specified in the `runParams()` of the provided celda list. #' @examples #' data(celdaCGGridSearchRes) #' celdaCGGridModelPerplexities <- celdaPerplexity(celdaCGGridSearchRes) #' @export setGeneric( "celdaPerplexity", function(celdaList) { standardGeneric("celdaPerplexity") } ) #' @title Get perplexity for every model in a celdaList #' @description Returns perplexity for each model in a celdaList as calculated #' by `perplexity().` #' @param celdaList An object of class celdaList. #' @return List. Contains one celdaModel object for each of the parameters #' specified in the `runParams()` of the provided celda list. #' @examples #' data(celdaCGGridSearchRes) #' celdaCGGridModelPerplexities <- celdaPerplexity(celdaCGGridSearchRes) #' @export setMethod("celdaPerplexity", signature = c(celdaList = "celdaList"), function(celdaList) { celdaList@perplexity } ) #' @title Get the MD5 hash of the count matrix from the celdaList #' @description Returns the MD5 hash of the count matrix used to generate the #' celdaList. #' @param celdaList An object of class celdaList. #' @return A character string of length 32 containing the MD5 digest of #' the count matrix. #' @examples #' data(celdaCGGridSearchRes) #' countChecksum <- countChecksum(celdaCGGridSearchRes) #' @export setGeneric( "countChecksum", function(celdaList) { standardGeneric("countChecksum") } ) #' @title Get the MD5 hash of the count matrix from the celdaList #' @description Returns the MD5 hash of the count matrix used to generate the #' celdaList. #' @param celdaList An object of class celdaList. #' @return A character string of length 32 containing the MD5 digest of #' the count matrix. #' @examples #' data(celdaCGGridSearchRes) #' countChecksum <- countChecksum(celdaCGGridSearchRes) #' @export setMethod("countChecksum", signature = c(celdaList = "celdaList"), function(celdaList) { celdaList@countChecksum } ) ================================================ FILE: R/celdaGridSearch.R ================================================ #' @title Run Celda in parallel with multiple parameters #' @description Run Celda with different combinations of parameters and #' multiple chains in parallel. The variable \link{availableModels} contains #' the potential models that can be utilized. Different parameters to be tested #' should be stored in a list and passed to the argument \code{paramsTest}. #' Fixed parameters to be used in all models, such as \code{sampleLabel}, can #' be passed as a list to the argument \code{paramsFixed}. When #' \code{verbose = TRUE}, output from each chain will be sent to a log file #' but not be displayed in \code{stdout}. #' @param x A numeric \link{matrix} of counts or a #' \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. #' @param useAssay A string specifying the name of the #' \link{assay} slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param model Celda model. Options available in \link{availableModels}. #' @param paramsTest List. A list denoting the combinations of parameters to #' run in a celda model. For example, #' \code{list(K = seq(5, 10), L = seq(15, 20))} #' will run all combinations of K from 5 to 10 and L from 15 to 20 in model #' \link{celda_CG}. #' @param paramsFixed List. A list denoting additional parameters to use in #' each celda model. Default NULL. #' @param maxIter Integer. Maximum number of iterations of sampling to #' perform. Default 200. #' @param nchains Integer. Number of random cluster initializations. Default 3. #' @param cores Integer. The number of cores to use for parallel estimation of #' chains. Default 1. #' @param bestOnly Logical. Whether to return only the chain with the highest #' log likelihood per combination of parameters or return all chains. Default #' TRUE. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. Seed values #' \code{seq(seed, (seed + nchains - 1))} will be supplied to each chain in #' \code{nchains}. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param perplexity Logical. Whether to calculate perplexity for each model. #' If FALSE, then perplexity can be calculated later with #' \link{resamplePerplexity}. Default TRUE. #' @param verbose Logical. Whether to print log messages during celda chain #' execution. Default TRUE. #' @param logfilePrefix Character. Prefix for log files from worker threads #' and main process. Default "Celda". #' @return A \linkS4class{SingleCellExperiment} object. Function #' parameter settings and celda model results are stored in the #' \link{metadata} \code{"celda_grid_search"} slot. #' @seealso \link{celda_G} for feature clustering, \link{celda_C} for #' clustering of cells, and \link{celda_CG} for simultaneous clustering of #' features and cells. \link{subsetCeldaList} can subset the \code{celdaList} #' object. \link{selectBestModel} can get the best model for each combination #' of parameters. #' @import foreach #' @importFrom doParallel registerDoParallel #' @importFrom methods is #' @examples #' \dontrun{ #' data(celdaCGSim) #' ## Run various combinations of parameters with 'celdaGridSearch' #' celdaCGGridSearchRes <- celdaGridSearch(celdaCGSim$counts, #' model = "celda_CG", #' paramsTest = list(K = seq(4, 6), L = seq(9, 11)), #' paramsFixed = list(sampleLabel = celdaCGSim$sampleLabel), #' bestOnly = TRUE, #' nchains = 1, #' cores = 1) #' } #' @export setGeneric("celdaGridSearch", function( x, useAssay = "counts", altExpName = "featureSubset", model, paramsTest, paramsFixed = NULL, maxIter = 200, nchains = 3, cores = 1, bestOnly = TRUE, seed = 12345, perplexity = TRUE, verbose = TRUE, logfilePrefix = "Celda") { standardGeneric("celdaGridSearch")}) #' @rdname celdaGridSearch #' @export setMethod("celdaGridSearch", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", model, paramsTest, paramsFixed = NULL, maxIter = 200, nchains = 3, cores = 1, bestOnly = TRUE, seed = 12345, perplexity = TRUE, verbose = TRUE, logfilePrefix = "Celda") { xClass <- "SingleCellExperiment" if (!altExpName %in% SingleCellExperiment::altExpNames(x)) { stop(altExpName, " not in 'altExpNames(x)'. Run ", "selectFeatures(x) first!") } altExp <- SingleCellExperiment::altExp(x, altExpName) if (!useAssay %in% SummarizedExperiment::assayNames(altExp)) { stop(useAssay, " not in assayNames(altExp(x, altExpName))") } counts <- SummarizedExperiment::assay(altExp, i = useAssay) celdaList <- .celdaGridSearch(counts = counts, model = paste0(".", model), paramsTest = paramsTest, paramsFixed = paramsFixed, maxIter = maxIter, nchains = nchains, cores = cores, bestOnly = bestOnly, seed = seed, perplexity = perplexity, verbose = verbose, logfilePrefix = logfilePrefix) altExp <- .createSCEceldaGridSearch(celdaList = celdaList, sce = altExp, xClass = xClass, useAssay = useAssay, model = model, paramsTest = paramsTest, paramsFixed = paramsFixed, maxIter = maxIter, seed = seed, nchains = nchains, cores = cores, bestOnly = bestOnly, perplexity = perplexity, verbose = verbose, logfilePrefix = logfilePrefix) SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) }) #' @rdname celdaGridSearch #' @export setMethod("celdaGridSearch", signature(x = "matrix"), function(x, useAssay = "counts", altExpName = "featureSubset", model, paramsTest, paramsFixed = NULL, maxIter = 200, nchains = 3, cores = 1, bestOnly = TRUE, seed = 12345, perplexity = TRUE, verbose = TRUE, logfilePrefix = "Celda") { ls <- list() ls[[useAssay]] <- x sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" celdaList <- .celdaGridSearch(counts = x, model = paste0(".", model), paramsTest = paramsTest, paramsFixed = paramsFixed, maxIter = maxIter, nchains = nchains, cores = cores, bestOnly = bestOnly, seed = seed, perplexity = perplexity, verbose = verbose, logfilePrefix = logfilePrefix) altExp <- .createSCEceldaGridSearch(celdaList = celdaList, sce = SingleCellExperiment::altExp(sce, altExpName), xClass = xClass, useAssay = useAssay, model = model, paramsTest = paramsTest, paramsFixed = paramsFixed, maxIter = maxIter, seed = seed, nchains = nchains, cores = cores, bestOnly = bestOnly, perplexity = perplexity, verbose = verbose, logfilePrefix = logfilePrefix) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) }) .celdaGridSearch <- function(counts, model, paramsTest, paramsFixed, maxIter, nchains, cores, bestOnly, seed, perplexity, verbose, logfilePrefix) { ## Check parameters .validateCounts(counts) modelParams <- as.list(formals(model)) if (!all(names(paramsTest) %in% names(modelParams))) { badParams <- setdiff(names(paramsTest), names(modelParams)) stop( "The following elements in 'paramsTest' are not arguments of '", substring(model, 2), "': ", paste(badParams, collapse = ",") ) } if (!is.null(paramsFixed) && !all(names(paramsFixed) %in% names(modelParams))) { badParams <- setdiff(names(paramsFixed), names(modelParams)) stop( "The following elements in 'paramsFixed' are not arguments", " of '", substring(model, 2), "': ", paste(badParams, collapse = ",") ) } modelParamsRequired <- setdiff( names(modelParams[modelParams == ""]), "counts" ) if (!all(modelParamsRequired %in% c( names(paramsTest), names(paramsFixed) ))) { missing.params <- setdiff( modelParamsRequired, c(names(paramsTest), names(paramsFixed)) ) stop( "The following arguments are not in 'paramsTest' or 'paramsFixed'", " but are required for '", substring(model, 2), "': ", paste(missing.params, collapse = ",") ) } if (any(c("z.init", "y.init", "sampleLabel") %in% names(paramsTest))) { stop( "Setting parameters such as 'z.init', 'y.init', and 'sampleLabel'", " in 'paramsTest' is not currently supported." ) } if (any(c("nchains") %in% names(paramsTest))) { warning( "Parameter 'nchains' should not be used within the paramsTest", " list" ) paramsTest[["nchains"]] <- NULL } # Pre-generate a set of random seeds to be used for each chain if (is.null(seed)) { allSeeds <- NULL } else { allSeeds <- seq(seed, (seed + nchains - 1)) } # Set up parameter combinations for each individual chain runParams <- base::expand.grid(c( chain = list(seq_len(nchains)), paramsTest )) runParams <- cbind(index = seq_len(nrow(runParams)), runParams) if (is.null(allSeeds)) { runParams <- cbind(runParams, seed = rep("NULL", nrow(runParams))) } else { runParams <- cbind(runParams, seed = rep(allSeeds, nrow(runParams) / nchains)) } .logMessages(paste(rep("-", 50), collapse = ""), logfile = NULL, append = FALSE, verbose = verbose ) .logMessages("Starting celdaGridSearch with", substring(model, 2), logfile = NULL, append = TRUE, verbose = verbose ) .logMessages("Number of cores:", cores, logfile = NULL, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = NULL, append = TRUE, verbose = verbose ) startTime <- Sys.time() # An MD5 checksum of the count matrix. Passed to models so # later on, we can check on celda_* model objects which # count matrix was used. counts <- .processCounts(counts) countChecksum <- .createCountChecksum(counts) ## Use DoParallel to loop through each combination of parameters cl <- parallel::makeCluster(cores) doParallel::registerDoParallel(cl) i <- NULL # Setting visible binding for R CMD CHECK resList <- foreach( i = seq_len(nrow(runParams)), .export = model, .combine = c, .multicombine = TRUE ) %dopar% { ## Set up chain parameter list current.run <- c(runParams[i, ]) chainParams <- list() for (j in names(paramsTest)) { chainParams[[j]] <- current.run[[j]] } chainParams$counts <- counts chainParams$maxIter <- maxIter chainParams$nchain <- 1 chainParams$countChecksum <- countChecksum chainParams$verbose <- verbose chainParams$logfile <- paste0( logfilePrefix, "_", paste(paste( colnames(runParams), runParams[i, ], sep = "-" ), collapse = "_"), "_Seed-", ifelse(is.null(chainParams$seed), "NULL", chainParams$seed), "_log.txt" ) ## Run model if (is.null(seed)) { res <- do.call(model, c(chainParams, paramsFixed)) } else { chainSeed <- allSeeds[ifelse(i %% nchains == 0, nchains, i %% nchains)] res <- with_seed(chainSeed, do.call(model, c(chainParams, paramsFixed))) } return(list(res)) } parallel::stopCluster(cl) logliks <- vapply(resList, function(mod) { bestLogLikelihood(mod) }, double(1)) runParams <- cbind(runParams, logLikelihood = logliks) celdaRes <- methods::new( "celdaList", runParams = runParams, resList = resList, countChecksum = countChecksum ) if (isTRUE(bestOnly)) { celdaRes <- selectBestModel(celdaRes, asList = TRUE) } if (isTRUE(perplexity)) { .logMessages( date(), ".. Calculating perplexity", append = TRUE, verbose = verbose, logfile = NULL ) celdaRes <- resamplePerplexity(counts, celdaRes, seed = seed) } endTime <- Sys.time() .logMessages(paste(rep("-", 50), collapse = ""), logfile = NULL, append = TRUE, verbose = verbose ) .logMessages("Completed celdaGridSearch. Total time:", format(difftime(endTime, startTime)), logfile = NULL, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = NULL, append = TRUE, verbose = verbose ) return(celdaRes) } #' @title Subset celda model from SCE object returned from #' \code{celdaGridSearch} #' @description Select a subset of models from a #' \linkS4class{SingleCellExperiment} object generated by #' \link{celdaGridSearch} that match the criteria in the argument #' \code{params}. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned from #' \code{celdaGridSearch}, \code{recursiveSplitModule}, #' or \code{recursiveSplitCell}. Must contain a list named #' \code{"celda_grid_search"} in \code{metadata(x)}. #' \item celdaList object.} #' @param params List. List of parameters used to subset the matching celda #' models in list \code{"celda_grid_search"} in \code{metadata(x)}. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return One of #' \itemize{ #' \item A new \linkS4class{SingleCellExperiment} object containing #' all models matching the #' provided criteria in \code{params}. If only one celda model result in the #' \code{"celda_grid_search"} slot in \code{metadata(x)} matches #' the given criteria, a new \linkS4class{SingleCellExperiment} object #' with the matching model stored in the #' \link{metadata} #' \code{"celda_parameters"} slot will be returned. Otherwise, a new #' \linkS4class{SingleCellExperiment} object with the subset models stored #' in the \link{metadata} #' \code{"celda_grid_search"} slot will be returned. #' \item A new \code{celdaList} object containing all models matching the #' provided criteria in \code{params}. If only one item in the #' \code{celdaList} matches the given criteria, the matching model will be #' returned directly instead of a \code{celdaList} object.} #' @seealso \link{celdaGridSearch} can run Celda with multiple parameters and #' chains in parallel. \link{selectBestModel} can get the best model for each #' combination of parameters. #' @export setGeneric("subsetCeldaList", function(x, params, altExpName = "featureSubset") { standardGeneric("subsetCeldaList")}) #' @rdname subsetCeldaList #' @examples #' data(sceCeldaCGGridSearch) #' sceK5L10 <- subsetCeldaList(sceCeldaCGGridSearch, #' params = list(K = 5, L = 10)) #' @export setMethod("subsetCeldaList", signature(x = "SingleCellExperiment"), function(x, params, altExpName = "featureSubset") { ## Check for bad parameter names if (!all(names(params) %in% colnames(runParams(x, altExpName = altExpName)))) { badParams <- setdiff(names(params), colnames(runParams(x, altExpName = altExpName))) stop("The following elements in 'params' are not columns in", " runParams(x, altExpName = altExpName) ", paste(badParams, collapse = ",") ) } ## Subset 'runParams' based on items in 'params' newRunParams <- runParams(x, altExpName = altExpName) for (i in names(params)) { newRunParams <- subset(newRunParams, newRunParams[, i] %in% params[[i]]) if (nrow(newRunParams) == 0) { stop("No runs matched the criteria given in 'params'. Check", " 'runParams(x, altExpName = altExpName)' for complete", " list of parameters used", " to generate 'x'.") } } ## Get index of selected models, subset celdaList, and return ix <- match(newRunParams$index, runParams(x, altExpName = altExpName)$index) altExp <- SingleCellExperiment::altExp(x, altExpName) if (length(ix) == 1) { altExp <- .subsetCeldaListSCE(altExp, ix) } else { altExp@metadata$celda_grid_search@runParams <- as.data.frame(newRunParams) altExp@metadata$celda_grid_search@resList <- altExp@metadata$celda_grid_search@resList[ix] } SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname subsetCeldaList #' @examples #' data(celdaCGGridSearchRes) #' resK5L10 <- subsetCeldaList(celdaCGGridSearchRes, #' params = list(K = 5, L = 10)) #' @export setMethod("subsetCeldaList", signature(x = "celdaList"), function(x, params) { ## Check for bad parameter names if (!all(names(params) %in% colnames(runParams(x)))) { badParams <- setdiff(names(params), colnames(runParams(x))) stop("The following elements in 'params' are not columns in", " runParams (x) ", paste(badParams, collapse = ",") ) } ## Subset 'runParams' based on items in 'params' newRunParams <- runParams(x) for (i in names(params)) { newRunParams <- subset(newRunParams, newRunParams[, i] %in% params[[i]]) if (nrow(newRunParams) == 0) { stop("No runs matched the criteria given in 'params'. Check", " 'runParams(x)' for complete list of parameters used", " to generate 'x'.") } } ## Get index of selected models, subset celdaList, and return ix <- match(newRunParams$index, runParams(x)$index) if (length(ix) == 1) { return(resList(x)[[ix]]) } else { x@runParams <- as.data.frame(newRunParams) x@resList <- resList(x)[ix] return(x) } } ) #' @title Select best chain within each combination of parameters #' @description Select the chain with the best log likelihood for each #' combination of tested parameters from a \code{SCE} object gererated by #' \link{celdaGridSearch} or from a \code{celdaList} object. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned from #' \code{celdaGridSearch}, \code{recursiveSplitModule}, #' or \code{recursiveSplitCell}. Must contain a list named #' \code{"celda_grid_search"} in \code{metadata(x)}. #' \item celdaList object.} #' @param asList \code{TRUE} or \code{FALSE}. Whether to return the #' best model as a #' \code{celdaList} object or not. If \code{FALSE}, return the best model as a #' corresponding celda model object. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return One of #' \itemize{ #' \item A new \linkS4class{SingleCellExperiment} object containing #' one model with the best log-likelihood for each set of parameters in #' \code{metadata(x)}. If there is only one set of parameters, #' a new \linkS4class{SingleCellExperiment} object #' with the matching model stored in the #' \link{metadata} #' \code{"celda_parameters"} slot will be returned. Otherwise, a new #' \linkS4class{SingleCellExperiment} object with the subset models stored #' in the \link{metadata} #' \code{"celda_grid_search"} slot will be returned. #' \item A new \code{celdaList} object containing one model with the best #' log-likelihood for each set of parameters. If only one set of parameters #' is in the \code{celdaList}, the best model will be returned directly #' instead of a \code{celdaList} object.} #' @seealso \link{celdaGridSearch} \link{subsetCeldaList} #' @export setGeneric("selectBestModel", function(x, asList = FALSE, altExpName = "featureSubset") { standardGeneric("selectBestModel")}) #' @rdname selectBestModel #' @examples #' data(sceCeldaCGGridSearch) #' ## Returns same result as running celdaGridSearch with "bestOnly = TRUE" #' sce <- selectBestModel(sceCeldaCGGridSearch) #' @importFrom data.table as.data.table #' @export setMethod("selectBestModel", signature(x = "SingleCellExperiment"), function(x, asList = FALSE, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) logLikelihood <- NULL group <- setdiff(colnames(runParams(x, altExpName = altExpName)), c("index", "chain", "logLikelihood", "mean_perplexity", "seed")) runParams <- S4Vectors::metadata(altExp)$celda_grid_search@runParams dt <- data.table::as.data.table(runParams) .SD <- NULL # fix check note newRunParams <- as.data.frame(dt[, .SD[which.max(logLikelihood)], by = group]) newRunParams <- newRunParams[, colnames(runParams)] ix <- match(newRunParams$index, runParams$index) if (nrow(newRunParams) == 1 & !asList) { altExp <- .subsetCeldaListSCE(altExp, ix) } else { altExp@metadata$celda_grid_search@runParams <- as.data.frame(newRunParams) altExp@metadata$celda_grid_search@resList <- altExp@metadata$celda_grid_search@resList[ix] } SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname selectBestModel #' @examples #' data(celdaCGGridSearchRes) #' ## Returns same result as running celdaGridSearch with "bestOnly = TRUE" #' cgsBest <- selectBestModel(celdaCGGridSearchRes) #' @importFrom data.table as.data.table #' @export setMethod("selectBestModel", signature(x = "celdaList"), function(x, asList = FALSE) { logLikelihood <- NULL group <- setdiff(colnames(runParams(x)), c("index", "chain", "logLikelihood", "mean_perplexity", "seed")) dt <- data.table::as.data.table(runParams(x)) .SD <- NULL # fix check note newRunParams <- as.data.frame(dt[, .SD[which.max(logLikelihood)], by = group]) newRunParams <- newRunParams[, colnames(runParams(x))] ix <- match(newRunParams$index, runParams(x)$index) if (nrow(newRunParams) == 1 & !asList) { return(resList(x)[[ix]]) } else { x@runParams <- as.data.frame(newRunParams) x@resList <- resList(x)[ix] return(x) } } ) .createSCEceldaGridSearch <- function(celdaList, sce, xClass, useAssay, model, paramsTest, paramsFixed, maxIter, seed, nchains, cores, bestOnly, perplexity, verbose, logfilePrefix) { S4Vectors::metadata(sce)[["celda_grid_search"]] <- celdaList S4Vectors::metadata(sce)$celda_grid_search@celdaGridSearchParameters <- list(xClass = xClass, useAssay = useAssay, model = model, paramsTest = paramsTest, paramsFixed = paramsFixed, maxIter = maxIter, seed = seed, nchains = nchains, cores = cores, bestOnly = bestOnly, perplexity = perplexity, verbose = verbose, logfilePrefix = logfilePrefix) return(sce) } .subsetCeldaListSCE <- function(x, ix) { cgsparam <- x@metadata$celda_grid_search@celdaGridSearchParameters if (cgsparam$model == "celda_C") { x <- .createSCEceldaC(celdaCMod = x@metadata$celda_grid_search@resList[[ix]], sce = x, xClass = cgsparam$xClass, useAssay = cgsparam$useAssay, algorithm = cgsparam$algorithm, stopIter = cgsparam$stopIter, maxIter = cgsparam$maxIter, splitOnIter = cgsparam$splitOnIter, splitOnLast = cgsparam$splitOnLast, nchains = cgsparam$nchains, zInitialize = cgsparam[["zInitialize"]], zInit = cgsparam[["zInit"]], logfile = cgsparam$logfile, verbose = cgsparam$verbose) } else if (cgsparam$model == "celda_G") { x <- .createSCEceldaG(celdaGMod = x@metadata$celda_grid_search@resList[[ix]], sce = x, xClass = cgsparam$xClass, useAssay = cgsparam$useAssay, stopIter = cgsparam$stopIter, maxIter = cgsparam$maxIter, splitOnIter = cgsparam$splitOnIter, splitOnLast = cgsparam$splitOnLast, nchains = cgsparam$nchains, yInitialize = cgsparam[["yInitialize"]], yInit = cgsparam[["yInit"]], logfile = cgsparam$logfile, verbose = cgsparam$verbose) } else if (cgsparam$model == "celda_CG") { x <- .createSCEceldaCG(celdaCGMod = x@metadata$celda_grid_search@resList[[ix]], sce = x, xClass = cgsparam$xClass, useAssay = cgsparam$useAssay, algorithm = cgsparam$algorithm, stopIter = cgsparam$stopIter, maxIter = cgsparam$maxIter, splitOnIter = cgsparam$splitOnIter, splitOnLast = cgsparam$splitOnLast, nchains = cgsparam$nchains, zInitialize = cgsparam[["zInitialize"]], yInitialize = cgsparam[["yInitialize"]], zInit = cgsparam[["zInit"]], yInit = cgsparam[["yInit"]], logfile = cgsparam$logfile, verbose = cgsparam$verbose) } else { stop("S4Vectors::metadata(altExp(x, altExpName))$celda_grid_search@", "celdaGridSearchParameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } return(x) } ================================================ FILE: R/celdaProbabilityMap.R ================================================ #' @title Probability map for a celda model #' @description Renders probability and relative expression heatmaps to #' visualize the relationship between features and cell populations (or cell #' populations and samples). #' @param sce A \link[SingleCellExperiment]{SingleCellExperiment} object #' returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}. #' @param useAssay A string specifying which \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param level Character. One of "cellPopulation" or "Sample". #' "cellPopulation" will display the absolute probabilities and relative #' normalized expression of each module in each cell population. #' \strong{\code{level = "cellPopulation"} only works for celda_CG \code{sce} #' objects}. "sample" will display the absolute probabilities and relative #' normalized abundance of each cell population in each sample. Default #' "cellPopulation". #' @param ncols The number of colors (>1) to be in the color palette of #' the absolute probability heatmap. #' @param col2 Passed to \code{col} argument of \link[ComplexHeatmap]{Heatmap}. #' Set color boundaries and colors for the relative expression heatmap. #' @param title1 Passed to \code{column_title} argument of #' \link[ComplexHeatmap]{Heatmap}. Figure title for the absolute probability #' heatmap. #' @param title2 Passed to \code{column_title} argument of #' \link[ComplexHeatmap]{Heatmap}. Figure title for the relative expression #' heatmap. #' @param showColumnNames Passed to \code{show_column_names} argument of #' \link[ComplexHeatmap]{Heatmap}. Show column names. #' @param showRowNames Passed to \code{show_row_names} argument of #' \link[ComplexHeatmap]{Heatmap}. Show row names. #' @param rowNamesgp Passed to \code{row_names_gp} argument of #' \link[ComplexHeatmap]{Heatmap}. Set row name font. #' @param colNamesgp Passed to \code{column_names_gp} argument of #' \link[ComplexHeatmap]{Heatmap}. Set column name font. #' @param clusterRows Passed to \code{cluster_rows} argument of #' \link[ComplexHeatmap]{Heatmap}. Cluster rows. #' @param clusterColumns Passed to \code{cluster_columns} argument of #' \link[ComplexHeatmap]{Heatmap}. Cluster columns. #' @param showHeatmapLegend Passed to \code{show_heatmap_legend} argument of #' \link[ComplexHeatmap]{Heatmap}. Show heatmap legend. #' @param heatmapLegendParam Passed to \code{heatmap_legend_param} argument of #' \link[ComplexHeatmap]{Heatmap}. Heatmap legend parameters. #' @param ... Additional parameters passed to \link[ComplexHeatmap]{Heatmap}. #' @seealso \link{celda_C} for clustering cells. \link{celda_CG} for #' clustering features and cells #' @return A \link[ComplexHeatmap]{HeatmapList} object containing 2 #' \link[ComplexHeatmap]{Heatmap-class} objects #' @export setGeneric("celdaProbabilityMap", function(sce, useAssay = "counts", altExpName = "featureSubset", level = c("cellPopulation", "sample"), ncols = 100, col2 = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), title1 = "Absolute probability", title2 = "Relative expression", showColumnNames = TRUE, showRowNames = TRUE, rowNamesgp = grid::gpar(fontsize = 8), colNamesgp = grid::gpar(fontsize = 12), clusterRows = FALSE, clusterColumns = FALSE, showHeatmapLegend = TRUE, heatmapLegendParam = list(title = NULL, legend_height = grid::unit(6, "cm")), ...) { standardGeneric("celdaProbabilityMap") }) #' @rdname celdaProbabilityMap #' @importFrom RColorBrewer brewer.pal #' @importFrom grDevices colorRampPalette #' @examples #' data(sceCeldaCG) #' celdaProbabilityMap(sceCeldaCG) #' @export setMethod("celdaProbabilityMap", signature(sce = "SingleCellExperiment"), function(sce, useAssay = "counts", altExpName = "featureSubset", level = c("cellPopulation", "sample"), ncols = 100, col2 = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), title1 = "Absolute probability", title2 = "Relative expression", showColumnNames = TRUE, showRowNames = TRUE, rowNamesgp = grid::gpar(fontsize = 8), colNamesgp = grid::gpar(fontsize = 12), clusterRows = FALSE, clusterColumns = FALSE, showHeatmapLegend = TRUE, heatmapLegendParam = list(title = NULL, legend_height = grid::unit(6, "cm")), ...) { level <- match.arg(level) if (celdaModel(sce, altExpName = altExpName) == "celda_C") { if (level == "cellPopulation") { warning("'level' has been set to 'sample'") } pm <- .celdaProbabilityMapC(sce = sce, useAssay = useAssay, altExpName = altExpName, level = "sample", ncols = ncols, col2 = col2, title1 = title1, title2 = title2, showColumnNames = showColumnNames, showRowNames = showRowNames, rowNamesgp = rowNamesgp, colNamesgp = colNamesgp, clusterRows = clusterRows, clusterColumns = clusterColumns, showHeatmapLegend = showHeatmapLegend, heatmapLegendParam = heatmapLegendParam, ...) } else if (celdaModel(sce, altExpName = altExpName) == "celda_CG") { pm <- .celdaProbabilityMapCG(sce = sce, useAssay = useAssay, altExpName = altExpName, level = level, ncols = ncols, col2 = col2, title1 = title1, title2 = title2, showColumnNames = showColumnNames, showRowNames = showRowNames, rowNamesgp = rowNamesgp, colNamesgp = colNamesgp, clusterRows = clusterRows, clusterColumns = clusterColumns, showHeatmapLegend = showHeatmapLegend, heatmapLegendParam = heatmapLegendParam, ...) } else { stop("S4Vectors::metadata(altExp(sce,", " altExpName))$celda_parameters$model must be", " one of 'celda_C', or 'celda_CG'!") } return(pm) } ) .celdaProbabilityMapC <- function(sce, useAssay, altExpName, level, ncols, col2, title1, title2, showColumnNames, showRowNames, rowNamesgp, colNamesgp, clusterRows, clusterColumns, showHeatmapLegend, heatmapLegendParam, ...) { altExp <- SingleCellExperiment::altExp(sce, altExpName) zInclude <- which(tabulate(SummarizedExperiment::colData( altExp)$celda_cell_cluster, S4Vectors::metadata(altExp)$celda_parameters$K) > 0) factorized <- factorizeMatrix(x = sce, useAssay = useAssay, type = "proportion") samp <- factorized$proportions$sample[zInclude, , drop = FALSE] col1 <- grDevices::colorRampPalette(c("white", "blue", "midnightblue", "springgreen4", "yellowgreen", "yellow", "orange", "red"))(100) breaks <- seq(0, 1, length.out = length(col1)) g1 <- ComplexHeatmap::Heatmap(matrix = samp, col = circlize::colorRamp2(breaks, col1), column_title = title1, show_column_names = showColumnNames, show_row_names = showRowNames, row_names_gp = rowNamesgp, column_names_gp = colNamesgp, cluster_rows = clusterRows, cluster_columns = clusterColumns, show_heatmap_legend = showHeatmapLegend, heatmap_legend_param = heatmapLegendParam, ...) if (ncol(samp) > 1) { sampNorm <- normalizeCounts(samp, normalize = "proportion", transformationFun = sqrt, scaleFun = base::scale) g2 <- ComplexHeatmap::Heatmap(matrix = sampNorm, col = col2, column_title = title2, show_column_names = showColumnNames, show_row_names = showRowNames, row_names_gp = rowNamesgp, column_names_gp = colNamesgp, cluster_rows = clusterRows, cluster_columns = clusterColumns, show_heatmap_legend = showHeatmapLegend, heatmap_legend_param = heatmapLegendParam, ...) return(g1 + g2) } else { return(g1) } } .celdaProbabilityMapCG <- function(sce, useAssay, altExpName, level, ncols, col2, title1, title2, showColumnNames, showRowNames, rowNamesgp, colNamesgp, clusterRows, clusterColumns, showHeatmapLegend, heatmapLegendParam, ...) { altExp <- SingleCellExperiment::altExp(sce, altExpName) factorized <- factorizeMatrix(x = sce, useAssay = useAssay, altExpName = altExpName, type = c("counts", "proportion")) zInclude <- which(tabulate(SummarizedExperiment::colData( altExp)$celda_cell_cluster, S4Vectors::metadata(altExp)$celda_parameters$K) > 0) yInclude <- which(tabulate(SummarizedExperiment::rowData( altExp)$celda_feature_module, S4Vectors::metadata(altExp)$celda_parameters$L) > 0) if (level == "cellPopulation") { pop <- factorized$proportions$cellPopulation[yInclude, zInclude, drop = FALSE] popNorm <- normalizeCounts(pop, normalize = "proportion", transformationFun = sqrt, scaleFun = base::scale) percentile9 <- round(stats::quantile(pop, .9), digits = 2) * 100 cols11 <- grDevices::colorRampPalette(c("white", RColorBrewer::brewer.pal(n = 9, name = "Blues")))(percentile9) cols12 <- grDevices::colorRampPalette(c("midnightblue", c("springgreen4", "Yellowgreen", "Yellow", "Orange", "Red")))(ncols - percentile9) col1 <- c(cols11, cols12) breaks <- seq(0, 1, length.out = length(col1)) g1 <- ComplexHeatmap::Heatmap(matrix = pop, col = circlize::colorRamp2(breaks, col1), column_title = title1, show_column_names = showColumnNames, show_row_names = showRowNames, row_names_gp = rowNamesgp, column_names_gp = colNamesgp, cluster_rows = clusterRows, cluster_columns = clusterColumns, show_heatmap_legend = showHeatmapLegend, heatmap_legend_param = heatmapLegendParam, ...) g2 <- ComplexHeatmap::Heatmap(matrix = popNorm, col = col2, column_title = title2, show_column_names = showColumnNames, show_row_names = showRowNames, row_names_gp = rowNamesgp, column_names_gp = colNamesgp, cluster_rows = clusterRows, cluster_columns = clusterColumns, show_heatmap_legend = showHeatmapLegend, heatmap_legend_param = heatmapLegendParam, ...) return(g1 + g2) } else { samp <- factorized$proportions$sample col1 <- grDevices::colorRampPalette(c( "white", "blue", "#08306B", "#006D2C", "yellowgreen", "yellow", "orange", "red" ))(100) breaks <- seq(0, 1, length.out = length(col1)) g1 <- ComplexHeatmap::Heatmap(matrix = samp, col = circlize::colorRamp2(breaks, col1), column_title = title1, show_column_names = showColumnNames, show_row_names = showRowNames, row_names_gp = rowNamesgp, column_names_gp = colNamesgp, cluster_rows = clusterRows, cluster_columns = clusterColumns, show_heatmap_legend = showHeatmapLegend, heatmap_legend_param = heatmapLegendParam, ...) if (ncol(samp) > 1) { sampNorm <- normalizeCounts(factorized$counts$sample, normalize = "proportion", transformationFun = sqrt, scaleFun = base::scale) g2 <- ComplexHeatmap::Heatmap(matrix = sampNorm, col = col2, column_title = title2, show_column_names = showColumnNames, show_row_names = showRowNames, row_names_gp = rowNamesgp, column_names_gp = colNamesgp, cluster_rows = clusterRows, cluster_columns = clusterColumns, show_heatmap_legend = showHeatmapLegend, heatmap_legend_param = heatmapLegendParam, ...) return(g1 + g2) } else { return(g1 + g2) } } } ================================================ FILE: R/celdaUMAP.R ================================================ #' @title Uniform Manifold Approximation and Projection (UMAP) dimension #' reduction for celda \code{sce} object #' @description Embeds cells in two dimensions using \link[uwot]{umap} based on #' a celda model. For celda_C \code{sce} objects, PCA on the normalized counts #' is used to reduce the number of features before applying UMAP. For celda_CG #' \code{sce} object, UMAP is run on module probabilities to reduce the number #' of features instead of using PCA. Module probabilities are square-root #' transformed before applying UMAP. #' @param sce A \link[SingleCellExperiment]{SingleCellExperiment} object #' returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}. #' @param useAssay A string specifying which \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param maxCells Integer. Maximum number of cells to plot. Cells will be #' randomly subsampled if \code{ncol(sce) > maxCells}. Larger numbers of cells #' requires more memory. If NULL, no subsampling will be performed. #' Default NULL. #' @param minClusterSize Integer. Do not subsample cell clusters below this #' threshold. Default 100. #' @param modules Integer vector. Determines which features modules to use for #' UMAP. If NULL, all modules will be used. Default NULL. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param nNeighbors The size of local neighborhood used for #' manifold approximation. Larger values result in more global #' views of the manifold, while smaller values result in more #' local data being preserved. Default 30. #' See \link[uwot]{umap} for more information. #' @param minDist The effective minimum distance between embedded points. #' Smaller values will result in a more clustered/clumped #' embedding where nearby points on the manifold are drawn #' closer together, while larger values will result on a more #' even dispersal of points. Default 0.75. #' See \link[uwot]{umap} for more information. #' @param spread The effective scale of embedded points. In combination with #' \code{min_dist}, this determines how clustered/clumped the #' embedded points are. Default 1. See \link[uwot]{umap} for more information. #' @param pca Logical. Whether to perform #' dimensionality reduction with PCA before UMAP. Only works for celda_C #' \code{sce} objects. #' @param initialDims Integer. Number of dimensions from PCA to use as #' input in UMAP. Default 50. Only works for celda_C \code{sce} objects. #' @param normalize Character. Passed to \link{normalizeCounts} in #' normalization step. Divides counts by the library sizes for each #' cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses #' the total counts for each cell as the library size. 'cpm' divides the #' library size of each cell by one million to produce counts per million. #' 'median' divides the library size of each cell by the median library size #' across all cells. 'mean' divides the library size of each cell by the mean #' library size across all cells. #' @param scaleFactor Numeric. Sets the scale factor for cell-level #' normalization. This scale factor is multiplied to each cell after the #' library size of each cell had been adjusted in \code{normalize}. Default #' \code{NULL} which means no scale factor is applied. #' @param transformationFun Function. Applys a transformation such as 'sqrt', #' 'log', 'log2', 'log10', or 'log1p'. If \code{NULL}, no transformation will #' be applied. Occurs after applying normalization and scale factor. Default #' \code{NULL}. #' @param cores Number of threads to use. Default 1. #' @param ... Additional parameters to pass to \link[uwot]{umap}. #' @return \code{sce} with UMAP coordinates #' (columns "celda_UMAP1" & "celda_UMAP2") added to #' \code{\link{reducedDim}(sce, "celda_UMAP")}. #' @export setGeneric("celdaUmap", function(sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, modules = NULL, seed = 12345, nNeighbors = 30, minDist = 0.75, spread = 1, pca = TRUE, initialDims = 50, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, cores = 1, ...) { standardGeneric("celdaUmap") }) #' @rdname celdaUmap #' @examples #' data(sceCeldaCG) #' umapRes <- celdaUmap(sceCeldaCG) #' @export setMethod("celdaUmap", signature(sce = "SingleCellExperiment"), function(sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, modules = NULL, seed = 12345, nNeighbors = 30, minDist = 0.75, spread = 1, pca = TRUE, initialDims = 50, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, cores = 1, ...) { if (is.null(seed)) { sce <- .celdaUmap(sce = sce, useAssay = useAssay, altExpName = altExpName, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, seed = seed, nNeighbors = nNeighbors, minDist = minDist, spread = spread, pca = pca, initialDims = initialDims, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun, cores = cores, ...) } else { with_seed(seed, sce <- .celdaUmap(sce = sce, useAssay = useAssay, altExpName = altExpName, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, seed = seed, nNeighbors = nNeighbors, minDist = minDist, spread = spread, pca = pca, initialDims = initialDims, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun, cores = cores, ...)) } return(sce) }) .celdaUmap <- function(sce, useAssay, altExpName, maxCells, minClusterSize, modules, seed, nNeighbors, minDist, spread, pca, initialDims, cores, normalize, scaleFactor, transformationFun, ...) { celdaMod <- celdaModel(sce, altExpName = altExpName) altExp <- SingleCellExperiment::altExp(sce, altExpName) if (celdaMod == "celda_C") { res <- .celdaUmapC(sce = altExp, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, nNeighbors = nNeighbors, minDist = minDist, spread = spread, pca = pca, initialDims = initialDims, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun, cores = cores, ...) } else if (celdaMod == "celda_CG") { res <- .celdaUmapCG(sce = altExp, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, seed = seed, nNeighbors = nNeighbors, minDist = minDist, spread = spread, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun, cores = cores, ...) } else if (celdaMod == "celda_G") { res <- .celdaUmapG(sce = altExp, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, seed = seed, nNeighbors = nNeighbors, minDist = minDist, spread = spread, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun, cores = cores, ...) } else { stop("S4Vectors::metadata(altExp(sce, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } SingleCellExperiment::reducedDim(altExp, "celda_UMAP") <- res SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } .celdaUmapC <- function(sce, useAssay, maxCells, minClusterSize, nNeighbors, minDist, spread, pca, initialDims, normalize, scaleFactor, transformationFun, cores, ...) { preparedCountInfo <- .prepareCountsForDimReductionCeldaC(sce = sce, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) umapRes <- .calculateUmap(preparedCountInfo$norm, nNeighbors = nNeighbors, minDist = minDist, spread = spread, pca = pca, initialDims = initialDims, cores = cores, ... ) final <- matrix(NA, nrow = ncol(sce), ncol = 2) final[preparedCountInfo$cellIx, ] <- umapRes rownames(final) <- colnames(sce) colnames(final) <- c("celda_UMAP1", "celda_UMAP2") return(final) } .celdaUmapCG <- function(sce, useAssay, maxCells, minClusterSize, modules, seed, nNeighbors, minDist, spread, normalize, scaleFactor, transformationFun, cores, ...) { preparedCountInfo <- .prepareCountsForDimReductionCeldaCG(sce = sce, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) umapRes <- .calculateUmap(preparedCountInfo$norm, nNeighbors = nNeighbors, minDist = minDist, spread = spread, cores = cores, ...) final <- matrix(NA, nrow = ncol(sce), ncol = 2) final[preparedCountInfo$cellIx, ] <- umapRes rownames(final) <- colnames(sce) colnames(final) <- c("celda_UMAP1", "celda_UMAP2") return(final) } .celdaUmapG <- function(sce, useAssay, maxCells, minClusterSize, modules, seed, nNeighbors, minDist, spread, normalize, scaleFactor, transformationFun, cores, ...) { preparedCountInfo <- .prepareCountsForDimReductionCeldaG(sce = sce, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) umapRes <- .calculateUmap(preparedCountInfo$norm, nNeighbors = nNeighbors, minDist = minDist, spread = spread, cores = cores, ...) final <- matrix(NA, nrow = ncol(sce), ncol = 2) final[preparedCountInfo$cellIx, ] <- umapRes rownames(final) <- colnames(sce) colnames(final) <- c("celda_UMAP1", "celda_UMAP2") return(final) } # Run the UMAP algorithm for dimensionality reduction # @param norm Normalized count matrix. # @param nNeighbors The size of local neighborhood used for # manifold approximation. Larger values result in more global # views of the manifold, while smaller values result in more # local data being preserved. Default 30. # See `?uwot::umap` for more information. # @param minDist The effective minimum distance between embedded points. # Smaller values will result in a more clustered/clumped # embedding where nearby points on the manifold are drawn # closer together, while larger values will result on a more # even dispersal of points. Default 0.2. # See `?uwot::umap` for more information. # @param spread The effective scale of embedded points. In combination with # 'min_dist', this determines how clustered/clumped the # embedded points are. Default 1. # See `?uwot::umap` for more information. # @param pca Logical. Whether to perform # dimensionality reduction with PCA before UMAP. # @param initialDims Integer. Number of dimensions from PCA to use as # input in UMAP. Default 50. # @param cores Number of threads to use. Default 1. # @param ... Other parameters to pass to `uwot::umap`. #' @import uwot .calculateUmap <- function(norm, nNeighbors = 30, minDist = 0.75, spread = 1, pca = FALSE, initialDims = 50, cores = 1, ...) { if (isTRUE(pca)) { doPCA <- initialDims } else { doPCA <- NULL } res <- uwot::umap(norm, n_neighbors = nNeighbors, min_dist = minDist, spread = spread, n_threads = cores, n_sgd_threads = 1, pca = doPCA, ... ) return(res) } ================================================ FILE: R/celda_C.R ================================================ #' @title Cell clustering with Celda #' @description Clusters the columns of a count matrix containing single-cell #' data into K subpopulations. The #' \code{useAssay} \link{assay} slot in #' \code{altExpName} \link{altExp} slot will be used if #' it exists. Otherwise, the \code{useAssay} #' \link{assay} slot in \code{x} will be used if #' \code{x} is a \linkS4class{SingleCellExperiment} object. #' @param x A \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. Alternatively, #' any matrix-like object that can be coerced to a sparse matrix of class #' "dgCMatrix" can be directly used as input. The matrix will automatically be #' converted to a \linkS4class{SingleCellExperiment} object. #' @param useAssay A string specifying the name of the #' \link{assay} slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param sampleLabel Vector or factor. Denotes the sample label for each cell #' (column) in the count matrix. #' @param K Integer. Number of cell populations. #' @param alpha Numeric. Concentration parameter for Theta. Adds a pseudocount #' to each cell population in each sample. Default 1. #' @param beta Numeric. Concentration parameter for Phi. Adds a pseudocount to #' each feature in each cell population. Default 1. #' @param algorithm String. Algorithm to use for clustering cell subpopulations. #' One of 'EM' or 'Gibbs'. The EM algorithm is faster, especially for larger #' numbers of cells. However, more chains may be required to ensure a good #' solution is found. If 'EM' is selected, then 'stopIter' will be #' automatically set to 1. Default 'EM'. #' @param stopIter Integer. Number of iterations without improvement in the #' log likelihood to stop inference. Default 10. #' @param maxIter Integer. Maximum number of iterations of Gibbs sampling or #' EM to perform. Default 200. #' @param splitOnIter Integer. On every `splitOnIter` iteration, a heuristic #' will be applied to determine if a cell population should be reassigned and #' another cell population should be split into two clusters. To disable #' splitting, set to -1. Default 10. #' @param splitOnLast Integer. After `stopIter` iterations have been #' performed without improvement, a heuristic will be applied to determine if #' a cell population should be reassigned and another cell population should be #' split into two clusters. If a split occurs, then `stopIter` will be reset. #' Default TRUE. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param nchains Integer. Number of random cluster initializations. Default 3. #' @param zInitialize Character. One of 'random', 'split', or 'predefined'. #' With 'random', cells are randomly assigned to a populations. With 'split', #' cells will be split into sqrt(K) populations and then each population will #' be subsequently split into another sqrt(K) populations. With 'predefined', #' values in `zInit` will be used to initialize `z`. Default 'split'. #' @param zInit Integer vector. Sets initial starting values of z. 'zInit' #' is only used when `zInitialize = 'predfined'`. Default NULL. #' @param countChecksum Character. An MD5 checksum for the `counts` matrix. #' Default NULL. #' @param logfile Character. Messages will be redirected to a file named #' `logfile`. If NULL, messages will be printed to stdout. Default NULL. #' @param verbose Logical. Whether to print log messages. Default TRUE. #' @return A \link[SingleCellExperiment]{SingleCellExperiment} object. Function #' parameter settings are stored in the \link{metadata} #' \code{"celda_parameters"} slot. #' Columns \code{celda_sample_label} and \code{celda_cell_cluster} in #' \link{colData} contain sample labels and celda cell #' population clusters. #' @seealso \link{celda_G} for feature clustering and \link{celda_CG} for #' simultaneous clustering of features and cells. \link{celdaGridSearch} can #' be used to run multiple values of K and multiple chains in parallel. #' @examples #' data(celdaCSim) #' sce <- celda_C(celdaCSim$counts, #' K = celdaCSim$K, #' sampleLabel = celdaCSim$sampleLabel, #' nchains = 1) #' @import Rcpp RcppEigen #' @importFrom withr with_seed #' @export setGeneric("celda_C", function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, alpha = 1, beta = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, logfile = NULL, verbose = TRUE) { standardGeneric("celda_C")}) #' @rdname celda_C #' @export setMethod("celda_C", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, alpha = 1, beta = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, logfile = NULL, verbose = TRUE) { xClass <- "SingleCellExperiment" if (!altExpName %in% SingleCellExperiment::altExpNames(x)) { stop(altExpName, " not in 'altExpNames(x)'. Run ", "selectFeatures(x) first!") } altExp <- SingleCellExperiment::altExp(x, altExpName) if (!useAssay %in% SummarizedExperiment::assayNames(altExp)) { stop(useAssay, " not in assayNames(altExp(x, altExpName))") } counts <- SummarizedExperiment::assay(altExp, i = useAssay) altExp <- .celdaCWithSeed(counts = counts, xClass = xClass, useAssay = useAssay, sce = altExp, sampleLabel = sampleLabel, K = K, alpha = alpha, beta = beta, algorithm = match.arg(algorithm), stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = seed, nchains = nchains, zInitialize = match.arg(zInitialize), countChecksum = countChecksum, zInit = zInit, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname celda_C #' @export setMethod("celda_C", signature(x = "ANY"), function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, alpha = 1, beta = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, logfile = NULL, verbose = TRUE) { # Convert to sparse matrix x <- methods::as(x, "CsparseMatrix") ls <- list() ls[[useAssay]] <- x sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" altExp <- .celdaCWithSeed(counts = x, xClass = xClass, useAssay = useAssay, sce = SingleCellExperiment::altExp(sce, altExpName), sampleLabel = sampleLabel, K = K, alpha = alpha, beta = beta, algorithm = match.arg(algorithm), stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = seed, nchains = nchains, zInitialize = match.arg(zInitialize), countChecksum = countChecksum, zInit = zInit, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) .celdaCWithSeed <- function(counts, xClass, useAssay, sce, sampleLabel, K, alpha, beta, algorithm, stopIter, maxIter, splitOnIter, splitOnLast, seed, nchains, zInitialize, countChecksum, zInit, logfile, verbose) { .validateCounts(counts) if (is.null(seed)) { celdaCMod <- .celda_C(counts = counts, sampleLabel = sampleLabel, K = K, alpha = alpha, beta = beta, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, zInitialize = zInitialize, countChecksum = countChecksum, zInit = zInit, logfile = logfile, verbose = verbose, reorder = TRUE) } else { with_seed(seed, celdaCMod <- .celda_C(counts = counts, sampleLabel = sampleLabel, K = K, alpha = alpha, beta = beta, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, zInitialize = zInitialize, countChecksum = countChecksum, zInit = zInit, logfile = logfile, verbose = verbose, reorder = TRUE)) } sce <- .createSCEceldaC(celdaCMod = celdaCMod, sce = sce, xClass = xClass, useAssay = useAssay, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, zInitialize = zInitialize, zInit = zInit, logfile = logfile, verbose = verbose) return(sce) } # celda_C main function .celda_C <- function(counts, sampleLabel = NULL, K, alpha = 1, beta = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, nchains = 3, zInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, logfile = NULL, verbose = TRUE, reorder = TRUE) { .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = FALSE, verbose = verbose) .logMessages("Starting Celda_C: Clustering cells.", logfile = logfile, append = TRUE, verbose = verbose) .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose) startTime <- Sys.time() ## Error checking and variable processing counts <- .processCounts(counts) if (is.null(countChecksum)) { countChecksum <- .createCountChecksum(counts) } sampleLabel <- .processSampleLabels(sampleLabel, ncol(counts)) s <- as.integer(sampleLabel) algorithm <- match.arg(algorithm) if (algorithm == "EM") { stopIter <- 1 } algorithmFun <- ifelse(algorithm == "Gibbs", ".cCCalcGibbsProbZ", ".cCCalcEMProbZ" ) zInitialize <- match.arg(zInitialize) allChains <- seq(nchains) bestResult <- NULL for (i in allChains) { ## Initialize cluster labels .logMessages(date(), ".. Initializing 'z' in chain", i, "with", paste0("'", zInitialize, "' "), logfile = logfile, append = TRUE, verbose = verbose ) if (zInitialize == "predefined") { if (is.null(zInit)) { stop("'zInit' needs to specified when initilize.z == 'given'.") } z <- .initializeCluster(K, ncol(counts), initial = zInit, fixed = NULL ) } else if (zInitialize == "split") { z <- .initializeSplitZ(counts, K = K, alpha = alpha, beta = beta ) } else { z <- .initializeCluster(K, ncol(counts), initial = NULL, fixed = NULL ) } zBest <- z ## Calculate counts one time up front p <- .cCDecomposeCounts(counts, s, z, K) nS <- p$nS nG <- p$nG nM <- p$nM mCPByS <- p$mCPByS nGByCP <- p$nGByCP nCP <- p$nCP nByC <- p$nByC ll <- .cCCalcLL( mCPByS = mCPByS, nGByCP = nGByCP, s = s, K = K, nS = nS, nG = nG, alpha = alpha, beta = beta ) iter <- 1L numIterWithoutImprovement <- 0L doCellSplit <- TRUE while (iter <= maxIter & numIterWithoutImprovement <= stopIter) { nextZ <- do.call(algorithmFun, list( counts = counts, mCPByS = mCPByS, nGByCP = nGByCP, nByC = nByC, nCP = nCP, z = z, s = s, K = K, nG = nG, nM = nM, alpha = alpha, beta = beta )) mCPByS <- nextZ$mCPByS nGByCP <- nextZ$nGByCP nCP <- nextZ$nCP z <- nextZ$z ## Perform split on i-th iteration of no improvement in log ## likelihood tempLl <- .cCCalcLL( mCPByS = mCPByS, nGByCP = nGByCP, s = s, K = K, nS = nS, nG = nG, alpha = alpha, beta = beta ) if (K > 2 & iter != maxIter & ((((numIterWithoutImprovement == stopIter & !all(tempLl >= ll))) & isTRUE(splitOnLast)) | (splitOnIter > 0 & iter %% splitOnIter == 0 & isTRUE(doCellSplit)))) { .logMessages(date(), " .... Determining if any cell clusters should be split.", logfile = logfile, append = TRUE, sep = "", verbose = verbose ) res <- .cCSplitZ( counts, mCPByS, nGByCP, nCP, s, z, K, nS, nG, alpha, beta, zProb = t(as.matrix(nextZ$probs)), maxClustersToTry = K, minCell = 3 ) .logMessages(res$message, logfile = logfile, append = TRUE, verbose = verbose ) # Reset convergence counter if a split occured if (!isTRUE(all.equal(z, res$z))) { numIterWithoutImprovement <- 0L doCellSplit <- TRUE } else { doCellSplit <- FALSE } ## Re-calculate variables z <- res$z mCPByS <- res$mCPByS nGByCP <- res$nGByCP nCP <- res$nCP } ## Calculate complete likelihood tempLl <- .cCCalcLL( mCPByS = mCPByS, nGByCP = nGByCP, s = s, K = K, nS = nS, nG = nG, alpha = alpha, beta = beta ) if ((all(tempLl > ll)) | iter == 1) { zBest <- z llBest <- tempLl numIterWithoutImprovement <- 1L } else { numIterWithoutImprovement <- numIterWithoutImprovement + 1L } ll <- c(ll, tempLl) .logMessages(date(), ".... Completed iteration:", iter, "| logLik:", tempLl, logfile = logfile, append = TRUE, verbose = verbose ) iter <- iter + 1 } names <- list( row = rownames(counts), column = colnames(counts), sample = levels(sampleLabel) ) result <- list( z = zBest, completeLogLik = ll, finalLogLik = llBest, K = K, sampleLabel = sampleLabel, alpha = alpha, beta = beta, countChecksum = countChecksum, names = names ) if (is.null(bestResult) || result$finalLogLik > bestResult$finalLogLik) { bestResult <- result } .logMessages(date(), ".. Finished chain", i, logfile = logfile, append = TRUE, verbose = verbose ) } bestResult <- methods::new("celda_C", clusters = list(z = bestResult$z), params = list( K = as.integer(bestResult$K), alpha = bestResult$alpha, beta = bestResult$beta, countChecksum = bestResult$countChecksum ), sampleLabel = bestResult$sampleLabel, completeLogLik = bestResult$completeLogLik, finalLogLik = bestResult$finalLogLik, names = bestResult$names ) if (isTRUE(reorder)) { bestResult <- .reorderCeldaC(counts = counts, res = bestResult) } endTime <- Sys.time() .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages("Completed Celda_C. Total time:", format(difftime(endTime, startTime)), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) return(bestResult) } # Gibbs sampling for the celda_C Model .cCCalcGibbsProbZ <- function(counts, mCPByS, nGByCP, nByC, nCP, z, s, K, nG, nM, alpha, beta, doSample = TRUE) { ## Set variables up front outside of loop probs <- matrix(NA, ncol = nM, nrow = K) ix <- sample(seq(nM)) for (i in ix) { ## Subtract cell counts from current population assignment # nGByCP1 <- nGByCP # nGByCP1[, z[i]] <- nGByCP[, z[i]] - counts[, i] # nGByCP1 <- .colSums(lgamma(nGByCP1 + beta), nrow(nGByCP), ncol(nGByCP)) # nCP1 <- nCP # nCP1[z[i]] <- nCP1[z[i]] - nByC[i] # nCP1 <- lgamma(nCP1 + (nG * beta)) ## Add cell counts to all other populations # nGByCP2 <- nGByCP # otherIx <- seq(K)[-z[i]] # nGByCP2[, otherIx] <- nGByCP2[, otherIx] + counts[, i] # nGByCP2 <- .colSums(lgamma(nGByCP2 + beta), nrow(nGByCP), ncol(nGByCP)) # nCP2 <- nCP # nCP2[otherIx] <- nCP2[otherIx] + nByC[i] # nCP2 <- lgamma(nCP2 + (nG * beta)) mCPByS[z[i], s[i]] <- mCPByS[z[i], s[i]] - 1L ## Calculate probabilities for each state ## when consider a specific cluster fo this cell, ## no need to calculate cells in other cluster for (j in seq_len(K)) { # otherIx <- seq(K)[-j] if (j != z[i]) { # when j is not current population assignment ## Theta simplified probs[j, i] <- log(mCPByS[j, s[i]] + alpha) + # if adding this cell -- Phi Numerator sum(lgamma(nGByCP[, j] + counts[, i] + beta)) - # if adding this cell -- Phi Denominator lgamma(nCP[j] + nByC[i] + nG * beta) - # if without this cell -- Phi Numerator sum(lgamma(nGByCP[, j] + beta)) + # if without this cell -- Phi Denominator lgamma(nCP[j] + nG * beta) # sum(nGByCP1[otherIx]) + ## Phi Numerator (other cells) # nGByCP2[j] - ## Phi Numerator (current cell) # sum(nCP1[otherIx]) - ## Phi Denominator (other cells) # nCP2[j] - ## Phi Denominator (current cell) } else { # when j is current population assignment ## Theta simplified probs[j, i] <- log(mCPByS[j, s[i]] + alpha) + sum(lgamma(nGByCP[, j] + beta)) - lgamma(nCP[j] + nG * beta) - sum(lgamma(nGByCP[, j] - counts[, i] + beta)) + lgamma(nCP[j] - nByC[i] + nG * beta) } } ## Sample next state and add back counts prevZ <- z[i] if (isTRUE(doSample)) { z[i] <- .sampleLl(probs[, i]) } if (prevZ != z[i]) { nGByCP[, prevZ] <- nGByCP[, prevZ] - counts[, i] nGByCP[, z[i]] <- nGByCP[, z[i]] + counts[, i] nCP[prevZ] <- nCP[prevZ] - nByC[i] nCP[z[i]] <- nCP[z[i]] + nByC[i] } mCPByS[z[i], s[i]] <- mCPByS[z[i], s[i]] + 1L } return(list( mCPByS = mCPByS, nGByCP = nGByCP, nCP = nCP, z = z, probs = probs )) } .cCCalcEMProbZ <- function(counts, mCPByS, nGByCP, nByC, nCP, z, s, K, nG, nM, alpha, beta, doSample = TRUE) { ## Expectation given current cell population labels theta <- fastNormPropLog(mCPByS, alpha) phi <- fastNormPropLog(nGByCP, beta) ## Maximization to find best label for each cell probs <- .countsTimesProbs(counts, phi) + theta[, s] if (isTRUE(doSample)) { zPrevious <- z z <- apply(probs, 2, which.max) ## Recalculate counts based on new label p <- .cCReDecomposeCounts(counts, s, z, zPrevious, nGByCP, K) mCPByS <- p$mCPByS nGByCP <- p$nGByCP nCP <- p$nCP } return(list( mCPByS = mCPByS, nGByCP = nGByCP, nCP = nCP, z = z, probs = probs )) } # Calculate log-likelihood for celda_C model .cCCalcLL <- function(mCPByS, nGByCP, s, z, K, nS, nG, alpha, beta) { ## Calculate for "Theta" component a <- nS * lgamma(K * alpha) b <- sum(lgamma(mCPByS + alpha)) c <- -nS * K * lgamma(alpha) d <- -sum(lgamma(colSums(mCPByS + alpha))) thetaLl <- a + b + c + d ## Calculate for "Phi" component a <- K * lgamma(nG * beta) b <- sum(lgamma(nGByCP + beta)) c <- -K * nG * lgamma(beta) d <- -sum(lgamma(colSums(nGByCP + beta))) phiLl <- a + b + c + d final <- thetaLl + phiLl return(final) } # Takes raw counts matrix and converts it to a series of matrices needed for # log likelihood calculation # @param counts Integer matrix. Rows represent features and columns represent # cells. # @param s Integer vector. Contains the sample label for each cell (column) in # the count matrix. # @param z Numeric vector. Denotes cell population labels. # @param K Integer. Number of cell populations. #' @importFrom Matrix colSums .cCDecomposeCounts <- function(counts, s, z, K) { nS <- length(unique(s)) nG <- nrow(counts) nM <- ncol(counts) mCPByS <- matrix(as.integer(table(factor(z, levels = seq(K)), s)), ncol = nS ) nGByCP <- .colSumByGroup(counts, group = z, K = K) nCP <- .colSums(nGByCP, nrow(nGByCP), ncol(nGByCP)) nByC <- colSums(counts) return(list( mCPByS = mCPByS, nGByCP = nGByCP, nCP = nCP, nByC = nByC, nS = nS, nG = nG, nM = nM )) } #' @importFrom Matrix colSums .cCReDecomposeCounts <- function(counts, s, z, previousZ, nGByCP, K) { ## Recalculate counts based on new label nGByCP <- .colSumByGroupChange(counts, nGByCP, z, previousZ, K) nCP <- colSums(nGByCP) nS <- length(unique(s)) mCPByS <- matrix(as.integer(table(factor(z, levels = seq(K)), s)), ncol = nS ) return(list( mCPByS = mCPByS, nGByCP = nGByCP, nCP = nCP )) } .prepareCountsForDimReductionCeldaC <- function(sce, useAssay, maxCells, minClusterSize, normalize, scaleFactor, transformationFun) { counts <- SummarizedExperiment::assay(sce, i = useAssay) counts <- .processCounts(counts) ## Checking if maxCells and minClusterSize will work if (!is.null(maxCells)) { if ((maxCells < ncol(counts)) & (maxCells / minClusterSize < S4Vectors::metadata(sce)$celda_parameters$K)) { stop("Cannot distribute ", maxCells, " cells among ", S4Vectors::metadata(sce)$celda_parameters$K, " clusters while maintaining a minumum of ", minClusterSize, " cells per cluster. Try increasing 'maxCells' or decreasing", " 'minClusterSize'.") } } else { maxCells <- ncol(counts) } ## Select a subset of cells to sample if greater than 'maxCells' totalCellsToRemove <- ncol(counts) - maxCells zInclude <- rep(TRUE, ncol(counts)) if (totalCellsToRemove > 0) { zTa <- tabulate(SummarizedExperiment::colData(sce)$celda_cell_cluster, S4Vectors::metadata(sce)$celda_parameters$K) ## Number of cells that can be sampled from each cluster without ## going below the minimum threshold clusterCellsToSample <- zTa - minClusterSize clusterCellsToSample[clusterCellsToSample < 0] <- 0 ## Number of cells to sample after exluding smaller clusters ## Rounding can cause number to be off by a few, so ceiling is ## used with a second round of subtraction clusterNToSample <- ceiling((clusterCellsToSample / sum(clusterCellsToSample)) * totalCellsToRemove) diff <- sum(clusterNToSample) - totalCellsToRemove clusterNToSample[which.max(clusterNToSample)] <- clusterNToSample[which.max(clusterNToSample)] - diff ## Perform sampling for each cluster for (i in which(clusterNToSample > 0)) { zInclude[sample(which( SummarizedExperiment::colData(sce)$celda_cell_cluster == i), clusterNToSample[i])] <- FALSE } } cellIx <- which(zInclude) norm <- t(normalizeCounts(counts[, cellIx], normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun)) return(list(norm = norm, cellIx = cellIx)) } .createSCEceldaC <- function(celdaCMod, sce, xClass, useAssay, algorithm, stopIter, maxIter, splitOnIter, splitOnLast, nchains, zInitialize, zInit, logfile, verbose) { # add metadata S4Vectors::metadata(sce)[["celda_parameters"]] <- list( model = "celda_C", xClass = xClass, useAssay = useAssay, sampleLevels = celdaCMod@names$sample, K = celdaCMod@params$K, alpha = celdaCMod@params$alpha, beta = celdaCMod@params$beta, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = celdaCMod@params$seed, nchains = nchains, zInitialize = zInitialize, countChecksum = celdaCMod@params$countChecksum, zInit = zInit, logfile = logfile, verbose = verbose, completeLogLik = celdaCMod@completeLogLik, finalLogLik = celdaCMod@finalLogLik, cellClusterLevels = sort(unique(celdaClusters(celdaCMod)$z))) SummarizedExperiment::rowData(sce)["rownames"] <- celdaCMod@names$row SummarizedExperiment::colData(sce)["colnames"] <- celdaCMod@names$column SummarizedExperiment::colData(sce)["celda_sample_label"] <- as.factor(celdaCMod@sampleLabel) SummarizedExperiment::colData(sce)["celda_cell_cluster"] <- as.factor(celdaClusters(celdaCMod)$z) return(sce) } # #' @name countsTimesProbs # #' @title Counts matrix times cell population probabilies # #' @param counts feature-by-cell matrix # #' @param phi feature-by-probability matrix #' @importMethodsFrom Matrix %*% .countsTimesProbs <- function(counts, phi) { ## Maximization to find best label for each cell if (inherits(counts, "matrix") & is.integer(counts)) { probs <- eigenMatMultInt(phi, counts) } else if (inherits(counts, "matrix") & is.numeric(counts)) { probs <- eigenMatMultNumeric(phi, counts) } else { probs <- (t(phi) %*% counts) } return(probs) } ================================================ FILE: R/celda_CG.R ================================================ #' @title Cell and feature clustering with Celda #' @description Clusters the rows and columns of a count matrix containing #' single-cell data into L modules and K subpopulations, respectively. The #' \code{useAssay} \link{assay} slot in #' \code{altExpName} \link{altExp} slot will be used if #' it exists. Otherwise, the \code{useAssay} #' \link{assay} slot in \code{x} will be used if #' \code{x} is a \linkS4class{SingleCellExperiment} object. #' @param x A \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. Alternatively, #' any matrix-like object that can be coerced to a sparse matrix of class #' "dgCMatrix" can be directly used as input. The matrix will automatically be #' converted to a \linkS4class{SingleCellExperiment} object. #' @param useAssay A string specifying the name of the #' \link{assay} slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param sampleLabel Vector or factor. Denotes the sample label for each cell #' (column) in the count matrix. #' @param K Integer. Number of cell populations. #' @param L Integer. Number of feature modules. #' @param alpha Numeric. Concentration parameter for Theta. Adds a pseudocount #' to each cell population in each sample. Default 1. #' @param beta Numeric. Concentration parameter for Phi. Adds a pseudocount to #' each feature module in each cell population. Default 1. #' @param delta Numeric. Concentration parameter for Psi. Adds a pseudocount to #' each feature in each module. Default 1. #' @param gamma Numeric. Concentration parameter for Eta. Adds a pseudocount to #' the number of features in each module. Default 1. #' @param algorithm String. Algorithm to use for clustering cell subpopulations. #' One of 'EM' or 'Gibbs'. The EM algorithm for cell clustering is faster, #' especially for larger numbers of cells. However, more chains may be required #' to ensure a good solution is found. Default 'EM'. #' @param stopIter Integer. Number of iterations without improvement in the log #' likelihood to stop inference. Default 10. #' @param maxIter Integer. Maximum number of iterations of Gibbs sampling to #' perform. Default 200. #' @param splitOnIter Integer. On every \code{splitOnIter} iteration, #' a heuristic #' will be applied to determine if a cell population or feature module should #' be reassigned and another cell population or feature module should be split #' into two clusters. To disable splitting, set to -1. Default 10. #' @param splitOnLast Integer. After \code{stopIter} iterations have been #' performed without improvement, a heuristic will be applied to determine if #' a cell population or feature module should be reassigned and another cell #' population or feature module should be split into two clusters. If a split #' occurs, then 'stopIter' will be reset. Default TRUE. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param nchains Integer. Number of random cluster initializations. Default 3. #' @param zInitialize Chararacter. One of 'random', 'split', or 'predefined'. #' With 'random', cells are randomly assigned to a populations. With 'split', #' cells will be split into sqrt(K) populations and then each population will #' be subsequently split into another sqrt(K) populations. With 'predefined', #' values in \code{zInit} will be used to initialize \code{z}. Default 'split'. #' @param yInitialize Character. One of 'random', 'split', or 'predefined'. #' With 'random', features are randomly assigned to a modules. With 'split', #' features will be split into sqrt(L) modules and then each module will be #' subsequently split into another sqrt(L) modules. With 'predefined', values #' in \code{yInit} will be used to initialize \code{y}. Default 'split'. #' @param zInit Integer vector. Sets initial starting values of z. 'zInit' #' is only used when `zInitialize = 'predfined'`. Default NULL. #' @param yInit Integer vector. Sets initial starting values of y. #' 'yInit' is only be used when `yInitialize = "predefined"`. Default NULL. #' @param countChecksum Character. An MD5 checksum for the counts matrix. #' Default NULL. #' @param logfile Character. Messages will be redirected to a file named #' `logfile`. If NULL, messages will be printed to stdout. Default NULL. #' @param verbose Logical. Whether to print log messages. Default TRUE. #' @return A \linkS4class{SingleCellExperiment} object. Function #' parameter settings are stored in \link{metadata} #' \code{"celda_parameters"} in \link{altExp} slot. #' In \link{altExp} slot, #' columns \code{celda_sample_label} and \code{celda_cell_cluster} in #' \link{colData} contain sample labels and celda cell #' population clusters. Column \code{celda_feature_module} in #' \link{rowData} contains feature modules. #' @seealso \link{celda_G} for feature clustering and \link{celda_C} for #' clustering cells. \link{celdaGridSearch} can be used to run multiple #' values of K/L and multiple chains in parallel. #' @import Rcpp RcppEigen #' @export setGeneric("celda_CG", function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, L, alpha = 1, beta = 1, delta = 1, gamma = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), yInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = TRUE) { standardGeneric("celda_CG")}) #' @rdname celda_CG #' @export setMethod("celda_CG", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, L, alpha = 1, beta = 1, delta = 1, gamma = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), yInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = TRUE) { xClass <- "SingleCellExperiment" if (!altExpName %in% SingleCellExperiment::altExpNames(x)) { stop(altExpName, " not in 'altExpNames(x)'. Run ", "selectFeatures(x) first!") } altExp <- SingleCellExperiment::altExp(x, altExpName) if (!useAssay %in% SummarizedExperiment::assayNames(altExp)) { stop(useAssay, " not in assayNames(altExp(x, altExpName))") } counts <- SummarizedExperiment::assay(altExp, i = useAssay) altExp <- .celdaCGWithSeed(counts = counts, xClass = xClass, useAssay = useAssay, sce = altExp, sampleLabel = sampleLabel, K = K, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, algorithm = match.arg(algorithm), stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = seed, nchains = nchains, zInitialize = match.arg(zInitialize), yInitialize = match.arg(yInitialize), countChecksum = countChecksum, zInit = zInit, yInit = yInit, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname celda_CG #' @examples #' data(celdaCGSim) #' sce <- celda_CG(celdaCGSim$counts, #' K = celdaCGSim$K, #' L = celdaCGSim$L, #' sampleLabel = celdaCGSim$sampleLabel, #' nchains = 1) #' @export setMethod("celda_CG", signature(x = "ANY"), function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, L, alpha = 1, beta = 1, delta = 1, gamma = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), yInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = TRUE) { # Convert to sparse matrix x <- methods::as(x, "CsparseMatrix") ls <- list() ls[[useAssay]] <- x sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" altExp <- .celdaCGWithSeed(counts = x, xClass = xClass, useAssay = useAssay, sce = SingleCellExperiment::altExp(sce, altExpName), sampleLabel = sampleLabel, K = K, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, algorithm = match.arg(algorithm), stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = seed, nchains = nchains, zInitialize = match.arg(zInitialize), yInitialize = match.arg(yInitialize), countChecksum = countChecksum, zInit = zInit, yInit = yInit, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) .celdaCGWithSeed <- function(counts, xClass, useAssay, sce, sampleLabel, K, L, alpha, beta, delta, gamma, algorithm, stopIter, maxIter, splitOnIter, splitOnLast, seed, nchains, zInitialize, yInitialize, countChecksum, zInit, yInit, logfile, verbose) { .validateCounts(counts) if (is.null(seed)) { celdaCGMod <- .celda_CG( counts = counts, sampleLabel = sampleLabel, K = K, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, zInitialize = zInitialize, yInitialize = yInitialize, countChecksum = countChecksum, zInit = zInit, yInit = yInit, logfile = logfile, verbose = verbose, reorder = TRUE ) } else { with_seed( seed, celdaCGMod <- .celda_CG( counts = counts, sampleLabel = sampleLabel, K = K, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, zInitialize = zInitialize, yInitialize = yInitialize, countChecksum = countChecksum, zInit = zInit, yInit = yInit, logfile = logfile, verbose = verbose, reorder = TRUE ) ) } sce <- .createSCEceldaCG(celdaCGMod = celdaCGMod, sce = sce, xClass = xClass, useAssay = useAssay, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, zInitialize = zInitialize, yInitialize = yInitialize, zInit = zInit, yInit = yInit, logfile = logfile, verbose = verbose) return(sce) } .celda_CG <- function(counts, sampleLabel = NULL, K, L, alpha = 1, beta = 1, delta = 1, gamma = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, nchains = 3, zInitialize = c("split", "random", "predefined"), yInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = TRUE, reorder = TRUE) { .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = FALSE, verbose = verbose ) .logMessages("Starting Celda_CG: Clustering cells and genes.", logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) startTime <- Sys.time() counts <- .processCounts(counts) if (is.null(countChecksum)) { countChecksum <- .createCountChecksum(counts) } sampleLabel <- .processSampleLabels(sampleLabel, ncol(counts)) s <- as.integer(sampleLabel) algorithm <- match.arg(algorithm) algorithmFun <- ifelse(algorithm == "Gibbs", ".cCCalcGibbsProbZ", ".cCCalcEMProbZ" ) zInitialize <- match.arg(zInitialize) yInitialize <- match.arg(yInitialize) allChains <- seq(nchains) # Pre-compute lgamma values lggamma <- lgamma(seq(0, nrow(counts) + L) + gamma) lgdelta <- c(NA, lgamma((seq(nrow(counts) + L) * delta))) bestResult <- NULL for (i in allChains) { ## Initialize cluster labels .logMessages(date(), ".. Initializing 'z' in chain", i, "with", paste0("'", zInitialize, "' "), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(date(), ".. Initializing 'y' in chain", i, "with", paste0("'", yInitialize, "' "), logfile = logfile, append = TRUE, verbose = verbose ) if (zInitialize == "predefined") { if (is.null(zInit)) { stop("'zInit' needs to specified when initilize.z == 'given'.") } z <- .initializeCluster(K, ncol(counts), initial = zInit, fixed = NULL ) } else if (zInitialize == "split") { z <- .initializeSplitZ( counts, K = K, alpha = alpha, beta = beta ) } else { z <- .initializeCluster(K, ncol(counts), initial = NULL, fixed = NULL ) } if (yInitialize == "predefined") { if (is.null(yInit)) { stop("'yInit' needs to specified when initilize.y == 'given'.") } y <- .initializeCluster(L, nrow(counts), initial = yInit, fixed = NULL ) } else if (yInitialize == "split") { y <- .initializeSplitY(counts, L, beta = beta, delta = delta, gamma = gamma ) } else { y <- .initializeCluster(L, nrow(counts), initial = NULL, fixed = NULL ) } zBest <- z yBest <- y ## Calculate counts one time up front p <- .cCGDecomposeCounts(counts, s, z, y, K, L) mCPByS <- p$mCPByS nTSByC <- p$nTSByC nTSByCP <- p$nTSByCP nCP <- p$nCP nByG <- p$nByG nByC <- p$nByC nByTS <- p$nByTS nGByTS <- p$nGByTS nGByCP <- p$nGByCP nM <- p$nM nG <- p$nG nS <- p$nS rm(p) ll <- .cCGCalcLL( K = K, L = L, mCPByS = mCPByS, nTSByCP = nTSByCP, nByG = nByG, nByTS = nByTS, nGByTS = nGByTS, nS = nS, nG = nG, alpha = alpha, beta = beta, delta = delta, gamma = gamma ) iter <- 1L numIterWithoutImprovement <- 0L doCellSplit <- TRUE doGeneSplit <- TRUE while (iter <= maxIter & numIterWithoutImprovement <= stopIter) { ## Gibbs sampling for each gene lgbeta <- lgamma(seq(0, max(nCP)) + beta) nextY <- .cGCalcGibbsProbY( counts = nGByCP, nTSByC = nTSByCP, nByTS = nByTS, nGByTS = nGByTS, nByG = nByG, y = y, L = L, nG = nG, beta = beta, delta = delta, gamma = gamma, lgbeta = lgbeta, lggamma = lggamma, lgdelta = lgdelta ) nTSByCP <- nextY$nTSByC nGByTS <- nextY$nGByTS nByTS <- nextY$nByTS nTSByC <- .rowSumByGroupChange(counts, nTSByC, nextY$y, y, L) y <- nextY$y ## Gibbs or EM sampling for each cell nextZ <- do.call(algorithmFun, list( counts = nTSByC, mCPByS = mCPByS, nGByCP = nTSByCP, nCP = nCP, nByC = nByC, z = z, s = s, K = K, nG = L, nM = nM, alpha = alpha, beta = beta )) mCPByS <- nextZ$mCPByS nTSByCP <- nextZ$nGByCP nCP <- nextZ$nCP nGByCP <- .colSumByGroupChange(counts, nGByCP, nextZ$z, z, K) z <- nextZ$z ## Perform split on i-th iteration defined by splitOnIter tempLl <- .cCGCalcLL( K = K, L = L, mCPByS = mCPByS, nTSByCP = nTSByCP, nByG = nByG, nByTS = nByTS, nGByTS = nGByTS, nS = nS, nG = nG, alpha = alpha, beta = beta, delta = delta, gamma = gamma ) if (L > 2 & iter != maxIter & (((numIterWithoutImprovement == stopIter & !all(tempLl >= ll)) & isTRUE(splitOnLast)) | (splitOnIter > 0 & iter %% splitOnIter == 0 & isTRUE(doGeneSplit)))) { .logMessages(date(), " .... Determining if any gene clusters should be split.", logfile = logfile, append = TRUE, sep = "", verbose = verbose ) res <- .cCGSplitY(counts, y, mCPByS, nGByCP, nTSByC, nTSByCP, nByG, nByTS, nGByTS, nCP, s, z, K, L, nS, nG, alpha, beta, delta, gamma, yProb = t(nextY$probs), maxClustersToTry = max(L / 2, 10), minCell = 3 ) .logMessages(res$message, logfile = logfile, append = TRUE, verbose = verbose ) # Reset convergence counter if a split occured if (!isTRUE(all.equal(y, res$y))) { numIterWithoutImprovement <- 1L doGeneSplit <- TRUE } else { doGeneSplit <- FALSE } ## Re-calculate variables y <- res$y nTSByCP <- res$nTSByCP nByTS <- res$nByTS nGByTS <- res$nGByTS nTSByC <- .rowSumByGroup(counts, group = y, L = L) } if (K > 2 & iter != maxIter & (((numIterWithoutImprovement == stopIter & !all(tempLl > ll)) & isTRUE(splitOnLast)) | (splitOnIter > 0 & iter %% splitOnIter == 0 & isTRUE(doCellSplit)))) { .logMessages(date(), " .... Determining if any cell clusters should be split.", logfile = logfile, append = TRUE, sep = "", verbose = verbose ) res <- .cCGSplitZ(counts, mCPByS, nTSByC, nTSByCP, nByG, nByTS, nGByTS, nCP, s, z, K, L, nS, nG, alpha, beta, delta, gamma, zProb = t(nextZ$probs), maxClustersToTry = K, minCell = 3 ) .logMessages(res$message, logfile = logfile, append = TRUE, verbose = verbose ) # Reset convergence counter if a split occured if (!isTRUE(all.equal(z, res$z))) { numIterWithoutImprovement <- 0L doCellSplit <- TRUE } else { doCellSplit <- FALSE } ## Re-calculate variables z <- res$z mCPByS <- res$mCPByS nTSByCP <- res$nTSByCP nCP <- res$nCP nGByCP <- .colSumByGroup(counts, group = z, K = K) } ## Calculate complete likelihood tempLl <- .cCGCalcLL( K = K, L = L, mCPByS = mCPByS, nTSByCP = nTSByCP, nByG = nByG, nByTS = nByTS, nGByTS = nGByTS, nS = nS, nG = nG, alpha = alpha, beta = beta, delta = delta, gamma = gamma ) if ((all(tempLl > ll)) | iter == 1) { zBest <- z yBest <- y llBest <- tempLl numIterWithoutImprovement <- 1L } else { numIterWithoutImprovement <- numIterWithoutImprovement + 1L } ll <- c(ll, tempLl) .logMessages(date(), " .... Completed iteration: ", iter, " | logLik: ", tempLl, logfile = logfile, append = TRUE, sep = "", verbose = verbose ) iter <- iter + 1L } names <- list( row = rownames(counts), column = colnames(counts), sample = levels(sampleLabel) ) result <- list( z = zBest, y = yBest, completeLogLik = ll, finalLogLik = llBest, K = K, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, sampleLabel = sampleLabel, names = names, countChecksum = countChecksum ) class(result) <- "celda_CG" if (is.null(bestResult) || result$finalLogLik > bestResult$finalLogLik) { bestResult <- result } .logMessages(date(), ".. Finished chain", i, logfile = logfile, append = TRUE, verbose = verbose ) } ## Peform reordering on final Z and Y assigments: bestResult <- methods::new("celda_CG", clusters = list(z = zBest, y = yBest), params = list( K = as.integer(K), L = as.integer(L), alpha = alpha, beta = beta, delta = delta, gamma = gamma, countChecksum = countChecksum ), completeLogLik = ll, finalLogLik = llBest, sampleLabel = sampleLabel, names = names ) if (isTRUE(reorder)) { bestResult <- .reorderCeldaCG(counts = counts, res = bestResult) } endTime <- Sys.time() .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages("Completed Celda_CG. Total time:", format(difftime(endTime, startTime)), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) return(bestResult) } # Calculate the loglikelihood for the celda_CG model .cCGCalcLL <- function(K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, nS, nG, alpha, beta, delta, gamma) { nG <- sum(nGByTS) ## Calculate for "Theta" component a <- nS * lgamma(K * alpha) b <- sum(lgamma(mCPByS + alpha)) c <- -nS * K * lgamma(alpha) d <- -sum(lgamma(colSums(mCPByS + alpha))) thetaLl <- a + b + c + d ## Calculate for "Phi" component a <- K * lgamma(L * beta) b <- sum(lgamma(nTSByCP + beta)) c <- -K * L * lgamma(beta) d <- -sum(lgamma(colSums(nTSByCP + beta))) phiLl <- a + b + c + d ## Calculate for "Psi" component a <- sum(lgamma(nGByTS * delta)) b <- sum(lgamma(nByG + delta)) c <- -nG * lgamma(delta) d <- -sum(lgamma(nByTS + (nGByTS * delta))) psiLl <- a + b + c + d ## Calculate for "Eta" side a <- lgamma(L * gamma) b <- sum(lgamma(nGByTS + gamma)) c <- -L * lgamma(gamma) d <- -lgamma(sum(nGByTS + gamma)) etaLl <- a + b + c + d final <- thetaLl + phiLl + psiLl + etaLl return(final) } # Takes raw counts matrix and converts it to a series of matrices needed for # log likelihood calculation # @param counts Integer matrix. Rows represent features and columns represent # cells. # @param s Integer vector. Contains the sample label for each cell (column) in # the count matrix. # @param z Numeric vector. Denotes cell population labels. # @param y Numeric vector. Denotes feature module labels. # @param K Integer. Number of cell populations. # @param L Integer. Number of feature modules. #' @importFrom Matrix colSums rowSums .cCGDecomposeCounts <- function(counts, s, z, y, K, L) { nS <- length(unique(s)) mCPByS <- matrix(as.integer(table(factor(z, levels = seq(K)), s)), ncol = nS ) nTSByC <- .rowSumByGroup(counts, group = y, L = L) nGByCP <- .colSumByGroup(counts, group = z, K = K) nTSByCP <- .colSumByGroup(nTSByC, group = z, K = K) nByC <- colSums(counts) nByG <- rowSums(counts) nByTS <- .rowSumByGroup(matrix(nByG, ncol = 1), group = y, L = L) nCP <- .colSums(nTSByCP, nrow(nTSByCP), ncol(nTSByCP)) nGByTS <- tabulate(y, L) + 1 ## Add pseudogene to each module nG <- nrow(counts) nM <- ncol(counts) return(list( mCPByS = mCPByS, nTSByC = nTSByC, nTSByCP = nTSByCP, nCP = nCP, nByG = nByG, nByC = nByC, nByTS = nByTS, nGByTS = nGByTS, nGByCP = nGByCP, nM = nM, nG = nG, nS = nS )) } .prepareCountsForDimReductionCeldaCG <- function(sce, useAssay, maxCells, minClusterSize, modules, normalize, scaleFactor, transformationFun) { counts <- SummarizedExperiment::assay(sce, i = useAssay) counts <- .processCounts(counts) K <- S4Vectors::metadata(sce)$celda_parameters$K z <- as.integer(SummarizedExperiment::colData(sce)$celda_cell_cluster) y <- as.integer(SummarizedExperiment::rowData(sce)$celda_feature_module) L <- S4Vectors::metadata(sce)$celda_parameters$L alpha <- S4Vectors::metadata(sce)$celda_parameters$alpha beta <- S4Vectors::metadata(sce)$celda_parameters$beta delta <- S4Vectors::metadata(sce)$celda_parameters$delta gamma <- S4Vectors::metadata(sce)$celda_parameters$gamma sampleLabel <- SummarizedExperiment::colData(sce)$celda_sample_label cNames <- colnames(sce) rNames <- rownames(sce) sNames <- S4Vectors::metadata(sce)$celda_parameters$sampleLevels ## Checking if maxCells and minClusterSize will work if (!is.null(maxCells)) { if ((maxCells < ncol(counts)) & (maxCells / minClusterSize < K)) { stop("Cannot distribute ", maxCells, " cells among ", K, " clusters while maintaining a minumum of ", minClusterSize, " cells per cluster. Try increasing 'maxCells' or", " decreasing 'minClusterSize'.") } } else { maxCells <- ncol(counts) } fm <- .factorizeMatrixCG( counts = counts, K = K, z = z, y = y, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, sampleLabel = sampleLabel, cNames = cNames, rNames = rNames, sNames = sNames, type = "counts") modulesToUse <- seq(nrow(fm$counts$cell)) if (!is.null(modules)) { if (!all(modules %in% modulesToUse)) { stop("'modules' must be a vector of numbers between 1 and ", modulesToUse, ".") } modulesToUse <- modules } ## Select a subset of cells to sample if greater than 'maxCells' totalCellsToRemove <- ncol(counts) - maxCells zInclude <- rep(TRUE, ncol(counts)) if (totalCellsToRemove > 0) { zTa <- tabulate(z, K) ## Number of cells that can be sampled from each cluster without ## going below the minimum threshold clusterCellsToSample <- zTa - minClusterSize clusterCellsToSample[clusterCellsToSample < 0] <- 0 ## Number of cells to sample after exluding smaller clusters ## Rounding can cause number to be off by a few, so ceiling is used ## with a second round of subtraction clusterNToSample <- ceiling((clusterCellsToSample / sum(clusterCellsToSample)) * totalCellsToRemove) diff <- sum(clusterNToSample) - totalCellsToRemove clusterNToSample[which.max(clusterNToSample)] <- clusterNToSample[which.max(clusterNToSample)] - diff ## Perform sampling for each cluster for (i in which(clusterNToSample > 0)) { zInclude[sample(which(z == i), clusterNToSample[i])] <- FALSE } } cellIx <- which(zInclude) norm <- t(normalizeCounts(fm$counts$cell[modulesToUse, cellIx], normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun)) return(list(norm = norm, cellIx = cellIx)) } .createSCEceldaCG <- function(celdaCGMod, sce, xClass, useAssay, algorithm, stopIter, maxIter, splitOnIter, splitOnLast, nchains, zInitialize, yInitialize, zInit, yInit, logfile, verbose) { # add metadata S4Vectors::metadata(sce)[["celda_parameters"]] <- list( model = "celda_CG", xClass = xClass, useAssay = useAssay, sampleLevels = celdaCGMod@names$sample, K = celdaCGMod@params$K, L = celdaCGMod@params$L, alpha = celdaCGMod@params$alpha, beta = celdaCGMod@params$beta, delta = celdaCGMod@params$delta, gamma = celdaCGMod@params$gamma, algorithm = algorithm, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = celdaCGMod@params$seed, nchains = nchains, zInitialize = zInitialize, yInitialize = yInitialize, countChecksum = celdaCGMod@params$countChecksum, zInit = zInit, yInit = yInit, logfile = logfile, verbose = verbose, completeLogLik = celdaCGMod@completeLogLik, finalLogLik = celdaCGMod@finalLogLik, cellClusterLevels = sort(unique(celdaClusters(celdaCGMod)$z)), featureModuleLevels = sort(unique(celdaClusters(celdaCGMod)$y))) SummarizedExperiment::rowData(sce)["rownames"] <- celdaCGMod@names$row SummarizedExperiment::colData(sce)["colnames"] <- celdaCGMod@names$column SummarizedExperiment::colData(sce)["celda_sample_label"] <- as.factor(celdaCGMod@sampleLabel) SummarizedExperiment::colData(sce)["celda_cell_cluster"] <- as.factor(celdaClusters(celdaCGMod)$z) SummarizedExperiment::rowData(sce)["celda_feature_module"] <- as.factor(celdaClusters(celdaCGMod)$y) return(sce) } ================================================ FILE: R/celda_G.R ================================================ #' @title Feature clustering with Celda #' @description Clusters the rows of a count matrix containing single-cell data #' into L modules. The #' \code{useAssay} \link{assay} slot in #' \code{altExpName} \link{altExp} slot will be used if #' it exists. Otherwise, the \code{useAssay} #' \link{assay} slot in \code{x} will be used if #' \code{x} is a \linkS4class{SingleCellExperiment} object. #' @param x A \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. Alternatively, #' any matrix-like object that can be coerced to a sparse matrix of class #' "dgCMatrix" can be directly used as input. The matrix will automatically be #' converted to a \linkS4class{SingleCellExperiment} object. #' @param useAssay A string specifying the name of the #' \link{assay} slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param L Integer. Number of feature modules. #' @param beta Numeric. Concentration parameter for Phi. Adds a pseudocount to #' each feature module in each cell. Default 1. #' @param delta Numeric. Concentration parameter for Psi. Adds a pseudocount to #' each feature in each module. Default 1. #' @param gamma Numeric. Concentration parameter for Eta. Adds a pseudocount to #' the number of features in each module. Default 1. #' @param stopIter Integer. Number of iterations without improvement in the #' log likelihood to stop inference. Default 10. #' @param maxIter Integer. Maximum number of iterations of Gibbs sampling to #' perform. Default 200. #' @param splitOnIter Integer. On every `splitOnIter` iteration, a heuristic #' will be applied to determine if a feature module should be reassigned and #' another feature module should be split into two clusters. To disable #' splitting, set to -1. Default 10. #' @param splitOnLast Integer. After `stopIter` iterations have been #' performed without improvement, a heuristic will be applied to determine if #' a cell population should be reassigned and another cell population should be #' split into two clusters. If a split occurs, then `stopIter` will be reset. #' Default TRUE. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param nchains Integer. Number of random cluster initializations. Default 3. #' @param yInitialize Chararacter. One of 'random', 'split', or 'predefined'. #' With 'random', features are randomly assigned to a modules. With 'split', #' features will be split into sqrt(L) modules and then each module will be #' subsequently split into another sqrt(L) modules. With 'predefined', values #' in `yInit` will be used to initialize `y`. Default 'split'. #' @param yInit Integer vector. Sets initial starting values of y. #' `yInit` can only be used when `yInitialize = 'predefined'`. Default NULL. #' @param countChecksum Character. An MD5 checksum for the `counts` matrix. #' Default NULL. #' @param logfile Character. Messages will be redirected to a file named #' \code{logfile}. If NULL, messages will be printed to stdout. Default NULL. #' @param verbose Logical. Whether to print log messages. Default TRUE. #' @return A \linkS4class{SingleCellExperiment} object. Function #' parameter settings are stored in the \link{metadata} #' \code{"celda_parameters"} slot. Column \code{celda_feature_module} in #' \link{rowData} contains feature modules. #' @seealso \link{celda_C} for cell clustering and \link{celda_CG} for #' simultaneous clustering of features and cells. \link{celdaGridSearch} can #' be used to run multiple values of L and multiple chains in parallel. #' @examples #' data(celdaGSim) #' sce <- celda_G(celdaGSim$counts, L = celdaGSim$L, nchains = 1) #' @export setGeneric("celda_G", function(x, useAssay = "counts", altExpName = "featureSubset", L, beta = 1, delta = 1, gamma = 1, stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, yInitialize = c("split", "random", "predefined"), countChecksum = NULL, yInit = NULL, logfile = NULL, verbose = TRUE) { standardGeneric("celda_G")}) #' @rdname celda_G #' @export setMethod("celda_G", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", L, beta = 1, delta = 1, gamma = 1, stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, yInitialize = c("split", "random", "predefined"), countChecksum = NULL, yInit = NULL, logfile = NULL, verbose = TRUE) { xClass <- "SingleCellExperiment" if (!altExpName %in% SingleCellExperiment::altExpNames(x)) { stop(altExpName, " not in 'altExpNames(x)'. Run ", "selectFeatures(x) first!") } altExp <- SingleCellExperiment::altExp(x, altExpName) if (!useAssay %in% SummarizedExperiment::assayNames(altExp)) { stop(useAssay, " not in assayNames(altExp(x, altExpName))") } counts <- SummarizedExperiment::assay(altExp, i = useAssay) altExp <- .celdaGWithSeed(counts = counts, xClass = xClass, useAssay = useAssay, sce = altExp, L = L, beta = beta, delta = delta, gamma = gamma, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = seed, nchains = nchains, yInitialize = match.arg(yInitialize), countChecksum = countChecksum, yInit = yInit, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname celda_G #' @export setMethod("celda_G", signature(x = "ANY"), function(x, useAssay = "counts", altExpName = "featureSubset", L, beta = 1, delta = 1, gamma = 1, stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, yInitialize = c("split", "random", "predefined"), countChecksum = NULL, yInit = NULL, logfile = NULL, verbose = TRUE) { # Convert to sparse matrix x <- methods::as(x, "CsparseMatrix") ls <- list() ls[[useAssay]] <- x sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" altExp <- .celdaGWithSeed(counts = x, xClass = xClass, useAssay = useAssay, sce = SingleCellExperiment::altExp(sce, altExpName), L = L, beta = beta, delta = delta, gamma = gamma, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = seed, nchains = nchains, yInitialize = match.arg(yInitialize), countChecksum = countChecksum, yInit = yInit, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) .celdaGWithSeed <- function(counts, xClass, useAssay, sce, L, beta, delta, gamma, stopIter, maxIter, splitOnIter, splitOnLast, seed, nchains, yInitialize, countChecksum, yInit, logfile, verbose) { .validateCounts(counts) if (is.null(seed)) { celdaGMod <- .celda_G(counts = counts, L = L, beta = beta, delta = delta, gamma = gamma, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, yInitialize = yInitialize, countChecksum = countChecksum, yInit = yInit, logfile = logfile, verbose = verbose, reorder = TRUE) } else { with_seed( seed, celdaGMod <- .celda_G(counts = counts, L = L, beta = beta, delta = delta, gamma = gamma, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, yInitialize = yInitialize, countChecksum = countChecksum, yInit = yInit, logfile = logfile, verbose = verbose, reorder = TRUE) ) } sce <- .createSCEceldaG(celdaGMod = celdaGMod, sce = sce, xClass = xClass, useAssay = useAssay, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, nchains = nchains, yInitialize = yInitialize, yInit = yInit, logfile = logfile, verbose = verbose) return(sce) } #' @importFrom Matrix colSums .celda_G <- function(counts, L, beta = 1, delta = 1, gamma = 1, stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, nchains = 3, yInitialize = c("split", "random", "predefined"), countChecksum = NULL, yInit = NULL, logfile = NULL, verbose = TRUE, reorder = TRUE) { .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = FALSE, verbose = verbose ) .logMessages("Starting Celda_G: Clustering genes.", logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) start.time <- Sys.time() ## Error checking and variable processing counts <- .processCounts(counts) if (is.null(countChecksum)) { countChecksum <- .createCountChecksum(counts) } yInitialize <- match.arg(yInitialize) allChains <- seq(nchains) # Pre-compute lgamma values cs <- colSums(counts) lgbeta <- lgamma(seq(0, max(cs)) + beta) lggamma <- lgamma(seq(0, nrow(counts) + L) + gamma) lgdelta <- c(NA, lgamma((seq(nrow(counts) + L) * delta))) bestResult <- NULL for (i in allChains) { ## Randomly select y or y to supplied initial values ## Initialize cluster labels .logMessages(date(), ".. Initializing 'y' in chain", i, "with", paste0("'", yInitialize, "' "), logfile = logfile, append = TRUE, verbose = verbose ) if (yInitialize == "predefined") { if (is.null(yInit)) { stop("'yInit' needs to specified when initilize.y == 'given'.") } y <- .initializeCluster(L, nrow(counts), initial = yInit, fixed = NULL ) } else if (yInitialize == "split") { y <- .initializeSplitY(counts, L, beta = beta, delta = delta, gamma = gamma ) } else { y <- .initializeCluster(L, nrow(counts), initial = NULL, fixed = NULL ) } yBest <- y ## Calculate counts one time up front p <- .cGDecomposeCounts(counts = counts, y = y, L = L) nTSByC <- p$nTSByC nByG <- p$nByG nByTS <- p$nByTS nGByTS <- p$nGByTS nM <- p$nM nG <- p$nG rm(p) ## Calculate initial log likelihood ll <- .cGCalcLL( nTSByC = nTSByC, nByTS = nByTS, nByG = nByG, nGByTS = nGByTS, nM = nM, nG = nG, L = L, beta = beta, delta = delta, gamma = gamma ) iter <- 1L numIterWithoutImprovement <- 0L doGeneSplit <- TRUE while (iter <= maxIter & numIterWithoutImprovement <= stopIter) { nextY <- .cGCalcGibbsProbY( counts = counts, nTSByC = nTSByC, nByTS = nByTS, nGByTS = nGByTS, nByG = nByG, y = y, nG = nG, L = L, beta = beta, delta = delta, gamma = gamma, lgbeta = lgbeta, lggamma = lggamma, lgdelta = lgdelta ) nTSByC <- nextY$nTSByC nGByTS <- nextY$nGByTS nByTS <- nextY$nByTS y <- nextY$y ## Perform split on i-th iteration of no improvement in log ## likelihood tempLl <- .cGCalcLL( nTSByC = nTSByC, nByTS = nByTS, nByG = nByG, nGByTS = nGByTS, nM = nM, nG = nG, L = L, beta = beta, delta = delta, gamma = gamma ) if (L > 2 & iter != maxIter & ((((numIterWithoutImprovement == stopIter & !all(tempLl >= ll))) & isTRUE(splitOnLast)) | (splitOnIter > 0 & iter %% splitOnIter == 0 & isTRUE(doGeneSplit)))) { .logMessages(date(), " .... Determining if any gene clusters should be split.", logfile = logfile, append = TRUE, sep = "", verbose = verbose ) res <- .cGSplitY(counts, y, nTSByC, nByTS, nByG, nGByTS, nM, nG, L, beta, delta, gamma, yProb = t(nextY$probs), minFeature = 3, maxClustersToTry = max(L / 2, 10) ) .logMessages(res$message, logfile = logfile, append = TRUE, verbose = verbose ) # Reset convergence counter if a split occured if (!isTRUE(all.equal(y, res$y))) { numIterWithoutImprovement <- 1L doGeneSplit <- TRUE } else { doGeneSplit <- FALSE } ## Re-calculate variables y <- res$y nTSByC <- res$nTSByC nByTS <- res$nByTS nGByTS <- res$nGByTS } ## Calculate complete likelihood tempLl <- .cGCalcLL( nTSByC = nTSByC, nByTS = nByTS, nByG = nByG, nGByTS = nGByTS, nM = nM, nG = nG, L = L, beta = beta, delta = delta, gamma = gamma ) if ((all(tempLl > ll)) | iter == 1) { yBest <- y llBest <- tempLl numIterWithoutImprovement <- 1L } else { numIterWithoutImprovement <- numIterWithoutImprovement + 1L } ll <- c(ll, tempLl) .logMessages(date(), ".... Completed iteration:", iter, "| logLik:", tempLl, logfile = logfile, append = TRUE, verbose = verbose ) iter <- iter + 1 } names <- list(row = rownames(counts), column = colnames(counts)) result <- list( y = yBest, completeLogLik = ll, finalLogLik = llBest, L = L, beta = beta, delta = delta, gamma = gamma, countChecksum = countChecksum, names = names ) if (is.null(bestResult) || result$finalLogLik > bestResult$finalLogLik) { bestResult <- result } .logMessages(date(), ".. Finished chain", i, logfile = logfile, append = TRUE, verbose = verbose ) } bestResult <- methods::new("celda_G", clusters = list(y = yBest), params = list( L = as.integer(L), beta = beta, delta = delta, gamma = gamma, countChecksum = countChecksum ), completeLogLik = ll, finalLogLik = llBest, names = names ) if (isTRUE(reorder)) { bestResult <- .reorderCeldaG(counts = counts, res = bestResult) } endTime <- Sys.time() .logMessages(paste0(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages("Completed Celda_G. Total time:", format(difftime(endTime, start.time)), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste0(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) return(bestResult) } # Calculate Log Likelihood For Single Set of Cluster Assignments # (Gene Clustering) # This function calculates the log-likelihood of a given set of cluster # assigments for the samples # represented in the provided count matrix. # @param nTSByC Number of counts in each Transcriptional State per Cell. # @param nByTS Number of counts per Transcriptional State. # @param nGByTS Number of genes in each Transcriptional State. # @param nG.in.Y Number of genes in each of the cell cluster. # @param gamma Numeric. Concentration parameter for Eta. Adds a pseudocount to # the number of features in each module. Default 1. # @param delta Numeric. Concentration parameter for Psi. Adds a pseudocount to # each feature in each module. Default 1. # @param beta Numeric. Concentration parameter for Phi. Adds a pseudocount to # each feature module in each cell. Default 1. # @keywords log likelihood .cGCalcGibbsProbY <- function(counts, nTSByC, nByTS, nGByTS, nByG, y, L, nG, beta, delta, gamma, lgbeta, lggamma, lgdelta, doSample = TRUE) { ## Set variables up front outside of loop probs <- matrix(NA, ncol = nG, nrow = L) ix <- sample(seq(nG)) for (i in ix) { probs[, i] <- cG_CalcGibbsProbY(index = i, counts = as.numeric(counts[i, ]), nTSbyC = nTSByC, nbyTS = nByTS, nGbyTS = nGByTS, nbyG = nByG, y = y, L = L, nG = nG, lg_beta = lgbeta, lg_gamma = lggamma, lg_delta = lgdelta, delta = delta ) ## Sample next state and add back counts if (isTRUE(doSample)) { prevY <- y[i] y[i] <- .sampleLl(probs[, i]) if (prevY != y[i]) { nTSByC[prevY, ] <- nTSByC[prevY, ] - counts[i, ] nGByTS[prevY] <- nGByTS[prevY] - 1L nByTS[prevY] <- nByTS[prevY] - nByG[i] nTSByC[y[i], ] <- nTSByC[y[i], ] + counts[i, ] nGByTS[y[i]] <- nGByTS[y[i]] + 1L nByTS[y[i]] <- nByTS[y[i]] + nByG[i] } } } return(list( nTSByC = nTSByC, nGByTS = nGByTS, nByTS = nByTS, y = y, probs = probs )) } # Calculate log-likelihood of celda_CG model .cGCalcLL <- function(nTSByC, nByTS, nByG, nGByTS, nM, nG, L, beta, delta, gamma) { nG <- sum(nGByTS) ## Calculate for "Phi" component a <- nM * lgamma(L * beta) b <- sum(lgamma(nTSByC + beta)) c <- -nM * L * lgamma(beta) d <- -sum(lgamma(colSums(nTSByC + beta))) phiLl <- a + b + c + d ## Calculate for "Psi" component a <- sum(lgamma(nGByTS * delta)) b <- sum(lgamma(nByG + delta)) c <- -nG * lgamma(delta) d <- -sum(lgamma(nByTS + (nGByTS * delta))) psiLl <- a + b + c + d ## Calculate for "Eta" component a <- lgamma(L * gamma) b <- sum(lgamma(nGByTS + gamma)) c <- -L * lgamma(gamma) d <- -sum(lgamma(sum(nGByTS + gamma))) etaLl <- a + b + c + d final <- phiLl + psiLl + etaLl return(final) } # Takes raw counts matrix and converts it to a series of matrices needed for # log likelihood calculation # @param counts Integer matrix. Rows represent features and columns represent # cells. # @param y Numeric vector. Denotes feature module labels. # @param L Integer. Number of feature modules. #' @importFrom Matrix rowSums .cGDecomposeCounts <- function(counts, y, L) { if (any(y > L)) { stop("Assigned value of feature module greater than the total number", " of feature modules!") } nTSByC <- .rowSumByGroup(counts, group = y, L = L) nByG <- rowSums(counts) nByTS <- .rowSumByGroup(matrix(nByG, ncol = 1), group = y, L = L) nGByTS <- tabulate(y, L) + 1 ## Add pseudogene to each state nM <- ncol(counts) nG <- nrow(counts) return(list( nTSByC = nTSByC, nByG = nByG, nByTS = nByTS, nGByTS = nGByTS, nM = nM, nG = nG )) } .cGReDecomposeCounts <- function(counts, y, previousY, nTSByC, nByG, L) { ## Recalculate counts based on new label nTSByC <- .rowSumByGroupChange(counts, nTSByC, y, previousY, L) nByTS <- .rowSumByGroup(matrix(nByG, ncol = 1), group = y, L = L) nGByTS <- tabulate(y, L) + 1 return(list( nTSByC = nTSByC, nByTS = nByTS, nGByTS = nGByTS )) } .prepareCountsForDimReductionCeldaG <- function(sce, useAssay, maxCells, minClusterSize, modules, normalize, scaleFactor, transformationFun) { counts <- SummarizedExperiment::assay(sce, i = useAssay) counts <- .processCounts(counts) y <- as.integer(SummarizedExperiment::rowData(sce)$celda_feature_module) L <- S4Vectors::metadata(sce)$celda_parameters$L beta <- S4Vectors::metadata(sce)$celda_parameters$beta delta <- S4Vectors::metadata(sce)$celda_parameters$delta gamma <- S4Vectors::metadata(sce)$celda_parameters$gamma cNames <- colnames(sce) rNames <- rownames(sce) if (is.null(maxCells) || maxCells > ncol(counts)) { maxCells <- ncol(counts) cellIx <- seq_len(ncol(counts)) } else { cellIx <- sample(seq(ncol(counts)), maxCells) } fm <- .factorizeMatrixG( counts = counts, y = y, L = L, beta = beta, delta = delta, gamma = gamma, cNames = cNames, rNames = rNames, type = "counts") modulesToUse <- seq(nrow(fm$counts$cell)) if (!is.null(modules)) { if (!all(modules %in% modulesToUse)) { stop( "'modules' must be a vector of numbers between 1 and ", modulesToUse, "." ) } modulesToUse <- modules } norm <- t(normalizeCounts(fm$counts$cell[modulesToUse, cellIx], normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun)) return(list(norm = norm, cellIx = cellIx)) } .createSCEceldaG <- function(celdaGMod, sce, xClass, useAssay, stopIter, maxIter, splitOnIter, splitOnLast, nchains, yInitialize, yInit, logfile, verbose) { # add metadata S4Vectors::metadata(sce)[["celda_parameters"]] <- list( model = "celda_G", xClass = xClass, useAssay = useAssay, L = celdaGMod@params$L, beta = celdaGMod@params$beta, delta = celdaGMod@params$delta, gamma = celdaGMod@params$gamma, stopIter = stopIter, maxIter = maxIter, splitOnIter = splitOnIter, splitOnLast = splitOnLast, seed = celdaGMod@params$seed, nchains = nchains, yInitialize = yInitialize, countChecksum = celdaGMod@params$countChecksum, yInit = yInit, logfile = logfile, verbose = verbose, completeLogLik = celdaGMod@completeLogLik, finalLogLik = celdaGMod@finalLogLik, featureModuleLevels = sort(unique(celdaClusters(celdaGMod)$y))) SummarizedExperiment::rowData(sce)["rownames"] <- celdaGMod@names$row SummarizedExperiment::colData(sce)["colnames"] <- celdaGMod@names$column SummarizedExperiment::rowData(sce)["celda_feature_module"] <- as.factor(celdaClusters(celdaGMod)$y) return(sce) } ================================================ FILE: R/celda_functions.R ================================================ .sampleLl <- function(llProbs) { probsSub <- exp(llProbs - max(llProbs)) probsNorm <- probsSub / sum(probsSub) probsSelect <- sample.int( length(probsNorm), size = 1L, replace = TRUE, prob = probsNorm ) return(probsSelect) } .cosineDist <- function(x) { x <- t(x) y <- (1 - .cosine(x)) / 2 return(stats::as.dist(y)) } .cosine <- function(x) { y <- x %*% t(x) / (sqrt(rowSums(x^2) %*% t(rowSums(x^2)))) return(y) } .spearmanDist <- function(x) { y <- (1 - stats::cor(x, method = "spearman")) / 2 return(stats::as.dist(y)) } .hellingerDist <- function(x) { y <- stats::dist(t(sqrt(x)), method = "euclidean") * 1 / sqrt(2) return(y) } .normalizeLogProbs <- function(llProbs) { llProbs <- exp(sweep(llProbs, 1, base::apply(llProbs, 1, max), "-")) probs <- sweep(llProbs, 1, rowSums(llProbs), "/") return(probs) } #' @title Normalization of count data #' @description Performs normalization, transformation, and/or scaling of a #' counts matrix #' @param counts Integer, Numeric or Sparse matrix. Rows represent features #' and columns represent cells. #' @param normalize Character. #' Divides counts by the library sizes for each cell. One of 'proportion', #' 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each #' cell as the library size. 'cpm' divides the library size of each cell by #' one million to produce counts per million. 'median' divides the library #' size of each cell by the median library size across all cells. 'mean' #' divides the library size of each cell by the mean library size across all #' cells. #' @param scaleFactor Numeric. Sets the scale factor for cell-level #' normalization. This scale factor is multiplied to each cell after the #' library size of each cell had been adjusted in \code{normalize}. Default #' \code{NULL} which means no scale factor is applied. #' @param transformationFun Function. Applys a transformation such as #' \link{sqrt}, \link{log}, \link{log2}, \link{log10}, or \link{log1p}. #' If NULL, no transformation will be applied. Occurs after normalization. #' Default NULL. #' @param scaleFun Function. Scales the rows of the normalized and transformed #' count matrix. For example, 'scale' can be used to z-score normalize the #' rows. Default NULL. #' @param pseudocountNormalize Numeric. Add a pseudocount to counts before #' normalization. Default 0. #' @param pseudocountTransform Numeric. Add a pseudocount to normalized counts #' before applying the transformation function. Adding a pseudocount #' can be useful before applying a log transformation. Default 0. #' @return Numeric Matrix. A normalized matrix. #' @examples #' data(celdaCGSim) #' normalizedCounts <- normalizeCounts(celdaCGSim$counts, "proportion", #' pseudocountNormalize = 1) #' @importFrom Matrix colSums #' @export normalizeCounts <- function(counts, normalize = c("proportion", "cpm", "median", "mean"), scaleFactor = NULL, transformationFun = NULL, scaleFun = NULL, pseudocountNormalize = 0, pseudocountTransform = 0) { normalize <- match.arg(normalize) if (!is.null(transformationFun) && !is.function(transformationFun)) { stop("'transformationFun' needs to be of class 'function'") } if (!is.null(scaleFun) && !is.function(scaleFun)) { stop("'scaleFun' needs to be of class 'function'") } # Perform normalization if (normalize == "proportion" & inherits(counts, "matrix")) { norm <- fastNormProp(counts, pseudocountNormalize) } else { counts <- counts + pseudocountNormalize cs <- colSums(counts) norm <- switch( normalize, "proportion" = sweep(counts, 2, cs, "/"), "cpm" = sweep(counts, 2, cs / 1e6, "/"), "median" = sweep(counts, 2, cs / stats::median(cs), "/"), "mean" = sweep(counts, 2, cs / mean(cs), "/") ) } if (!is.null(scaleFactor)) { norm <- norm * scaleFactor } if (!is.null(transformationFun)) { norm <- do.call( transformationFun, list(norm + pseudocountTransform) ) } if (!is.null(scaleFun)) { norm <- t(base::apply(norm, 1, scaleFun)) } colnames(norm) <- colnames(counts) rownames(norm) <- rownames(counts) return(norm) } #' @title Recode cell cluster labels #' @description Recode cell subpopulaton clusters using a mapping in the #' \code{from} and \code{to} arguments. #' @param sce \linkS4class{SingleCellExperiment} object returned from #' \link{celda_C} or \link{celda_CG}. Must contain column #' \code{celda_cell_cluster} in #' \code{\link{colData}(altExp(sce, altExpName))}. #' @param from Numeric vector. Unique values in the range of #' \code{seq(max(as.integer(celdaClusters(sce, altExpName = altExpName))))} #' that correspond to the original cluster #' labels in \code{sce}. #' @param to Numeric vector. Unique values in the range of #' \code{seq(max(as.integer(celdaClusters(sce, altExpName = altExpName))))} #' that correspond to the new cluster labels. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return \linkS4class{SingleCellExperiment} object with recoded cell #' cluster labels. #' @examples #' data(sceCeldaCG) #' sceReorderedZ <- recodeClusterZ(sceCeldaCG, c(1, 3), c(3, 1)) #' @importFrom plyr mapvalues #' @export recodeClusterZ <- function(sce, from, to, altExpName = "featureSubset") { if (length(setdiff(from, to)) != 0) { stop("All values in 'from' must have a mapping in 'to'") } if (is.null(celdaClusters(sce, altExpName = altExpName))) { stop("Provided 'sce' argument does not have a 'celda_cell_cluster'", " column in 'colData(altExp(sce, altExpName))'") } z <- as.integer(celdaClusters(sce, altExpName = altExpName)) new.clusters <- plyr::mapvalues(z, from = from, to = to) new.clusters <- factor(new.clusters, levels = sort(as.numeric(unique(new.clusters)))) celdaClusters(sce, altExpName = altExpName) <- new.clusters return(sce) } # for deprecated celda model objects .recodeClusterZ <- function(celdaMod, from, to) { if (length(setdiff(from, to)) != 0) { stop("All values in 'from' must have a mapping in 'to'") } if (is.null(celdaClusters(celdaMod)$z)) { stop("Provided celdaMod argument does not have a z attribute") } z <- as.integer(celdaClusters(celdaMod)$z) new.clusters <- plyr::mapvalues(z, from = from, to = to) new.clusters <- factor(new.clusters, levels = sort(as.numeric(unique(new.clusters)))) celdaMod@clusters$z <- new.clusters return(celdaMod) } #' @title Recode feature module labels #' @description Recode feature module clusters using a mapping in the #' \code{from} and \code{to} arguments. #' @param sce \linkS4class{SingleCellExperiment} object returned from #' \link{celda_G} or \link{celda_CG}. Must contain column #' \code{celda_feature_module} in #' \code{\link{rowData}(altExp(sce, altExpName))}. #' @param from Numeric vector. Unique values in the range of #' \code{seq(celdaModules(sce))} that correspond to the original module labels #' in \code{sce}. #' @param to Numeric vector. Unique values in the range of #' \code{seq(celdaModules(sce))} that correspond to the new module labels. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return @return \linkS4class{SingleCellExperiment} object with recoded #' feature module labels. #' @examples #' data(sceCeldaCG) #' sceReorderedY <- recodeClusterY(sceCeldaCG, c(1, 3), c(3, 1)) #' @export recodeClusterY <- function(sce, from, to, altExpName = "featureSubset") { if (length(setdiff(from, to)) != 0) { stop("All values in 'from' must have a mapping in 'to'") } if (is.null(celdaModules(sce, altExpName = altExpName))) { stop("Provided 'sce' argument does not have a 'celda_feature_module'", " column in 'rowData(altExp(sce, altExpName))'") } y <- as.integer(celdaModules(sce, altExpName = altExpName)) new.clusters <- plyr::mapvalues(y, from = from, to = to) new.clusters <- factor(new.clusters, levels = sort(as.numeric(unique(new.clusters)))) celdaModules(sce, altExpName = altExpName) <- new.clusters return(sce) } # for deprecated celda model objects .recodeClusterY <- function(celdaMod, from, to) { if (length(setdiff(from, to)) != 0) { stop("All values in 'from' must have a mapping in 'to'") } if (is.null(celdaClusters(celdaMod)$y)) { stop("Provided celdaMod argument does not have a y attribute") } y <- as.integer(celdaClusters(celdaMod)$y) new.clusters <- plyr::mapvalues(y, from = from, to = to) new.clusters <- factor(new.clusters, levels = sort(as.numeric(unique(new.clusters)))) celdaMod@clusters$y <- new.clusters return(celdaMod) } #' @title Check count matrix consistency #' @description Checks if the counts matrix is the same one used to generate #' the celda model object by comparing dimensions and MD5 checksum. #' @param counts Integer , Numeric, or Sparse matrix. Rows represent features #' and columns represent cells. #' @param celdaMod A \code{celdaModel} or \code{celdaList} object. #' @param errorOnMismatch Logical. Whether to throw an error in the event of #' a mismatch. Default TRUE. #' @return Returns TRUE if provided count matrix matches the one used in the #' celda object and/or \code{errorOnMismatch = FALSE}, FALSE otherwise. #' @export setGeneric("compareCountMatrix", function(counts, celdaMod, errorOnMismatch = TRUE) { standardGeneric("compareCountMatrix")}) #' @rdname compareCountMatrix #' @examples #' data(celdaCGSim, celdaCGMod) #' compareCountMatrix(celdaCGSim$counts, celdaCGMod, errorOnMismatch = FALSE) #' @export setMethod("compareCountMatrix", signature(celdaMod = "celdaModel"), function(counts, celdaMod, errorOnMismatch = TRUE) { if ("y" %in% names(celdaClusters(celdaMod))) { if (nrow(counts) != length(celdaClusters(celdaMod)$y)) { stop( "The provided celda object was generated from a counts", " matrix with a different number of features than the one", " provided." ) } } if ("z" %in% names(celdaClusters(celdaMod))) { if (ncol(counts) != length(celdaClusters(celdaMod)$z)) { stop( "The provided celda object was generated from a counts", " matrix with a different number of cells than the one", " provided." ) } } celdaChecksum <- params(celdaMod)$countChecksum counts <- .processCounts(counts) # Checksums are generated in celdaGridSearch and model after processing count.md5 <- .createCountChecksum(counts) res <- isTRUE(count.md5 == celdaChecksum) if (res) { return(TRUE) } if (!res && errorOnMismatch) { stop( "There was a mismatch between the provided count matrix and", " the count matrix used to generate the provided celda result." ) } else if (!res && !errorOnMismatch) { warning("There was a mismatch between the provided count matrix", " and the count matrix used to generate the provided celda", " result.") return(FALSE) } } ) #' @rdname compareCountMatrix #' @examples #' data(celdaCGSim, celdaCGGridSearchRes) #' compareCountMatrix(celdaCGSim$counts, celdaCGGridSearchRes, #' errorOnMismatch = FALSE) #' @export setMethod("compareCountMatrix", signature(celdaMod = "celdaList"), function(counts, celdaMod, errorOnMismatch = TRUE) { if ("y" %in% names(celdaMod@resList[[1]]@clusters)) { if (nrow(counts) != length(celdaMod@resList[[1]]@clusters$y)) { stop( "The provided celda object was generated from a counts", " matrix with a different number of features than the one", " provided." ) } } if ("z" %in% names(celdaMod@resList[[1]]@clusters)) { if (ncol(counts) != length(celdaMod@resList[[1]]@clusters$z)) { stop( "The provided celda object was generated from a counts", " matrix with a different number of cells than the one", " provided." ) } } celdaChecksum <- celdaMod@countChecksum counts <- .processCounts(counts) # Checksums are generated in celdaGridSearch and model after processing count.md5 <- .createCountChecksum(counts) res <- isTRUE(count.md5 == celdaChecksum) if (res) { return(TRUE) } if (!res && errorOnMismatch) { stop( "There was a mismatch between the provided count matrix and", " the count matrix used to generate the provided celda result." ) } else if (!res && !errorOnMismatch) { warning("There was a mismatch between the provided count matrix", " and the count matrix used to generate the provided celda", " result.") return(FALSE) } } ) .logMessages <- function(..., sep = " ", logfile = NULL, append = FALSE, verbose = TRUE) { if (isTRUE(verbose)) { if (!is.null(logfile)) { if (!is.character(logfile) || length(logfile) > 1) { stop( "The log file parameter needs to be a single character", " string." ) } cat(paste(..., "\n", sep = sep), file = logfile, append = append ) } else { message(paste(..., sep = sep)) } } } #' @title Create a color palette #' @description Generate a palette of `n` distinct colors. #' @param n Integer. Number of colors to generate. #' @param hues Character vector. Colors available from `colors()`. These will #' be used as the base colors for the clustering scheme in HSV. Different #' saturations and values will be generated for each hue. Default c("red", #' "cyan", "orange", "blue", "yellow", "purple", "green", "magenta"). #' @param saturationRange Numeric vector. A vector of length 2 denoting the #' saturation for HSV. Values must be in [0,1]. Default: c(0.25, 1). #' @param valueRange Numeric vector. A vector of length 2 denoting the range #' of values for HSV. Values must be in [0,1]. Default: `c(0.5, 1)`. #' @return A vector of distinct colors that have been converted to HEX from HSV. #' @examples #' colorPal <- distinctColors(6) # can be used in plotting functions #' @importFrom grDevices colors #' @importFrom grDevices rgb2hsv #' @importFrom grDevices hsv #' @export distinctColors <- function(n, hues = c( "red", "cyan", "orange", "blue", "yellow", "purple", "green", "magenta" ), saturationRange = c(0.7, 1), valueRange = c(0.7, 1)) { if (!(all(hues %in% grDevices::colors()))) { stop( "Only color names listed in the 'color' function can be used in", " 'hues'" ) } # Convert R colors to RGB and then to HSV color format huesHsv <- grDevices::rgb2hsv(grDevices::col2rgb(hues)) # Calculate all combination of saturation/value pairs # Note that low saturation with low value (i.e. high darkness) is too dark # for all hues. Likewise, high saturation with high value (i.e. low # darkness) is hard to distinguish. Therefore, saturation and value are # set to be anticorrelated numVs <- ceiling(n / length(hues)) s <- seq( from = saturationRange[1], to = saturationRange[2], length = numVs ) v <- seq( from = valueRange[2], to = valueRange[1], length = numVs ) # Create all combination of hues with saturation/value pairs list <- lapply(seq(numVs), function(x) { rbind(huesHsv[1, ], s[x], v[x]) }) newHsv <- do.call(cbind, list) # Convert to hex col <- grDevices::hsv(newHsv[1, ], newHsv[2, ], newHsv[3, ]) return(col[seq(n)]) } .processCounts <- function(counts) { if (!((inherits(counts, "matrix") & (is.integer(counts) | is.numeric(counts))) | inherits(counts, "dgCMatrix"))) { stop("'counts' must be a sparse dgCMatrix ", "from the 'Matrix' package or a matrix containing integer or numeric ", "values.") } return(counts) } # Perform some simple checks on the counts matrix, to ensure celda modeling # expectations are met #' @importFrom Matrix rowSums colSums .validateCounts <- function(counts) { countRowSum <- rowSums(counts) countColSum <- colSums(counts) if (sum(countRowSum == 0) > 0 | sum(countColSum == 0) > 0) { stop( "Each row and column of the count matrix must have at least", " one count" ) } } # Wrapper function, creates checksum for matrix. # Feature names, cell names are not taken into account. #' @importFrom digest digest .createCountChecksum <- function(counts) { rownames(counts) <- NULL colnames(counts) <- NULL countChecksum <- digest::digest(counts, algo = "md5") return(countChecksum) } # Generate n random deviates from the Dirichlet function with shape parameters # alpha. Adapted from gtools v3.5 .rdirichlet <- function(n, alpha) { l <- length(alpha) x <- matrix(stats::rgamma(l * n, alpha), ncol = l, byrow = TRUE ) # Check for case where all sampled entries are zero due to round off # One entry will be randomly chosen to be one isZero <- rowSums(x) == 0 assignment <- sample(seq(l), size = sum(isZero), replace = TRUE) x[cbind(which(isZero), assignment)] <- 1 # Normalize sm <- x %*% rep(1, l) y <- x / as.vector(sm) return(y) } # Make sure provided sample labels are the right type, # or generate some if none were provided .processSampleLabels <- function(sampleLabel, numCells) { if (is.null(sampleLabel)) { sampleLabel <- as.factor(rep("Sample_1", numCells)) } else { if (length(sampleLabel) != numCells) { stop( "'sampleLabel' must be the same length as the number of", " columns in the 'counts' matrix." ) } } if (!is.factor(sampleLabel)) { sampleLabel <- as.factor(sampleLabel) } return(sampleLabel) } #' @title Output a feature module table #' @description Creates a table that contains the list of features in #' each feature module. #' @param sce A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_G}, or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. #' Rows represent features and columns represent cells. #' @param useAssay A string specifying which \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param displayName Character. The column name of #' \code{rowData(sce)} that specifies the display names for #' the features. Default \code{NULL}, which displays the row names. #' @param outputFile File name for feature module table. If NULL, file will #' not be created. Default NULL. #' @return Matrix. Contains a list of features per each column (feature module) #' @examples #' data(sceCeldaCG) #' featureModuleTable(sceCeldaCG) #' @export featureModuleTable <- function(sce, useAssay = "counts", altExpName = "featureSubset", displayName = NULL, outputFile = NULL) { factorizeMatrix <- factorizeMatrix(sce, useAssay = useAssay, altExpName = altExpName, type = "proportion") altExp <- SingleCellExperiment::altExp(sce, altExpName) allGenes <- topRank(factorizeMatrix$proportions$module, n = nrow(altExp)) maxlen <- max(vapply(allGenes$names, length, integer(1))) if (is.null(displayName)) { res <- vapply(allGenes$names, FUN = "[", FUN.VALUE = character(length = maxlen), seq(maxlen)) } else { dn <- lapply(allGenes$index, FUN = function(v) { SummarizedExperiment::rowData(altExp)[[displayName]][v] }) res <- vapply(dn, FUN = "[", FUN.VALUE = character(length = maxlen), seq(maxlen)) } res <- apply(res, c(1, 2), function(x) { if (is.na(x)) { return("") } else { return(x) } }) if (is.null(outputFile)) { return(res) } else { utils::write.table( res, file = outputFile, sep = "\t", row.names = FALSE, quote = FALSE ) } } #' @title Retrieve row index for a set of features #' @description This will return indices of features among the rownames #' or rowData of a data.frame, matrix, or a \linkS4class{SummarizedExperiment} #' object including a \linkS4class{SingleCellExperiment}. #' Partial matching (i.e. grepping) can be used by setting #' \code{exactMatch = FALSE}. #' @param features Character vector of feature names to find in the rows of #' \code{x}. #' @param x A data.frame, matrix, or \linkS4class{SingleCellExperiment} #' object to search. #' @param by Character. Where to search for features in \code{x}. If set to #' \code{"rownames"} then the features will be searched for among #' \code{rownames(x)}. If \code{x} inherits from class #' \linkS4class{SummarizedExperiment}, then \code{by} can be one of the #' fields in the row annotation data.frame (i.e. one of #' \code{colnames(rowData(x))}). #' @param exactMatch Boolean. Whether to only identify exact matches #' or to identify partial matches using \code{\link{grep}}. #' @param removeNA Boolean. If set to \code{FALSE}, features not found in #' \code{x} will be given \code{NA} and the returned vector will be the same #' length as \code{features}. If set to \code{TRUE}, then the \code{NA} #' values will be removed from the returned vector. Default \code{FALSE}. #' @return A vector of row indices for the matching features in \code{x}. #' @author Yusuke Koga, Joshua Campbell #' @seealso '\link[scater]{retrieveFeatureInfo}' from package \code{'scater'} #' and \code{link{regex}} for how to use regular expressions when #' \code{exactMatch = FALSE}. #' @examples #' data(celdaCGSim) #' retrieveFeatureIndex(c("Gene_1", "Gene_5"), celdaCGSim$counts) #' retrieveFeatureIndex(c("Gene_1", "Gene_5"), celdaCGSim$counts, #' exactMatch = FALSE) #' @export retrieveFeatureIndex <- function(features, x, by = "rownames", exactMatch = TRUE, removeNA = FALSE) { # Extract vector to search through if (by == "rownames") { if (is.null(rownames(x))) { stop("'rownames' of 'x' are 'NULL'. Please set 'rownames' or change", " 'by' to search a different column in 'x'.") } search <- rownames(x) } else if (length(ncol(x)) > 0) { if (inherits(x, "SummarizedExperiment")) { if (!(by %in% colnames(SummarizedExperiment::rowData(x)))) { stop("'by' is not a column in 'rowData(x)'.") } search <- SummarizedExperiment::rowData(x)[, by] } else { if (!(by %in% colnames(x))) { stop("'by' is not a column in 'x'.") } search <- x[, by] } } else { search <- as.character(x) } # Match each element of 'pattern' in vector 'search' if (!isTRUE(exactMatch)) { featuresIndices <- rep(NA, length(features)) for (i in seq_along(features)) { g <- grep(features[i], search) if (length(g) == 1) { featuresIndices[i] <- g } else if (length(g) > 1) { warning( "Feature '", features[i], "' matched multiple items in '", by, "': ", paste(search[g], collapse = ","), ". Only the first match will be selected." ) featuresIndices[i] <- g[1] } } } else { featuresIndices <- match(features, search) } if (sum(is.na(featuresIndices)) > 0) { if (sum(is.na(featuresIndices)) == length(features)) { if (isTRUE(exactMatch)) { stop( "None of the provided features had matching", " items in '", by, "' within 'x'. ", "Check the spelling or try setting", " 'exactMatch = FALSE'." ) } else { stop( "None of the provided features had matching", " items in '", by, "' within 'x'. ", "Check the spelling and make sure 'by' is set", " to the appropriate place in 'x'." ) } } warning( "The following features were not present in 'x': ", paste(features[which(is.na(featuresIndices))], collapse = "," ) ) } if (isTRUE(removeNA)) { featuresIndices <- featuresIndices[!is.na(featuresIndices)] } return(featuresIndices) } ================================================ FILE: R/celda_heatmap.R ================================================ #' @title Plot celda Heatmap #' @description Render a stylable heatmap of count data based on celda #' clustering results. #' @param sce A \link[SingleCellExperiment]{SingleCellExperiment} object #' returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}. #' @param useAssay A string specifying which \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param featureIx Integer vector. Select features for display in heatmap. If #' NULL, no subsetting will be performed. Default NULL. \strong{Only used for #' \code{sce} containing celda_C model result returned by \link{celda_C}.} #' @param nfeatures Integer. Maximum number of features to select for each #' gene module. Default 25. \strong{Only used for \code{sce} containing #' celda_CG or celda_G model results returned by \link{celda_CG} or #' \link{celda_G}.} #' @param ... Additional parameters passed to \link{plotHeatmap}. #' @seealso `celdaTsne()` for generating 2-dimensional tSNE coordinates #' @return list A list containing dendrogram information and the heatmap grob #' @export setGeneric("celdaHeatmap", function(sce, useAssay = "counts", altExpName = "featureSubset", featureIx = NULL, nfeatures = 25, ...) { standardGeneric("celdaHeatmap") }) #' @rdname celdaHeatmap #' @examples #' data(sceCeldaCG) #' celdaHeatmap(sceCeldaCG) #' @export setMethod("celdaHeatmap", signature(sce = "SingleCellExperiment"), function(sce, useAssay = "counts", altExpName = "featureSubset", featureIx = NULL, nfeatures = 25, ...) { counts <- SummarizedExperiment::assay(sce, i = useAssay) counts <- .processCounts(counts) model <- celdaModel(sce, altExpName = altExpName) if (model == "celda_C") { z <- as.integer(celdaClusters(sce, altExpName = altExpName)) g <- .celdaHeatmapCelda_C( counts = counts, z = z, featureIx = featureIx, ...) } else if (model == "celda_CG") { fm <- factorizeMatrix(x = sce, useAssay = useAssay, altExpName = altExpName, type = "proportion") z <- as.integer(celdaClusters(sce, altExpName = altExpName)) y <- as.integer(celdaModules(sce, altExpName = altExpName)) g <- .celdaHeatmapCelda_CG( counts = counts, fm = fm, z = z, y = y, nfeatures = nfeatures, ...) } else if (model == "celda_G") { fm <- factorizeMatrix(x = sce, useAssay = useAssay, altExpName = altExpName, type = "proportion") y <- as.integer(celdaModules(sce, altExpName = altExpName)) g <- .celdaHeatmapCelda_G(counts, fm, y, nfeatures = nfeatures, ...) } else { stop("S4Vectors::metadata(altExp(sce, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } return(g) } ) .celdaHeatmapCelda_C <- function( counts, z, featureIx, ...) { norm <- normalizeCounts(counts, normalize = "proportion", transformationFun = sqrt) if (is.null(featureIx)) { return(plotHeatmap(norm, z, ...)) } return(plotHeatmap(norm[featureIx, ], z = z, ...)) } .celdaHeatmapCelda_CG <- function( counts = counts, fm = fm, z = z, y = y, nfeatures, ...) { top <- topRank(fm$proportions$module, n = nfeatures) ix <- unlist(top$index) rn <- unlist(top$names) norm <- normalizeCounts(counts, normalize = "proportion", transformationFun = sqrt) plt <- plotHeatmap(norm[rn, ], z = z, y = y[ix], ...) return(plt) } .celdaHeatmapCelda_G <- function(counts, fm, y, nfeatures, ...) { top <- topRank(fm$proportions$module, n = nfeatures) ix <- unlist(top$index) rn <- unlist(top$names) norm <- normalizeCounts(counts, normalize = "proportion", transformationFun = sqrt) plt <- plotHeatmap(norm[rn, ], y = y[ix], ...) return(plt) } ================================================ FILE: R/celdatSNE.R ================================================ #' @title t-Distributed Stochastic Neighbor Embedding (t-SNE) dimension #' reduction for celda \code{sce} object #' @description Embeds cells in two dimensions using \link[Rtsne]{Rtsne} based #' on a celda model. For celda_C \code{sce} objects, PCA on the normalized #' counts is used to reduce the number of features before applying t-SNE. For #' celda_CG and celda_G \code{sce} objects, tSNE is run on module #' probabilities to reduce the number of features instead of using PCA. #' Module probabilities are square-root transformed before applying tSNE. #' @param sce A \linkS4class{SingleCellExperiment} object #' returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}. #' @param useAssay A string specifying which \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param maxCells Integer. Maximum number of cells to plot. Cells will be #' randomly subsampled if \code{ncol(counts) > maxCells}. Larger numbers of #' cells requires more memory. If \code{NULL}, no subsampling will be #' performed. Default \code{NULL}. #' @param minClusterSize Integer. Do not subsample cell clusters below this #' threshold. Default 100. #' @param initialDims Integer. PCA will be used to reduce the dimensionality #' of the dataset. The top 'initialDims' principal components will be used #' for tSNE. Default 20. #' @param modules Integer vector. Determines which feature modules to use for #' tSNE. If \code{NULL}, all modules will be used. Default \code{NULL}. #' @param perplexity Numeric. Perplexity parameter for tSNE. Default 20. #' @param maxIter Integer. Maximum number of iterations in tSNE generation. #' Default 2500. #' @param normalize Character. Passed to \link{normalizeCounts} in #' normalization step. Divides counts by the library sizes for each #' cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses #' the total counts for each cell as the library size. 'cpm' divides the #' library size of each cell by one million to produce counts per million. #' 'median' divides the library size of each cell by the median library size #' across all cells. 'mean' divides the library size of each cell by the mean #' library size across all cells. #' @param scaleFactor Numeric. Sets the scale factor for cell-level #' normalization. This scale factor is multiplied to each cell after the #' library size of each cell had been adjusted in \code{normalize}. Default #' \code{NULL} which means no scale factor is applied. #' @param transformationFun Function. Applys a transformation such as 'sqrt', #' 'log', 'log2', 'log10', or 'log1p'. If \code{NULL}, no transformation will #' be applied. Occurs after applying normalization and scale factor. Default #' \code{NULL}. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @return \code{sce} with t-SNE coordinates #' (columns "celda_tSNE1" & "celda_tSNE2") added to #' \code{\link{reducedDim}(sce, "celda_tSNE")}. #' @export setGeneric("celdaTsne", function(sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, initialDims = 20, modules = NULL, perplexity = 20, maxIter = 2500, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, seed = 12345) { standardGeneric("celdaTsne") }) #' @rdname celdaTsne #' @examples #' data(sceCeldaCG) #' tsneRes <- celdaTsne(sceCeldaCG) #' @export setMethod("celdaTsne", signature(sce = "SingleCellExperiment"), function(sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, initialDims = 20, modules = NULL, perplexity = 20, maxIter = 2500, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, seed = 12345) { if (is.null(seed)) { sce <- .celdaTsne(sce = sce, useAssay = useAssay, altExpName = altExpName, maxCells = maxCells, minClusterSize = minClusterSize, initialDims = initialDims, modules = modules, perplexity = perplexity, maxIter = maxIter, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) } else { with_seed(seed, sce <- .celdaTsne(sce = sce, useAssay = useAssay, altExpName = altExpName, maxCells = maxCells, minClusterSize = minClusterSize, initialDims = initialDims, modules = modules, perplexity = perplexity, maxIter = maxIter, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun)) } return(sce) }) .celdaTsne <- function(sce, useAssay, altExpName, maxCells, minClusterSize, initialDims, modules, perplexity, maxIter, normalize, scaleFactor, transformationFun) { celdaMod <- celdaModel(sce, altExpName = altExpName) altExp <- SingleCellExperiment::altExp(sce, altExpName) if (celdaMod == "celda_C") { res <- .celdaTsneC(sce = altExp, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, initialDims = initialDims, perplexity = perplexity, maxIter = maxIter, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) } else if (celdaMod == "celda_CG") { res <- .celdaTsneCG(sce = altExp, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, initialDims = initialDims, modules = modules, perplexity = perplexity, maxIter = maxIter, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) } else if (celdaMod == "celda_G") { res <- .celdaTsneG(sce = altExp, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, initialDims = initialDims, modules = modules, perplexity = perplexity, maxIter = maxIter, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) } else { stop("S4Vectors::metadata(altExp(sce, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } SingleCellExperiment::reducedDim(altExp, "celda_tSNE") <- res SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } .celdaTsneC <- function(sce, useAssay, maxCells, minClusterSize, initialDims, perplexity, maxIter, normalize, scaleFactor, transformationFun) { preparedCountInfo <- .prepareCountsForDimReductionCeldaC(sce = sce, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) res <- .calculateTsne(preparedCountInfo$norm, perplexity = perplexity, maxIter = maxIter, doPca = TRUE, initialDims = initialDims) final <- matrix(NA, nrow = ncol(sce), ncol = 2) final[preparedCountInfo$cellIx, ] <- res rownames(final) <- colnames(sce) colnames(final) <- c("celda_tSNE1", "celda_tSNE2") return(final) } .celdaTsneCG <- function(sce, useAssay, maxCells, minClusterSize, initialDims, modules, perplexity, maxIter, normalize, scaleFactor, transformationFun) { preparedCountInfo <- .prepareCountsForDimReductionCeldaCG(sce = sce, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) norm <- preparedCountInfo$norm res <- .calculateTsne(norm, doPca = FALSE, perplexity = perplexity, maxIter = maxIter, initialDims = initialDims) final <- matrix(NA, nrow = ncol(sce), ncol = 2) final[preparedCountInfo$cellIx, ] <- res rownames(final) <- colnames(sce) colnames(final) <- c("celda_tSNE1", "celda_tSNE2") return(final) } .celdaTsneG <- function(sce, useAssay, maxCells, minClusterSize, initialDims, modules, perplexity, maxIter, normalize, scaleFactor, transformationFun) { preparedCountInfo <- .prepareCountsForDimReductionCeldaG(sce = sce, useAssay = useAssay, maxCells = maxCells, minClusterSize = minClusterSize, modules = modules, normalize = normalize, scaleFactor = scaleFactor, transformationFun = transformationFun) res <- .calculateTsne(preparedCountInfo$norm, perplexity = perplexity, maxIter = maxIter, doPca = FALSE, initialDims = initialDims) final <- matrix(NA, nrow = ncol(sce), ncol = 2) final[preparedCountInfo$cellIx, ] <- res rownames(final) <- colnames(sce) colnames(final) <- c("celda_tSNE1", "celda_tSNE2") return(final) } # Run the t-SNE algorithm for dimensionality reduction # @param norm Normalized count matrix. # @param perplexity Numeric vector. Determines perplexity for tsne. Default 20. # @param maxIter Numeric vector. Determines iterations for tsne. Default 1000. # @param doPca Logical. Whether to perform # dimensionality reduction with PCA before tSNE. # @param initialDims Integer. Number of dimensions from PCA to use as # input in tSNE. Default 50. #' @importFrom Rtsne Rtsne .calculateTsne <- function(norm, perplexity, maxIter, doPca, initialDims) { res <- Rtsne::Rtsne( norm, pca = doPca, max_iter = maxIter, perplexity = perplexity, check_duplicates = FALSE, is_distance = FALSE, initial_dims = initialDims)$Y return(res) } ================================================ FILE: R/celdatosce.R ================================================ #' @title Convert old celda model object to \code{SCE} object #' @description Convert a old celda model object (\code{celda_C}, #' \code{celda_G}, or \code{celda_CG} object) to a #' \linkS4class{SingleCellExperiment} object containing celda model #' information in \code{metadata} slot. Counts matrix is stored in the #' \code{"counts"} assay slot in \code{assays}. #' @param celdaModel A \code{celdaModel} or \code{celdaList} object generated #' using older versions of \code{celda}. #' @param counts A numeric \link{matrix} of counts used to generate #' \code{celdaModel}. Dimensions and MD5 checksum will be checked by #' \link{compareCountMatrix}. #' @param useAssay A string specifying the name of the #' \link{assay} slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return A \linkS4class{SingleCellExperiment} object. Function #' parameter settings are stored in the \link{metadata} #' \code{"celda_parameters"} slot. #' Columns \code{celda_sample_label} and \code{celda_cell_cluster} in #' \link{colData} contain sample labels and celda cell #' population clusters. Column \code{celda_feature_module} in #' \link{rowData} contain feature modules. #' @export setGeneric("celdatosce", function(celdaModel, counts, useAssay = "counts", altExpName = "featureSubset") { standardGeneric("celdatosce")}) #' @rdname celdatosce #' @examples #' data(celdaCMod, celdaCSim) #' sce <- celdatosce(celdaCMod, celdaCSim$counts) #' @export setMethod("celdatosce", signature(celdaModel = "celda_C"), function(celdaModel, counts, useAssay = "counts", altExpName = "featureSubset") { compareCountMatrix(counts, celdaModel, errorOnMismatch = FALSE) ls <- list() ls[[useAssay]] <- counts sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" altExp <- .createSCEceldaC(celdaCMod = celdaModel, sce = SingleCellExperiment::altExp(sce, altExpName), xClass = xClass, useAssay = useAssay, algorithm = NULL, stopIter = NULL, maxIter = NULL, splitOnIter = NULL, splitOnLast = NULL, nchains = NULL, zInitialize = NULL, zInit = NULL, logfile = NULL, verbose = NULL) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) #' @rdname celdatosce #' @examples #' data(celdaGMod, celdaGSim) #' sce <- celdatosce(celdaGMod, celdaGSim$counts) #' @export setMethod("celdatosce", signature(celdaModel = "celda_G"), function(celdaModel, counts, useAssay = "counts", altExpName = "featureSubset") { compareCountMatrix(counts, celdaModel, errorOnMismatch = FALSE) ls <- list() ls[[useAssay]] <- counts sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" altExp <- .createSCEceldaG(celdaGMod = celdaModel, sce = SingleCellExperiment::altExp(sce, altExpName), xClass = xClass, useAssay = useAssay, stopIter = NULL, maxIter = NULL, splitOnIter = NULL, splitOnLast = NULL, nchains = NULL, yInitialize = NULL, yInit = NULL, logfile = NULL, verbose = NULL) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) #' @rdname celdatosce #' @examples #' data(celdaCGMod, celdaCGSim) #' sce <- celdatosce(celdaCGMod, celdaCGSim$counts) #' @export setMethod("celdatosce", signature(celdaModel = "celda_CG"), function(celdaModel, counts, useAssay = "counts", altExpName = "featureSubset") { compareCountMatrix(counts, celdaModel, errorOnMismatch = FALSE) ls <- list() ls[[useAssay]] <- counts sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" altExp <- .createSCEceldaCG(celdaCGMod = celdaModel, sce = SingleCellExperiment::altExp(sce, altExpName), xClass = xClass, useAssay = useAssay, algorithm = NULL, stopIter = NULL, maxIter = NULL, splitOnIter = NULL, splitOnLast = NULL, nchains = NULL, zInitialize = NULL, yInitialize = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = NULL) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) #' @rdname celdatosce #' @examples #' data(celdaCGGridSearchRes, celdaCGSim) #' sce <- celdatosce(celdaCGGridSearchRes, celdaCGSim$counts) #' @export setMethod("celdatosce", signature(celdaModel = "celdaList"), function(celdaModel, counts, useAssay = "counts", altExpName = "featureSubset") { compareCountMatrix(counts, celdaModel, errorOnMismatch = FALSE) ls <- list() ls[[useAssay]] <- counts sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" model <- celdaModel@celdaGridSearchParameters$model paramsTest <- celdaModel@celdaGridSearchParameters$paramsTest paramsFixed <- celdaModel@celdaGridSearchParameters$paramsFixed maxIter <- celdaModel@celdaGridSearchParameters$maxIter nchains <- celdaModel@celdaGridSearchParameters$nchains cores <- celdaModel@celdaGridSearchParameters$cores bestOnly <- celdaModel@celdaGridSearchParameters$bestOnly perplexity <- celdaModel@celdaGridSearchParameters$perplexity verbose <- celdaModel@celdaGridSearchParameters$verbose logfilePrefix <- celdaModel@celdaGridSearchParameters$logfilePrefix altExp <- .createSCEceldaGridSearch(celdaList = celdaModel, sce = SingleCellExperiment::altExp(sce, altExpName), xClass = xClass, useAssay = useAssay, model = model, paramsTest = paramsTest, paramsFixed = paramsFixed, maxIter = maxIter, seed = NULL, nchains = nchains, cores = cores, bestOnly = bestOnly, perplexity = perplexity, verbose = verbose, logfilePrefix = logfilePrefix) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) ================================================ FILE: R/clusterProbability.R ================================================ #' @title Get the conditional probabilities of cell in subpopulations from celda #' model #' @description Calculate the conditional probability of each cell belonging to #' each subpopulation given all other cell cluster assignments and/or #' each feature belonging to each module given all other feature cluster #' assignments in a celda model. #' @param sce A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_C}, \link{celda_G}, or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. #' Rows represent features and columns represent cells. #' @param useAssay A string specifying which \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param log Logical. If \code{FALSE}, then the normalized conditional #' probabilities will be returned. If \code{TRUE}, then the unnormalized log #' probabilities will be returned. Default \code{FALSE}. #' @examples #' data(sceCeldaCG) #' clusterProb <- clusterProbability(sceCeldaCG, log = TRUE) #' @return A list containging a matrix for the conditional cell subpopulation #' cluster and/or feature module probabilities. #' @export setGeneric("clusterProbability", function(sce, useAssay = "counts", altExpName = "featureSubset", log = FALSE) { standardGeneric("clusterProbability") }) #' @seealso `celda_C()` for clustering cells #' @examples #' data(sceCeldaC) #' clusterProb <- clusterProbability(sceCeldaC) #' @rdname clusterProbability #' @export setMethod("clusterProbability", signature(sce = "SingleCellExperiment"), function(sce, useAssay = "counts", altExpName = "featureSubset", log = FALSE) { model <- celdaModel(sce, altExpName = altExpName) altExp <- SingleCellExperiment::altExp(sce, altExpName) counts <- SummarizedExperiment::assay(altExp, i = useAssay) beta <- S4Vectors::metadata(altExp)$celda_parameters$beta if (model == "celda_C") { s <- as.integer( SummarizedExperiment::colData(altExp)$celda_sample_label) z <- as.integer( SummarizedExperiment::colData(altExp)$celda_cell_cluster) K <- S4Vectors::metadata(altExp)$celda_parameters$K alpha <- S4Vectors::metadata(altExp)$celda_parameters$alpha cp <- .clusterProbabilityCeldaC( counts = counts, z = z, s = s, K = K, alpha = alpha, beta = beta, log = log) } else if (model == "celda_CG") { s <- as.integer( SummarizedExperiment::colData(altExp)$celda_sample_label) z <- as.integer( SummarizedExperiment::colData(altExp)$celda_cell_cluster) K <- S4Vectors::metadata(altExp)$celda_parameters$K y <- as.integer( SummarizedExperiment::rowData(altExp)$celda_feature_module) L <- S4Vectors::metadata(altExp)$celda_parameters$L alpha <- S4Vectors::metadata(altExp)$celda_parameters$alpha delta <- S4Vectors::metadata(altExp)$celda_parameters$delta gamma <- S4Vectors::metadata(altExp)$celda_parameters$gamma cp <- .clusterProbabilityCeldaCG( counts = counts, s = s, z = z, y = y, K = K, L = L, alpha = alpha, delta = delta, beta = beta, gamma = gamma, log = log) } else if (model == "celda_G") { y <- as.integer( SummarizedExperiment::rowData(altExp)$celda_feature_module) L <- S4Vectors::metadata(altExp)$celda_parameters$L delta <- S4Vectors::metadata(altExp)$celda_parameters$delta gamma <- S4Vectors::metadata(altExp)$celda_parameters$gamma cp <- .clusterProbabilityCeldaG( counts = counts, y = y, L = L, delta = delta, beta = beta, gamma = gamma, log = log) } else { stop("S4Vectors::metadata(altExp(sce, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'!") } return(cp) } ) .clusterProbabilityCeldaC <- function( counts, z, s, K, alpha, beta, log) { p <- .cCDecomposeCounts(counts, s, z, K) nextZ <- .cCCalcGibbsProbZ(counts = counts, mCPByS = p$mCPByS, nGByCP = p$nGByCP, nByC = p$nByC, nCP = p$nCP, z = z, s = s, K = K, nG = p$nG, nM = p$nM, alpha = alpha, beta = beta, doSample = FALSE) zProb <- t(nextZ$probs) if (!isTRUE(log)) { zProb <- .normalizeLogProbs(zProb) } return(list(zProbability = zProb)) } .clusterProbabilityCeldaCG <- function( counts, s, z, y, K, L, alpha, delta, beta, gamma, log) { p <- .cCGDecomposeCounts(counts, s, z, y, K, L) lgbeta <- lgamma(seq(0, max(p$nCP)) + beta) lggamma <- lgamma(seq(0, nrow(counts) + L) + gamma) lgdelta <- c(NA, lgamma((seq(nrow(counts) + L) * delta))) nextZ <- .cCCalcGibbsProbZ( counts = p$nTSByC, mCPByS = p$mCPByS, nGByCP = p$nTSByCP, nCP = p$nCP, nByC = p$nByC, z = z, s = s, K = K, nG = L, nM = p$nM, alpha = alpha, beta = beta, doSample = FALSE ) zProb <- t(nextZ$probs) ## Gibbs sampling for each gene nextY <- .cGCalcGibbsProbY( counts = p$nGByCP, nTSByC = p$nTSByCP, nByTS = p$nByTS, nGByTS = p$nGByTS, nByG = p$nByG, y = y, L = L, nG = p$nG, lgbeta = lgbeta, lgdelta = lgdelta, lggamma = lggamma, delta = delta, doSample = FALSE ) yProb <- t(nextY$probs) if (!isTRUE(log)) { zProb <- .normalizeLogProbs(zProb) yProb <- .normalizeLogProbs(yProb) } return(list(zProbability = zProb, yProbability = yProb)) } .clusterProbabilityCeldaG <- function( counts, y, L, delta, beta, gamma, log) { ## Calculate counts one time up front p <- .cGDecomposeCounts(counts = counts, y = y, L = L) lgbeta <- lgamma(seq(0, max(.colSums( counts, nrow(counts), ncol(counts) ))) + beta) lggamma <- lgamma(seq(0, nrow(counts) + L) + gamma) lgdelta <- c(NA, lgamma(seq(nrow(counts) + L) * delta)) nextY <- .cGCalcGibbsProbY( counts = counts, nTSByC = p$nTSByC, nByTS = p$nByTS, nGByTS = p$nGByTS, nByG = p$nByG, y = y, nG = p$nG, L = L, lgbeta = lgbeta, lgdelta = lgdelta, lggamma = lggamma, delta = delta, doSample = FALSE ) yProb <- t(nextY$probs) if (!isTRUE(log)) { yProb <- .normalizeLogProbs(yProb) } return(list(yProbability = yProb)) } ================================================ FILE: R/data.R ================================================ #' @title available models #' @export availableModels <- c("celda_C", "celda_G", "celda_CG") #' A toy count matrix for use with celda. #' #' @title sampleCells #' @description A matrix of simulated gene counts. #' @format A matrix of simulated gene counts with 10 rows (genes) and 10 #' columns (cells). #' @details Generated by Josh Campbell. #' @source \url{http://github.com/campbio/celda} "sampleCells" #' @title celdaCGSim #' @description An deprecated example of simulated count matrix from the #' celda_CG model. #' @format A list of counts and properties as returned from old simulateCells(). "celdaCGSim" #' @title celdaCGmod #' @description celda_CG model object generated from \code{celdaCGSim} using #' old \code{celda_CG} function. #' @format A celda_CG object # @examples # data(celdaCGSim) # celdaCGMod <- celda_CG(celdaCGSim$counts, # K = celdaCGSim$K, # L = celdaCGSim$L, # nchains = 1) "celdaCGMod" #' @title celdaCGGridSearchRes #' @description Example results of old celdaGridSearch on celdaCGSim #' @format An object as returned from old celdaGridSearch() # @examples # data(celdaCGSim) # celdaCGGridSearchRes <- celdaGridSearch(celdaCGSim$counts, # model = "celda_CG", # paramsTest = list(K = seq(4, 6), L = seq(9, 11)), # paramsFixed = list(sampleLabel = celdaCGSim$sampleLabel), # bestOnly = TRUE, # nchains = 1, # cores = 2) "celdaCGGridSearchRes" #' @title celdaCSim #' @description An old example simulated count matrix from the celda_C model. #' @format A list of counts and properties as returned from old simulateCells(). # @examples # celdaCSim <- simulateCells("celda_C") "celdaCSim" #' @title celdaCMod #' @description Old celda_C results generated from celdaCSim #' @format A celda_C object # @examples # data(celdaCSim) # celdaCMod <- celda_C(celdaCSim$counts, K = celdaCSim$K, nchains = 1) "celdaCMod" #' @title celdaGSim #' @description An old example simulated count matrix from the celda_G model. #' @format A list of counts and properties as returned from old simulateCells() # @examples # celdaGSim <- simulateCells("celda_G") "celdaGSim" #' @title celdaGMod #' @description Old celda_G results generated from celdaGsim #' @format A celda_G object # @examples # data(celdaGSim) # celdaGMod <- celda_G(celdaGSim$counts, L = celdaGSim$L, nchains = 1) "celdaGMod" #' @title contaminationSim #' @description A toy contamination data generated by #' \link{simulateContamination} #' @format A list "contaminationSim" #' @title sceCeldaC #' @description A \linkS4class{SingleCellExperiment} object containing the #' results of running \link{selectFeatures} and \link{celda_C} on #' \link{celdaCSim}. #' @format A \linkS4class{SingleCellExperiment} object #' @examples #' data(celdaCSim) #' sceCeldaC <- selectFeatures(celdaCSim$counts) #' sceCeldaC <- celda_C(sceCeldaC, #' K = celdaCSim$K, #' sampleLabel = celdaCSim$sampleLabel, #' nchains = 1) "sceCeldaC" #' @title sceCeldaG #' @description A \linkS4class{SingleCellExperiment} object containing the #' results of running \link{selectFeatures} and \link{celda_G} on #' \link{celdaGSim}. #' @format A \linkS4class{SingleCellExperiment} object #' @examples #' data(celdaGSim) #' sceCeldaG <- selectFeatures(celdaGSim$counts) #' sceCeldaG <- celda_G(sceCeldaG, L = celdaGSim$L, nchains = 1) "sceCeldaG" #' @title sceCeldaCG #' @description A \linkS4class{SingleCellExperiment} object containing the #' results of running \link{selectFeatures} and \link{celda_CG} on #' \link{celdaCGSim}. #' @format A \linkS4class{SingleCellExperiment} object #' @examples #' data(celdaCGSim) #' sceCeldaCG <- selectFeatures(celdaCGSim$counts) #' sceCeldaCG <- celda_CG(sceCeldaCG, #' K = celdaCGSim$K, #' L = celdaCGSim$L, #' sampleLabel = celdaCGSim$sampleLabel, #' nchains = 1) "sceCeldaCG" #' @title sceCeldaCGGridSearch #' @description A \linkS4class{SingleCellExperiment} object containing the #' results of running \link{selectFeatures} and \link{celdaGridSearch} on #' \link{celdaCGSim}. #' @format A \linkS4class{SingleCellExperiment} object #' @examples #' data(celdaCGSim) #' sce <- selectFeatures(celdaCGSim$counts) #' sceCeldaCGGridSearch <- celdaGridSearch(sce, #' model = "celda_CG", #' paramsTest = list(K = seq(4, 6), L = seq(9, 11)), #' paramsFixed = list(sampleLabel = celdaCGSim$sampleLabel), #' bestOnly = TRUE, #' nchains = 1, #' cores = 1, #' verbose = FALSE) "sceCeldaCGGridSearch" ================================================ FILE: R/decon.R ================================================ #' @title Contamination estimation with decontX #' #' @description Identifies contamination from factors such as ambient RNA #' in single cell genomic datasets. #' #' @name decontX #' #' @param x A numeric matrix of counts or a \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{assayName}. #' Cells in each batch will be subsetted and converted to a sparse matrix #' of class \code{dgCMatrix} from package \link{Matrix} before analysis. This #' object should only contain filtered cells after cell calling. Empty #' cell barcodes (low expression droplets before cell calling) are not needed #' to run DecontX. #' @param assayName Character. Name of the assay to use if \code{x} is a #' \linkS4class{SingleCellExperiment}. #' @param z Numeric or character vector. Cell cluster labels. If NULL, #' PCA will be used to reduce the dimensionality of the dataset initially, #' '\link[uwot]{umap}' from the 'uwot' package #' will be used to further reduce the dataset to 2 dimenions and #' the '\link[dbscan]{dbscan}' function from the 'dbscan' package #' will be used to identify clusters of broad cell types. Default NULL. #' @param batch Numeric or character vector. Batch labels for cells. #' If batch labels are supplied, DecontX is run on cells from each #' batch separately. Cells run in different channels or assays #' should be considered different batches. Default NULL. #' @param background A numeric matrix of counts or a #' \linkS4class{SingleCellExperiment} with the matrix located in the assay #' slot under \code{assayName}. It should have the same data format as \code{x} #' except it contains the empty droplets instead of cells. When supplied, #' empirical distribution of transcripts from these empty droplets #' will be used as the contamination distribution. Default NULL. #' @param bgAssayName Character. Name of the assay to use if \code{background} #' is a \linkS4class{SingleCellExperiment}. Default to same as #' \code{assayName}. #' @param bgBatch Numeric or character vector. Batch labels for #' \code{background}. Its unique values should be the same as those in #' \code{batch}, such that each batch of cells have their corresponding batch #' of empty droplets as background, pointed by this parameter. Default to NULL. #' @param maxIter Integer. Maximum iterations of the EM algorithm. Default 500. #' @param convergence Numeric. The EM algorithm will be stopped if the maximum #' difference in the contamination estimates between the previous and #' current iterations is less than this. Default 0.001. #' @param iterLogLik Integer. Calculate log likelihood every \code{iterLogLik} #' iteration. Default 10. #' @param delta Numeric Vector of length 2. Concentration parameters for #' the Dirichlet prior for the contamination in each cell. The first element #' is the prior for the native counts while the second element is the prior for #' the contamination counts. These essentially act as pseudocounts for the #' native and contamination in each cell. If \code{estimateDelta = TRUE}, #' this is only used to produce a random sample of proportions for an initial #' value of contamination in each cell. Then #' \code{\link[MCMCprecision]{fit_dirichlet}} is used to update #' \code{delta} in each iteration. #' If \code{estimateDelta = FALSE}, then \code{delta} is fixed with these #' values for the entire inference procedure. Fixing \code{delta} and #' setting a high number in the second element will force \code{decontX} #' to be more aggressive and estimate higher levels of contamination at #' the expense of potentially removing native expression. #' Default \code{c(10, 10)}. #' @param estimateDelta Boolean. Whether to update \code{delta} at each #' iteration. #' @param varGenes Integer. The number of variable genes to use in #' dimensionality reduction before clustering. Variability is calcualted using #' \code{\link[scran]{modelGeneVar}} function from the 'scran' package. #' Used only when z is not provided. Default 5000. #' @param dbscanEps Numeric. The clustering resolution parameter #' used in '\link[dbscan]{dbscan}' to estimate broad cell clusters. #' Used only when z is not provided. Default 1. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param logfile Character. Messages will be redirected to a file named #' `logfile`. If NULL, messages will be printed to stdout. Default NULL. #' @param verbose Logical. Whether to print log messages. Default TRUE. #' @param ... For the generic, further arguments to pass to each method. #' #' @return If \code{x} is a matrix-like object, a list will be returned #' with the following items: #' \describe{ #' \item{\code{decontXcounts}:}{The decontaminated matrix. Values obtained #' from the variational inference procedure may be non-integer. However, #' integer counts can be obtained by rounding, #' e.g. \code{round(decontXcounts)}.} #' \item{\code{contamination}:}{Percentage of contamination in each cell.} #' \item{\code{estimates}:}{List of estimated parameters for each batch. If z #' was not supplied, then the UMAP coordinates used to generated cell #' cluster labels will also be stored here.} #' \item{\code{z}:}{Cell population/cluster labels used for analysis.} #' \item{\code{runParams}:}{List of arguments used in the function call.} #' } #' #' If \code{x} is a \linkS4class{SingleCellExperiment}, then the decontaminated #' counts will be stored as an assay and can be accessed with #' \code{decontXcounts(x)}. The contamination values and cluster labels #' will be stored in \code{colData(x)}. \code{estimates} and \code{runParams} #' will be stored in \code{metadata(x)$decontX}. The UMAPs used to generated #' cell cluster labels will be stored in #' \code{reducedDims} slot in \code{x}. #' #' @author Shiyi Yang, Yuan Yin, Joshua Campbell #' #' @examples #' # Generate matrix with contamination #' s <- simulateContamination(seed = 12345) #' #' library(SingleCellExperiment) #' sce <- SingleCellExperiment(list(counts = s$observedCounts)) #' sce <- decontX(sce) #' #' # Plot contamination on UMAP #' plotDecontXContamination(sce) #' #' # Plot decontX cluster labels #' umap <- reducedDim(sce) #' plotDimReduceCluster(x = sce$decontX_clusters, #' dim1 = umap[, 1], dim2 = umap[, 2], ) #' #' # Plot percentage of marker genes detected #' # in each cell cluster before decontamination #' s$markers #' plotDecontXMarkerPercentage(sce, markers = s$markers, assayName = "counts") #' #' # Plot percentage of marker genes detected #' # in each cell cluster after contamination #' plotDecontXMarkerPercentage(sce, markers = s$markers, #' assayName = "decontXcounts") #' #' # Plot percentage of marker genes detected in each cell #' # comparing original and decontaminated counts side-by-side #' plotDecontXMarkerPercentage(sce, markers = s$markers, #' assayName = c("counts", "decontXcounts")) #' #' # Plot raw counts of indiviual markers genes before #' # and after decontamination #' plotDecontXMarkerExpression(sce, unlist(s$markers)) NULL #' @export #' @rdname decontX setGeneric("decontX", function(x, ...) standardGeneric("decontX")) ######################### # Setting up S4 methods # ######################### #' @export #' @rdname decontX #' @importClassesFrom SingleCellExperiment SingleCellExperiment #' @importClassesFrom Matrix dgCMatrix setMethod("decontX", "SingleCellExperiment", function(x, assayName = "counts", z = NULL, batch = NULL, background = NULL, bgAssayName = NULL, bgBatch = NULL, maxIter = 500, delta = c(10, 10), estimateDelta = TRUE, convergence = 0.001, iterLogLik = 10, varGenes = 5000, dbscanEps = 1, seed = 12345, logfile = NULL, verbose = TRUE) { countsBackground <- NULL if (!is.null(background)) { # Remove cells with the same ID between x and the background matrix # Also update bgBatch when background is updated and bgBatch is not null temp <- .checkBackground(x = x, background = background, bgBatch = bgBatch, logfile = logfile, verbose = verbose) background <- temp$background bgBatch <- temp$bgBatch if (is.null(bgAssayName)) { bgAssayName <- assayName } countsBackground <- SummarizedExperiment::assay(background, i = bgAssayName) } mat <- SummarizedExperiment::assay(x, i = assayName) result <- .decontX( counts = mat, z = z, batch = batch, countsBackground = countsBackground, batchBackground = bgBatch, maxIter = maxIter, convergence = convergence, iterLogLik = iterLogLik, delta = delta, estimateDelta = estimateDelta, varGenes = varGenes, dbscanEps = dbscanEps, seed = seed, logfile = logfile, verbose = verbose ) ## Add results into column annotation SummarizedExperiment::colData(x)$decontX_contamination <- result$contamination SummarizedExperiment::colData(x)$decontX_clusters <- as.factor(result$z) ## Put estimated UMAPs into SCE batchIndex <- unique(result$runParams$batch) if (length(batchIndex) > 1) { for (i in batchIndex) { ## Each individual UMAP will only be for one batch so need ## to put NAs in for cells in other batches tempUMAP <- matrix(NA, ncol = 2, nrow = ncol(mat)) tempUMAP[result$runParams$batch == i, ] <- result$estimates[[i]]$UMAP colnames(tempUMAP) <- c("UMAP_1", "UMAP_2") rownames(tempUMAP) <- colnames(mat) SingleCellExperiment::reducedDim( x, paste("decontX", i, "UMAP", sep = "_") ) <- tempUMAP } } else { SingleCellExperiment::reducedDim(x, "decontX_UMAP") <- result$estimates[[batchIndex]]$UMAP } ## Save the rest of the result object into metadata decontXcounts(x) <- result$decontXcounts result$decontXcounts <- NULL S4Vectors::metadata(x)$decontX <- result return(x) }) #' @export #' @rdname decontX setMethod("decontX", "ANY", function(x, z = NULL, batch = NULL, background = NULL, bgBatch = NULL, maxIter = 500, delta = c(10, 10), estimateDelta = TRUE, convergence = 0.001, iterLogLik = 10, varGenes = 5000, dbscanEps = 1, seed = 12345, logfile = NULL, verbose = TRUE) { countsBackground <- NULL if (!is.null(background)) { # Remove cells with the same ID between x and the background matrix # Also update bgBatch when background is updated and bgBatch is not null temp <- .checkBackground(x = x, background = background, bgBatch = bgBatch, logfile = logfile, verbose = verbose) background <- temp$background countsBackground <- background bgBatch <- temp$bgBatch } .decontX( counts = x, z = z, batch = batch, countsBackground = countsBackground, batchBackground = bgBatch, maxIter = maxIter, convergence = convergence, iterLogLik = iterLogLik, delta = delta, estimateDelta = estimateDelta, varGenes = varGenes, dbscanEps = dbscanEps, seed = seed, logfile = logfile, verbose = verbose ) }) ## Copied from SingleCellExperiment Package GET_FUN <- function(exprs_values, ...) { (exprs_values) # To ensure evaluation function(object, ...) { SummarizedExperiment::assay(object, i = exprs_values, ...) } } SET_FUN <- function(exprs_values, ...) { (exprs_values) # To ensure evaluation function(object, ..., value) { SummarizedExperiment::assay(object, i = exprs_values, ...) <- value object } } #' @title Get or set decontaminated counts matrix #' #' @description Gets or sets the decontaminated counts matrix from a #' a \linkS4class{SingleCellExperiment} object. #' @name decontXcounts #' @param object A \linkS4class{SingleCellExperiment} object. #' @param value A matrix to save as an assay called \code{decontXcounts} #' @param ... For the generic, further arguments to pass to each method. #' @return If getting, the assay from \code{object} with the name #' \code{decontXcounts} will be returned. If setting, a #' \linkS4class{SingleCellExperiment} object will be returned with #' \code{decontXcounts} listed in the \code{assay} slot. #' @seealso \code{\link{assay}} and \code{\link{assay<-}} NULL #' @export #' @rdname decontXcounts setGeneric("decontXcounts", function(object, ...) { standardGeneric("decontXcounts") }) #' @export #' @rdname decontXcounts setGeneric("decontXcounts<-", function(object, ..., value) { standardGeneric("decontXcounts<-") }) #' @export #' @rdname decontXcounts setMethod("decontXcounts", "SingleCellExperiment", GET_FUN("decontXcounts")) #' @export #' @rdname decontXcounts setMethod( "decontXcounts<-", c("SingleCellExperiment", "ANY"), SET_FUN("decontXcounts") ) ########################## # Core Decontx Functions # ########################## .decontX <- function(counts, z = NULL, batch = NULL, countsBackground = NULL, batchBackground = NULL, maxIter = 200, convergence = 0.001, iterLogLik = 10, delta = c(10, 10), estimateDelta = TRUE, varGenes = NULL, dbscanEps = NULL, seed = 12345, logfile = NULL, verbose = TRUE) { startTime <- Sys.time() .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages("Starting DecontX", logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) runParams <- list( z = z, batch = batch, batchBackground = batchBackground, maxIter = maxIter, delta = delta, estimateDelta = estimateDelta, convergence = convergence, varGenes = varGenes, dbscanEps = dbscanEps, logfile = logfile, verbose = verbose ) totalGenes <- nrow(counts) totalCells <- ncol(counts) geneNames <- rownames(counts) nC <- ncol(counts) allCellNames <- colnames(counts) ## Set up final decontaminated matrix estRmat <- Matrix::Matrix( data = 0, ncol = totalCells, nrow = totalGenes, sparse = TRUE, dimnames = list(geneNames, allCellNames) ) ## Generate batch labels if none were supplied if (is.null(batch)) { batch <- rep("all_cells", nC) # If batch null, bgBatch has to be null if (!is.null(batchBackground)) { stop( "When experiment default to no bacth, background should ", "also default to no batch." ) } if (!is.null(countsBackground)) { batchBackground <- rep("all_cells", ncol(countsBackground)) } } else { # If batch not null and countsBackground supplied, # user has to supply batchBackground as well if (!is.null(countsBackground) & is.null(batchBackground)) { stop( "Cell batch, and background are supplied. Please also ", "supply background batch." ) } } runParams$batch <- batch runParams$batchBackground <- batchBackground batchIndex <- unique(batch) ## Set result lists upfront for all cells from different batches logLikelihood <- c() estConp <- rep(NA, nC) returnZ <- rep(NA, nC) resBatch <- list() ## Cycle through each sample/batch and run DecontX for (bat in batchIndex) { if (length(batchIndex) == 1) { .logMessages( date(), ".. Analyzing all cells", logfile = logfile, append = TRUE, verbose = verbose ) } else { .logMessages( date(), " .. Analyzing cells in batch '", bat, "'", sep = "", logfile = logfile, append = TRUE, verbose = verbose ) } zBat <- NULL countsBat <- counts[, batch == bat] bgBat <- countsBackground[, batchBackground == bat] ## Convert to sparse matrix if (!inherits(countsBat, "dgCMatrix")) { .logMessages( date(), ".... Converting to sparse matrix", logfile = logfile, append = TRUE, verbose = verbose ) countsBat <- methods::as(countsBat, "CsparseMatrix") } if (!is.null(bgBat)) { if (!inherits(bgBat, "dgCMatrix")) { bgBat <- methods::as(bgBat, "CsparseMatrix") } } if (!is.null(z)) { zBat <- z[batch == bat] } if (is.null(seed)) { res <- .decontXoneBatch( counts = countsBat, z = zBat, batch = bat, countsBackground = bgBat, maxIter = maxIter, delta = delta, estimateDelta = estimateDelta, convergence = convergence, iterLogLik = iterLogLik, logfile = logfile, verbose = verbose, varGenes = varGenes, dbscanEps = dbscanEps, seed = seed ) } else { withr::with_seed( seed, res <- .decontXoneBatch( counts = countsBat, z = zBat, batch = bat, countsBackground = bgBat, maxIter = maxIter, delta = delta, estimateDelta = estimateDelta, convergence = convergence, iterLogLik = iterLogLik, logfile = logfile, verbose = verbose, varGenes = varGenes, dbscanEps = dbscanEps, seed = seed ) ) } ## Try to convert class of new matrix to class of original matrix .logMessages( date(), ".. Calculating final decontaminated matrix", logfile = logfile, append = TRUE, verbose = verbose ) estRmat.temp <- calculateNativeMatrix( counts = countsBat, theta = res$theta, eta = res$eta, phi = res$phi, z = as.integer(res$z), pseudocount = 1e-20 ) # Speed up sparse matrix value assignment by cbind -> order recovery allCol <- paste0("col_", seq_len(ncol(estRmat))) colnames(estRmat) <- allCol subCol <- paste0("col_", which(batch == bat)) colnames(estRmat.temp) <- subCol estRmat <- estRmat[, !(allCol %in% subCol)] estRmat <- cbind(estRmat, estRmat.temp) # Recover order and set names estRmat <- estRmat[, allCol] dimnames(estRmat) <- list(geneNames, allCellNames) resBatch[[bat]] <- list( z = res$z, phi = res$phi, eta = res$eta, delta = res$delta, theta = res$theta, contamination = res$contamination, logLikelihood = res$logLikelihood, UMAP = res$UMAP, z = res$z, iteration = res$iteration ) estConp[batch == bat] <- res$contamination if (length(batchIndex) > 1) { returnZ[batch == bat] <- paste0(bat, "-", res$z) } else { returnZ[batch == bat] <- res$z } } names(resBatch) <- batchIndex returnResult <- list( "runParams" = runParams, "estimates" = resBatch, "decontXcounts" = estRmat, "contamination" = estConp, "z" = returnZ ) if (inherits(counts, c("DelayedMatrix", "DelayedArray"))) { .logMessages( date(), ".. Converting decontaminated matrix to", class(counts), logfile = logfile, append = TRUE, verbose = verbose ) ## Determine class of seed in DelayedArray seed.class <- unique(DelayedArray::seedApply(counts, class))[[1]] if (seed.class == "HDF5ArraySeed") { returnResult$decontXcounts <- methods::as(returnResult$decontXcounts, "HDF5Matrix") } else { if (isTRUE(methods::canCoerce(returnResult$decontXcounts, seed.class))) { returnResult$decontXcounts <- methods::as(returnResult$decontXcounts, seed.class) } } returnResult$decontXcounts <- DelayedArray::DelayedArray(returnResult$decontXcounts) } else { try({ if (methods::canCoerce(returnResult$decontXcounts, class(counts))) { returnResult$decontXcounts <- methods::as(returnResult$decontXcounts, class(counts)) } }, silent = TRUE ) } endTime <- Sys.time() .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages("Completed DecontX. Total time:", format(difftime(endTime, startTime)), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("-", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) return(returnResult) } # This function updates decontamination for one batch # seed passed to this function is to be furhter passed to # function .decontxInitializeZ() .decontXoneBatch <- function(counts, z = NULL, batch = NULL, countsBackground = NULL, maxIter = 200, delta = c(10, 10), estimateDelta = TRUE, convergence = 0.01, iterLogLik = 10, logfile = NULL, verbose = TRUE, varGenes = NULL, dbscanEps = NULL, seed = 12345) { .checkCountsDecon(counts) .checkDelta(delta) # nG <- nrow(counts) nC <- ncol(counts) deconMethod <- "clustering" ## Generating UMAP and cell cluster labels if none are provided umap <- NULL if (is.null(z)) { m <- ".... Generating UMAP and estimating cell types" estimateCellTypes <- TRUE } else { m <- ".... Generating UMAP" estimateCellTypes <- FALSE } .logMessages( date(), m, logfile = logfile, append = TRUE, verbose = verbose ) varGenes <- .processvarGenes(varGenes) dbscanEps <- .processdbscanEps(dbscanEps) celda.init <- .decontxInitializeZ( object = counts, varGenes = varGenes, dbscanEps = dbscanEps, estimateCellTypes = estimateCellTypes, seed = seed ) if (is.null(z)) { z <- celda.init$z } umap <- celda.init$umap colnames(umap) <- c( "DecontX_UMAP_1", "DecontX_UMAP_2" ) rownames(umap) <- colnames(counts) z <- .processCellLabels(z, numCells = nC) K <- length(unique(z)) iter <- 1L numIterWithoutImprovement <- 0L stopIter <- 3L .logMessages( date(), ".... Estimating contamination", logfile = logfile, append = TRUE, verbose = verbose ) if (deconMethod == "clustering") { ## Initialization theta <- stats::rbeta( n = nC, shape1 = delta[1], shape2 = delta[2] ) nextDecon <- decontXInitialize( counts = counts, theta = theta, z = z, pseudocount = 1e-20 ) phi <- nextDecon$phi eta <- nextDecon$eta # if countsBackground is not null, use empirical dist. to replace eta if (!is.null(countsBackground)) { # Add pseudocount to each gene in eta eta_tilda <- Matrix::rowSums(countsBackground) + 1e-20 eta <- eta_tilda / sum(eta_tilda) # Make eta a matrix same dimension as phi eta <- matrix(eta, length(eta), dim(phi)[2]) } ll <- c() llRound <- decontXLogLik( counts = counts, z = z, phi = phi, eta = eta, theta = theta, pseudocount = 1e-20 ) ## EM updates theta.previous <- theta converged <- FALSE counts.colsums <- Matrix::colSums(counts) while (iter <= maxIter & !isTRUE(converged) & numIterWithoutImprovement <= stopIter) { if (is.null(countsBackground)) { nextDecon <- decontXEM( counts = counts, counts_colsums = counts.colsums, phi = phi, estimate_eta = TRUE, eta = eta, theta = theta, z = z, estimate_delta = isTRUE(estimateDelta), delta = delta, pseudocount = 1e-20 ) } else { nextDecon <- decontXEM( counts = counts, counts_colsums = counts.colsums, phi = phi, estimate_eta = FALSE, eta = eta, theta = theta, z = z, estimate_delta = isTRUE(estimateDelta), delta = delta, pseudocount = 1e-20 ) } theta <- nextDecon$theta phi <- nextDecon$phi eta <- nextDecon$eta delta <- nextDecon$delta max.divergence <- max(abs(theta.previous - theta)) if (max.divergence < convergence) { converged <- TRUE } theta.previous <- theta ## Calculate likelihood and check for convergence if (iter %% iterLogLik == 0 || converged) { llTemp <- decontXLogLik( counts = counts, z = z, phi = phi, eta = eta, theta = theta, pseudocount = 1e-20 ) ll <- c(ll, llTemp) .logMessages(date(), "...... Completed iteration:", iter, "| converge:", signif(max.divergence, 4), logfile = logfile, append = TRUE, verbose = verbose ) } iter <- iter + 1L } } resConp <- nextDecon$contamination names(resConp) <- colnames(counts) return(list( "logLikelihood" = ll, "contamination" = resConp, "theta" = theta, "delta" = delta, "phi" = phi, "eta" = eta, "UMAP" = umap, "iteration" = iter - 1L, "z" = z )) } # This function calculates the log-likelihood # # counts Numeric/Integer matrix. Observed count matrix, rows represent features # and columns represent cells # z Integer vector. Cell population labels # phi Numeric matrix. Rows represent features and columns represent cell # populations # eta Numeric matrix. Rows represent features and columns represent cell # populations # theta Numeric vector. Proportion of truely expressed transcripts .deconCalcLL <- function(counts, z, phi, eta, theta) { # ll = sum( t(counts) * log( (1-conP )*geneDist[z,] + conP * conDist[z, ] + # 1e-20 ) ) # when dist_mat are K x G matrices ll <- sum(Matrix::t(counts) * log(theta * t(phi)[z, ] + (1 - theta) * t(eta)[z, ] + 1e-20)) return(ll) } # DEPRECATED. This is not used, but is kept as it might be useful in the future # This function calculates the log-likelihood of background distribution # decontamination # bgDist Numeric matrix. Rows represent feature and columns are the times that # the background-distribution has been replicated. .bgCalcLL <- function(counts, globalZ, cbZ, phi, eta, theta) { # ll <- sum(t(counts) * log(theta * t(cellDist) + # (1 - theta) * t(bgDist) + 1e-20)) ll <- sum(t(counts) * log(theta * t(phi)[cbZ, ] + (1 - theta) * t(eta)[globalZ, ] + 1e-20)) return(ll) } # This function updates decontamination # phi Numeric matrix. Rows represent features and columns represent cell # populations # eta Numeric matrix. Rows represent features and columns represent cell # populations # theta Numeric vector. Proportion of truely expressed transctripts #' @importFrom MCMCprecision fit_dirichlet .cDCalcEMDecontamination <- function(counts, phi, eta, theta, z, K, delta) { ## Notes: use fix-point iteration to update prior for theta, no need ## to feed delta anymore logPr <- log(t(phi)[z, ] + 1e-20) + log(theta + 1e-20) logPc <- log(t(eta)[z, ] + 1e-20) + log(1 - theta + 1e-20) Pr.e <- exp(logPr) Pc.e <- exp(logPc) Pr <- Pr.e / (Pr.e + Pc.e) estRmat <- t(Pr) * counts rnGByK <- .colSumByGroupNumeric(estRmat, z, K) cnGByK <- rowSums(rnGByK) - rnGByK counts.cs <- colSums(counts) estRmat.cs <- colSums(estRmat) estRmat.cs.n <- estRmat.cs / counts.cs estCmat.cs.n <- 1 - estRmat.cs.n temp <- cbind(estRmat.cs.n, estCmat.cs.n) deltaV2 <- MCMCprecision::fit_dirichlet(temp)$alpha ## Update parameters theta <- (estRmat.cs + deltaV2[1]) / (counts.cs + sum(deltaV2)) phi <- normalizeCounts(rnGByK, normalize = "proportion", pseudocountNormalize = 1e-20 ) eta <- normalizeCounts(cnGByK, normalize = "proportion", pseudocountNormalize = 1e-20 ) return(list( "estRmat" = estRmat, "theta" = theta, "phi" = phi, "eta" = eta, "delta" = deltaV2 )) } # DEPRECATED. This is not used, but is kept as it might be useful in the # feature. # This function updates decontamination using background distribution .cDCalcEMbgDecontamination <- function(counts, globalZ, cbZ, trZ, phi, eta, theta) { logPr <- log(t(phi)[cbZ, ] + 1e-20) + log(theta + 1e-20) logPc <- log(t(eta)[globalZ, ] + 1e-20) + log(1 - theta + 1e-20) Pr <- exp(logPr) / (exp(logPr) + exp(logPc)) Pc <- 1 - Pr deltaV2 <- MCMCprecision::fit_dirichlet(matrix(c(Pr, Pc), ncol = 2))$alpha estRmat <- t(Pr) * counts phiUnnormalized <- .colSumByGroupNumeric(estRmat, cbZ, max(cbZ)) etaUnnormalized <- rowSums(phiUnnormalized) - .colSumByGroupNumeric( phiUnnormalized, trZ, max(trZ) ) ## Update paramters theta <- (colSums(estRmat) + deltaV2[1]) / (colSums(counts) + sum(deltaV2)) phi <- normalizeCounts(phiUnnormalized, normalize = "proportion", pseudocountNormalize = 1e-20 ) eta <- normalizeCounts(etaUnnormalized, normalize = "proportion", pseudocountNormalize = 1e-20 ) return(list( "estRmat" = estRmat, "theta" = theta, "phi" = phi, "eta" = eta, "delta" = deltaV2 )) } ## Make sure provided count matrix is the right type .checkCountsDecon <- function(counts) { if (sum(is.na(counts)) > 0) { stop("Missing value in 'counts' matrix.") } if (is.null(dim(counts))) { stop("At least 2 genes need to have non-zero expressions.") } } ## Make sure provided cell labels are the right type #' @importFrom plyr mapvalues .processCellLabels <- function(z, numCells) { if (length(z) != numCells) { stop( "'z' must be of the same length as the number of cells in the", " 'counts' matrix." ) } if (length(unique(z)) < 2) { stop( "No need to decontaminate when only one cluster", " is in the dataset." ) # Even though # everything runs smoothly when length(unique(z)) == 1, result is not # trustful } if (!is.factor(z)) { z <- plyr::mapvalues(z, unique(z), seq(length(unique(z)))) z <- as.factor(z) } return(z) } ## Add two (veried-length) vectors of logLikelihood addLogLikelihood <- function(llA, llB) { lengthA <- length(llA) lengthB <- length(llB) if (lengthA >= lengthB) { llB <- c(llB, rep(llB[lengthB], lengthA - lengthB)) ll <- llA + llB } else { llA <- c(llA, rep(llA[lengthA], lengthB - lengthA)) ll <- llA + llB } return(ll) } .decontxInitializeZ <- function(object, varGenes = 2000, dbscanEps = 1, estimateCellTypes = TRUE, seed = 12345) { if (!is(object, "SingleCellExperiment")) { sce <- SingleCellExperiment::SingleCellExperiment( assays = list(counts = object) ) } sce <- scater::logNormCounts(sce, log = TRUE) if (!is.null(seed)) { with_seed( seed, resUmap <- scater::calculateUMAP(sce, ntop = varGenes, n_threads = 1, exprs_values = "logcounts") ) } else { resUmap <- scater::calculateUMAP(sce, ntop = varGenes, n_threads = 1, exprs_values = "logcounts") } z <- NULL if (isTRUE(estimateCellTypes)) { # Find clusters with dbSCAN totalClusters <- 1 iter <- 1 while (totalClusters <= 1 & dbscanEps > 0 & iter < 10) { resDbscan <- dbscan::dbscan(resUmap, dbscanEps) dbscanEps <- dbscanEps - (0.25 * dbscanEps) totalClusters <- length(unique(resDbscan$cluster)) iter <- iter + 1 } # If dbscan was not able to get more than 2 clusters, # use kmeans to force 2 clusters as a last resort if (totalClusters == 1) { cl <- stats::kmeans(t(SingleCellExperiment::logcounts(sce)), 2) z <- cl$cluster } else { z <- resDbscan$cluster } } return(list( "z" = z, "umap" = resUmap )) } ## Initialization of cell labels for DecontX when they are not given .decontxInitializeZ_prevous <- function(object, # object is either a sce object or a count matrix varGenes = 5000, dbscanEps = 1.0, verbose = TRUE, seed = 12345, logfile = NULL) { if (!is(object, "SingleCellExperiment")) { sce <- SingleCellExperiment::SingleCellExperiment( assays = list(counts = object) ) } sce <- sce[Matrix::rowSums(SingleCellExperiment::counts(sce)) > 0, ] sce <- scater::logNormCounts(sce, log = TRUE) # sce <- scater::normalize(sce) if (nrow(sce) <= varGenes) { topVariableGenes <- seq_len(nrow(sce)) } else if (nrow(sce) > varGenes) { sce.var <- scran::modelGeneVar(sce) topVariableGenes <- order(sce.var$bio, decreasing = TRUE )[seq(varGenes)] } countsFiltered <- as.matrix(SingleCellExperiment::counts( sce[topVariableGenes, ] )) storage.mode(countsFiltered) <- "integer" .logMessages( date(), "...... Collapsing features into", L, "modules", logfile = logfile, append = TRUE, verbose = verbose ) ## Celda clustering using recursive module splitting L <- min(L, nrow(countsFiltered)) if (is.null(seed)) { initialModuleSplit <- recursiveSplitModule(countsFiltered, initialL = L, maxL = L, perplexity = FALSE, verbose = FALSE ) } else { with_seed(seed, initialModuleSplit <- recursiveSplitModule(countsFiltered, initialL = L, maxL = L, perplexity = FALSE, verbose = FALSE )) } initialModel <- subsetCeldaList(initialModuleSplit, list(L = L)) .logMessages( date(), "...... Reducing dimensionality with UMAP", logfile = logfile, append = TRUE, verbose = verbose ) ## Louvan graph-based method to reduce dimension into 2 cluster nNeighbors <- min(15, ncol(countsFiltered)) # resUmap <- uwot::umap(t(sqrt(fm)), n_neighbors = nNeighbors, # min_dist = 0.01, spread = 1) # rm(fm) resUmap <- celdaUmap(countsFiltered, initialModel, minDist = 0.01, spread = 1, nNeighbors = nNeighbors, seed = seed ) .logMessages( date(), " ...... Determining cell clusters with DBSCAN (Eps=", dbscanEps, ")", sep = "", logfile = logfile, append = TRUE, verbose = verbose ) # Use dbSCAN on the UMAP to identify broad cell types totalClusters <- 1 while (totalClusters <= 1 & dbscanEps > 0) { resDbscan <- dbscan::dbscan(resUmap, dbscanEps) dbscanEps <- dbscanEps - (0.25 * dbscanEps) totalClusters <- length(unique(resDbscan$cluster)) } return(list( "z" = resDbscan$cluster, "umap" = resUmap )) } ## process varGenes .processvarGenes <- function(varGenes) { if (is.null(varGenes)) { varGenes <- 5000 } else { if (varGenes < 2 | length(varGenes) > 1) { stop("Parameter 'varGenes' must be an integer larger than 1.") } } return(varGenes) } ## process dbscanEps for resolusion threshold using DBSCAN .processdbscanEps <- function(dbscanEps) { if (is.null(dbscanEps)) { dbscanEps <- 1 } else { if (dbscanEps < 0) { stop("Parameter 'dbscanEps' needs to be non-negative.") } } return(dbscanEps) } .checkDelta <- function(delta) { if (!is.numeric(delta) | length(delta) != 2 | any(delta < 0)) { stop("'delta' needs to be a numeric vector of length 2", " containing positive values.") } return(delta) } ######################### # Simulating Data # ######################### #' @title Simulate contaminated count matrix #' @description This function generates a list containing two count matrices -- #' one for real expression, the other one for contamination, as well as other #' parameters used in the simulation which can be useful for running #' decontamination. #' @param C Integer. Number of cells to be simulated. Default \code{300}. #' @param G Integer. Number of genes to be simulated. Default \code{100}. #' @param K Integer. Number of cell populations to be simulated. #' Default \code{3}. #' @param NRange Integer vector. A vector of length 2 that specifies the lower #' and upper bounds of the number of counts generated for each cell. Default #' \code{c(500, 1000)}. #' @param beta Numeric. Concentration parameter for Phi. Default \code{0.1}. #' @param delta Numeric or Numeric vector. Concentration parameter for Theta. #' If input as a single numeric value, symmetric values for beta #' distribution are specified; if input as a vector of lenght 2, the two #' values will be the shape1 and shape2 paramters of the beta distribution #' respectively. Default \code{c(1, 5)}. #' @param numMarkers Integer. Number of markers for each cell population. #' Default \code{3}. #' @param seed Integer. Passed to \code{\link[withr]{with_seed}}. #' For reproducibility, a default value of 12345 is used. If NULL, no calls to #' \code{\link[withr]{with_seed}} are made. #' @return A list containing the \code{nativeMatirx} (real expression), #' \code{observedMatrix} (real expression + contamination), as well as other #' parameters used in the simulation. #' @author Shiyi Yang, Yuan Yin, Joshua Campbell #' @examples #' contaminationSim <- simulateContamination(K = 3, delta = c(1, 10)) #' @export simulateContamination <- function(C = 300, G = 100, K = 3, NRange = c(500, 1000), beta = 0.1, delta = c(1, 10), numMarkers = 3, seed = 12345) { if (is.null(seed)) { res <- .simulateContaminatedMatrix( C = C, G = G, K = K, NRange = NRange, beta = beta, delta = delta, numMarkers = numMarkers ) } else { with_seed( seed, res <- .simulateContaminatedMatrix( C = C, G = G, K = K, NRange = NRange, beta = beta, delta = delta, numMarkers = numMarkers ) ) } return(res) } .simulateContaminatedMatrix <- function(C = 300, G = 100, K = 3, NRange = c(500, 1000), beta = 0.5, delta = c(1, 2), numMarkers = 3) { if (length(delta) == 1) { cpByC <- stats::rbeta( n = C, shape1 = delta, shape2 = delta ) } else { cpByC <- stats::rbeta( n = C, shape1 = delta[1], shape2 = delta[2] ) } z <- sample(seq(K), size = C, replace = TRUE) if (length(unique(z)) < K) { warning( "Only ", length(unique(z)), " clusters are simulated. Try to increase numebr of cells 'C' if", " more clusters are needed" ) K <- length(unique(z)) z <- plyr::mapvalues(z, unique(z), seq(length(unique(z)))) } NbyC <- sample(seq(min(NRange), max(NRange)), size = C, replace = TRUE ) cNbyC <- vapply(seq(C), function(i) { stats::rbinom( n = 1, size = NbyC[i], p = cpByC[i] ) }, integer(1)) rNbyC <- NbyC - cNbyC phi <- .rdirichlet(K, rep(beta, G)) ## Select random genes to be markers in each cell population ## by setting their values to zero. if (K * numMarkers > G) { stop("The number of markers ('numMarkers') times the number of cell", " populations ('K') cannot be greater than the number of", " genes ('G').") } markerKIndex <- rep(seq(K), each = numMarkers) markerRowIndex <- sample(seq(G), numMarkers * K) for (i in seq(K)) { ix <- markerRowIndex[markerKIndex == i] phi[i, ix] <- max(phi[i, ]) for (j in setdiff(seq(K), i)) { phi[j, ix] <- 0 } } phi <- prop.table(phi, margin = 1) ## sample real expressed count matrix cellRmat <- vapply(seq(C), function(i) { stats::rmultinom(1, size = rNbyC[i], prob = phi[z[i], ]) }, integer(G)) rownames(cellRmat) <- paste0("Gene_", seq(G)) colnames(cellRmat) <- paste0("Cell_", seq(C)) ## Get list of marker names markerNames <- list() for (i in seq(K)) { markerNames[[i]] <- rownames(cellRmat)[markerRowIndex[markerKIndex == i]] } names(markerNames) <- paste0("CellType_", seq(K), "_Markers") ## sample contamination count matrix nGByK <- rowSums(cellRmat) - .colSumByGroup(cellRmat, group = z, K = K) eta <- normalizeCounts(counts = nGByK, normalize = "proportion") cellCmat <- vapply(seq(C), function(i) { stats::rmultinom(1, size = cNbyC[i], prob = eta[, z[i]]) }, integer(G)) cellOmat <- cellRmat + cellCmat contamination <- colSums(cellCmat) / colSums(cellOmat) rownames(cellOmat) <- paste0("Gene_", seq(G)) colnames(cellOmat) <- paste0("Cell_", seq(C)) return( list( "nativeCounts" = cellRmat, "observedCounts" = cellOmat, "NByC" = NbyC, "z" = z, "eta" = eta, "phi" = t(phi), "markers" = markerNames, "numMarkers" = numMarkers, "contamination" = contamination ) ) } .checkBackground <- function(x, background, bgBatch, logfile = NULL, verbose = FALSE) { # Remove background barcodes that have already appeared in x # If bgBatch param is supplied, also remove duplicate bgBatch if (!is.null(colnames(background))) { dupBarcode <- colnames(background) %in% colnames(x) } else { dupBarcode <- FALSE warning("No column names were found for the 'background' matrix. ", "No checking was performed between the ids in the 'backgroud' ", "matrix and 'x'.", " Please ensure that no true cells are included in the background ", "matrix. Otherwise, results will be incorrect.") } if (any(dupBarcode)) { .logMessages( date(), ".. ", sum(dupBarcode), " cells in the background matrix were removed as they were found in", " the filtered matrix.", logfile = logfile, append = TRUE, verbose = verbose ) background <- background[, !(dupBarcode), drop = FALSE] if (!is.null(bgBatch)) { if (length(bgBatch) != length(dupBarcode)) { stop( "Length of bgBatch must be equal to the number of columns", "of background matrix." ) } bgBatch <- bgBatch[!(dupBarcode)] } } re <- list(background = background, bgBatch = bgBatch) return(re) } ================================================ FILE: R/elbow.R ================================================ # https://stackoverflow.com/questions/35194048/using-r-how-to-calculate #-the-distance-from-one-point-to-a-line # http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html # Kimberling, C. "Triangle Centers and Central Triangles." Congr. # Numer. 129, 1-295, 1998. .dist2d <- function(a, b, c) { v1 <- b - c v2 <- a - b m <- cbind(v1, v2) d <- abs(det(m)) / sqrt(sum(v1 * v1)) return(d) } .secondDerivativeEstimate <- function(v) { nv <- length(v) res <- rep(NA, nv) for (i in seq(2, nv - 1)) { res[i] <- v[i + 1] + v[i - 1] - (2 * v[i]) } return(res) } .curveElbow <- function(var, perplexity, pvalCutoff = 0.05) { len <- length(perplexity) a <- c(var[1], perplexity[1]) b <- c(var[len], perplexity[len]) res <- rep(NA, len) for (i in seq_along(var)) { res[i] <- .dist2d(c(var[i], perplexity[i]), a, b) } elbow <- which.max(res) ix <- var > var[elbow] perplexitySde <- .secondDerivativeEstimate(perplexity) perplexitySdeSd <- stats::sd(perplexitySde[ix], na.rm = TRUE) perplexitySdeMean <- mean(perplexitySde[ix], na.rm = TRUE) perplexitySdePval <- stats::pnorm(perplexitySde, mean = perplexitySdeMean, sd = perplexitySdeSd, lower.tail = FALSE ) # other <- which(ix & perplexitySdePval < pvalCutoff) return(list(elbow = var[elbow])) } ================================================ FILE: R/factorizeMatrix.R ================================================ #' @title Generate factorized matrices showing each feature's influence on cell #' / gene clustering #' @description Generates factorized matrices showing the contribution of each #' feature in each cell population or each cell population in each sample. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_C}, \link{celda_G} or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot in \code{altExp(x, altExpName)}. #' Rows represent features and columns represent cells. #' \item Integer counts matrix. Rows represent features and columns represent #' cells. This matrix should be the same as the one used to generate #' \code{celdaMod}.} #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. #' Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param celdaMod Celda model object. Only works if \code{x} is an integer #' counts matrix. #' @param type Character vector. A vector containing one or more of "counts", #' "proportion", or "posterior". "counts" returns the raw number of counts for #' each factorized matrix. "proportions" returns the normalized probabilities #' for each factorized matrix, which are calculated by dividing the raw counts #' in each factorized matrix by the total counts in each column. "posterior" #' returns the posterior estimates which include the addition of the Dirichlet #' concentration parameter (essentially as a pseudocount). Default #' \code{"counts"}. #' @export setGeneric("factorizeMatrix", function(x, celdaMod, useAssay = "counts", altExpName = "featureSubset", type = c("counts", "proportion", "posterior")) { standardGeneric("factorizeMatrix")}) #' @examples #' data(sceCeldaCG) #' factorizedMatrices <- factorizeMatrix(sceCeldaCG, type = "posterior") #' @rdname factorizeMatrix #' @export setMethod("factorizeMatrix", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", type = c("counts", "proportion", "posterior")) { altExp <- SingleCellExperiment::altExp(x, e = altExpName) counts <- SummarizedExperiment::assay(altExp, i = useAssay) counts <- .processCounts(counts) beta <- S4Vectors::metadata(altExp)$celda_parameters$beta rNames <- rownames(altExp) if (celdaModel(x, altExpName = altExpName) == "celda_C") { z <- as.integer( SummarizedExperiment::colData(altExp)$celda_cell_cluster) K <- S4Vectors::metadata(altExp)$celda_parameters$K alpha <- S4Vectors::metadata(altExp)$celda_parameters$alpha sampleLabel <- SummarizedExperiment::colData(altExp)$celda_sample_label sNames <- S4Vectors::metadata(altExp)$celda_parameters$sampleLevels res <- .factorizeMatrixC( counts = counts, z = z, K = K, alpha = alpha, beta = beta, sampleLabel = sampleLabel, rNames = rNames, sNames = sNames, type = type) } else if (celdaModel(x, altExpName = altExpName) == "celda_CG") { K <- S4Vectors::metadata(altExp)$celda_parameters$K z <- as.integer( SummarizedExperiment::colData(altExp)$celda_cell_cluster) y <- as.integer( SummarizedExperiment::rowData(altExp)$celda_feature_module) L <- S4Vectors::metadata(altExp)$celda_parameters$L alpha <- S4Vectors::metadata(altExp)$celda_parameters$alpha delta <- S4Vectors::metadata(altExp)$celda_parameters$delta gamma <- S4Vectors::metadata(altExp)$celda_parameters$gamma sampleLabel <- SummarizedExperiment::colData(altExp)$celda_sample_label cNames <- colnames(altExp) sNames <- S4Vectors::metadata(altExp)$celda_parameters$sampleLevels res <- .factorizeMatrixCG( counts = counts, K = K, z = z, y = y, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, sampleLabel = sampleLabel, cNames = cNames, rNames = rNames, sNames = sNames, type = type) } else if (celdaModel(x, altExpName = altExpName) == "celda_G") { y <- as.integer( SummarizedExperiment::rowData(altExp)$celda_feature_module) L <- S4Vectors::metadata(altExp)$celda_parameters$L delta <- S4Vectors::metadata(altExp)$celda_parameters$delta gamma <- S4Vectors::metadata(altExp)$celda_parameters$gamma cNames <- colnames(altExp) res <- .factorizeMatrixG( counts = counts, y = y, L = L, beta = beta, delta = delta, gamma = gamma, cNames = cNames, rNames = rNames, type = type) } else { stop("S4Vectors::metadata(altExp(x, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } return(res) }) #' @return For celda_CG model, A list with elements for "counts", "proportions", #' or "posterior" probabilities. Each element will be a list containing #' factorized matrices for "module", "cellPopulation", and "sample". #' Additionally, the contribution of each module in each individual cell will #' be included in the "cell" element of "counts" and "proportions" elements. #' @examples #' data(celdaCGSim, celdaCGMod) #' factorizedMatrices <- factorizeMatrix( #' celdaCGSim$counts, #' celdaCGMod, #' "posterior") #' @rdname factorizeMatrix #' @export setMethod("factorizeMatrix", signature(x = "ANY", celdaMod = "celda_CG"), function(x, celdaMod, type = c("counts", "proportion", "posterior")) { counts <- .processCounts(x) compareCountMatrix(counts, celdaMod) z <- as.integer(celdaClusters(celdaMod)$z) y <- as.integer(celdaClusters(celdaMod)$y) # Sometimes, fewer clusters get returned by celda_C/G # Taking the max(z)/max(y) rather than # the original K/L will prevent errors # K <- params(celdaMod)$K; L <- params(celdaMod)$L K <- max(z) L <- max(y) alpha <- params(celdaMod)$alpha beta <- params(celdaMod)$beta delta <- params(celdaMod)$delta gamma <- params(celdaMod)$gamma sampleLabel <- sampleLabel(celdaMod) cNames <- matrixNames(celdaMod)$column rNames <- matrixNames(celdaMod)$row sNames <- matrixNames(celdaMod)$sample res <- .factorizeMatrixCG( counts = counts, K = K, z = z, y = y, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma, sampleLabel = sampleLabel, cNames = cNames, rNames = rNames, sNames = sNames, type = type) return(res) } ) .factorizeMatrixCG <- function(counts, K, z, y, L, alpha, beta, delta, gamma, sampleLabel, cNames, rNames, sNames, type) { s <- as.integer(sampleLabel) ## Calculate counts one time up front p <- .cCGDecomposeCounts(counts, s, z, y, K, L) nS <- p$nS nG <- p$nG nM <- p$nM mCPByS <- p$mCPByS nTSByC <- p$nTSByC nTSByCP <- p$nTSByCP nByG <- p$nByG nByTS <- p$nByTS nGByTS <- p$nGByTS nGByTS[nGByTS == 0] <- 1 GByTS <- matrix(0, nrow = length(y), ncol = L) GByTS[cbind(seq(nG), y)] <- p$nByG LNames <- paste0("L", seq(L)) KNames <- paste0("K", seq(K)) colnames(nTSByC) <- cNames rownames(nTSByC) <- LNames colnames(GByTS) <- LNames rownames(GByTS) <- rNames rownames(mCPByS) <- KNames colnames(mCPByS) <- sNames colnames(nTSByCP) <- KNames rownames(nTSByCP) <- LNames countsList <- c() propList <- c() postList <- c() res <- list() if (any("counts" %in% type)) { countsList <- list( sample = mCPByS, cellPopulation = nTSByCP, cell = nTSByC, module = GByTS, geneDistribution = nGByTS ) res <- c(res, list(counts = countsList)) } if (any("proportion" %in% type)) { ## Need to avoid normalizing cell/gene states with zero cells/genes uniqueZ <- sort(unique(z)) tempNTSByCP <- nTSByCP tempNTSByCP[, uniqueZ] <- normalizeCounts(tempNTSByCP[, uniqueZ], normalize = "proportion" ) uniqueY <- sort(unique(y)) tempGByTS <- GByTS tempGByTS[, uniqueY] <- normalizeCounts(tempGByTS[, uniqueY], normalize = "proportion" ) tempNGByTS <- nGByTS / sum(nGByTS) propList <- list( sample = normalizeCounts(mCPByS, normalize = "proportion" ), cellPopulation = tempNTSByCP, cell = normalizeCounts(nTSByC, normalize = "proportion"), module = tempGByTS, geneDistribution = tempNGByTS ) res <- c(res, list(proportions = propList)) } if (any("posterior" %in% type)) { gs <- GByTS gs[cbind(seq(nG), y)] <- gs[cbind(seq(nG), y)] + delta gs <- normalizeCounts(gs, normalize = "proportion") tempNGByTS <- (nGByTS + gamma) / sum(nGByTS + gamma) postList <- list( sample = normalizeCounts(mCPByS + alpha, normalize = "proportion" ), cellPopulation = normalizeCounts(nTSByCP + beta, normalize = "proportion" ), module = gs, geneDistribution = tempNGByTS ) res <- c(res, posterior = list(postList)) } return(res) } #' @examples #' data(celdaCSim, celdaCMod) #' factorizedMatrices <- factorizeMatrix( #' celdaCSim$counts, #' celdaCMod, "posterior" #' ) #' @return For celda_C model, a list with elements for "counts", "proportions", #' or "posterior" probabilities. Each element will be a list containing #' factorized matrices for "module" and "sample". #' @rdname factorizeMatrix #' @export setMethod("factorizeMatrix", signature(x = "ANY", celdaMod = "celda_C"), function(x, celdaMod, type = c("counts", "proportion", "posterior")) { counts <- .processCounts(x) compareCountMatrix(counts, celdaMod) z <- as.integer(celdaClusters(celdaMod)$z) # Sometimes, fewer clusters get returned by celda_C # Taking the max(z) rather than the # original K will prevent errors # K <- params(celdaMod)$K K <- max(z) alpha <- params(celdaMod)$alpha beta <- params(celdaMod)$beta sampleLabel <- sampleLabel(celdaMod) rNames <- matrixNames(celdaMod)$row sNames <- matrixNames(celdaMod)$sample res <- .factorizeMatrixC( counts = counts, z = z, K = K, alpha = alpha, beta = beta, sampleLabel = sampleLabel, rNames = rNames, sNames = sNames, type = type) return(res) } ) .factorizeMatrixC <- function( counts, z, K, alpha, beta, sampleLabel, rNames, sNames, type) { s <- as.integer(sampleLabel) p <- .cCDecomposeCounts(counts, s, z, K) mCPByS <- p$mCPByS nGByCP <- p$nGByCP KNames <- paste0("K", seq(K)) rownames(nGByCP) <- rNames colnames(nGByCP) <- KNames rownames(mCPByS) <- KNames colnames(mCPByS) <- sNames countsList <- c() propList <- c() postList <- c() res <- list() if (any("counts" %in% type)) { countsList <- list(sample = mCPByS, module = nGByCP) res <- c(res, list(counts = countsList)) } if (any("proportion" %in% type)) { ## Need to avoid normalizing cell/gene states with zero cells/genes uniqueZ <- sort(unique(z)) tempNGByCP <- nGByCP tempNGByCP[, uniqueZ] <- normalizeCounts(tempNGByCP[, uniqueZ], normalize = "proportion" ) propList <- list( sample = normalizeCounts(mCPByS, normalize = "proportion" ), module = tempNGByCP ) res <- c(res, list(proportions = propList)) } if (any("posterior" %in% type)) { postList <- list( sample = normalizeCounts(mCPByS + alpha, normalize = "proportion" ), module = normalizeCounts(nGByCP + beta, normalize = "proportion" ) ) res <- c(res, posterior = list(postList)) } return(res) } #' @return For celda_G model, a list with elements for "counts", "proportions", #' or "posterior" probabilities. Each element will be a list containing #' factorized matrices for "module" and "cell". #' @examples #' data(celdaGSim, celdaGMod) #' factorizedMatrices <- factorizeMatrix( #' celdaGSim$counts, #' celdaGMod, "posterior" #' ) #' @rdname factorizeMatrix #' @export setMethod("factorizeMatrix", signature(x = "ANY", celdaMod = "celda_G"), function(x, celdaMod, type = c("counts", "proportion", "posterior")) { counts <- .processCounts(x) compareCountMatrix(counts, celdaMod) y <- as.integer(celdaClusters(celdaMod)$y) # Sometimes, fewer clusters get returned by celda_G # Taking the max(y) rather than the original # L will prevent errors # L <- params(celdaMod)$L L <- max(y) beta <- params(celdaMod)$beta delta <- params(celdaMod)$delta gamma <- params(celdaMod)$gamma cNames <- matrixNames(celdaMod)$column rNames <- matrixNames(celdaMod)$row res <- .factorizeMatrixG( counts = counts, y = y, L = L, beta = beta, delta = delta, gamma = gamma, cNames = cNames, rNames = rNames, type = type) return(res) } ) .factorizeMatrixG <- function( counts, y, L, beta, delta, gamma, cNames, rNames, type) { p <- .cGDecomposeCounts(counts = counts, y = y, L = L) nTSByC <- p$nTSByC nByG <- p$nByG nByTS <- p$nByTS nGByTS <- p$nGByTS nGByTS[nGByTS == 0] <- 1 nM <- p$nM nG <- p$nG rm(p) GByTS <- matrix(0, nrow = length(y), ncol = L) GByTS[cbind(seq(nG), y)] <- nByG LNames <- paste0("L", seq(L)) colnames(nTSByC) <- cNames rownames(nTSByC) <- LNames colnames(GByTS) <- LNames rownames(GByTS) <- rNames names(nGByTS) <- LNames countsList <- c() propList <- c() postList <- c() res <- list() if (any("counts" %in% type)) { countsList <- list( cell = nTSByC, module = GByTS, geneDistribution = nGByTS ) res <- c(res, list(counts = countsList)) } if (any("proportion" %in% type)) { ## Need to avoid normalizing cell/gene states with zero cells/genes uniqueY <- sort(unique(y)) tempGByTS <- GByTS tempGByTS[, uniqueY] <- normalizeCounts(tempGByTS[, uniqueY], normalize = "proportion" ) tempNGByTS <- nGByTS / sum(nGByTS) propList <- list( cell = normalizeCounts(nTSByC, normalize = "proportion" ), module = tempGByTS, geneDistribution = tempNGByTS ) res <- c(res, list(proportions = propList)) } if (any("posterior" %in% type)) { gs <- GByTS gs[cbind(seq(nG), y)] <- gs[cbind(seq(nG), y)] + delta gs <- normalizeCounts(gs, normalize = "proportion") tempNGByTS <- (nGByTS + gamma) / sum(nGByTS + gamma) postList <- list( cell = normalizeCounts(nTSByC + beta, normalize = "proportion" ), module = gs, geneDistribution = tempNGByTS ) res <- c(res, posterior = list(postList)) } return(res) } ================================================ FILE: R/featureModuleLookup.R ================================================ #' @title Obtain the gene module of a gene of interest #' @description This function will output the corresponding feature module for #' a specified vector of genes from a celda_CG or celda_G \code{celdaModel}. #' \code{features} must match the rownames of \code{sce}. #' @param sce A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_G}, or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. #' Rows represent features and columns represent cells. #' @param features Character vector. Identify feature modules for the specified #' feature names. \code{feature} must match the rownames of \code{sce}. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param exactMatch Logical. Whether to look for exactMatch of the gene name #' within counts matrix. Default \code{TRUE}. #' @param by Character. Where to search for \code{features} in the sce object. #' If set to \code{"rownames"} then the features will be searched for among #' rownames(sce). This can also be set to one of the \code{colnames} of #' rowData(sce). Default \code{"rownames"}. #' @return Numeric vector containing the module numbers for each feature. If #' the feature was not found, then an \code{NA} value will be returned in that #' position. If no features were found, then an error will be given. #' @export setGeneric("featureModuleLookup", function(sce, features, altExpName = "featureSubset", exactMatch = TRUE, by = "rownames") { standardGeneric("featureModuleLookup")}) #' @examples #' data(sceCeldaCG) #' module <- featureModuleLookup(sce = sceCeldaCG, #' features = c("Gene_1", "Gene_XXX")) #' @export #' @rdname featureModuleLookup setMethod("featureModuleLookup", signature(sce = "SingleCellExperiment"), function(sce, features, altExpName = "featureSubset", exactMatch = TRUE, by = "rownames") { modules <- as.numeric(celdaModules(sce, altExpName = altExpName)) if (celdaModel(sce, altExpName = altExpName) %in% c("celda_CG", "celda_G")) { altExp <- SingleCellExperiment::altExp(sce, altExpName) featureIndex <- retrieveFeatureIndex(features, x = altExp, exactMatch = exactMatch, by = by) featureModules <- modules[featureIndex] names(featureModules) <- features } else { stop("S4Vectors::metadata(altExp(sce, altExpName))$", "celda_parameters$model must be", " one of 'celda_G', or 'celda_CG'") } return(featureModules) } ) ================================================ FILE: R/geneSetEnrich.R ================================================ #' @title Gene set enrichment #' @description Identify and return significantly-enriched terms for each gene #' module in a Celda object or a \linkS4class{SingleCellExperiment} object. #' Performs gene set enrichment analysis for Celda #' identified modules using the \link[enrichR]{enrichr}. #' @author Ahmed Youssef, Zhe Wang #' @param x A numeric \link{matrix} of counts or a #' \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. Rownames of the #' matrix or \linkS4class{SingleCellExperiment} object should be gene names. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param celdaModel Celda object of class \code{celda_G} or \code{celda_CG}. #' @param databases Character vector. Name of reference database. Available #' databases can be viewed by \link[enrichR]{listEnrichrDbs}. #' @param fdr False discovery rate (FDR). Numeric. Cutoff value for adjusted #' p-value, terms with FDR below this value are considered significantly #' enriched. #' @return List of length 'L' where each member contains the significantly #' enriched terms for the corresponding module. #' @importFrom enrichR enrichr #' @importFrom enrichR listEnrichrDbs #' @export setGeneric("geneSetEnrich", function(x, celdaModel, useAssay = "counts", altExpName = "featureSubset", databases, fdr = 0.05) { standardGeneric("geneSetEnrich")}) #' @rdname geneSetEnrich #' @examples #' library(M3DExampleData) #' counts <- M3DExampleData::Mmus_example_list$data #' # subset 500 genes for fast clustering #' counts <- counts[seq(1501, 2000), ] #' # cluster genes into 10 modules for quick demo #' sce <- celda_G(x = as.matrix(counts), L = 10, verbose = FALSE) #' gse <- geneSetEnrich(sce, #' databases = c("GO_Biological_Process_2018", "GO_Molecular_Function_2018")) #' @export setMethod("geneSetEnrich", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", databases, fdr = 0.05) { altExp <- SingleCellExperiment::altExp(x, e = altExpName) # initialize list with one entry for each gene module modules <- vector("list", length = S4Vectors::metadata(altExp)$celda_parameters$L) # create dataframe with gene-module associations genes <- data.frame(gene = rownames(altExp), module = celdaModules(x, altExpName = altExpName)) # iterate over each module, get genes in that module, add to list for (i in seq_len(S4Vectors::metadata(altExp)$celda_parameters$L)) { modules[[i]] <- as.character(genes[genes$module == i, "gene"]) } # enrichment analysis enrichment <- lapply(modules, function(module) { invisible(utils::capture.output(table <- enrichR::enrichr( genes = module, databases = databases))) table <- Reduce(f = rbind, x = table) table[table$Adjusted.P.value < fdr, "Term"] }) # return results as a list return(enrichment) } ) #' @rdname geneSetEnrich #' @export setMethod("geneSetEnrich", signature(x = "matrix"), function(x, celdaModel, databases, fdr = 0.05) { # check for correct celda object if (!(class(celdaModel) %in% c("celda_G", "celda_CG"))) { stop( "No gene modules in celda object. ", "Please provide object of class celda_G or celda_CG." ) } # initialize list with one entry for each gene module modules <- vector("list", length = params(celdaModel)$L) # create dataframe with gene-module associations genes <- data.frame(gene = rownames(x), module = celdaClusters(celdaModel)$y) # iterate over each module, get genes in that module, add to list for (i in seq_len(params(celdaModel)$L)) { modules[[i]] <- as.character(genes[genes$module == i, "gene"]) } # enrichment analysis enrichment <- lapply(modules, function(module) { invisible(utils::capture.output(table <- enrichR::enrichr( genes = module, databases = databases ))) table <- Reduce(f = rbind, x = table) table[table$Adjusted.P.value < fdr, "Term"] }) # return results as a list return(enrichment) } ) ================================================ FILE: R/initialize_clusters.R ================================================ .initializeCluster <- function(N, len, z = NULL, initial = NULL, fixed = NULL) { # If initial values are given, then they will not be randomly initialized if (!is.null(initial)) { initValues <- sort(unique(initial)) if (length(unique(initial)) != N || length(initial) != len || !all(initValues %in% seq(N))) { stop( "'initial' needs to be a vector of length 'len'", " containing N unique values." ) } z <- as.integer(as.factor(initial)) } else { z <- rep(NA, len) } # Set any values that need to be fixed during sampling if (!is.null(fixed)) { fixedValues <- sort(unique(fixed)) if (length(fixed) != len || !all(fixedValues %in% seq(N))) { stop( "'fixed' to be a vector of length 'len' where each entry is", " one of N unique values or NA." ) } fixedIx <- !is.na(fixed) z[fixedIx] <- fixed[fixedIx] zNotUsed <- setdiff(seq(N), unique(fixed[fixedIx])) } else { zNotUsed <- seq(N) fixedIx <- rep(FALSE, len) } # Randomly sample remaining values zNa <- which(is.na(z)) if (length(zNa) > 0) { z[zNa] <- sample(zNotUsed, length(zNa), replace = TRUE) } # Check to ensure each value is in the vector at least once missing <- setdiff(seq(N), z) for (i in missing) { ta <- sort(table(z[!fixedIx]), decreasing = TRUE) if (ta[1] == 1) { stop("'len' is not long enough to accomodate 'N' unique values") } ix <- which(z == as.integer(names(ta))[1] & !fixedIx) z[sample(ix, 1)] <- i } return(z) } .initializeSplitZ <- function(counts, K, KSubcluster = NULL, alpha = 1, beta = 1, minCell = 3) { s <- rep(1, ncol(counts)) if (is.null(KSubcluster)) { KSubcluster <- ceiling(sqrt(K)) } # Initialize the model with KSubcluster clusters res <- .celda_C( counts, K = min(KSubcluster, ncol(counts)), maxIter = 20, zInitialize = "random", alpha = alpha, beta = beta, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE ) overallZ <- as.integer(as.factor(celdaClusters(res)$z)) currentK <- max(overallZ) counter <- 0 while (currentK < K & counter < 25) { # Determine which clusters are split-able # KRemaining <- K - currentK KPerCluster <- min(ceiling(K / currentK), KSubcluster) KToUse <- ifelse(KPerCluster < 2, 2, KPerCluster) zTa <- tabulate(overallZ, max(overallZ)) zToSplit <- which(zTa > minCell & zTa > KToUse) if (length(zToSplit) > 1) { zToSplit <- sample(zToSplit) } else if (length(zToSplit) == 0) { break } # Cycle through each splitable cluster and split it up into # K.sublcusters for (i in zToSplit) { clustLabel <- .celda_C(counts[, overallZ == i, drop = FALSE], K = KToUse, zInitialize = "random", alpha = alpha, beta = beta, maxIter = 20, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE) tempZ <- as.integer(as.factor(celdaClusters(clustLabel)$z)) # Reassign clusters with label > 1 splitIx <- tempZ > 1 ix <- overallZ == i newZ <- overallZ[ix] newZ[splitIx] <- currentK + tempZ[splitIx] - 1 overallZ[ix] <- newZ currentK <- max(overallZ) # Ensure that the maximum number of clusters does not get too large' if (currentK > K + 10) { break } } counter <- counter + 1 } # Decompose counts for likelihood calculation p <- .cCDecomposeCounts(counts, s, overallZ, currentK) nS <- p$nS nG <- p$nG nM <- p$nM mCPByS <- p$mCPByS nGByCP <- p$nGByCP nCP <- p$nCP nByC <- p$nByC # Remove clusters 1-by-1 until K is reached while (currentK > K) { # Find second best assignment give current assignments for each cell probs <- .cCCalcEMProbZ(counts, s = s, z = overallZ, K = currentK, mCPByS = mCPByS, nGByCP = nGByCP, nByC = nByC, nCP = nCP, nG = nG, nM = nM, alpha = alpha, beta = beta, doSample = FALSE) zProb <- t(as.matrix(probs$probs)) zProb[cbind(seq(nrow(zProb)), overallZ)] <- NA zSecond <- apply(zProb, 1, which.max) zTa <- tabulate(overallZ, currentK) zNonEmpty <- which(zTa > 0) # Find worst cluster by logLik to remove previousZ <- overallZ llShuffle <- rep(NA, currentK) for (i in zNonEmpty) { ix <- overallZ == i newZ <- overallZ newZ[ix] <- zSecond[ix] p <- .cCReDecomposeCounts( counts, s, newZ, previousZ, nGByCP, currentK) nGByCP <- p$nGByCP mCPByS <- p$mCPByS llShuffle[i] <- .cCCalcLL( mCPByS, nGByCP, s, newZ, currentK, nS, nG, alpha, beta) previousZ <- newZ } # Remove the cluster which had the the largest likelihood after removal zToRemove <- which.max(llShuffle) ix <- overallZ == zToRemove overallZ[ix] <- zSecond[ix] p <- .cCReDecomposeCounts(counts, s, overallZ, previousZ, nGByCP, currentK) nGByCP <- p$nGByCP[, -zToRemove, drop = FALSE] mCPByS <- p$mCPByS[-zToRemove, , drop = FALSE] overallZ <- as.integer(as.factor(overallZ)) currentK <- currentK - 1 } return(overallZ) } .initializeSplitY <- function(counts, L, LSubcluster = NULL, tempK = 100, beta = 1, delta = 1, gamma = 1, minFeature = 3) { if (is.null(LSubcluster)) { LSubcluster <- ceiling(sqrt(L)) } # Collapse cells to managable number of clusters if (!is.null(tempK) && ncol(counts) > tempK) { z <- .initializeSplitZ(counts, K = tempK) counts <- .colSumByGroup(counts, z, length(unique(z))) } # Initialize the model with KSubcluster clusters res <- .celda_G(counts, L = LSubcluster, maxIter = 10, yInitialize = "random", beta = beta, delta = delta, gamma = gamma, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE) overallY <- as.integer(as.factor(celdaClusters(res)$y)) currentL <- max(overallY) counter <- 0 while (currentL < L & counter < 25) { # Determine which clusters are split-able yTa <- tabulate(overallY, max(overallY)) yToSplit <- sample(which(yTa > minFeature & yTa > LSubcluster)) if (length(yToSplit) == 0) { break } # Cycle through each splitable cluster and split it up into # LSublcusters for (i in yToSplit) { # make sure the colSums of subset counts is not 0 countsY <- counts[overallY == i, , drop = FALSE] countsY <- countsY[, !(colSums(countsY) == 0)] if (ncol(countsY) == 0) { next } clustLabel <- .celda_G( countsY, L = min(LSubcluster, nrow(countsY)), yInitialize = "random", beta = beta, delta = delta, gamma = gamma, maxIter = 20, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE ) tempY <- as.integer(as.factor(celdaClusters(clustLabel)$y)) # Reassign clusters with label > 1 splitIx <- tempY > 1 ix <- overallY == i newY <- overallY[ix] newY[splitIx] <- currentL + tempY[splitIx] - 1 overallY[ix] <- newY currentL <- max(overallY) # Ensure that the maximum number of clusters does not get too large if (currentL > L + 10) { break } } counter <- counter + 1 } ## Decompose counts for likelihood calculation p <- .cGDecomposeCounts(counts = counts, y = overallY, L = currentL) nTSByC <- p$nTSByC nByG <- p$nByG nByTS <- p$nByTS nGByTS <- p$nGByTS nM <- p$nM nG <- p$nG rm(p) # Pre-compute lgamma values lgbeta <- lgamma((seq(0, max(colSums(counts)))) + beta) lggamma <- lgamma(seq(0, nrow(counts) + L) + gamma) lgdelta <- c(NA, lgamma(seq(nrow(counts) + L) * delta)) # Remove clusters 1-by-1 until L is reached while (currentL > L) { # Find second best assignment give current assignments for each cell probs <- .cGCalcGibbsProbY( counts = counts, y = overallY, L = currentL, nTSByC = nTSByC, nByTS = nByTS, nGByTS = nGByTS, nByG = nByG, nG = nG, beta = beta, delta = delta, gamma = gamma, lgbeta = lgbeta, lggamma = lggamma, lgdelta = lgdelta, doSample = FALSE) yProb <- t(probs$probs) yProb[cbind(seq(nrow(yProb)), overallY)] <- NA ySecond <- apply(yProb, 1, which.max) yTa <- tabulate(overallY, currentL) yNonEmpty <- which(yTa > 0) # Find worst cluster by logLik to remove previousY <- overallY llShuffle <- rep(NA, currentL) for (i in yNonEmpty) { ix <- overallY == i newY <- overallY newY[ix] <- ySecond[ix] # Move arounds counts for likelihood calculation p <- .cGReDecomposeCounts( counts, newY, previousY, nTSByC, nByG, currentL) nTSByC <- p$nTSByC nGByTS <- p$nGByTS nByTS <- p$nByTS llShuffle[i] <- .cGCalcLL( nTSByC, nByTS, nByG, nGByTS, nM, nG, currentL, beta, delta, gamma) previousY <- newY } # Remove the cluster which had the the largest likelihood after removal yToRemove <- which.max(llShuffle) ix <- overallY == yToRemove overallY[ix] <- ySecond[ix] # Move around counts and remove module p <- .cGReDecomposeCounts( counts, overallY, previousY, nTSByC, nByG, currentL) nTSByC <- p$nTSByC[-yToRemove, , drop = FALSE] nGByTS <- p$nGByTS[-yToRemove] nByTS <- p$nByTS[-yToRemove] overallY <- as.integer(as.factor(overallY)) currentL <- currentL - 1 } return(overallY) } ================================================ FILE: R/loglikelihood.R ================================================ #' @title Calculate the Log-likelihood of a celda model #' @description Calculate the log-likelihood for cell population #' and feature module cluster assignments on the count matrix, per celda model. #' @param x A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_C}, \link{celda_G}, or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. #' Rows represent features and columns represent cells. #' @param useAssay A string specifying which \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param celdaMod celda model object. Ignored if \code{x} is a #' \linkS4class{SingleCellExperiment} object. #' @return The log-likelihood of the cluster assignment for the #' provided \linkS4class{SingleCellExperiment}. #' @seealso `celda_C()` for clustering cells #' @export setGeneric("logLikelihood", function(x, celdaMod, useAssay = "counts", altExpName = "featureSubset") { standardGeneric("logLikelihood") }) #' @rdname logLikelihood #' @examples #' data(sceCeldaC, sceCeldaCG) #' loglikC <- logLikelihood(sceCeldaC) #' loglikCG <- logLikelihood(sceCeldaCG) #' @export setMethod("logLikelihood", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) counts <- SummarizedExperiment::assay(altExp, i = useAssay) sampleLabel <- sampleLabel(x, altExpName = altExpName) z <- as.integer(celdaClusters(x, altExpName = altExpName)) y <- as.integer(celdaModules(x, altExpName = altExpName)) K <- S4Vectors::metadata(altExp)$celda_parameters$K L <- S4Vectors::metadata(altExp)$celda_parameters$L alpha <- S4Vectors::metadata(altExp)$celda_parameters$alpha beta <- S4Vectors::metadata(altExp)$celda_parameters$beta delta <- S4Vectors::metadata(altExp)$celda_parameters$delta gamma <- S4Vectors::metadata(altExp)$celda_parameters$gamma if (celdaModel(x, altExpName = altExpName) == "celda_C") { ll <- .logLikelihoodcelda_C(counts = counts, sampleLabel = sampleLabel, z = z, K = K, alpha = alpha, beta = beta) } else if (celdaModel(x, altExpName = altExpName) == "celda_CG") { ll <- .logLikelihoodcelda_CG(counts = counts, sampleLabel = sampleLabel, z = z, y = y, K = K, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma) } else if (celdaModel(x, altExpName = altExpName) == "celda_G") { ll <- .logLikelihoodcelda_G(counts = counts, y = y, L = L, beta = beta, delta = delta, gamma = gamma) } else { stop("S4Vectors::metadata(altExp(x, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'!") } return(ll) } ) #' @rdname logLikelihood #' @export setMethod("logLikelihood", signature(x = "matrix", celdaMod = "celda_C"), function(x, celdaMod) { sampleLabel <- sampleLabel(celdaMod) z <- as.integer(celdaClusters(celdaMod)$z) K <- params(celdaMod)$K alpha <- params(celdaMod)$alpha beta <- params(celdaMod)$beta ll <- .logLikelihoodcelda_C(counts = x, sampleLabel = sampleLabel, z = z, K = K, alpha = alpha, beta = beta) return(ll) } ) #' @rdname logLikelihood #' @export setMethod("logLikelihood", signature(x = "matrix", celdaMod = "celda_G"), function(x, celdaMod) { y <- as.integer(celdaClusters(celdaMod)$y) L <- params(celdaMod)$L beta <- params(celdaMod)$beta delta <- params(celdaMod)$delta gamma <- params(celdaMod)$gamma ll <- .logLikelihoodcelda_G(counts = x, y = y, L = L, beta = beta, delta = delta, gamma = gamma) return(ll) } ) #' @rdname logLikelihood #' @export setMethod("logLikelihood", signature(x = "matrix", celdaMod = "celda_CG"), function(x, celdaMod) { sampleLabel <- sampleLabel(celdaMod) z <- as.integer(celdaClusters(celdaMod)$z) y <- as.integer(celdaClusters(celdaMod)$y) K <- params(celdaMod)$K L <- params(celdaMod)$L alpha <- params(celdaMod)$alpha beta <- params(celdaMod)$beta delta <- params(celdaMod)$delta gamma <- params(celdaMod)$gamma ll <- .logLikelihoodcelda_CG(counts = x, sampleLabel = sampleLabel, z = z, y = y, K = K, L = L, alpha = alpha, beta = beta, delta = delta, gamma = gamma) return(ll) } ) #' @title Get log-likelihood history #' @description Retrieves the complete log-likelihood from all iterations of #' Gibbs sampling used to generate a celda model. #' @param x A \linkS4class{SingleCellExperiment} object #' returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}, or a celda #' model object. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return Numeric. The log-likelihood at each step of Gibbs sampling used to #' generate the model. #' @export setGeneric( "logLikelihoodHistory", function(x, altExpName = "featureSubset") { standardGeneric("logLikelihoodHistory") } ) #' @rdname logLikelihoodHistory #' @examples #' data(sceCeldaCG) #' logLikelihoodHistory(sceCeldaCG) #' @export setMethod("logLikelihoodHistory", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) cll <- S4Vectors::metadata(altExp)$celda_parameters$completeLogLik return(cll) } ) #' @rdname logLikelihoodHistory #' @examples #' data(celdaCGMod) #' logLikelihoodHistory(celdaCGMod) #' @export setMethod("logLikelihoodHistory", signature(x = "celdaModel"), function(x) { cll <- x@completeLogLik return(cll) } ) #' @title Get the log-likelihood #' @description Retrieves the final log-likelihood from all iterations of Gibbs #' sampling used to generate a celdaModel. #' @return Numeric. The log-likelihood at the final step of Gibbs sampling used #' to generate the model. #' @param x A \linkS4class{SingleCellExperiment} object #' returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}, or a celda #' model object. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @export setGeneric("bestLogLikelihood", function(x, altExpName = "featureSubset") { standardGeneric("bestLogLikelihood") } ) #' @rdname bestLogLikelihood #' @examples #' data(sceCeldaCG) #' bestLogLikelihood(sceCeldaCG) #' @export setMethod("bestLogLikelihood", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset") { altExp <- SingleCellExperiment::altExp(x, altExpName) fll <- S4Vectors::metadata(altExp)$celda_parameters$finalLogLik return(fll) } ) #' @rdname bestLogLikelihood #' @examples #' data(celdaCGMod) #' bestLogLikelihood(celdaCGMod) #' @export setMethod("bestLogLikelihood", signature(x = "celdaModel"), function(x) { fll <- x@finalLogLik return(fll) } ) .logLikelihoodcelda_C <- function(counts, sampleLabel, z, K, alpha, beta) { if (sum(z > K) > 0) { stop("Assigned value of cell cluster greater than the total number of", " cell clusters!") } sampleLabel <- .processSampleLabels(sampleLabel, ncol(counts)) s <- as.integer(sampleLabel) p <- .cCDecomposeCounts(counts, s, z, K) final <- .cCCalcLL( mCPByS = p$mCPByS, nGByCP = p$nGByCP, s = s, z = z, K = K, nS = p$nS, nG = p$nG, alpha = alpha, beta = beta ) return(final) } .logLikelihoodcelda_CG <- function(counts, sampleLabel, z, y, K, L, alpha, beta, delta, gamma) { if (sum(z > K) > 0) { stop("Assigned value of cell cluster greater than the total number of", " cell clusters!") } if (sum(y > L) > 0) { stop("Assigned value of feature module greater than the total number", " of feature modules!") } sampleLabel <- .processSampleLabels(sampleLabel, ncol(counts)) s <- as.integer(sampleLabel) p <- .cCGDecomposeCounts(counts, s, z, y, K, L) final <- .cCGCalcLL( K = K, L = L, mCPByS = p$mCPByS, nTSByCP = p$nTSByCP, nByG = p$nByG, nByTS = p$nByTS, nGByTS = p$nGByTS, nS = p$nS, nG = p$nG, alpha = alpha, beta = beta, delta = delta, gamma = gamma) return(final) } .logLikelihoodcelda_G <- function(counts, y, L, beta, delta, gamma) { if (sum(y > L) > 0) { stop("Assigned value of feature module greater than the total number", " of feature modules!") } p <- .cGDecomposeCounts(counts = counts, y = y, L = L) final <- .cGCalcLL( nTSByC = p$nTSByC, nByTS = p$nByTS, nByG = p$nByG, nGByTS = p$nGByTS, nM = p$nM, nG = p$nG, L = L, beta = beta, delta = delta, gamma = gamma ) return(final) } ================================================ FILE: R/matrixSums.R ================================================ .rowSumByGroup <- function(counts, group, L) { if (inherits(counts, "matrix") & is.integer(counts)) { res <- .rowSumByGroupInteger(counts, group, L) } else if (inherits(counts, "matrix") & is.numeric(counts)) { res <- .rowSumByGroupNumeric(counts, group, L) } else if (inherits(counts, "dgCMatrix")) { res <- rowSumByGroupSparse(counts, group, L) } else { stop("'counts' must be an integer, numeric, or dgCMatrix matrix.") } return(res) } .rowSumByGroupChange <- function(counts, pcounts, group, pgroup, L) { if (inherits(counts, "matrix") & is.integer(counts)) { res <- .rowSumByGroupChangeInteger(counts, pcounts, group, pgroup, L) } else if (inherits(counts, "matrix") & is.numeric(counts)) { res <- .rowSumByGroupChangeNumeric(counts, pcounts, group, pgroup, L) } else if (inherits(counts, "dgCMatrix")) { res <- rowSumByGroupChangeSparse(counts, pcounts, group, pgroup, L) } else { stop("'counts' must be an integer, numeric, or dgCMatrix matrix.") } return(res) } .colSumByGroup <- function(counts, group, K) { if (inherits(counts, "matrix") & is.integer(counts)) { res <- .colSumByGroupInteger(counts, group, K) } else if (inherits(counts, "matrix") & is.numeric(counts)) { res <- .colSumByGroupNumeric(counts, group, K) } else if (inherits(counts, "dgCMatrix")) { res <- colSumByGroupSparse(counts, group, K) } else { stop("'counts' must be an integer, numeric, or dgCMatrix matrix.") } return(res) } .colSumByGroupChange <- function(counts, pcounts, group, pgroup, K) { if (inherits(counts, "matrix") & is.integer(counts)) { res <- .colSumByGroupChangeInteger(counts, pcounts, group, pgroup, K) } else if (inherits(counts, "matrix") & is.numeric(counts)) { res <- .colSumByGroupChangeNumeric(counts, pcounts, group, pgroup, K) } else if (inherits(counts, "dgCMatrix")) { res <- colSumByGroupChangeSparse(counts, pcounts, group, pgroup, K) } else { stop("'counts' must be an integer, numeric, or dgCMatrix matrix.") } return(res) } #' @useDynLib celda _rowSumByGroup .rowSumByGroupInteger <- function(x, group, L) { group <- factor(group, levels = seq(L)) res <- .Call("_rowSumByGroup", x, group) return(res) } #' @useDynLib celda _rowSumByGroupChange .rowSumByGroupChangeInteger <- function(x, px, group, pgroup, L) { group <- factor(group, levels = seq(L)) pgroup <- factor(pgroup, levels = seq(L)) res <- .Call("_rowSumByGroupChange", x, px, group, pgroup) return(res) } #' @useDynLib celda _colSumByGroup .colSumByGroupInteger <- function(x, group, K) { group <- factor(group, levels = seq(K)) res <- .Call("_colSumByGroup", x, group) return(res) } #' @useDynLib celda _colSumByGroupChange .colSumByGroupChangeInteger <- function(x, px, group, pgroup, K) { group <- factor(group, levels = seq(K)) pgroup <- factor(pgroup, levels = seq(K)) res <- .Call("_colSumByGroupChange", x, px, group, pgroup) return(res) } #' @useDynLib celda _rowSumByGroup_numeric .rowSumByGroupNumeric <- function(x, group, L) { group <- factor(group, levels = seq(L)) res <- .Call("_rowSumByGroup_numeric", x, group) return(res) } #' @useDynLib celda _colSumByGroup_numeric .colSumByGroupNumeric <- function(x, group, K) { group <- factor(group, levels = seq(K)) res <- .Call("_colSumByGroup_numeric", x, group) return(res) } #' @useDynLib celda _rowSumByGroupChange_numeric .rowSumByGroupChangeNumeric <- function(x, px, group, pgroup, L) { group <- factor(group, levels = seq(L)) pgroup <- factor(pgroup, levels = seq(L)) res <- .Call("_rowSumByGroupChange_numeric", x, px, group, pgroup) return(res) } #' @useDynLib celda _colSumByGroupChange_numeric .colSumByGroupChangeNumeric <- function(x, px, group, pgroup, K) { group <- factor(group, levels = seq(K)) pgroup <- factor(pgroup, levels = seq(K)) res <- .Call("_colSumByGroupChange_numeric", x, px, group, pgroup) return(res) } #' @useDynLib celda _perplexityG .perplexityGLogPx <- function(x, phi, psi, group, L) { group <- factor(group, levels = seq(L)) res <- .Call("_perplexityG", x, phi, psi, group) return(res) } ================================================ FILE: R/misc.R ================================================ #' @title Celda models #' @description List of available Celda models with correpsonding descriptions. #' @export #' @examples #' celda() #' @return None celda <- function() { message( "celda_C: Clusters the columns of a count matrix containing", " single-cell data into K subpopulations." ) message( "celda_G: Clusters the rows of a count matrix containing", " single-cell data into L modules." ) message( "celda_CG: Clusters the rows and columns of a count matrix", " containing single-cell data into L modules and K subpopulations,", " respectively." ) message( "celdaGridSearch: Run Celda with different combinations of", " parameters and multiple chains in parallel." ) } #' @title Append two celdaList objects #' @description Returns a single celdaList representing the combination of two #' provided celdaList objects. #' @return A celdaList object. This object contains all resList entries and #' runParam records from both lists. #' @param list1 A celda_list object #' @param list2 A celda_list object to be joined with list_1 #' @examples #' data(celdaCGGridSearchRes) #' appendedList <- appendCeldaList( #' celdaCGGridSearchRes, #' celdaCGGridSearchRes #' ) #' @importFrom methods new #' @export appendCeldaList <- function(list1, list2) { if (!is.element("celdaList", class(list1)) | !is.element("celdaList", class(list2))) { stop("Both parameters to appendCeldaList must be of class celdaList.") } if (!(countChecksum(list1) == countChecksum(list2))) { warning( "Provided lists have different countChecksums and may have", " been generated from different count matrices. Using checksum", " from first list..." ) } newList <- methods::new( "celdaList", runParams = rbind(runParams(list1), runParams(list2)), resList = c(resList(list1), resList(list2)), countChecksum = countChecksum(list1), perplexity = matrix(nrow = 0, ncol = 0) ) return(newList) } ================================================ FILE: R/moduleHeatmap.R ================================================ #' @title Heatmap for featureModules #' @description Renders a heatmap for selected \code{featureModule}. Cells are #' ordered from those with the lowest probability of the module on the left to #' the highest probability on the right. Features are ordered from those #' with the highest probability in the module #' on the top to the lowest probability on the bottom. #' @param x A numeric \link{matrix} of counts or a #' \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. Celda #' results must be present under \code{metadata(altExp(x, altExpName))}. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param modules Integer Vector. The featureModule(s) to display. #' Multiple modules can be included in a vector. Default \code{NULL} which #' plots all module heatmaps. #' @param featureModule Same as \code{modules}. Either can be used to specify #' the modules to display. #' @param col Passed to \link[ComplexHeatmap]{Heatmap}. Set color boundaries #' and colors. #' @param topCells Integer. Number of cells with the highest and lowest #' probabilities for each module to include in the heatmap. For example, if #' \code{topCells = 50}, the 50 cells with the lowest probabilities and #' the 50 cells #' with the highest probabilities for each featureModule will be included. If #' NULL, all cells will be plotted. Default 100. #' @param topFeatures Integer. Plot `topFeatures` features with the highest #' probabilities in the module heatmap for each featureModule. If \code{NULL}, #' plot all features in the module. Default \code{NULL}. #' @param normalizedCounts Integer matrix. Rows represent features and columns #' represent cells. If you have a normalized matrix result from #' \link{normalizeCounts}, you can pass through the result here to #' skip the normalization step in this function. Make sure the colnames and #' rownames match the object in x. This matrix should #' correspond to one generated from this count matrix #' \code{assay(altExp(x, altExpName), i = useAssay)}. If \code{NA}, #' normalization will be carried out in the following form #' \code{normalizeCounts(assay(altExp(x, altExpName), i = useAssay), #' normalize = "proportion", transformationFun = sqrt)}. #' Use of this parameter is particularly useful for plotting many #' module heatmaps, where normalizing the counts matrix repeatedly would #' be too time consuming. Default NA. #' @param normalize Character. Passed to \link{normalizeCounts} if #' \code{normalizedCounts} is \code{NA}. #' Divides counts by the library sizes for each cell. One of 'proportion', #' 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each #' cell as the library size. 'cpm' divides the library size of each cell by #' one million to produce counts per million. 'median' divides the library #' size of each cell by the median library size across all cells. 'mean' #' divides the library size of each cell by the mean library size across all #' cells. Default "proportion". #' @param transformationFun Function. Passed to \link{normalizeCounts} if #' \code{normalizedCounts} is \code{NA}. Applies a transformation such as #' \link{sqrt}, \link{log}, \link{log2}, \link{log10}, or \link{log1p}. #' If \code{NULL}, no transformation will be applied. Occurs after #' normalization. Default \link{sqrt}. #' @param scaleRow Function. Which function to use to scale each individual #' row. Set to NULL to disable. Occurs after normalization and log #' transformation. For example, \link{scale} will Z-score transform each row. #' Default \link{scale}. #' @param showFeatureNames Logical. Whether feature names should be displayed. #' Default TRUE. #' @param displayName Character. The column name of #' \code{rowData(altExp(x, altExpName))} that specifies the display names for #' the features. Default \code{NULL}, which displays the row names. Only works #' if \code{showFeaturenames} is \code{TRUE} and \code{x} is a #' \linkS4class{SingleCellExperiment} object. #' @param trim Numeric vector. Vector of length two that specifies the lower #' and upper bounds for plotting the data. This threshold is applied #' after row scaling. Set to NULL to disable. Default \code{c(-2,2)}. #' @param rowFontSize Numeric. Font size for feature names. If \code{NULL}, #' then the size will automatically be determined. Default \code{NULL}. #' @param showHeatmapLegend Passed to \link[ComplexHeatmap]{Heatmap}. Show #' legend for expression levels. #' @param showTopAnnotationLegend Passed to #' \link[ComplexHeatmap]{HeatmapAnnotation}. Show legend for cell annotation. #' @param showTopAnnotationName Passed to #' \link[ComplexHeatmap]{HeatmapAnnotation}. Show heatmap top annotation name. #' @param topAnnotationHeight Passed to #' \link[ComplexHeatmap]{HeatmapAnnotation}. Column annotation height. #' \link[ComplexHeatmap]{rowAnnotation}. Show legend for module annotation. #' @param showModuleLabel Show left side module labels. #' @param moduleLabel The left side row titles for module heatmap. Must be #' vector of the same length as \code{featureModule}. Default "auto", which #' automatically pulls module labels from \code{x}. #' @param moduleLabelSize Passed to \link{gpar}. The size of text (in points). #' @param byrow Passed to \link{matrix}. logical. If \code{FALSE} (the default) #' the figure panel is filled by columns, otherwise the figure panel is filled #' by rows. #' @param top Passed to \link[gridExtra]{marrangeGrob}. The title for each page. #' @param unit Passed to \link[grid]{unit}. Single #' character object defining the unit of all dimensions defined. #' @param ncol Integer. Number of columns of module heatmaps. If \code{NULL}, #' then this will be automatically calculated so that the number of columns #' and rows will be approximately the same. Default \code{NULL}. #' @param useRaster Boolean. Rasterizing will make the heatmap a single object #' and reduced the memory of the plot and the size of a file. If \code{NULL}, #' then rasterization will be automatically determined by the underlying #' \link[ComplexHeatmap]{Heatmap} function. Default \code{TRUE}. #' @param returnAsList Boolean. If \code{TRUE}, then a list of plots will be #' returned instead of a single multi-panel figure. These plots can be #' displayed using the \link[grid]{grid.draw} function. Default \code{FALSE}. #' @param ... Additional parameters passed to \link[ComplexHeatmap]{Heatmap}. #' @return A list object if plotting #' more than one module heatmaps. Otherwise a #' \link[ComplexHeatmap]{HeatmapList} object is returned. #' @importFrom methods .hasSlot #' @export setGeneric("moduleHeatmap", function(x, useAssay = "counts", altExpName = "featureSubset", modules = NULL, featureModule = NULL, col = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), topCells = 100, topFeatures = NULL, normalizedCounts = NA, normalize = "proportion", transformationFun = sqrt, scaleRow = scale, showFeatureNames = TRUE, displayName = NULL, trim = c(-2, 2), rowFontSize = NULL, showHeatmapLegend = FALSE, showTopAnnotationLegend = FALSE, showTopAnnotationName = FALSE, topAnnotationHeight = 5, showModuleLabel = TRUE, moduleLabel = "auto", moduleLabelSize = NULL, byrow = TRUE, top = NA, unit = "mm", ncol = NULL, useRaster = TRUE, returnAsList = FALSE, ...) { standardGeneric("moduleHeatmap")}) #' @rdname moduleHeatmap #' @examples #' data(sceCeldaCG) #' moduleHeatmap(sceCeldaCG, displayName = "rownames") #' @export setMethod("moduleHeatmap", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", modules = NULL, featureModule = NULL, col = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), topCells = 100, topFeatures = NULL, normalizedCounts = NA, normalize = "proportion", transformationFun = sqrt, scaleRow = scale, showFeatureNames = TRUE, displayName = NULL, trim = c(-2, 2), rowFontSize = NULL, showHeatmapLegend = FALSE, showTopAnnotationLegend = FALSE, showTopAnnotationName = FALSE, topAnnotationHeight = 5, showModuleLabel = TRUE, moduleLabel = "auto", moduleLabelSize = NULL, byrow = TRUE, top = NA, unit = "mm", ncol = NULL, useRaster = TRUE, returnAsList = FALSE, ...) { # 'modules' is an easier parameter name to remember so we include # support for both. if(!is.null(modules)) { featureModule <- modules } altExp <- SingleCellExperiment::altExp(x, altExpName) counts <- SummarizedExperiment::assay(altExp, i = useAssay) if (is.null(colnames(counts))) { stop("colnames(altExp(x, altExpName)) is NULL!", " Please assign column names to x and", " try again.") } if (is.null(rownames(counts))) { stop("rownames(altExp(x, altExpName)) is NULL!", " Please assign row names to x and", " try again.") } if (!(S4Vectors::metadata(altExp)$celda_parameters$model %in% c("celda_G", "celda_CG"))) { stop("metadata(altExp(x, altExpName))$", "celda_parameters$model must be 'celda_G' or", " 'celda_CG'") } if (is.null(featureModule)) { featureModule <- sort(unique(celdaModules(x))) } if (length(featureModule) == 1) { returnHeatmap <- TRUE } else { returnHeatmap <- FALSE } if (moduleLabel == "auto") { moduleLabel <- paste0("Module ", as.character(featureModule)) } else if (length(moduleLabel) == 1 & length(featureModule) > 1) { moduleLabel <- rep(moduleLabel, length(featureModule)) } else if (length(moduleLabel) != length(featureModule)) { stop("Invalid 'moduleLabel' length") } # factorize counts matrix factorizedMatrix <- factorizeMatrix(x, useAssay = useAssay, altExpName = altExpName, type = "proportion") allCellStates <- factorizedMatrix$proportions$cell if (is.na(normalizedCounts)) { normCounts <- normalizeCounts(counts, normalize = normalize, transformationFun = transformationFun) } else { normCounts <- normalizedCounts } # take topRank if (!is.null(topFeatures) && (is.numeric(topFeatures)) | is.integer(topFeatures)) { topRanked <- topRank( matrix = factorizedMatrix$proportions$module, n = topFeatures) } else { topRanked <- topRank( matrix = factorizedMatrix$proportions$module, n = nrow(factorizedMatrix$proportions$module)) } # filter topRank using featureModule into featureIndices featureIndices <- lapply( featureModule, function(module) { topRanked$index[[module]] } ) # Set up displayName variable if specified if (is.null(displayName)) { displayNames <- rownames(altExp) } else { displayNames <- SummarizedExperiment::rowData(altExp)[[ displayName]] } z <- celdaClusters(x, altExpName = altExpName) y <- celdaModules(x, altExpName = altExpName) # Get max rowFontSize if multiple modules are selected if (is.null(rowFontSize)) { if (length(featureIndices) > 1 & !isTRUE(returnAsList)) { # If there is more than 1 module selected, then the miniumum size # size will be caculated for each module. This will ensure that # all modules will have the same rowFontSize and the module # heatmaps will have the same width. maxlen <- max(unlist(lapply(featureIndices, length))) maxlen <- maxlen * sqrt(length(featureIndices)) rowFontSize <- rep(min(200 / maxlen, 20), length(featureIndices)) } else { # If there is only one plot or each plot will be generated # separately and returned in a list, then the size of the labels, # will be caculated for each module separately. len <- unlist(lapply(featureIndices, length)) rowFontSize <- pmin(200 / len, 20) } } plts <- vector("list", length = length(featureModule)) for (i in seq(length(featureModule))) { plts[[i]] <- .plotModuleHeatmap(normCounts = normCounts, col = col, allCellStates = allCellStates, featureIndices = featureIndices[[i]], featureModule = featureModule[i], z = z, y = y, topCells = topCells, altExpName = altExpName, scaleRow = scaleRow, showFeatureNames = showFeatureNames, displayNames = displayNames[featureIndices[[i]]], trim = trim, rowFontSize = rowFontSize[i], showHeatmapLegend = showHeatmapLegend, showTopAnnotationLegend = showTopAnnotationLegend, showTopAnnotationName = showTopAnnotationName, topAnnotationHeight = topAnnotationHeight, showModuleLabel = showModuleLabel, moduleLabel = moduleLabel[i], moduleLabelSize = moduleLabelSize, useRaster = useRaster, unit = unit, ... = ...) } if (isTRUE(returnHeatmap)) { return(plts[[1]]) } else { if (is.null(ncol)) { ncol <- floor(sqrt(length(plts))) } nrow <- ceiling(length(plts) / ncol) for (i in seq(length(plts))) { plts[[i]] <- grid::grid.grabExpr( ComplexHeatmap::draw(plts[[i]]), wrap.grobs = TRUE) } if (isTRUE(returnAsList)) { figure <- plts } else { figure <- gridExtra::marrangeGrob(plts, ncol = ncol, nrow = nrow, layout_matrix = matrix(seq_len(nrow * ncol), nrow = nrow, ncol = ncol, byrow = TRUE), top = NA) } suppressWarnings(return(figure)) } } ) .plotModuleHeatmap <- function(normCounts, col, allCellStates, featureIndices, featureModule, z, y, topCells, altExpName, scaleRow, showFeatureNames, displayNames, trim, rowFontSize, showHeatmapLegend, showTopAnnotationLegend, showTopAnnotationName, topAnnotationHeight, showModuleLabel, moduleLabel, moduleLabelSize, useRaster, unit, ...) { # Determine cell order from factorizedMatrix$proportions$cell cellStates <- allCellStates[featureModule, , drop = TRUE] singleModuleOrdered <- order(cellStates, decreasing = TRUE) if (!is.null(topCells)) { if (topCells * 2 < ncol(allCellStates)) { cellIndices <- c( utils::head(singleModuleOrdered, n = topCells), utils::tail(singleModuleOrdered, n = topCells)) } else { cellIndices <- singleModuleOrdered } } else { cellIndices <- singleModuleOrdered } cellIndices <- rev(cellIndices) # filter counts based on featureIndices filteredNormCounts <- normCounts[featureIndices, cellIndices, drop = FALSE] # Show/hide features with 0 counts in these cells in the module # filteredNormCounts <- # filteredNormCounts[rowSums(filteredNormCounts > 0) > 0, , # drop = FALSE] geneIx <- match(rownames(filteredNormCounts), rownames(normCounts)) cellIx <- match(colnames(filteredNormCounts), colnames(normCounts)) zToPlot <- z[cellIx] uniquezToPlot <- sort(unique(zToPlot)) ccols <- distinctColors(length(unique(z)))[uniquezToPlot] names(ccols) <- uniquezToPlot yToPlot <- y[geneIx] uniqueyToPlot <- sort(unique(yToPlot)) rcols <- distinctColors(length(y))[uniqueyToPlot] names(rcols) <- uniqueyToPlot # scale indivisual rows by scaleRow if (!is.null(scaleRow)) { if (is.function(scaleRow)) { cn <- colnames(filteredNormCounts) filteredNormCounts <- t(base::apply(filteredNormCounts, 1, scaleRow)) colnames(filteredNormCounts) <- cn } else { stop("'scaleRow' needs to be of class 'function'") } # If the standard deviation was 0 then the values will be NA # Replacing the NAs with zero will keep the row the middle color # rather than grey (default with ComplexHeatmap) filteredNormCounts[is.na(filteredNormCounts)] <- 0 } if (!is.null(trim)) { if (length(trim) != 2) { stop( "'trim' should be a 2 element vector specifying the lower", " and upper boundaries" ) } trim <- sort(trim) filteredNormCounts[filteredNormCounts < trim[1]] <- trim[1] filteredNormCounts[filteredNormCounts > trim[2]] <- trim[2] } if (isTRUE(showModuleLabel)) { plt <- ComplexHeatmap::Heatmap(matrix = filteredNormCounts, col = col, row_title = moduleLabel, row_title_gp = grid::gpar(fontsize = moduleLabelSize), show_column_names = FALSE, show_row_names = showFeatureNames, row_labels = displayNames, row_names_gp = grid::gpar(fontsize = rowFontSize), cluster_rows = FALSE, cluster_columns = FALSE, heatmap_legend_param = list(title = "Expression"), show_heatmap_legend = showHeatmapLegend, use_raster = useRaster, top_annotation = ComplexHeatmap::HeatmapAnnotation( cell = factor(zToPlot, levels = stringr::str_sort(unique(zToPlot), numeric = TRUE)), show_legend = showTopAnnotationLegend, show_annotation_name = showTopAnnotationName, col = list(cell = ccols), simple_anno_size = grid::unit(topAnnotationHeight, unit), simple_anno_size_adjust = TRUE), ...) } else { plt <- ComplexHeatmap::Heatmap(matrix = filteredNormCounts, col = col, show_column_names = FALSE, show_row_names = showFeatureNames, row_labels = displayNames, row_names_gp = grid::gpar(fontsize = rowFontSize), cluster_rows = FALSE, cluster_columns = FALSE, heatmap_legend_param = list(title = "Expression"), show_heatmap_legend = showHeatmapLegend, use_raster = useRaster, top_annotation = ComplexHeatmap::HeatmapAnnotation( cell = factor(zToPlot, levels = stringr::str_sort(unique(zToPlot), numeric = TRUE)), show_legend = showTopAnnotationLegend, show_annotation_name = showTopAnnotationName, col = list(cell = ccols), simple_anno_size = grid::unit(topAnnotationHeight, unit), simple_anno_size_adjust = TRUE), ...) } return(plt) } ================================================ FILE: R/perplexity.R ================================================ #' @title Calculate the perplexity of a celda model #' @description Perplexity is a statistical measure of how well a probability #' model can predict new data. Lower perplexity indicates a better model. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_C}, \link{celda_G} or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot. #' Rows represent features and columns represent cells. #' \item Integer counts matrix. Rows represent features and columns represent #' cells. This matrix should be the same as the one used to generate #' \code{celdaMod}.} #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. #' Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param celdaMod Celda model object. Only works if \code{x} is an integer #' counts matrix. #' @param newCounts A new counts matrix used to calculate perplexity. If NULL, #' perplexity will be calculated for the matrix in \code{useAssay} slot in #' \code{x}. Default NULL. #' @return Numeric. The perplexity for the provided \code{x} (and #' \code{celdaModel}). #' @export setGeneric("perplexity", function(x, celdaMod, useAssay = "counts", altExpName = "featureSubset", newCounts = NULL) { standardGeneric("perplexity")}) #' @importFrom matrixStats logSumExp #' @examples #' data(sceCeldaCG) #' perplexity <- perplexity(sceCeldaCG) #' @rdname perplexity #' @export setMethod("perplexity", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", newCounts = NULL) { altExp <- SingleCellExperiment::altExp(x, altExpName) counts <- SummarizedExperiment::assay(altExp, i = useAssay) counts <- .processCounts(counts) if (celdaModel(x, altExpName = altExpName) == "celda_C") { factorized <- factorizeMatrix(x = x, useAssay = useAssay, altExpName = altExpName, type = "posterior") s <- as.integer(sampleLabel(x, altExpName = altExpName)) p <- .perplexityCelda_C( counts = counts, factorized = factorized, s = s, newCounts = newCounts) } else if (celdaModel(x, altExpName = altExpName) == "celda_CG") { factorized <- factorizeMatrix(x = x, useAssay = useAssay, altExpName = altExpName, type = c("posterior", "counts")) s <- as.integer(sampleLabel(x, altExpName = altExpName)) p <- .perplexityCelda_CG(counts = counts, factorized = factorized, s = s, newCounts = newCounts) } else if (celdaModel(x, altExpName = altExpName) == "celda_G") { factorized <- factorizeMatrix(x = x, useAssay = useAssay, altExpName = altExpName, type = c("posterior", "counts")) L <- S4Vectors::metadata(altExp)$celda_parameters$L y <- celdaModules(x, altExpName = altExpName) p <- .perplexityCelda_G(counts, factorized, L = L, y = y, beta = S4Vectors::metadata(altExp)$celda_parameters$beta, newCounts = newCounts) } else { stop("S4Vectors::metadata(altExp(x, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } return(p) }) #' @importFrom matrixStats logSumExp #' @examples #' data(celdaCGSim, celdaCGMod) #' perplexity <- perplexity(celdaCGSim$counts, celdaCGMod) #' @rdname perplexity #' @export setMethod("perplexity", signature(x = "ANY", celdaMod = "celda_CG"), function(x, celdaMod, newCounts = NULL) { if (!("celda_CG" %in% class(celdaMod))) { stop("The celdaMod provided was not of class celda_CG.") } counts <- .processCounts(x) compareCountMatrix(counts, celdaMod) if (is.null(newCounts)) { newCounts <- counts } else { newCounts <- .processCounts(newCounts) } if (nrow(newCounts) != nrow(counts)) { stop("newCounts should have the same number of rows as counts.") } factorized <- factorizeMatrix( x = counts, celdaMod = celdaMod, type = c("posterior", "counts") ) theta <- log(factorized$posterior$sample) phi <- factorized$posterior$cellPopulation psi <- factorized$posterior$module s <- as.integer(sampleLabel(celdaMod)) eta <- factorized$posterior$geneDistribution nGByTS <- factorized$counts$geneDistribution etaProb <- log(eta) * nGByTS geneByPopProb <- log(psi %*% phi) innerLogProb <- .countsTimesProbs(newCounts, geneByPopProb) + theta[, s] # innerLogProb = (t(geneByPopProb) %*% newCounts) + theta[, s] log.px <- sum(apply(innerLogProb, 2, matrixStats::logSumExp)) # + sum(etaProb) perplexity <- exp(- (log.px / sum(newCounts))) return(perplexity) } ) #' @examples #' data(celdaCSim, celdaCMod) #' perplexity <- perplexity(celdaCSim$counts, celdaCMod) #' @importFrom matrixStats logSumExp #' @rdname perplexity #' @export setMethod( "perplexity", signature(x = "ANY", celdaMod = "celda_C"), function(x, celdaMod, newCounts = NULL) { if (!("celda_C" %in% class(celdaMod))) { stop("The celdaMod provided was not of class celda_C.") } counts <- .processCounts(x) compareCountMatrix(counts, celdaMod) if (is.null(newCounts)) { newCounts <- counts } else { newCounts <- .processCounts(newCounts) } if (nrow(newCounts) != nrow(counts)) { stop("newCounts should have the same number of rows as counts.") } factorized <- factorizeMatrix( x = counts, celdaMod = celdaMod, type = "posterior" ) theta <- log(factorized$posterior$sample) phi <- log(factorized$posterior$module) s <- as.integer(sampleLabel(celdaMod)) # inner.log.prob = (t(phi) %*% newCounts) + theta[, s] inner.log.prob <- .countsTimesProbs(newCounts, phi) + theta[, s] logPx <- sum(apply(inner.log.prob, 2, matrixStats::logSumExp)) perplexity <- exp(- (logPx / sum(newCounts))) return(perplexity) } ) #' @examples #' data(celdaGSim, celdaGMod) #' perplexity <- perplexity(celdaGSim$counts, celdaGMod) #' @rdname perplexity #' @export setMethod( "perplexity", signature(x = "ANY", celdaMod = "celda_G"), function(x, celdaMod, newCounts = NULL) { counts <- .processCounts(x) compareCountMatrix(counts, celdaMod) factorized <- factorizeMatrix( x = counts, celdaMod = celdaMod, type = c("posterior", "counts") ) psi <- factorized$posterior$module phi <- factorized$posterior$cell eta <- factorized$posterior$geneDistribution nGByTS <- factorized$counts$geneDistribution if (is.null(newCounts)) { newCounts <- counts } else { newCounts <- .processCounts(newCounts) } if (nrow(newCounts) != nrow(counts)) { stop("newCounts should have the same number of rows as counts.") } #etaProb <- log(eta) * nGByTS #gene.by.cell.prob = log(psi %*% phi) #logPx = sum(gene.by.cell.prob * newCounts) # + sum(etaProb) gene.by.cell.prob <- log(psi %*% phi) logPx <- sum(gene.by.cell.prob * newCounts) # + sum(etaProb) perplexity <- exp(- (logPx / sum(newCounts))) return(perplexity) } ) .perplexityCelda_C <- function( counts, factorized, s, newCounts) { if (is.null(newCounts)) { newCounts <- counts } else { newCounts <- .processCounts(newCounts) } if (nrow(newCounts) != nrow(counts)) { stop("'newCounts' should have the same number of rows as", " 'assay(altExp(x, altExpName), i = useAssay)'.") } theta <- log(factorized$posterior$sample) phi <- log(factorized$posterior$module) # inner.log.prob = (t(phi) %*% newCounts) + theta[, s] inner.log.prob <- .countsTimesProbs(newCounts, phi) + theta[, s] logPx <- sum(apply(inner.log.prob, 2, matrixStats::logSumExp)) perplexity <- exp(- (logPx / sum(newCounts))) return(perplexity) } .perplexityCelda_CG <- function( counts, factorized, s, newCounts) { if (is.null(newCounts)) { newCounts <- counts } else { newCounts <- .processCounts(newCounts) } if (nrow(newCounts) != nrow(counts)) { stop("newCounts should have the same number of rows as", " 'assay(altExp(x, altExpName), i = useAssay)'.") } theta <- log(factorized$posterior$sample) phi <- factorized$posterior$cellPopulation psi <- factorized$posterior$module eta <- factorized$posterior$geneDistribution nGByTS <- factorized$counts$geneDistribution etaProb <- log(eta) * nGByTS geneByPopProb <- log(psi %*% phi) innerLogProb <- .countsTimesProbs(newCounts, geneByPopProb) + theta[, s] # innerLogProb = (t(geneByPopProb) %*% newCounts) + theta[, s] log.px <- sum(apply(innerLogProb, 2, matrixStats::logSumExp)) # + sum(etaProb) perplexity <- exp(- (log.px / sum(newCounts))) return(perplexity) } .perplexityCelda_G <- function(counts, factorized, L, y, beta, newCounts) { psi <- factorized$posterior$module phi <- factorized$posterior$cell #eta <- factorized$posterior$geneDistribution #nGByTS <- factorized$counts$geneDistribution if (is.null(newCounts)) { newCounts <- counts } else { newCounts <- .processCounts(newCounts) } if (nrow(newCounts) != nrow(counts)) { stop("newCounts should have the same number of rows as counts.") } #etaProb <- log(eta) * nGByTS #gene.by.cell.prob <- log(psi %*% phi) gene.by.cell.prob <- log(psi %*% phi) logPx <- sum(gene.by.cell.prob * newCounts) # + sum(etaProb) perplexity <- exp(- (logPx / sum(newCounts))) return(perplexity) } #' @title Calculate and visualize perplexity of all models in a celdaList #' @description Calculates the perplexity of each model's cluster assignments #' given the provided countMatrix, as well as resamplings of that count #' matrix, providing a distribution of perplexities and a better sense of the #' quality of a given K/L choice. #' @param x A numeric \link{matrix} of counts or a #' \linkS4class{SingleCellExperiment} returned from \link{celdaGridSearch} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. Must contain #' "celda_grid_search" slot in \code{metadata(x)} if \code{x} is a #' \linkS4class{SingleCellExperiment} object. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param celdaList Object of class 'celdaList'. Used only if \code{x} is a #' matrix object. #' @param doResampling Boolean. If \code{TRUE}, then each cell in the counts #' matrix will be resampled according to a multinomial distribution to introduce #' noise before calculating perplexity. Default \code{FALSE}. #' @param numResample Integer. The number of times to resample the counts matrix #' for evaluating perplexity if \code{doResampling} is set to \code{TRUE}. #' Default \code{5}. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of \code{12345} is used. If \code{NULL}, no calls to #' \link[withr]{with_seed} are made. #' @return A \linkS4class{SingleCellExperiment} object or #' \code{celdaList} object with a \code{perplexity} #' property, detailing the perplexity of all K/L combinations that appeared in #' the celdaList's models. #' @export setGeneric("resamplePerplexity", function(x, celdaList, useAssay = "counts", altExpName = "featureSubset", doResampling = FALSE, numResample = 5, seed = 12345) { standardGeneric("resamplePerplexity")}) #' @rdname resamplePerplexity #' @examples #' data(sceCeldaCGGridSearch) #' sce <- resamplePerplexity(sceCeldaCGGridSearch) #' plotGridSearchPerplexity(sce) #' @export setMethod("resamplePerplexity", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", doResampling = FALSE, numResample = 5, seed = 12345) { altExp <- SingleCellExperiment::altExp(x, altExpName) counts <- SummarizedExperiment::assay(altExp, i = useAssay) celdaList <- S4Vectors::metadata(altExp)$celda_grid_search if (is.null(seed)) { res <- .resamplePerplexity( counts = counts, celdaList = celdaList, doResampling = doResampling, numResample = numResample) } else { with_seed(seed, res <- .resamplePerplexity( counts = counts, celdaList = celdaList, doResampling = doResampling, numResample = numResample)) } S4Vectors::metadata(altExp)$celda_grid_search <- res SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname resamplePerplexity #' @examples #' data(celdaCGSim, celdaCGGridSearchRes) #' celdaCGGridSearchRes <- resamplePerplexity( #' celdaCGSim$counts, #' celdaCGGridSearchRes #' ) #' plotGridSearchPerplexity(celdaCGGridSearchRes) #' @export setMethod("resamplePerplexity", signature(x = "ANY"), function(x, celdaList, doResampling = FALSE, numResample = 5, seed = 12345) { if (is.null(seed)) { res <- .resamplePerplexity( counts = x, celdaList = celdaList, doResampling = doResampling, numResample = numResample) } else { with_seed(seed, res <- .resamplePerplexity( counts = x, celdaList = celdaList, doResampling = doResampling, numResample = numResample)) } return(res) } ) .resamplePerplexity <- function(counts, celdaList, doResampling = FALSE, numResample = 5) { if (!methods::is(celdaList, "celdaList")) { stop("celdaList parameter was not of class celdaList.") } if (!isTRUE(is.logical(doResampling))) { stop("The 'doResampling' parameter needs to be logical (TRUE/FALSE).") } if (!isTRUE(doResampling) & (!is.numeric(numResample) || numResample < 1)) { stop("The 'numResample' parameter needs to be an integer greater ", "than 0.") } if(isTRUE(doResampling)) { perpRes <- matrix(NA, nrow = length(resList(celdaList)), ncol = numResample) for (j in seq(numResample)) { newCounts <- .resampleCountMatrix(counts) for (i in seq(length(resList(celdaList)))) { perpRes[i, j] <- perplexity(x = counts, celdaMod = resList(celdaList)[[i]], newCounts = newCounts) } } celdaList@perplexity <- perpRes } else { perpRes <- matrix(NA, nrow = length(resList(celdaList)), ncol = 1) for (i in seq(length(resList(celdaList)))) { perpRes[i,1] <- perplexity(x = counts, celdaMod = resList(celdaList)[[i]], newCounts = counts) } } # Add perplexity data.frame to celda list object celdaList@perplexity <- perpRes ## Add mean perplexity to runParams perpMean <- apply(perpRes, 1, mean) celdaList@runParams$mean_perplexity <- perpMean return(celdaList) } #' @title Visualize perplexity of a list of celda models #' @description Visualize perplexity of every model in a celdaList, by unique #' K/L combinations #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned from #' \code{celdaGridSearch}, \code{recursiveSplitModule}, #' or \code{recursiveSplitCell}. Must contain a list named #' \code{"celda_grid_search"} in \code{metadata(x)}. #' \item celdaList object.} #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". Only works if \code{x} is a #' \linkS4class{SingleCellExperiment} object. #' @param sep Numeric. Breaks in the x axis of the resulting plot. #' @param alpha Numeric. Passed to \link{geom_jitter}. Opacity of the points. #' Values of alpha range from 0 to 1, with lower values corresponding #' to more transparent colors. #' @return A ggplot plot object showing perplexity as a function of clustering #' parameters. #' @export setGeneric("plotGridSearchPerplexity", function(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) { standardGeneric("plotGridSearchPerplexity")}) #' @rdname plotGridSearchPerplexity #' @examples #' data(sceCeldaCGGridSearch) #' sce <- resamplePerplexity(sceCeldaCGGridSearch) #' plotGridSearchPerplexity(sce) #' @export setMethod("plotGridSearchPerplexity", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) { altExp <- SingleCellExperiment::altExp(x, altExpName) celdaList <- S4Vectors::metadata(altExp)$celda_grid_search g <- do.call(paste0(".plotGridSearchPerplexity", as.character(class(resList(x, altExpName = altExpName)[[1]]))), args = list(celdaList, sep, alpha)) return(g) } ) #' @rdname plotGridSearchPerplexity #' @examples #' data(celdaCGSim, celdaCGGridSearchRes) #' ## Run various combinations of parameters with 'celdaGridSearch' #' celdaCGGridSearchRes <- resamplePerplexity( #' celdaCGSim$counts, #' celdaCGGridSearchRes) #' plotGridSearchPerplexity(celdaCGGridSearchRes) #' @export setMethod("plotGridSearchPerplexity", signature(x = "celdaList"), function(x, sep = 5, alpha = 0.5) { g <- do.call(paste0(".plotGridSearchPerplexity", as.character(class(resList(x)[[1]]))), args = list(x, sep, alpha)) return(g) } ) .plotGridSearchPerplexitycelda_CG <- function(celdaList, sep, alpha) { if (!all(c("K", "L") %in% colnames(runParams(celdaList)))) { stop("runParams(celdaList) needs K and L columns.") } if (is.null(celdaPerplexity(celdaList))) { stop("No perplexity measurements available. First run", " 'resamplePerplexity' with celdaList object.") } ix1 <- rep(seq(nrow(celdaPerplexity(celdaList))), each = ncol(celdaPerplexity(celdaList)) ) ix2 <- rep( seq(ncol(celdaPerplexity(celdaList))), nrow(celdaPerplexity(celdaList)) ) df <- data.frame(runParams(celdaList)[ix1, ], perplexity = celdaPerplexity(celdaList)[cbind(ix1, ix2)] ) df$K <- as.factor(df$K) df$L <- as.factor(df$L) lMeansByK <- stats::aggregate(df$perplexity, by = list(df$K, df$L), FUN = mean ) colnames(lMeansByK) <- c("K", "L", "mean_perplexity") lMeansByK$K <- as.factor(lMeansByK$K) lMeansByK$L <- as.factor(lMeansByK$L) if (nlevels(df$K) > 1) { if (nlevels(df$L) > 1) { plot <- ggplot2::ggplot( df, ggplot2::aes_string(x = "K", y = "perplexity") ) + ggplot2::geom_jitter( height = 0, width = 0.1, alpha = alpha, ggplot2::aes_string(color = "L") ) + ggplot2::scale_color_discrete(name = "L") + ggplot2::geom_path(data = lMeansByK, ggplot2::aes_string( x = "K", y = "mean_perplexity", group = "L", color = "L" )) + ggplot2::ylab("Perplexity") + ggplot2::xlab("K") + ggplot2::scale_x_discrete(breaks = seq( min(runParams(celdaList)$K), max(runParams(celdaList)$K), sep )) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } else { plot <- ggplot2::ggplot( df, ggplot2::aes_string(x = "K", y = "perplexity")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::scale_color_manual(name = "L", values = "black") + ggplot2::geom_path(data = lMeansByK, ggplot2::aes_string( x = "K", y = "mean_perplexity", group = "L", color = "L" )) + ggplot2::ylab("Perplexity") + ggplot2::xlab("K") + ggplot2::scale_x_discrete(breaks = seq( min(runParams(celdaList)$K), max(runParams(celdaList)$K), sep )) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } } else { plot <- ggplot2::ggplot( df, ggplot2::aes_string(x = "L", y = "perplexity")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::geom_path(data = lMeansByK, ggplot2::aes_string(x = "L", y = "mean_perplexity", group = "K", color = "K")) + ggplot2::scale_color_manual(name = "K", values = "black") + ggplot2::ylab("Perplexity") + ggplot2::xlab("L") + ggplot2::scale_x_discrete(breaks = seq(min(runParams(celdaList)$L), max(runParams(celdaList)$L), sep)) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } return(plot) } .plotGridSearchPerplexitycelda_C <- function(celdaList, sep, alpha) { if (!all(c("K") %in% colnames(runParams(celdaList)))) { stop("runParams(celdaList) needs the column K.") } if (is.null(celdaPerplexity(celdaList))) { stop( "No perplexity measurements available. First run", " 'resamplePerplexity' with celdaList object." ) } ix1 <- rep(seq(nrow(celdaPerplexity(celdaList))), each = ncol(celdaPerplexity(celdaList)) ) ix2 <- rep( seq(ncol(celdaPerplexity(celdaList))), nrow(celdaPerplexity(celdaList)) ) df <- data.frame(runParams(celdaList)[ix1, ], perplexity = celdaPerplexity(celdaList)[cbind(ix1, ix2)] ) df$K <- as.factor(df$K) meansByK <- stats::aggregate(df$perplexity, by = list(df$K), FUN = mean) colnames(meansByK) <- c("K", "mean_perplexity") meansByK$K <- as.factor(meansByK$K) plot <- ggplot2::ggplot(df, ggplot2::aes_string(x = "K", y = "perplexity")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::geom_path( data = meansByK, ggplot2::aes_string(x = "K", y = "mean_perplexity", group = 1) ) + ggplot2::ylab("Perplexity") + ggplot2::xlab("K") + ggplot2::scale_x_discrete(breaks = seq( min(runParams(celdaList)$K), max(runParams(celdaList)$K), sep )) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) return(plot) } .plotGridSearchPerplexitycelda_G <- function(celdaList, sep, alpha) { if (!all(c("L") %in% colnames(runParams(celdaList)))) { stop("runParams(celdaList) needs the column L.") } if (length(celdaPerplexity(celdaList)) == 0) { stop( "No perplexity measurements available. First run", " 'resamplePerplexity' with celdaList object." ) } ix1 <- rep(seq(nrow(celdaPerplexity(celdaList))), each = ncol(celdaPerplexity(celdaList)) ) ix2 <- rep( seq(ncol(celdaPerplexity(celdaList))), nrow(celdaPerplexity(celdaList)) ) df <- data.frame(runParams(celdaList)[ix1, ], perplexity = celdaPerplexity(celdaList)[cbind(ix1, ix2)] ) df$L <- as.factor(df$L) meansByL <- stats::aggregate(df$perplexity, by = list(df$L), FUN = mean) colnames(meansByL) <- c("L", "mean_perplexity") meansByL$L <- as.factor(meansByL$L) plot <- ggplot2::ggplot(df, ggplot2::aes_string(x = "L", y = "perplexity")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::geom_path( data = meansByL, ggplot2::aes_string(x = "L", y = "mean_perplexity", group = 1) ) + ggplot2::ylab("Perplexity") + ggplot2::xlab("L") + ggplot2::scale_x_discrete(breaks = seq( min(runParams(celdaList)$L), max(runParams(celdaList)$L), sep )) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) return(plot) } # Resample a counts matrix for evaluating perplexity #' @importFrom Matrix colSums t .resampleCountMatrix <- function(countMatrix) { colsums <- colSums(countMatrix) prob <- t(t(countMatrix) / colsums) resample <- vapply(seq(ncol(countMatrix)), function(idx) { stats::rmultinom( n = 1, size = colsums[idx], prob = prob[, idx] ) }, integer(nrow(countMatrix))) return(resample) } #' @title Visualize perplexity differences of a list of celda models #' @description Visualize perplexity differences of every model in a celdaList, #' by unique K/L combinations. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned from #' \code{celdaGridSearch}, \code{recursiveSplitModule}, #' or \code{recursiveSplitCell}. Must contain a list named #' \code{"celda_grid_search"} in \code{metadata(x)}. #' \item celdaList object.} #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param sep Numeric. Breaks in the x axis of the resulting plot. #' @param alpha Numeric. Passed to \link{geom_jitter}. Opacity of the points. #' Values of alpha range from 0 to 1, with lower values corresponding #' to more transparent colors. #' @return A ggplot plot object showing perplexity differences as a function of #' clustering parameters. #' @export setGeneric("plotRPC", function(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) { standardGeneric("plotRPC")}) #' @rdname plotRPC #' @examples #' data(sceCeldaCGGridSearch) #' sce <- resamplePerplexity(sceCeldaCGGridSearch) #' plotRPC(sce) #' @export setMethod("plotRPC", signature(x = "SingleCellExperiment"), function(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) { altExp <- SingleCellExperiment::altExp(x, altExpName) model <- altExp@metadata$celda_grid_search@celdaGridSearchParameters$ model celdaList <- S4Vectors::metadata(altExp)$celda_grid_search if (model == "celda_C") { g <- .plotRPCC(celdaList, sep, alpha = alpha) } else if (model == "celda_G") { g <- .plotRPCG(celdaList, sep, alpha = alpha) } else if (model == "celda_CG") { g <- .plotRPCCG(celdaList, sep, alpha = alpha) } else { stop("S4Vectors::metadata(altExp(x, altExpName))$", "celda_grid_search@", "celdaGridSearchParameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } return(g) } ) #' @rdname plotRPC #' @examples #' data(celdaCGSim, celdaCGGridSearchRes) #' ## Run various combinations of parameters with 'celdaGridSearch' #' celdaCGGridSearchRes <- resamplePerplexity( #' celdaCGSim$counts, #' celdaCGGridSearchRes) #' plotRPC(celdaCGGridSearchRes) #' @export setMethod("plotRPC", signature(x = "celdaList"), function(x, sep = 5, alpha = 0.5) { g <- do.call(paste0(".plotRPC", unlist(strsplit(as.character(class(resList(x)[[1]])), "_"))[[2]]), args = list(x, sep, alpha)) return(g) } ) .plotRPCCG <- function(celdaList, sep, alpha) { # fix check note K <- L <- perpdiffK <- meanperpdiffK <- perpdiffL <- meanperpdiffL <- NULL if (!all(c("K", "L") %in% colnames(runParams(celdaList)))) { stop("runParams(celdaList) needs K and L columns.") } if (is.null(celdaPerplexity(celdaList))) { stop("No perplexity measurements available. First run", " 'resamplePerplexity' with celdaList object.") } ix1 <- rep(seq(nrow(celdaPerplexity(celdaList))), each = ncol(celdaPerplexity(celdaList))) ix2 <- rep(seq(ncol(celdaPerplexity(celdaList))), nrow(celdaPerplexity(celdaList))) dt <- data.table::data.table(runParams(celdaList)[ix1, ], perplexity = celdaPerplexity(celdaList)[cbind(ix1, ix2)]) dt$K <- as.factor(dt$K) dt$L <- as.factor(dt$L) if (nlevels(dt$K) > 1) { for (i in seq(nlevels(dt$L))) { for (j in seq(2, nlevels(dt$K))) { p1 <- dt[K == levels(dt$K)[j - 1] & L == levels(dt$L)[i], perplexity] p2 <- dt[K == levels(dt$K)[j] & L == levels(dt$L)[i], perplexity] dt[K == levels(dt$K)[j] & L == levels(dt$L)[i], "perpdiffK"] <- p2 - p1 } } diffMeansByK <- data.table::data.table(stats::aggregate( dt$perpdiffK, by = list(dt$K, dt$L), FUN = mean)) colnames(diffMeansByK) <- c("K", "L", "meanperpdiffK") diffMeansByK$K <- as.factor(diffMeansByK$K) diffMeansByK$L <- as.factor(diffMeansByK$L) diffMeansByK <- diffMeansByK[stats::complete.cases(diffMeansByK), ] diffMeansByK$spline <- stats::smooth.spline(diffMeansByK$meanperpdiffK)$y if (nlevels(dt$L) > 1) { plot <- ggplot2::ggplot(dt[!is.na(perpdiffK), ], ggplot2::aes_string(x = "K", y = "perpdiffK")) + ggplot2::geom_jitter(height = 0, width = 0.1, alpha = alpha, ggplot2::aes_string(color = "L")) + ggplot2::scale_color_discrete(name = "L") + ggplot2::geom_path(data = diffMeansByK, ggplot2::aes_string(x = "K", y = "spline", group = "L", color = "L"), size = 1) + ggplot2::ylab("Rate of perplexity change") + ggplot2::xlab("K") + ggplot2::scale_x_discrete( breaks = seq(min(as.integer(levels(dt$K))), max(as.integer(levels(dt$K))), sep)) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } else { plot <- ggplot2::ggplot(dt[!is.na(perpdiffK), ], ggplot2::aes_string(x = "K", y = "perpdiffK")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::scale_color_manual(name = "L", values = "black") + ggplot2::geom_path(data = diffMeansByK, ggplot2::aes_string(x = "K", y = "spline", group = "L", color = "L"), size = 1) + ggplot2::ylab("Rate of perplexity change") + ggplot2::xlab("K") + ggplot2::scale_x_discrete( breaks = seq(min(as.integer(levels(dt$K))), max(as.integer(levels(dt$K))), sep)) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } } else if (nlevels(dt$L) > 1) { for (j in seq(2, nlevels(dt$L))) { p1 <- dt[K == levels(dt$K) & L == levels(dt$L)[j - 1], perplexity] p2 <- dt[K == levels(dt$K) & L == levels(dt$L)[j], perplexity] dt[K == levels(dt$K) & L == levels(dt$L)[j], "perpdiffL"] <- p2 - p1 } diffMeansByL <- data.table::data.table(stats::aggregate(dt$perpdiffL, by = list(dt$K, dt$L), FUN = mean)) colnames(diffMeansByL) <- c("K", "L", "meanperpdiffL") diffMeansByL$K <- as.factor(diffMeansByL$K) diffMeansByL$L <- as.factor(diffMeansByL$L) diffMeansByL <- diffMeansByL[stats::complete.cases(diffMeansByL), ] diffMeansByL$spline <- stats::smooth.spline(diffMeansByL$meanperpdiffL)$y plot <- ggplot2::ggplot(dt[!is.na(perpdiffL), ], ggplot2::aes_string(x = "L", y = "perpdiffL")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::scale_color_manual(name = "K", values = "black") + ggplot2::geom_path( data = diffMeansByL, ggplot2::aes_string( x = "L", y = "spline", group = "K", color = "K"), size = 1) + ggplot2::ylab("Rate of perplexity change") + ggplot2::xlab("L") + ggplot2::scale_x_discrete( breaks = seq(min(as.integer(levels(dt$L))), max(as.integer(levels(dt$L))), sep)) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } else { stop("Only one combination of K and L available! Unable to calculate", " perplexity differences.") } return(plot) } .plotRPCC <- function(celdaList, sep, alpha) { K <- perpdiffK <- meanperpdiffK <- NULL # fix check note if (!all(c("K") %in% colnames(runParams(celdaList)))) { stop("runParams(celdaList) needs the column K.") } if (is.null(celdaPerplexity(celdaList))) { stop( "No perplexity measurements available. First run", " 'resamplePerplexity' with celdaList object." ) } ix1 <- rep(seq(nrow(celdaPerplexity(celdaList))), each = ncol(celdaPerplexity(celdaList))) ix2 <- rep(seq(ncol(celdaPerplexity(celdaList))), nrow(celdaPerplexity(celdaList))) dt <- data.table::data.table(runParams(celdaList)[ix1, ], perplexity = celdaPerplexity(celdaList)[cbind(ix1, ix2)]) dt$K <- as.factor(dt$K) if (nlevels(dt$K) > 1) { for (i in seq(2, nlevels(dt$K))) { p1 <- dt[K == levels(dt$K)[i - 1], perplexity] p2 <- dt[K == levels(dt$K)[i], perplexity] dt[K == levels(dt$K)[i], "perpdiffK"] <- p2 - p1 } diffMeansByK <- data.table::data.table(stats::aggregate(dt$perpdiffK, by = list(dt$K), FUN = mean)) colnames(diffMeansByK) <- c("K", "meanperpdiffK") diffMeansByK$K <- as.factor(diffMeansByK$K) diffMeansByK <- diffMeansByK[stats::complete.cases(diffMeansByK), ] diffMeansByK$spline <- stats::smooth.spline(diffMeansByK$meanperpdiffK)$y plot <- ggplot2::ggplot(dt[!is.na(perpdiffK), ], ggplot2::aes_string(x = "K", y = "perpdiffK")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::geom_path(data = diffMeansByK, ggplot2::aes_string(x = "K", y = "spline", group = 1), size = 1) + ggplot2::ylab("Perplexity difference compared to previous K") + ggplot2::xlab("K") + ggplot2::scale_x_discrete( breaks = seq(min(as.integer(levels(dt$K))), max(as.integer(levels(dt$K))), sep)) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } else { stop("Only one unique K value available! Unable to calculate", " perplexity differences.") } return(plot) } .plotRPCG <- function(celdaList, sep, alpha) { L <- perpdiffL <- meanperpdiffL <- NULL # fix check note if (!all(c("L") %in% colnames(runParams(celdaList)))) { stop("runParams(celdaList) needs the column L.") } if (length(celdaPerplexity(celdaList)) == 0) { stop( "No perplexity measurements available. First run", " 'resamplePerplexity' with celdaList object." ) } ix1 <- rep(seq(nrow(celdaPerplexity(celdaList))), each = ncol(celdaPerplexity(celdaList))) ix2 <- rep(seq(ncol(celdaPerplexity(celdaList))), nrow(celdaPerplexity(celdaList))) dt <- data.table::data.table(runParams(celdaList)[ix1, ], perplexity = celdaPerplexity(celdaList)[cbind(ix1, ix2)]) dt$L <- as.factor(dt$L) if (nlevels(dt$L) > 1) { for (i in seq(2, nlevels(dt$L))) { p1 <- dt[L == levels(dt$L)[i - 1], perplexity] p2 <- dt[L == levels(dt$L)[i], perplexity] dt[L == levels(dt$L)[i], "perpdiffL"] <- p2 - p1 } diffMeansByL <- data.table::data.table(stats::aggregate(dt$perpdiffL, by = list(dt$L), FUN = mean)) colnames(diffMeansByL) <- c("L", "meanperpdiffL") diffMeansByL$L <- as.factor(diffMeansByL$L) diffMeansByL <- diffMeansByL[stats::complete.cases(diffMeansByL), ] diffMeansByL$spline <- stats::smooth.spline(diffMeansByL$meanperpdiffL)$y plot <- ggplot2::ggplot(dt[!is.na(perpdiffL), ], ggplot2::aes_string(x = "L", y = "perpdiffL")) + ggplot2::geom_jitter(height = 0, width = 0.1, color = "grey", alpha = alpha) + ggplot2::geom_path(data = diffMeansByL, ggplot2::aes_string(x = "L", y = "spline", group = 1), size = 1) + ggplot2::ylab("Perplexity difference compared to previous L") + ggplot2::xlab("L") + ggplot2::scale_x_discrete( breaks = seq(min(as.integer(levels(dt$L))), max(as.integer(levels(dt$L))), sep)) + ggplot2::theme_bw() + ggplot2::theme(panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) } else { stop("Only one unique L value available! Unable to calculate", " perplexity differences.") } return(plot) } ================================================ FILE: R/plotHeatmap.R ================================================ #' @title Plots heatmap based on Celda model #' @description Renders a heatmap based on a matrix of counts where rows are #' features and columns are cells. #' @param counts Numeric or sparse matrix. Normalized counts matrix where rows #' represent features and columns represent cells. . #' @param z Numeric vector. Denotes cell population labels. #' @param y Numeric vector. Denotes feature module labels. #' @param featureIx Integer vector. Select features for display in heatmap. If #' NULL, no subsetting will be performed. Default NULL. #' @param cellIx Integer vector. Select cells for display in heatmap. If NULL, #' no subsetting will be performed. Default NULL. #' @param scaleRow Function. A function to scale each individual row. Set to #' NULL to disable. Occurs after normalization and log transformation. Defualt #' is 'scale' and thus will Z-score transform each row. #' @param trim Numeric vector. Vector of length two that specifies the lower #' and upper bounds for the data. This threshold is applied after row scaling. #' Set to NULL to disable. Default c(-2,2). #' @param clusterFeature Logical. Determines whether rows should be clustered. #' Default TRUE. #' @param clusterCell Logical. Determines whether columns should be clustered. #' Default TRUE. #' @param annotationCell Data frame. Additional annotations for each cell will #' be shown in the column color bars. The format of the data frame should be #' one row for each cell and one column for each annotation. Numeric variables #' will be displayed as continuous color bars and factors will be displayed as #' discrete color bars. Default NULL. #' @param annotationFeature A data frame for the feature annotations (rows). #' @param annotationColor List. Contains color scheme for all annotations. See #' `?pheatmap` for more details. #' @param colorScheme Character. One of "divergent" or "sequential". A #' "divergent" scheme is best for highlighting relative data (denoted by #' 'colorSchemeCenter') such as gene expression data that has been normalized #' and centered. A "sequential" scheme is best for highlighting data that #' are ordered low to high such as raw counts or probabilities. Default #' "divergent". #' @param colorSchemeSymmetric Logical. When the colorScheme is "divergent" #' and the data contains both positive and negative numbers, TRUE indicates #' that the color scheme should be symmetric from #' \code{[-max(abs(data)), max(abs(data))]}. For example, if the data ranges #' goes from -1.5 to 2, then setting this to TRUE will force the color scheme #' to range from -2 to 2. Default TRUE. #' @param colorSchemeCenter Numeric. Indicates the center of a "divergent" #' colorScheme. Default 0. #' @param col Color for the heatmap. #' @param breaks Numeric vector. A sequence of numbers that covers the range #' of values in the normalized `counts`. Values in the normalized `matrix` are #' assigned to each bin in `breaks`. Each break is assigned to a unique color #' from `col`. If NULL, then breaks are calculated automatically. Default NULL. #' @param legend Logical. Determines whether legend should be drawn. Default #' TRUE. #' @param annotationLegend Logical. Whether legend for all annotations should #' be drawn. Default TRUE. #' @param annotationNamesFeature Logical. Whether the names for features should #' be shown. Default TRUE. #' @param annotationNamesCell Logical. Whether the names for cells should be #' shown. Default TRUE. #' @param showNamesFeature Logical. Specifies if feature names should be shown. #' Default TRUE. #' @param showNamesCell Logical. Specifies if cell names should be shown. #' Default FALSE. #' @param rowGroupOrder Vector. Specifies the order of feature clusters when #' semisupervised clustering is performed on the \code{y} labels. #' @param colGroupOrder Vector. Specifies the order of cell clusters when #' semisupervised clustering is performed on the \code{z} labels. #' @param hclustMethod Character. Specifies the method to use for the 'hclust' #' function. See `?hclust` for possible values. Default "ward.D2". #' @param treeheightFeature Numeric. Width of the feature dendrogram. Set to 0 #' to disable plotting of this dendrogram. Default: if clusterFeature == TRUE, #' then treeheightFeature = 50, else treeheightFeature = 0. #' @param treeheightCell Numeric. Height of the cell dendrogram. Set to 0 to #' disable plotting of this dendrogram. Default: if clusterCell == TRUE, then #' treeheightCell = 50, else treeheightCell = 0. #' @param silent Logical. Whether to plot the heatmap. #' @param ... Other arguments to be passed to underlying pheatmap function. #' @examples #' data(celdaCGSim, celdaCGMod) #' plotHeatmap(celdaCGSim$counts, #' z = celdaClusters(celdaCGMod)$z, y = celdaClusters(celdaCGMod)$y #' ) #' @return list A list containing dendrogram information and the heatmap grob #' @import graphics #' @import grid #' @export plotHeatmap <- function(counts, z = NULL, y = NULL, scaleRow = scale, trim = c(-2, 2), featureIx = NULL, cellIx = NULL, clusterFeature = TRUE, clusterCell = TRUE, colorScheme = c("divergent", "sequential"), colorSchemeSymmetric = TRUE, colorSchemeCenter = 0, col = NULL, annotationCell = NULL, annotationFeature = NULL, annotationColor = NULL, breaks = NULL, legend = TRUE, annotationLegend = TRUE, annotationNamesFeature = TRUE, annotationNamesCell = TRUE, showNamesFeature = FALSE, showNamesCell = FALSE, rowGroupOrder = NULL, colGroupOrder = NULL, hclustMethod = "ward.D2", treeheightFeature = ifelse(clusterFeature, 50, 0), treeheightCell = ifelse(clusterCell, 50, 0), silent = FALSE, ...) { # Check for same lengths for z and y group variables if (!is.null(z) & length(z) != ncol(counts)) { stop("Length of z must match number of columns in counts matrix") } if (!is.null(y) & length(y) != nrow(counts)) { stop("Length of y must match number of rows in counts matrix") } colorScheme <- match.arg(colorScheme) ## Create cell annotation if (!is.null(annotationCell) & !is.null(z)) { if (is.null(rownames(annotationCell))) { rownames(annotationCell) <- colnames(counts) } else { if (any(rownames(annotationCell) != colnames(counts))) { stop("Row names of 'annotationCell' are different than the column names of 'counts'") } } annotationCell <- data.frame(cell = as.factor(z), annotationCell) } else if (is.null(annotationCell) & !is.null(z)) { annotationCell <- data.frame(cell = as.factor(z)) rownames(annotationCell) <- colnames(counts) } else { annotationCell <- NA } # Set feature annotation if (!is.null(annotationFeature) & !is.null(y)) { if (is.null(rownames(annotationFeature))) { rownames(annotationFeature) <- rownames(counts) } else { if (any(rownames(annotationFeature) != rownames(counts))) { stop("Row names of 'annotationFeature' are different than the row names of 'counts'") } } annotationFeature <- data.frame( module = as.factor(y), annotationFeature ) } else if (is.null(annotationFeature) & !is.null(y)) { annotationFeature <- data.frame(module = as.factor(y)) rownames(annotationFeature) <- rownames(counts) } else { annotationFeature <- NA } ## Select subsets of features/cells if (!is.null(featureIx)) { counts <- counts[featureIx, , drop = FALSE] if (!is.null(annotationFeature) && !is.null(ncol(annotationFeature))) { annotationFeature <- annotationFeature[featureIx, , drop = FALSE] } if (!is.null(y)) { y <- y[featureIx] } } if (!is.null(cellIx)) { counts <- counts[, cellIx, drop = FALSE] if (!is.null(annotationCell) && !is.null(ncol(annotationCell))) { annotationCell <- annotationCell[cellIx, , drop = FALSE] } if (!is.null(z)) { z <- z[cellIx] } } ## Set annotation colors if (!is.null(z)) { if (is.factor(z)) { K <- levels(z) } else { K <- unique(z) } K <- stringr::str_sort(K, numeric = TRUE) kCol <- distinctColors(length(K)) names(kCol) <- K if (!is.null(annotationColor)) { if (!("cell" %in% names(annotationColor))) { annotationColor <- c(list(cell = kCol), annotationColor) } } else { annotationColor <- list(cell = kCol) } } if (!is.null(y)) { if (is.factor(y)) { L <- levels(y) } else { L <- unique(y) } L <- stringr::str_sort(L, numeric = TRUE) lCol <- distinctColors(length(L)) names(lCol) <- L if (!is.null(annotationColor)) { if (!("module" %in% names(annotationColor))) { annotationColor <- c(list(module = lCol), annotationColor) } } else { annotationColor <- list(module = lCol) } } # scale indivisual rows by scaleRow if (!is.null(scaleRow)) { if (is.function(scaleRow)) { cn <- colnames(counts) counts <- t(base::apply(counts, 1, scaleRow)) colnames(counts) <- cn } else { stop("'scaleRow' needs to be of class 'function'") } } if (!is.null(trim)) { if (length(trim) != 2) { stop( "'trim' should be a 2 element vector specifying the lower", " and upper boundaries" ) } trim <- sort(trim) counts[counts < trim[1]] <- trim[1] counts[counts > trim[2]] <- trim[2] } ## Set color scheme and breaks uBoundRange <- max(counts, na.rm = TRUE) lboundRange <- min(counts, na.rm = TRUE) if (colorScheme == "divergent") { if (colorSchemeSymmetric == TRUE) { uBoundRange <- max(abs(uBoundRange), abs(lboundRange)) lboundRange <- -uBoundRange } if (is.null(col)) { col <- colorRampPalette(c("#1E90FF", "#FFFFFF", "#CD2626"), space = "Lab" )(100) } colLen <- length(col) if (is.null(breaks)) { breaks <- c( seq( lboundRange, colorSchemeCenter, length.out = round(colLen / 2) + 1 ), seq( colorSchemeCenter + 1e-6, uBoundRange, length.out = colLen - round(colLen / 2) ) ) } } else { # Sequential color scheme if (is.null(col)) { col <- colorRampPalette(c("#FFFFFF", brewer.pal( n = 9, name = "Blues" )))(100) } colLen <- length(col) if (is.null(breaks)) { breaks <- seq(lboundRange, uBoundRange, length.out = colLen) } } sp <- semiPheatmap( mat = counts, color = col, breaks = breaks, clusterCols = clusterCell, clusterRows = clusterFeature, annotationRow = annotationFeature, annotationCol = annotationCell, annotationColors = annotationColor, legend = legend, annotationLegend = annotationLegend, annotationNamesRow = annotationNamesFeature, annotationNamesCol = annotationNamesCell, showRownames = showNamesFeature, showColnames = showNamesCell, clusteringMethod = hclustMethod, treeHeightRow = treeheightFeature, treeHeightCol = treeheightCell, rowLabel = y, colLabel = z, rowGroupOrder = rowGroupOrder, colGroupOrder = colGroupOrder, silent = TRUE, ... ) return(sp) } ================================================ FILE: R/plot_decontx.R ================================================ #' @title Plots contamination on UMAP coordinates #' @description A scatter plot of the UMAP dimensions generated by DecontX with #' cells colored by the estimated percentation of contamation. #' @param x Either a \linkS4class{SingleCellExperiment} with \code{decontX} #' results stored in \code{metadata(x)$decontX} or the result from running #' decontX on a count matrix. #' @param batch Character. Batch of cells to plot. If \code{NULL}, then #' the first batch in the list will be selected. Default \code{NULL}. #' @param colorScale Character vector. Contains the color spectrum to be passed #' to \code{scale_colour_gradientn} from package 'ggplot2'. Default #' c("blue","green","yellow","orange","red"). #' @param size Numeric. Size of points in the scatterplot. Default 1. #' @return Returns a \code{ggplot} object. #' @author Shiyi Yang, Joshua Campbell #' @seealso See \code{\link{decontX}} for a full example of how to estimate #' and plot contamination. #' @export plotDecontXContamination <- function(x, batch = NULL, colorScale = c( "blue", "green", "yellow", "orange", "red" ), size = 1) { if (inherits(x, "SingleCellExperiment")) { estimates <- S4Vectors::metadata(x)$decontX$estimates } else { estimates <- x$estimates } if (is.null(estimates)) { stop("decontX estimates not found. Estimates will be found in 'metadata(x)$decontX$estimates' if 'x' is a SingleCellExperiment or 'x$estimates' if decontX was run on a count matrix. Are you sure 'x' is output from decontX?") } batches <- names(estimates) if (is.null(batch)) { i <- batches[1] } else { if (!(batch %in% batches)) { stop( "'", batch, "' is not one of the batches in 'x'. Batches available", " for plotting: '", paste(batches, collapse = ","), "'" ) } i <- batch } contamin <- estimates[[i]]$contamination umap <- estimates[[i]]$UMAP ## Create data.frame df <- data.frame(umap, "Contamination" = contamin) naIx <- is.na(umap[, 1]) | is.na(umap[, 2]) df <- df[!naIx, ] ## Generate ggplot scatterplot gg <- ggplot2::ggplot( df, ggplot2::aes_string( x = colnames(umap)[1], y = colnames(umap)[2] ) ) + ggplot2::geom_point( stat = "identity", size = size, ggplot2::aes_string(color = "Contamination") ) + ggplot2::theme_bw() + ggplot2::scale_colour_gradientn( colors = colorScale, name = "Contamination", limits = c(0, 1) ) + ggplot2::theme( panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank(), axis.text = ggplot2::element_text(size = 15), axis.title = ggplot2::element_text(size = 15) ) return(gg) } #' @title Plots percentage of cells cell types expressing markers #' @description Generates a barplot that shows the percentage of #' cells within clusters or cell types that have detectable levels #' of given marker genes. Can be used to view the expression of #' marker genes in different cell types before and after #' decontamination with \code{\link{decontX}}. #' @param x Either a \linkS4class{SingleCellExperiment} or #' a matrix-like object of counts. #' @param markers List. A named list indicating the marker genes #' for each cell type of #' interest. Multiple markers can be supplied for each cell type. For example, #' \code{list(Tcell_Markers=c("CD3E", "CD3D"), #' Bcell_Markers=c("CD79A", "CD79B", "MS4A1")} #' would specify markers for human T-cells and B-cells. #' A cell will be considered #' "positive" for a cell type if it has a count greater than \code{threshold} #' for at least one of the marker genes in the list. #' @param groupClusters List. A named list that allows #' cell clusters labels coded in #' \code{z} to be regrouped and renamed on the fly. For example, #' \code{list(Tcells=c(1, 2), Bcells=7)} would recode #' clusters 1 and 2 to "Tcells" #' and cluster 7 to "Bcells". Note that if this is #' used, clusters in \code{z} not found #' in \code{groupClusters} will be excluded from the barplot. #' Default \code{NULL}. #' @param assayName Character vector. Name(s) of the assay(s) to #' plot if \code{x} is a #' \linkS4class{SingleCellExperiment}. If more than one assay #' is listed, then side-by-side barplots will be generated. #' Default \code{c("counts", "decontXcounts")}. #' @param z Character, Integer, or Vector. Indicates the cluster labels #' for each cell. #' If \code{x} is a \linkS4class{SingleCellExperiment} and \code{z = NULL}, #' then the cluster labels from \code{\link{decontX}} will be retived from the #' \code{colData} of \code{x} (i.e. \code{colData(x)$decontX_clusters}). #' If \code{z} is a single character or integer, #' then that column will be retrived #' from \code{colData} of \code{x}. (i.e. \code{colData(x)[,z]}). If \code{x} #' is a counts matrix, then \code{z} will need #' to be a vector the same length as #' the number of columns in \code{x} that indicate #' the cluster to which each cell #' belongs. Default \code{NULL}. #' @param threshold Numeric. Markers greater than or equal to this value will #' be considered detected in a cell. Default 1. #' @param exactMatch Boolean. Whether to only identify exact matches #' for the markers or to identify partial matches using \code{\link{grep}}. See #' \code{\link{retrieveFeatureIndex}} for more details. Default \code{TRUE}. #' @param by Character. Where to search for the markers if \code{x} is a #' \linkS4class{SingleCellExperiment}. See \code{\link{retrieveFeatureIndex}} #' for more details. If \code{x} is a matrix, #' then this must be set to \code{"rownames"}.Default \code{"rownames"}. #' @param ncol Integer. Number of columns to make in the plot. #' Default \code{round(sqrt(length(markers))}. #' @param labelBars Boolean. Whether to display percentages above each bar #' Default \code{TRUE}. #' @param labelSize Numeric. Size of the percentage labels in the barplot. #' Default 3. #' @return Returns a \code{ggplot} object. #' @author Shiyi Yang, Joshua Campbell #' @seealso See \code{\link{decontX}} for a full example of how to estimate #' and plot contamination. #' @export plotDecontXMarkerPercentage <- function(x, markers, groupClusters = NULL, assayName = c( "counts", "decontXcounts" ), z = NULL, threshold = 1, exactMatch = TRUE, by = "rownames", ncol = round(sqrt(length(markers))), labelBars = TRUE, labelSize = 3) { cellTypeLabels <- percent <- NULL # fix check note legend <- "none" # Check that list arguments are named if (!is(markers, "list") || is.null(names(markers))) { stop("'markers' needs to be a named list.") } temp <- .processPlotDecontXMarkerInupt( x = x, z = z, markers = markers, groupClusters = groupClusters, by = by, exactMatch = exactMatch ) x <- temp$x z <- temp$z geneMarkerIndex <- temp$geneMarkerIndex geneMarkerCellTypeIndex <- temp$geneMarkerCellTypeIndex groupClusters <- temp$groupClusters xlab <- temp$xlab if (inherits(x, "SingleCellExperiment")) { # If 'x' is SingleCellExperiment, then get percentage # for each matrix in 'assayName' df.list <- list() for (i in seq_along(assayName)) { counts <- SummarizedExperiment::assay( x[geneMarkerIndex, ], assayName[i] ) df <- .calculateDecontXBarplotPercent( counts, z, geneMarkerCellTypeIndex, threshold ) df.list[[i]] <- cbind(df, assay = assayName[i]) } df <- do.call(rbind, df.list) assay <- as.factor(df$assay) if (length(assayName) > 1) { legend <- "right" } } else { ## If 'x' is matrix, then calculate percentages directly counts <- x[geneMarkerIndex, ] df <- .calculateDecontXBarplotPercent( counts, z, geneMarkerCellTypeIndex, threshold ) assay <- "red3" legend <- "none" } # Build data.frame for ggplots df <- cbind(df, cellTypeLabels = names(groupClusters)[df$cellType]) df$cellTypeLabels <- factor(df$cellTypeLabels, levels = names(groupClusters) ) df <- cbind(df, markerLabels = names(markers)[df$markers]) df$markerLabels <- factor(df$markerLabels, levels = names(markers)) plt <- ggplot2::ggplot(df, ggplot2::aes_string( x = "cellTypeLabels", y = "percent", fill = "assay" )) + ggplot2::geom_bar( stat = "identity", position = ggplot2::position_dodge2(width = 0.9, preserve = "single") ) + ggplot2::xlab(xlab) + ggplot2::ylab(paste0("Percentage of cells expressing markers")) + ggplot2::facet_wrap(. ~ df$markerLabels, ncol = ncol) + ggplot2::theme( panel.background = ggplot2::element_rect( fill = "white", color = "grey" ), panel.grid = ggplot2::element_line("grey"), legend.position = legend, legend.key = ggplot2::element_rect( fill = "white", color = "white" ), panel.grid.minor = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), text = ggplot2::element_text(size = 10), axis.text.x = ggplot2::element_text( size = 8, angle = 45, hjust = 1 ), axis.text.y = ggplot2::element_text(size = 9), legend.key.size = grid::unit(8, "mm"), legend.text = ggplot2::element_text(size = 10), strip.text.x = ggplot2::element_text(size = 10) ) if (isTRUE(labelBars)) { plt <- plt + ggplot2::geom_text(ggplot2::aes( x = cellTypeLabels, y = percent + 2.5, label = percent ), position = ggplot2::position_dodge2(width = 0.9, preserve = "single"), size = labelSize ) } return(plt) } #' @title Plots expression of marker genes before and after decontamination #' @description Generates a violin plot that shows the counts of marker #' genes in cells across specific clusters or cell types. Can be used to view #' the expression of marker genes in different cell types before and after #' decontamination with \code{\link{decontX}}. #' @param x Either a \linkS4class{SingleCellExperiment} #' or a matrix-like object of counts. #' @param markers Character Vector or List. A character vector #' or list of character vectors #' with the names of the marker genes of interest. #' @param groupClusters List. A named list that allows #' cell clusters labels coded in #' \code{z} to be regrouped and renamed on the fly. For example, #' \code{list(Tcells=c(1, 2), Bcells=7)} would recode clusters #' 1 and 2 to "Tcells" #' and cluster 7 to "Bcells". Note that if this is used, clusters #' in \code{z} not found #' in \code{groupClusters} will be excluded. Default \code{NULL}. #' @param assayName Character vector. Name(s) of the assay(s) to #' plot if \code{x} is a #' \linkS4class{SingleCellExperiment}. If more than one assay is listed, then #' side-by-side violin plots will be generated. #' Default \code{c("counts", "decontXcounts")}. #' @param z Character, Integer, or Vector. #' Indicates the cluster labels for each cell. #' If \code{x} is a \linkS4class{SingleCellExperiment} and \code{z = NULL}, #' then the cluster labels from \code{\link{decontX}} will be retreived from the #' \code{colData} of \code{x} (i.e. \code{colData(x)$decontX_clusters}). #' If \code{z} is a single character or integer, then that column will be #' retrived from \code{colData} of \code{x}. (i.e. \code{colData(x)[,z]}). #' If \code{x} is a counts matrix, then \code{z} will need to be a vector #' the same length as the number of columns in \code{x} that indicate #' the cluster to which each cell belongs. Default \code{NULL}. #' @param exactMatch Boolean. Whether to only identify exact matches #' for the markers or to identify partial matches using \code{\link{grep}}. #' See \code{\link{retrieveFeatureIndex}} for more details. #' Default \code{TRUE}. #' @param by Character. Where to search for the markers if \code{x} is a #' \linkS4class{SingleCellExperiment}. See \code{\link{retrieveFeatureIndex}} #' for more details. If \code{x} is a matrix, then this must be set to #' \code{"rownames"}. Default \code{"rownames"}. #' @param log1p Boolean. Whether to apply the function \code{log1p} to the data #' before plotting. This function will add a pseudocount of 1 and then log #' transform the expression values. Default \code{FALSE}. #' @param ncol Integer. Number of columns to make in the plot. #' Default \code{NULL}. #' @param plotDots Boolean. If \code{TRUE}, the #' expression of features will be plotted as points in addition to the violin #' curve. Default \code{FALSE}. #' @param dotSize Numeric. Size of points if \code{plotDots = TRUE}. #' Default \code{0.1}. #' @return Returns a \code{ggplot} object. #' @author Shiyi Yang, Joshua Campbell #' @seealso See \code{\link{decontX}} for a full example of how to estimate #' and plot contamination. #' @export plotDecontXMarkerExpression <- function(x, markers, groupClusters = NULL, assayName = c( "counts", "decontXcounts" ), z = NULL, exactMatch = TRUE, by = "rownames", log1p = FALSE, ncol = NULL, plotDots = FALSE, dotSize = 0.1) { legend <- "none" temp <- .processPlotDecontXMarkerInupt( x = x, z = z, markers = markers, groupClusters = groupClusters, by = by, exactMatch = exactMatch ) x <- temp$x z <- temp$z geneMarkerIndex <- temp$geneMarkerIndex groupClusters <- temp$groupClusters xlab <- temp$xlab if (inherits(x, "SingleCellExperiment")) { # If 'x' is SingleCellExperiment, then get percentage # for each matrix in 'assayName' df.list <- list() for (i in seq_along(assayName)) { counts <- SummarizedExperiment::assay( x[geneMarkerIndex, ], assayName[i] ) df <- reshape2::melt(as.matrix(counts), varnames = c("Marker", "Cell"), value.name = "Expression" ) df.list[[i]] <- cbind(df, assay = assayName[i]) } df <- do.call(rbind, df.list) assay <- factor(df$assay, levels = assayName) if (length(assayName) > 1) { legend <- "right" } } else { ## If 'x' is matrix, then calculate percentages directly counts <- x[geneMarkerIndex, ] df <- reshape2::melt(counts, varnames = c("Marker", "Cell"), value.name = "Expression" ) assay <- "red3" legend <- "none" } # Create data.frame and add cell type groups back in names(z) <- colnames(x) df <- cbind(df, Cluster = z[df$Cell]) ylab <- "Expression" if (isTRUE(log1p)) { df$Expression <- log1p(df$Expression) ylab <- "Expression (log1p)" } Expression <- df$Expression Marker <- df$Marker Assay <- factor(df$assay, levels = assayName) Cluster <- df$Cluster if (!is.null(groupClusters)) { df <- cbind(df, Cell_Type = names(groupClusters)[Cluster]) Cell_Type <- factor(df$Cell_Type, levels = names(groupClusters)) plt <- ggplot2::ggplot(df, ggplot2::aes( x = Cell_Type, y = Expression, fill = Assay )) + ggplot2::facet_wrap(~ Cell_Type + Marker, scales = "free", labeller = ggplot2::label_context, ncol = ncol ) } else { plt <- ggplot2::ggplot(df, ggplot2::aes( x = Cluster, y = Expression, fill = Assay )) + ggplot2::facet_wrap(~ Cluster + Marker, scales = "free", labeller = ggplot2::label_context, ncol = ncol ) } plt <- plt + ggplot2::geom_violin( trim = TRUE, scale = "width" ) + ggplot2::theme_bw() + ggplot2::theme( axis.text.x = ggplot2::element_blank(), axis.ticks.x = ggplot2::element_blank(), axis.title.x = ggplot2::element_blank(), strip.text = ggplot2::element_text(size = 8), panel.grid = ggplot2::element_blank(), legend.position = legend ) + ggplot2::ylab(ylab) if (isTRUE(plotDots)) { plt <- plt + ggplot2::geom_jitter(height = 0, size = dotSize) } return(plt) } .processPlotDecontXMarkerInupt <- function(x, z, markers, groupClusters, by, exactMatch) { # Process z and convert to a factor if (is.null(z) & inherits(x, "SingleCellExperiment")) { cn <- colnames(SummarizedExperiment::colData(x)) if (!("decontX_clusters" %in% cn)) { stop("'decontX_clusters' not found in 'colData(x)'. Make sure you have run 'decontX' or supply 'z' directly.") } z <- SummarizedExperiment::colData(x)$decontX_clusters } else if (length(z) == 1 & inherits(x, "SingleCellExperiment")) { if (!(z %in% colnames(SummarizedExperiment::colData(x)))) { stop("'", z, "' not found in 'colData(x)'.") } z <- SummarizedExperiment::colData(x)[, z] } else if (length(z) != ncol(x)) { stop("If 'x' is a SingleCellExperiment, then 'z' needs to be", " a single character or integer specifying the column in", " 'colData(x)'. Alternatively to specify the cell cluster", " labels directly as a vector, the length of 'z' needs to", " be the same as the number of columns in 'x'. This is", " required if 'x' is a matrix.") } if (!is.factor(z)) { z <- as.factor(z) } if (!is.null(groupClusters)) { if (!is(groupClusters, "list") || is.null(names(groupClusters))) { stop("'groupClusters' needs to be a named list.") } # Check that groupClusters are found in 'z' cellMappings <- unlist(groupClusters) if (any(!(cellMappings %in% z))) { missing <- cellMappings[!(cellMappings %in% z)] stop( "'groupClusters' not found in 'z': ", paste(missing, collapse = ",") ) } # Check for duplicate groupClusters ta <- table(unlist(groupClusters)) if (any(ta > 1)) { dup <- names(ta)[ta > 1] stop("'groupClusters' had duplicate values for the following clusters: ", paste(dup, collapse = ","), ". Clusters need be assigned to a", "single group.") } labels <- rep(NA, ncol(x)) for (i in seq_along(groupClusters)) { labels[z %in% groupClusters[[i]]] <- names(groupClusters)[i] } na.ix <- is.na(labels) labels <- labels[!na.ix] x <- x[, !na.ix] z <- as.integer(factor(labels, levels = names(groupClusters))) xlab <- "Cell types" } else { labels <- as.factor(z) groupClusters <- levels(labels) names(groupClusters) <- levels(labels) xlab <- "Clusters" } # Find index of each feature in 'x' geneMarkerCellTypeIndex <- rep( seq(length(markers)), lapply(markers, length) ) geneMarkerIndex <- retrieveFeatureIndex(unlist(markers), x, by = by, removeNA = FALSE, exactMatch = exactMatch ) # Remove genes that did not match na.ix <- is.na(geneMarkerIndex) geneMarkerCellTypeIndex <- geneMarkerCellTypeIndex[!na.ix] geneMarkerIndex <- geneMarkerIndex[!na.ix] return(list( x = x, z = z, geneMarkerIndex = geneMarkerIndex, geneMarkerCellTypeIndex = geneMarkerCellTypeIndex, groupClusters = groupClusters, xlab = xlab )) } .calculateDecontXBarplotPercent <- function(counts, z, geneMarkerCellTypeIndex, threshold) { # Get counts matrix and convert to DelayedMatrix counts <- DelayedArray::DelayedArray(counts) # Convert to boolean matrix and sum markers in same cell type # The "+ 0" is to convert boolean to numeric counts <- counts >= threshold countsByMarker <- DelayedArray::rowsum(counts + 0, geneMarkerCellTypeIndex) countsByCellType <- DelayedArray::colsum((countsByMarker > 0) + 0, z) # Calculate percentages within each cell cluster zTotals <- tabulate(z) percentByCellType <- round(sweep(countsByCellType, 2, zTotals, "/") * 100) df <- reshape2::melt(percentByCellType, varnames = c("markers", "cellType"), value.name = "percent" ) return(df) } ================================================ FILE: R/plot_dr.R ================================================ #' @title Mapping the dimension reduction plot #' @description Creates a scatterplot given two dimensions from a data #' dimension reduction tool (e.g tSNE) output. #' @param x Numeric matrix or a \linkS4class{SingleCellExperiment} object #' with the matrix located in the assay slot under \code{useAssay}. Each #' row of the matrix will be plotted as a separate facet. #' @param reducedDimName The name of the dimension reduction slot in #' \code{reducedDimNames(x)} if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Ignored if both \code{dim1} and #' \code{dim2} are set. #' @param dim1 Numeric vector. Second dimension from data dimension #' reduction output. #' @param dim2 Numeric vector. Second dimension from data dimension #' reduction output. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param size Numeric. Sets size of point on plot. Default 1. #' @param xlab Character vector. Label for the x-axis. Default 'Dimension_1'. #' @param ylab Character vector. Label for the y-axis. Default 'Dimension_2'. #' @param limits Passed to \link{scale_colour_gradient2}. The range #' of color scale. #' @param colorLow Character. A color available from `colors()`. #' The color will be used to signify the lowest values on the scale. #' Default "blue4". #' @param colorMid Character. A color available from `colors()`. #' The color will be used to signify the midpoint on the scale. Default #' "grey90". #' @param colorHigh Character. A color available from `colors()`. #' The color will be used to signify the highest values on the scale. #' Default "firebrick1". #' @param midpoint Numeric. The value indicating the midpoint of the #' diverging color scheme. If \code{NULL}, defaults to the mean #' with 10 percent of values trimmed. Default \code{0}. #' @param varLabel Character vector. Title for the color legend. #' @param ncol Integer. Passed to \link[ggplot2]{facet_wrap}. Specify the #' number of columns for facet wrap. #' @param headers Character vector. If `NULL`, the corresponding rownames are #' used as labels. Otherwise, these headers are used to label the genes. #' @param decreasing logical. Specifies the order of plotting the points. #' If \code{FALSE}, the points will be plotted in increasing order where #' the points with largest values will be on top. \code{TRUE} otherwise. #' If \code{NULL}, no sorting is performed. Points will be plotted in their #' current order in \code{x}. Default \code{FALSE}. #' @return The plot as a ggplot object #' @export setGeneric("plotDimReduceGrid", function(x, reducedDimName, dim1 = NULL, dim2 = NULL, useAssay = "counts", altExpName = "featureSubset", size = 1, xlab = "Dimension_1", ylab = "Dimension_2", limits = c(-2, 2), colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, varLabel = NULL, ncol = NULL, headers = NULL, decreasing = FALSE) { standardGeneric("plotDimReduceGrid") }) #' @rdname plotDimReduceGrid #' @examples #' data(sceCeldaCG) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceGrid(x = sce, #' reducedDimName = "celda_tSNE", #' xlab = "Dimension1", #' ylab = "Dimension2", #' varLabel = "tSNE") #' @export setMethod("plotDimReduceGrid", signature(x = "SingleCellExperiment"), function(x, reducedDimName, dim1 = NULL, dim2 = NULL, useAssay = "counts", altExpName = "featureSubset", size = 1, xlab = "Dimension_1", ylab = "Dimension_2", limits = c(-2, 2), colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, varLabel = NULL, ncol = NULL, headers = NULL, decreasing = FALSE) { altExp <- SingleCellExperiment::altExp(x, altExpName) matrix <- SummarizedExperiment::assay(x, i = useAssay) if (is.null(dim1)) { dim1 <- SingleCellExperiment::reducedDim(altExp, reducedDimName)[, 1] } if (is.null(dim2)) { dim2 <- SingleCellExperiment::reducedDim(altExp, reducedDimName)[, 2] } g <- .plotDimReduceGrid( dim1 = dim1, dim2 = dim2, matrix = matrix, size = size, xlab = xlab, ylab = ylab, limits = limits, colorLow = colorLow, colorMid = colorMid, colorHigh = colorHigh, midpoint = midpoint, varLabel = varLabel, ncol = ncol, headers = headers, decreasing = decreasing ) return(g) }) #' @rdname plotDimReduceGrid #' @examples #' library(SingleCellExperiment) #' data(sceCeldaCG) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceGrid(x = counts(sce), #' dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], #' dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], #' xlab = "Dimension1", #' ylab = "Dimension2", #' varLabel = "tSNE") #' @export setMethod("plotDimReduceGrid", signature(x = "ANY"), function(x, dim1, dim2, size = 1, xlab = "Dimension_1", ylab = "Dimension_2", limits = c(-2, 2), colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, varLabel = NULL, ncol = NULL, headers = NULL, decreasing = FALSE) { x <- as.matrix(x) g <- .plotDimReduceGrid( dim1 = dim1, dim2 = dim2, matrix = x, size = size, xlab = xlab, ylab = ylab, limits = limits, colorLow = colorLow, colorMid = colorMid, colorHigh = colorHigh, midpoint = midpoint, varLabel = varLabel, ncol = ncol, headers = headers, decreasing = decreasing ) return(g) }) #' @importFrom reshape2 melt .plotDimReduceGrid <- function(dim1, dim2, matrix, size, xlab, ylab, limits, colorLow, colorMid, colorHigh, midpoint, varLabel, ncol, headers, decreasing) { df <- data.frame(dim1, dim2, t(as.data.frame(matrix)), check.names = FALSE) naIx <- is.na(dim1) | is.na(dim2) df <- df[!naIx, ] m <- reshape2::melt(df, id.vars = c("dim1", "dim2")) colnames(m) <- c(xlab, ylab, "facet", "Expression") if (!is.null(decreasing)) { m <- m[order(m$facet, m$Expression, decreasing = decreasing), ] } if (is.null(midpoint)) { midpoint <- mean(m[, 4], trim = 0.1) } varLabel <- gsub("_", " ", varLabel) if (isFALSE(is.null(headers))) { names(headers) <- levels(m$facet) headers <- ggplot2::as_labeller(headers) g <- ggplot2::ggplot(m, ggplot2::aes_string(x = xlab, y = ylab)) + ggplot2::geom_point(stat = "identity", size = size, ggplot2::aes_string(color = m$Expression)) + ggplot2::theme_bw() + ggplot2::scale_colour_gradient2( limits = limits, low = colorLow, high = colorHigh, mid = colorMid, midpoint = midpoint, name = varLabel ) + ggplot2::theme( strip.background = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank(), panel.spacing = unit(0, "lines"), panel.background = ggplot2::element_blank(), axis.line = ggplot2::element_line(colour = "black") ) if (isFALSE(is.null(ncol))) { g <- g + ggplot2::facet_wrap(~ facet, labeller = headers, ncol = ncol) } else { g <- g + ggplot2::facet_wrap(~ facet, labeller = headers) } } else { g <- ggplot2::ggplot(m, ggplot2::aes_string(x = xlab, y = ylab)) + ggplot2::geom_point(stat = "identity", size = size, ggplot2::aes_string(color = m$Expression)) + ggplot2::facet_wrap(~ facet) + ggplot2::theme_bw() + ggplot2::scale_colour_gradient2( limits = limits, low = colorLow, high = colorHigh, mid = colorMid, midpoint = midpoint, name = varLabel ) + ggplot2::theme( strip.background = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank(), panel.spacing = unit(0, "lines"), panel.background = ggplot2::element_blank(), axis.line = ggplot2::element_line(colour = "black") ) if (isFALSE(is.null(ncol))) { g <- g + ggplot2::facet_wrap(~ facet, ncol = ncol) } else { g <- g + ggplot2::facet_wrap(~ facet) } } return(g) } #' @title Plotting feature expression on a dimension reduction plot #' @description Create a scatterplot for each row of a normalized gene #' expression matrix where x and y axis are from a data dimension #' reduction tool. The cells are colored by expression of #' the specified feature. #' @param x Numeric matrix or a \linkS4class{SingleCellExperiment} object #' with the matrix located in the assay slot under \code{useAssay}. Rows #' represent features and columns represent cells. #' @param features Character vector. Features in the rownames of counts to plot. #' @param reducedDimName The name of the dimension reduction slot in #' \code{reducedDimNames(x)} if \code{x} is a #' \linkS4class{SingleCellExperiment} object. If \code{NULL}, then both #' \code{dim1} and \code{dim2} need to be set. Default \code{NULL}. #' @param displayName Character. The column name of #' \code{rowData(x)} that specifies the display names for #' the features. Default \code{NULL}, which displays the row names. Only works #' if \code{x} is a \linkS4class{SingleCellExperiment} object. Overwrites #' \code{headers}. #' @param dim1 Integer or numeric vector. If \code{reducedDimName} is supplied, #' then, this will be used as an index to determine which dimension will be #' plotted on the x-axis. If \code{reducedDimName} is not supplied, then this #' should be a vector which will be plotted on the x-axis. Default \code{1}. #' @param dim2 Integer or numeric vector. If \code{reducedDimName} is supplied, #' then, this will be used as an index to determine which dimension will be #' plotted on the y-axis. If \code{reducedDimName} is not supplied, then this #' should be a vector which will be plotted on the y-axis. Default \code{2}. #' @param headers Character vector. If \code{NULL}, the corresponding #' rownames are used as labels. Otherwise, these headers are used to label #' the features. Only works if \code{displayName} is \code{NULL} and #' \code{exactMatch} is \code{FALSE}. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param normalize Logical. Whether to normalize the columns of `counts`. #' Default \code{FALSE}. #' @param zscore Logical. Whether to scale each feature to have a mean 0 #' and standard deviation of 1. Default \code{TRUE}. #' @param exactMatch Logical. Whether an exact match or a partial match using #' \code{grep()} is used to look up the feature in the rownames of the counts #' matrix. Default TRUE. #' @param trim Numeric vector. Vector of length two that specifies the lower #' and upper bounds for the data. This threshold is applied after row scaling. #' Set to NULL to disable. Default \code{c(-1,1)}. #' @param limits Passed to \link{scale_colour_gradient2}. The range #' of color scale. #' @param size Numeric. Sets size of point on plot. Default 1. #' @param xlab Character vector. Label for the x-axis. If \code{reducedDimName} #' is used, then this will be set to the column name of the first dimension of #' that object. Default "Dimension_1". #' @param ylab Character vector. Label for the y-axis. If \code{reducedDimName} #' is used, then this will be set to the column name of the second dimension of #' that object. Default "Dimension_2". #' @param colorLow Character. A color available from `colors()`. The color #' will be used to signify the lowest values on the scale. #' @param colorMid Character. A color available from `colors()`. The color #' will be used to signify the midpoint on the scale. #' @param colorHigh Character. A color available from `colors()`. The color #' will be used to signify the highest values on the scale. #' @param midpoint Numeric. The value indicating the midpoint of the #' diverging color scheme. If \code{NULL}, defaults to the mean #' with 10 percent of values trimmed. Default \code{0}. #' @param ncol Integer. Passed to \link[ggplot2]{facet_wrap}. Specify the #' number of columns for facet wrap. #' @param decreasing logical. Specifies the order of plotting the points. #' If \code{FALSE}, the points will be plotted in increasing order where #' the points with largest values will be on top. \code{TRUE} otherwise. #' If \code{NULL}, no sorting is performed. Points will be plotted in their #' current order in \code{x}. Default \code{FALSE}. #' @return The plot as a ggplot object #' @export setGeneric("plotDimReduceFeature", function(x, features, reducedDimName = NULL, displayName = NULL, dim1 = NULL, dim2 = NULL, headers = NULL, useAssay = "counts", altExpName = "featureSubset", normalize = FALSE, zscore = TRUE, exactMatch = TRUE, trim = c(-2, 2), limits = c(-2, 2), size = 0.5, xlab = NULL, ylab = NULL, colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, ncol = NULL, decreasing = FALSE) { standardGeneric("plotDimReduceFeature") }) #' @rdname plotDimReduceFeature #' @examples #' data(sceCeldaCG) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceFeature(x = sce, #' reducedDimName = "celda_tSNE", #' normalize = TRUE, #' features = c("Gene_98", "Gene_99"), #' exactMatch = TRUE) #' @export setMethod("plotDimReduceFeature", signature(x = "SingleCellExperiment"), function(x, features, reducedDimName, displayName = NULL, dim1 = 1, dim2 = 2, headers = NULL, useAssay = "counts", altExpName = "featureSubset", normalize = FALSE, zscore = TRUE, exactMatch = TRUE, trim = c(-2, 2), limits = c(-2, 2), size = 0.5, xlab = NULL, ylab = NULL, colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, ncol = NULL, decreasing = FALSE) { altExp <- SingleCellExperiment::altExp(x, altExpName) counts <- SummarizedExperiment::assay(x, i = useAssay) reddim <- .processReducedDim( x = altExp, reducedDimName = reducedDimName, dim1 = dim1, dim2 = dim2, xlab = xlab, ylab = ylab ) if (isFALSE(is.null(displayName))) { featuresIx <- retrieveFeatureIndex(features, x, by = displayName, exactMatch = exactMatch) headers <- SummarizedExperiment::rowData(x)[[displayName]][featuresIx] } else { featuresIx <- retrieveFeatureIndex(features, counts, by = "rownames", exactMatch = exactMatch) if (isFALSE(is.null(headers))) { if (length(headers) != length(features)) { stop("Headers ", headers, " should be the same length as features ", features) } if (isFALSE(exactMatch)) { warning("exactMatch is FALSE. headers will not be used!") headers <- NULL } } } g <- .plotDimReduceFeature( dim1 = reddim$dim1, dim2 = reddim$dim2, counts = counts, features = features, headers = headers, normalize = normalize, zscore = zscore, featuresIx = featuresIx, trim = trim, limits = limits, size = size, xlab = reddim$xlab, ylab = reddim$ylab, colorLow = colorLow, colorMid = colorMid, colorHigh = colorHigh, midpoint = midpoint, ncol = ncol, decreasing = decreasing ) return(g) }) #' @rdname plotDimReduceFeature #' @examples #' library(SingleCellExperiment) #' data(sceCeldaCG) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceFeature(x = counts(sce), #' dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], #' dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], #' normalize = TRUE, #' features = c("Gene_98", "Gene_99"), #' exactMatch = TRUE) #' @export setMethod("plotDimReduceFeature", signature(x = "ANY"), function(x, features, dim1, dim2, headers = NULL, normalize = FALSE, zscore = TRUE, exactMatch = TRUE, trim = c(-2, 2), limits = c(-2, 2), size = 0.5, xlab = "Dimension_1", ylab = "Dimension_2", colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, ncol = NULL, decreasing = FALSE) { x <- as.matrix(x) if (isFALSE(is.null(headers))) { if (length(headers) != length(features)) { stop("Headers ", headers, " should be the same length as features ", features) } if (isFALSE(exactMatch)) { warning("exactMatch is FALSE. headers will not be used!") headers <- NULL } } featuresIx <- retrieveFeatureIndex(features, x, by = "rownames", exactMatch = exactMatch) g <- .plotDimReduceFeature( dim1 = dim1, dim2 = dim2, counts = x, features = features, headers = headers, normalize = normalize, zscore = zscore, featuresIx = featuresIx, trim = trim, limits = limits, size = size, xlab = xlab, ylab = ylab, colorLow = colorLow, colorMid = colorMid, colorHigh = colorHigh, midpoint = midpoint, ncol = ncol, decreasing = decreasing ) return(g) }) .plotDimReduceFeature <- function(dim1, dim2, counts, features, headers, normalize, zscore, featuresIx, trim, limits, size, xlab, ylab, colorLow, colorMid, colorHigh, midpoint, ncol, decreasing) { # Perform checks if (is.null(features)) { stop("at least one feature is required to create a plot") } ## Normalize data if needed if (isTRUE(normalize)) { counts <- normalizeCounts(counts, transformationFun = sqrt) } # After normalization, features can be selected featuresIx <- featuresIx[stats::complete.cases(featuresIx)] counts <- as.matrix(counts[featuresIx, , drop = FALSE]) # Scale/zscore data if needed varLabel <- "Expression" if (isTRUE(zscore)) { counts <- t(scale(t(counts))) varLabel <- "Scaled\nExpression" } if (!is.null(trim)) { if (length(trim) != 2) { stop("'trim' should be a 2 element vector", "specifying the lower and upper boundaries") } trim <- sort(trim) counts[counts < trim[1]] <- trim[1] counts[counts > trim[2]] <- trim[2] } .plotDimReduceGrid( dim1 = dim1, dim2 = dim2, matrix = counts, size = size, xlab = xlab, ylab = ylab, limits = limits, colorLow = colorLow, colorMid = colorMid, colorHigh = colorHigh, varLabel = varLabel, midpoint = midpoint, ncol = ncol, headers = headers, decreasing = decreasing ) } #' @title Plotting Celda module probability on a #' dimension reduction plot #' @description Create a scatterplot for each row of a normalized #' gene expression matrix where x and y axis are from a data #' dimension reduction tool. #' The cells are colored by the module probability. #' @param x Numeric matrix or a \linkS4class{SingleCellExperiment} object #' with the matrix located in the assay slot under \code{useAssay}. Rows #' represent features and columns represent cells. #' @param reducedDimName The name of the dimension reduction slot in #' \code{reducedDimNames(x)} if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Ignored if both \code{dim1} and #' \code{dim2} are set. #' @param dim1 Integer or numeric vector. If \code{reducedDimName} is supplied, #' then, this will be used as an index to determine which dimension will be #' plotted on the x-axis. If \code{reducedDimName} is not supplied, then this #' should be a vector which will be plotted on the x-axis. Default \code{1}. #' @param dim2 Integer or numeric vector. If \code{reducedDimName} is supplied, #' then, this will be used as an index to determine which dimension will be #' plotted on the y-axis. If \code{reducedDimName} is not supplied, then this #' should be a vector which will be plotted on the y-axis. Default \code{2}. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param celdaMod Celda object of class "celda_G" or "celda_CG". Used only if #' \code{x} is a matrix object. #' @param modules Character vector. Module(s) from celda model to be plotted. #' e.g. c("1", "2"). #' @param size Numeric. Sets size of point on plot. Default 0.5. #' @param xlab Character vector. Label for the x-axis. Default "Dimension_1". #' @param ylab Character vector. Label for the y-axis. Default "Dimension_2". #' @param colorLow Character. A color available from `colors()`. #' The color will be used to signify the lowest values on the scale. #' @param rescale Logical. #' Whether rows of the matrix should be rescaled to [0, 1]. Default TRUE. #' @param limits Passed to \link{scale_colour_gradient}. The range #' of color scale. #' @param colorHigh Character. A color available from `colors()`. #' The color will be used to signify the highest values on the scale. #' @param ncol Integer. Passed to \link[ggplot2]{facet_wrap}. Specify the #' number of columns for facet wrap. #' @param decreasing logical. Specifies the order of plotting the points. #' If \code{FALSE}, the points will be plotted in increasing order where #' the points with largest values will be on top. \code{TRUE} otherwise. #' If \code{NULL}, no sorting is performed. Points will be plotted in their #' current order in \code{x}. Default \code{FALSE}. #' @return The plot as a ggplot object #' @export setGeneric("plotDimReduceModule", function(x, reducedDimName, useAssay = "counts", altExpName = "featureSubset", celdaMod, modules = NULL, dim1 = NULL, dim2 = NULL, size = 0.5, xlab = NULL, ylab = NULL, rescale = TRUE, limits = c(0, 1), colorLow = "grey90", colorHigh = "firebrick1", ncol = NULL, decreasing = FALSE) { standardGeneric("plotDimReduceModule") }) #' @rdname plotDimReduceModule #' @examples #' data(sceCeldaCG) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceModule(x = sce, #' reducedDimName = "celda_tSNE", #' modules = c("1", "2")) #' @export setMethod("plotDimReduceModule", signature(x = "SingleCellExperiment"), function(x, reducedDimName, useAssay = "counts", altExpName = "featureSubset", modules = NULL, dim1 = 1, dim2 = 2, size = 0.5, xlab = NULL, ylab = NULL, rescale = TRUE, limits = c(0, 1), colorLow = "grey90", colorHigh = "firebrick1", ncol = NULL, decreasing = FALSE) { # Get reduced dim object altExp <- SingleCellExperiment::altExp(x, altExpName) reddim <- .processReducedDim( x = altExp, reducedDimName = reducedDimName, dim1 = dim1, dim2 = dim2, xlab = xlab, ylab = ylab ) factorized <- factorizeMatrix(x, useAssay = useAssay, altExpName = altExpName, type = "proportion") g <- .plotDimReduceModule( dim1 = reddim$dim1, dim2 = reddim$dim2, factorized = factorized, modules = modules, rescale = rescale, limits = limits, size = size, xlab = reddim$xlab, ylab = reddim$ylab, colorLow = colorLow, colorHigh = colorHigh, ncol = ncol, decreasing = decreasing ) return(g) }) #' @rdname plotDimReduceModule #' @examples #' library(SingleCellExperiment) #' data(sceCeldaCG, celdaCGMod) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceModule(x = counts(sce), #' dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], #' dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], #' celdaMod = celdaCGMod, #' modules = c("1", "2")) #' @export setMethod("plotDimReduceModule", signature(x = "ANY"), function(x, celdaMod, modules = NULL, dim1, dim2, size = 0.5, xlab = "Dimension_1", ylab = "Dimension_2", rescale = TRUE, limits = c(0, 1), colorLow = "grey90", colorHigh = "firebrick1", ncol = NULL, decreasing = FALSE) { factorized <- factorizeMatrix(x = x, celdaMod = celdaMod) reddim <- .processReducedDim( x = x, dim1 = dim1, dim2 = dim2, xlab = xlab, ylab = ylab ) g <- .plotDimReduceModule( dim1 = reddim$dim1, dim2 = reddim$dim2, factorized = factorized, modules = modules, rescale = rescale, limits = limits, size = size, xlab = reddim$xlab, ylab = reddim$ylab, colorLow = colorLow, colorHigh = colorHigh, ncol = ncol, decreasing = decreasing ) return(g) }) .plotDimReduceModule <- function(dim1, dim2, factorized, modules, rescale, limits, size, xlab, ylab, colorLow, colorHigh, ncol, decreasing) { matrix <- factorized$proportions$cell if (rescale == TRUE) { for (x in seq(nrow(matrix))) { matrix[x, ] <- matrix[x, ] - min(matrix[x, ]) matrix[x, ] <- matrix[x, ] / max(matrix[x, ]) varLabel <- "Scaled Probability" } } else { varLabel <- "Probability" } rownames(matrix) <- gsub("L", "", rownames(matrix)) if (!is.null(modules)) { if (length(rownames(matrix)[rownames(matrix) %in% modules]) < 1) { stop("All modules selected do not exist in the model.") } matrix <- matrix[which(rownames(matrix) %in% modules), , drop = FALSE] matrix <- matrix[match(rownames(matrix), modules), , drop = FALSE] } rownames(matrix) <- paste0("L", rownames(matrix)) df <- data.frame(dim1, dim2, t(as.data.frame(matrix)), check.names = FALSE) naIx <- is.na(dim1) | is.na(dim2) df <- df[!naIx, ] m <- reshape2::melt(df, id.vars = c("dim1", "dim2")) colnames(m) <- c(xlab, ylab, "facet", "Expression") if (!is.null(decreasing)) { m <- m[order(m$facet, m$Expression, decreasing = decreasing), ] } g <- ggplot2::ggplot(m, ggplot2::aes_string(x = xlab, y = ylab)) + ggplot2::geom_point(stat = "identity", size = size, ggplot2::aes_string(color = m$Expression)) + ggplot2::facet_wrap(~ facet) + ggplot2::theme_bw() + ggplot2::scale_colour_gradient( limits = limits, low = colorLow, high = colorHigh, name = varLabel ) + ggplot2::theme( strip.background = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank(), panel.spacing = unit(0, "lines"), panel.background = ggplot2::element_blank(), axis.line = ggplot2::element_line(colour = "black") ) if (isFALSE(is.null(ncol))) { g <- g + ggplot2::facet_wrap(~ facet, ncol = ncol) } else { g <- g + ggplot2::facet_wrap(~ facet) } return(g) } # Labeling code adapted from Seurat (https://github.com/satijalab/seurat) #' @title Plotting the cell labels on a dimension reduction plot #' @description Create a scatterplot for each row of a normalized #' gene expression matrix where x and y axis are from a #' data dimension reduction tool. #' The cells are colored by "celda_cell_cluster" column in #' \code{colData(altExp(x, altExpName))} if \code{x} is a #' \linkS4class{SingleCellExperiment} object, or \code{x} if \code{x} is #' a integer vector of cell cluster labels. #' @param x Integer vector of cell cluster labels or a #' \linkS4class{SingleCellExperiment} object #' containing cluster labels for each cell in \code{"celda_cell_cluster"} #' column in \code{colData(x)}. #' @param reducedDimName The name of the dimension reduction slot in #' \code{reducedDimNames(x)} if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Ignored if both \code{dim1} and #' \code{dim2} are set. #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param dim1 Integer or numeric vector. If \code{reducedDimName} is supplied, #' then, this will be used as an index to determine which dimension will be #' plotted on the x-axis. If \code{reducedDimName} is not supplied, then this #' should be a vector which will be plotted on the x-axis. Default \code{1}. #' @param dim2 Integer or numeric vector. If \code{reducedDimName} is supplied, #' then, this will be used as an index to determine which dimension will be #' plotted on the y-axis. If \code{reducedDimName} is not supplied, then this #' should be a vector which will be plotted on the y-axis. Default \code{2}. #' @param size Numeric. Sets size of point on plot. Default \code{0.5}. #' @param xlab Character vector. Label for the x-axis. Default \code{NULL}. #' @param ylab Character vector. Label for the y-axis. Default \code{NULL}. #' @param specificClusters Numeric vector. #' Only color cells in the specified clusters. #' All other cells will be grey. #' If NULL, all clusters will be colored. Default \code{NULL}. #' @param labelClusters Logical. Whether the cluster labels are plotted. #' Default FALSE. #' @param groupBy Character vector. Contains sample labels for each cell. #' If NULL, all samples will be plotted together. Default NULL. #' @param labelSize Numeric. Sets size of label if labelClusters is TRUE. #' Default 3.5. #' @return The plot as a ggplot object #' @importFrom ggrepel geom_text_repel #' @export setGeneric("plotDimReduceCluster", function(x, reducedDimName, altExpName = "featureSubset", dim1 = NULL, dim2 = NULL, size = 0.5, xlab = NULL, ylab = NULL, specificClusters = NULL, labelClusters = FALSE, groupBy = NULL, labelSize = 3.5) { standardGeneric("plotDimReduceCluster") }) #' @rdname plotDimReduceCluster #' @examples #' data(sceCeldaCG) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceCluster(x = sce, #' reducedDimName = "celda_tSNE", #' specificClusters = c(1, 2, 3)) #' @export setMethod("plotDimReduceCluster", signature(x = "SingleCellExperiment"), function(x, reducedDimName, altExpName = "featureSubset", dim1 = 1, dim2 = 2, size = 0.5, xlab = NULL, ylab = NULL, specificClusters = NULL, labelClusters = FALSE, groupBy = NULL, labelSize = 3.5) { altExp <- SingleCellExperiment::altExp(x, altExpName) if (!("celda_cell_cluster" %in% colnames(SummarizedExperiment::colData(altExp)))) { stop("Must have column 'celda_cell_cluster' in", " colData(altExp(x, altExpName))!") } cluster <- SummarizedExperiment::colData(altExp)[["celda_cell_cluster"]] reddim <- .processReducedDim( x = altExp, reducedDimName = reducedDimName, dim1 = dim1, dim2 = dim2, xlab = xlab, ylab = ylab ) g <- .plotDimReduceCluster( dim1 = reddim$dim1, dim2 = reddim$dim2, cluster = cluster, size = size, xlab = reddim$xlab, ylab = reddim$ylab, specificClusters = specificClusters, labelClusters = labelClusters, groupBy = groupBy, labelSize = labelSize ) return(g) }) #' @rdname plotDimReduceCluster #' @examples #' library(SingleCellExperiment) #' data(sceCeldaCG, celdaCGMod) #' sce <- celdaTsne(sceCeldaCG) #' plotDimReduceCluster(x = celdaClusters(celdaCGMod)$z, #' dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], #' dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], #' specificClusters = c(1, 2, 3)) #' @export setMethod("plotDimReduceCluster", signature(x = "vector"), function(x, dim1, dim2, size = 0.5, xlab = "Dimension_1", ylab = "Dimension_2", specificClusters = NULL, labelClusters = FALSE, groupBy = NULL, labelSize = 3.5) { reddim <- .processReducedDim( x = x, dim1 = dim1, dim2 = dim2, xlab = xlab, ylab = ylab ) g <- .plotDimReduceCluster( dim1 = reddim$dim1, dim2 = reddim$dim2, cluster = x, size = size, xlab = reddim$xlab, ylab = reddim$ylab, specificClusters = specificClusters, labelClusters = labelClusters, groupBy = groupBy, labelSize = labelSize ) return(g) }) .plotDimReduceCluster <- function(dim1, dim2, cluster, size, xlab, ylab, specificClusters, labelClusters, groupBy, labelSize) { if (!is.null(groupBy)) { df <- data.frame(dim1, dim2, cluster, groupBy) colnames(df) <- c(xlab, ylab, "Cluster", "Sample") } else { df <- data.frame(dim1, dim2, cluster) colnames(df) <- c(xlab, ylab, "Cluster") } naIx <- is.na(dim1) | is.na(dim2) df <- df[!naIx, ] df[3] <- as.factor(df[[3]]) clusterColors <- distinctColors(nlevels(as.factor(cluster))) if (!is.null(specificClusters)) { clusterColors[!levels(df[[3]]) %in% specificClusters] <- "gray92" } g <- ggplot2::ggplot(df, ggplot2::aes_string(x = xlab, y = ylab)) + ggplot2::geom_point(stat = "identity", size = size, ggplot2::aes_string(color = "Cluster")) + ggplot2::theme( panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank(), panel.background = ggplot2::element_blank(), axis.line = ggplot2::element_line(color = "black") ) + ggplot2::scale_color_manual(values = clusterColors) + ggplot2::guides(color = ggplot2::guide_legend(override.aes = list(size = 1))) if (isTRUE(labelClusters)) { # centroidList <- lapply(seq(length(unique(cluster))), function(x) { centroidList <- lapply(unique(cluster), function(x) { df.sub <- df[df$Cluster == x, ] median1 <- stats::median(df.sub[, xlab]) median2 <- stats::median(df.sub[, ylab]) data.frame(median1 = median1, median2 = median2, x = x) }) centroid <- do.call(rbind, centroidList) centroid <- data.frame( Dimension_1 = as.numeric(centroid[, 1]), Dimension_2 = as.numeric(centroid[, 2]), Cluster = centroid[, 3] ) colnames(centroid)[seq(2)] <- c(xlab, ylab) g <- g + ggplot2::geom_point( data = centroid, mapping = ggplot2::aes_string(x = xlab, y = ylab), size = 0, alpha = 0 ) + ggrepel::geom_text_repel( data = centroid, mapping = ggplot2::aes_string(label = "Cluster"), size = labelSize, max.overlaps = Inf ) } if (!is.null(x = groupBy)) { g <- g + ggplot2::facet_wrap(facets = ggplot2::vars(!!ggplot2::sym(x = "Sample"))) + ggplot2::theme(strip.background = ggplot2::element_blank()) } return(g) } #' @title Feature Expression Violin Plot #' @description Outputs a violin plot for feature expression data. #' @param x Numeric matrix or a \linkS4class{SingleCellExperiment} object #' with the matrix located in the assay slot under \code{useAssay}. Rows #' represent features and columns represent cells. #' @param features Character vector. Uses these genes for plotting. #' @param displayName Character. The column name of #' \code{rowData(x)} that specifies the display names for #' the features. Default \code{NULL}, which displays the row names. Only works #' if \code{x} is a \linkS4class{SingleCellExperiment} object. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \linkS4class{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param celdaMod Celda object of class "celda_G" or "celda_CG". Used only if #' \code{x} is a matrix object. #' @param exactMatch Logical. Whether an exact match or a partial match using #' \code{grep()} is used to look up the feature in the rownames of the counts #' matrix. Default \code{TRUE}. #' @param plotDots Boolean. If \code{TRUE}, the #' expression of features will be plotted as points in addition to the violin #' curve. Default \code{TRUE}. #' @param dotSize Numeric. Size of points if \code{plotDots = TRUE}. #' Default \code{0.1}. #' @return Violin plot for each feature, grouped by celda cluster #' @export setGeneric("plotCeldaViolin", function(x, celdaMod, features, displayName = NULL, useAssay = "counts", altExpName = "featureSubset", exactMatch = TRUE, plotDots = TRUE, dotSize = 0.1) { standardGeneric("plotCeldaViolin") }) #' @rdname plotCeldaViolin #' @examples #' data(sceCeldaCG) #' plotCeldaViolin(x = sceCeldaCG, features = "Gene_1") #' @export setMethod("plotCeldaViolin", signature(x = "SingleCellExperiment"), function(x, features, displayName = NULL, useAssay = "counts", altExpName = "featureSubset", exactMatch = TRUE, plotDots = TRUE, dotSize = 0.1) { counts <- SummarizedExperiment::assay(x, i = useAssay) cluster <- as.integer(celdaClusters(x, altExpName = altExpName)) if (is.null(displayName)) { featuresIx <- retrieveFeatureIndex(features, counts, by = "rownames", exactMatch = exactMatch) rnames <- rownames(x)[featuresIx] } else { featuresIx <- retrieveFeatureIndex(features, x, by = displayName, exactMatch = exactMatch) rnames <- SummarizedExperiment::rowData(x)[featuresIx, displayName] } g <- .plotCeldaViolin( counts = counts, cluster = cluster, features = features, featuresIx = featuresIx, rnames = rnames, exactMatch = exactMatch, plotDots = plotDots, dotSize = dotSize ) return(g) }) #' @rdname plotCeldaViolin #' @examples #' data(celdaCGSim, celdaCGMod) #' plotCeldaViolin(x = celdaCGSim$counts, #' celdaMod = celdaCGMod, #' features = "Gene_1") #' @export setMethod("plotCeldaViolin", signature(x = "ANY"), function(x, celdaMod, features, exactMatch = TRUE, plotDots = TRUE, dotSize = 0.1) { x <- as.matrix(x) cluster <- celdaClusters(celdaMod)$z featuresIx <- retrieveFeatureIndex(features, x, by = "rownames", exactMatch = exactMatch) rnames <- rownames(x)[featuresIx] g <- .plotCeldaViolin( counts = x, cluster = cluster, features = features, featuresIx = featuresIx, rnames = rnames, exactMatch = exactMatch, plotDots = plotDots, dotSize = dotSize ) return(g) }) .plotCeldaViolin <- function(counts, cluster, features, featuresIx, rnames, exactMatch = TRUE, plotDots = TRUE, dotSize = 0.1) { dataFeature <- as.matrix(counts[featuresIx, , drop = FALSE]) rownames(dataFeature) <- rnames dataFeature <- as.data.frame(t(dataFeature)) df <- cbind(cluster, dataFeature) df$cluster <- as.factor(df$cluster) m <- reshape2::melt(df, id.vars = c("cluster")) colnames(m) <- c("Cluster", "Feature", "Expression") colorPal <- distinctColors(length(unique(cluster))) p <- ggplot2::ggplot(m, ggplot2::aes_string(x = "Cluster", y = "Expression", fill = "Cluster")) + ggplot2::facet_wrap(~ Feature) + ggplot2::geom_violin(trim = TRUE, scale = "width") + ggplot2::scale_fill_manual(values = colorPal) + ggplot2::theme( strip.background = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank(), panel.spacing = grid::unit(0, "lines"), panel.background = ggplot2::element_blank(), axis.line = ggplot2::element_line(colour = "black") ) if (isTRUE(plotDots)) { p <- p + ggplot2::geom_jitter(height = 0, size = dotSize) } return(p) } .processReducedDim <- function(x, reducedDimName = NULL, dim1 = NULL, dim2 = NULL, xlab = NULL, ylab = NULL) { if (inherits(x, "SingleCellExperiment") & !is.null(reducedDimName)) { reddim <- SingleCellExperiment::reducedDim(x, reducedDimName) # Get dims to retrieve from redDim object if (is.null(dim1)) dim1 <- 1 if (is.null(dim2)) dim2 <- 2 # Get labels xlab <- colnames(reddim)[dim1] ylab <- colnames(reddim)[dim2] # Set up return object res <- list( dim1 = reddim[, dim1], dim2 = reddim[, dim2], xlab = xlab, ylab = ylab ) } else if (!is.null(dim1) & !is.null(dim2)) { if (inherits(x, c("matrix", "Matrix"))) { if (length(dim1) != ncol(x)) { stop("'dim1' needs to be the same length as 'x'.") } if (length(dim2) != ncol(x)) { stop("'dim2' needs to be the same length as 'x'.") } } else { if (length(dim1) != length(x)) { stop("'dim1' needs to be the same length as 'x'.") } if (length(dim2) != length(x)) { stop("'dim2' needs to be the same length as 'x'.") } } if (is.null(xlab)) xlab <- "Dimension 1" if (is.null(ylab)) ylab <- "Dimension 2" res <- list( dim1 = dim1, dim2 = dim2, xlab = xlab, ylab = ylab ) } else { stop( "'x' can be supplied as a SingleCelExperiment along with ", "'reducedDimName' and 'dim1' and 'dim2' can be used to specify which ", "dimensions to plot on the x- and y-axis, respectively. Alternatively, ", "'x', 'dim1' and 'dim2' can be supplied as vectors of the same length ", "where 'dim1' is the x-axis, 'dim2', is the y-axis, and 'x' will be used ", "to color the points." ) } return(res) } ================================================ FILE: R/recursiveSplit.R ================================================ .singleSplitZ <- function(counts, z, s, K, minCell = 3, alpha = 1, beta = 1) { zTa <- tabulate(z, K) zToSplit <- which(zTa > minCell) bestZ <- z bestLl <- -Inf for (i in zToSplit) { clustLabel <- .celda_C( counts[, z == i, drop = FALSE], K = 2, zInitialize = "random", splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE ) if (length(unique(celdaClusters(clustLabel)$z)) == 2) { ix <- z == i newZ <- z newZ[ix] <- ifelse(celdaClusters(clustLabel)$z == 2, i, K) ll <- .logLikelihoodcelda_C(counts, s, newZ, K, alpha, beta) if (ll > bestLl) { bestZ <- newZ bestLl <- ll } } } return(list(ll = bestLl, z = bestZ)) } .singleSplitY <- function(counts, y, L, minFeature = 3, beta = 1, delta = 1, gamma = 1) { yTa <- tabulate(y, L) yToSplit <- which(yTa > minFeature) bestY <- y bestLl <- -Inf # previousY <- y for (i in yToSplit) { clustLabel <- .celda_G(counts[y == i, , drop = FALSE], L = 2, yInitialize = "random", splitOnIter = -1, splitOnLast = FALSE, nchains = 1, verbose = FALSE ) if (length(unique(celdaClusters(clustLabel)$y)) == 2) { ix <- y == i newY <- y newY[ix] <- ifelse(celdaClusters(clustLabel)$y == 2, i, L) ll <- .logLikelihoodcelda_G(counts, newY, L, beta, delta, gamma) if (ll > bestLl) { bestY <- newY bestLl <- ll } } } return(list(ll = bestLl, y = bestY)) } #' @title Recursive cell splitting #' @description Uses the \link{celda_C} model to cluster cells into #' population for range of possible K's. The cell population labels of the #' previous "K-1" model are used as the initial values in the current model #' with K cell populations. The best split of an existing cell population is #' found to create the K-th cluster. This procedure is much faster than #' randomly initializing each model with a different K. If module labels for #' each feature are given in 'yInit', the \link{celda_CG} model will be used to #' split cell populations based on those modules instead of individual #' features. Module labels will also be updated during sampling and thus #' may end up slightly different than \code{yInit}. #' @param x A numeric \link{matrix} of counts or a #' \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. #' @param useAssay A string specifying the name of the #' \link{assay} #' slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param sampleLabel Vector or factor. Denotes the sample label for each cell #' (column) in the count matrix. #' @param initialK Integer. Initial number of cell populations to try. #' Default \code{5}. #' @param maxK Integer. Maximum number of cell populations to try. #' Default \code{25}. #' @param tempL Integer. Number of temporary modules to identify and use in cell #' splitting. Only used if \code{yInit = NULL}. Collapsing features to a #' relatively smaller number of modules will increase the speed of clustering #' and tend to produce better cell populations. This number should be larger #' than the number of true modules expected in the dataset. Default #' \code{NULL.} #' @param yInit Integer vector. Module labels for features. Cells will be #' clustered using the \link{celda_CG} model based on the modules specified in #' \code{yInit} rather than the counts of individual features. While the #' features will be initialized to the module labels in \code{yInit}, the #' labels will be allowed to move within each new model with a different K. #' @param alpha Numeric. Concentration parameter for Theta. Adds a pseudocount #' to each cell population in each sample. Default \code{1}. #' @param beta Numeric. Concentration parameter for Phi. Adds a pseudocount to #' each feature in each cell (if \code{yInit} is NULL) or to each module in #' each cell population (if \code{yInit} is set). Default \code{1}. #' @param delta Numeric. Concentration parameter for Psi. Adds a pseudocount #' to each feature in each module. Only used if \code{yInit} is set. Default 1. #' @param gamma Numeric. Concentration parameter for Eta. Adds a pseudocount #' to the number of features in each module. Only used if \code{yInit} is set. #' Default 1. #' @param minCell Integer. Only attempt to split cell populations with at #' least this many cells. #' @param reorder Logical. Whether to reorder cell populations using #' hierarchical clustering after each model has been created. If FALSE, cell #' populations numbers will correspond to the split which created the cell #' populations (i.e. 'K15' was created at split 15, 'K16' was created at split #' 16, etc.). Default TRUE. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param perplexity Logical. Whether to calculate perplexity for each model. #' If FALSE, then perplexity can be calculated later with #' \link{resamplePerplexity}. Default TRUE. #' @param doResampling Boolean. If \code{TRUE}, then each cell in the counts #' matrix will be resampled according to a multinomial distribution to introduce #' noise before calculating perplexity. Default \code{FALSE}. #' @param numResample Integer. The number of times to resample the counts matrix #' for evaluating perplexity if \code{doResampling} is set to \code{TRUE}. #' Default \code{5}. #' @param verbose Logical. Whether to print log messages. Default TRUE. #' @param logfile Character. Messages will be redirected to a file named #' "logfile". If NULL, messages will be printed to stdout. Default NULL. #' @return A \linkS4class{SingleCellExperiment} object. Function #' parameter settings and celda model results are stored in the #' \link{metadata} \code{"celda_grid_search"} slot. The models in #' the list will be of class \code{celda_C} if \code{yInit = NULL} or #' \code{celda_CG} if \code{zInit} is set. #' @seealso \link{recursiveSplitModule} for recursive splitting of feature #' modules. #' @export setGeneric("recursiveSplitCell", function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, initialK = 5, maxK = 25, tempL = NULL, yInit = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minCell = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, logfile = NULL, verbose = TRUE) { standardGeneric("recursiveSplitCell")}) #' @rdname recursiveSplitCell #' @examples #' data(sceCeldaCG) #' ## Create models that range from K = 3 to K = 7 by recursively splitting #' ## cell populations into two to produce \link{celda_C} cell clustering models #' sce <- recursiveSplitCell(sceCeldaCG, initialK = 3, maxK = 7) #' #' ## Alternatively, first identify features modules using #' ## \link{recursiveSplitModule} #' moduleSplit <- recursiveSplitModule(sceCeldaCG, initialL = 3, maxL = 15) #' plotGridSearchPerplexity(moduleSplit) #' moduleSplitSelect <- subsetCeldaList(moduleSplit, list(L = 10)) #' #' ## Then use module labels for initialization in \link{recursiveSplitCell} to #' ## produce \link{celda_CG} bi-clustering models #' cellSplit <- recursiveSplitCell(sceCeldaCG, #' initialK = 3, maxK = 7, yInit = celdaModules(moduleSplitSelect)) #' plotGridSearchPerplexity(cellSplit) #' sce <- subsetCeldaList(cellSplit, list(K = 5, L = 10)) #' @export setMethod("recursiveSplitCell", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, initialK = 5, maxK = 25, tempL = NULL, yInit = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minCell = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, logfile = NULL, verbose = TRUE) { xClass <- "SingleCellExperiment" if (!altExpName %in% SingleCellExperiment::altExpNames(x)) { stop(altExpName, " not in 'altExpNames(x)'. Run ", "selectFeatures(x) first!") } altExp <- SingleCellExperiment::altExp(x, altExpName) if (!useAssay %in% SummarizedExperiment::assayNames(altExp)) { stop(useAssay, " not in assayNames(altExp(x, altExpName))") } counts <- SummarizedExperiment::assay(altExp, i = useAssay) if (!is.null(yInit)) { model <- "celda_CG" } else { model <- "celda_C" } celdaList <- .recursiveSplitCellWithSeed(counts = counts, sampleLabel = sampleLabel, initialK = initialK, maxK = maxK, tempL = tempL, yInit = yInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minCell = minCell, reorder = reorder, seed = seed, perplexity = perplexity, doResampling = doResampling, numResample = numResample, logfile = logfile, verbose = verbose) altExp <- .createSCERecursiveSplitCell(celdaList = celdaList, sce = altExp, xClass = xClass, useAssay = useAssay, model = model, sampleLabel = sampleLabel, initialK = initialK, maxK = maxK, tempL = tempL, yInit = yInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minCell = minCell, reorder = reorder, seed = seed, perplexity = perplexity, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname recursiveSplitCell #' @examples #' data(celdaCGSim, celdaCSim) #' ## Create models that range from K = 3 to K = 7 by recursively splitting #' ## cell populations into two to produce \link{celda_C} cell clustering models #' sce <- recursiveSplitCell(celdaCSim$counts, initialK = 3, maxK = 7) #' #' ## Alternatively, first identify features modules using #' ## \link{recursiveSplitModule} #' moduleSplit <- recursiveSplitModule(celdaCGSim$counts, #' initialL = 3, maxL = 15) #' plotGridSearchPerplexity(moduleSplit) #' moduleSplitSelect <- subsetCeldaList(moduleSplit, list(L = 10)) #' #' ## Then use module labels for initialization in \link{recursiveSplitCell} to #' ## produce \link{celda_CG} bi-clustering models #' cellSplit <- recursiveSplitCell(celdaCGSim$counts, #' initialK = 3, maxK = 7, yInit = celdaModules(moduleSplitSelect)) #' plotGridSearchPerplexity(cellSplit) #' sce <- subsetCeldaList(cellSplit, list(K = 5, L = 10)) #' @export setMethod("recursiveSplitCell", signature(x = "matrix"), function(x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, initialK = 5, maxK = 25, tempL = NULL, yInit = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minCell = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, logfile = NULL, verbose = TRUE) { ls <- list() ls[[useAssay]] <- x sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" if (!is.null(yInit)) { model <- "celda_CG" } else { model <- "celda_C" } celdaList <- .recursiveSplitCellWithSeed(counts = x, sampleLabel = sampleLabel, initialK = initialK, maxK = maxK, tempL = tempL, yInit = yInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minCell = minCell, reorder = reorder, seed = seed, perplexity = perplexity, doResampling = doResampling, numResample = numResample, logfile = logfile, verbose = verbose) altExp <- .createSCERecursiveSplitCell(celdaList = celdaList, sce = SingleCellExperiment::altExp(sce, altExpName), xClass = xClass, useAssay = useAssay, model = model, sampleLabel = sampleLabel, initialK = initialK, maxK = maxK, tempL = tempL, yInit = yInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minCell = minCell, reorder = reorder, seed = seed, perplexity = perplexity, logfile = logfile, verbose = verbose) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) .recursiveSplitCellWithSeed <- function(counts, sampleLabel, initialK, maxK, tempL, yInit, alpha, beta, delta, gamma, minCell, reorder, seed, perplexity, doResampling, numResample, logfile, verbose) { if (is.null(seed)) { celdaList <- .recursiveSplitCell(counts = counts, sampleLabel = sampleLabel, initialK = initialK, maxK = maxK, tempL = tempL, yInit = yInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minCell = minCell, reorder = reorder, perplexity = perplexity, doResampling = doResampling, numResample = numResample, logfile = logfile, verbose = verbose) } else { with_seed( seed, celdaList <- .recursiveSplitCell(counts = counts, sampleLabel = sampleLabel, initialK = initialK, maxK = maxK, tempL = tempL, yInit = yInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minCell = minCell, reorder = reorder, perplexity = perplexity, doResampling = doResampling, numResample = numResample, logfile = logfile, verbose = verbose) ) } return(celdaList) } .recursiveSplitCell <- function(counts, sampleLabel, initialK, maxK, tempL, yInit, alpha, beta, delta, gamma, minCell, reorder, perplexity, doResampling, numResample, logfile, verbose) { .logMessages(paste(rep("=", 50), collapse = ""), logfile = logfile, append = FALSE, verbose = verbose ) .logMessages("Starting recursive cell population splitting.", logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("=", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .validateCounts(counts) startTime <- Sys.time() counts <- .processCounts(counts) countChecksum <- .createCountChecksum(counts) sampleLabel <- .processSampleLabels(sampleLabel, numCells = ncol(counts)) s <- as.integer(sampleLabel) names <- list( row = rownames(counts), column = colnames(counts), sample = levels(sampleLabel) ) if (!is.null(yInit)) { # Create collapsed module matrix L <- length(unique(yInit)) .logMessages(date(), ".. Collapsing to", L, "modules", append = TRUE, verbose = verbose, logfile = logfile ) overallY <- .initializeCluster(L, nrow(counts), initial = yInit) countsY <- .rowSumByGroup(counts, overallY, L) # Create initial model with initialK and predifined y labels .logMessages(date(), ".. Initializing with", initialK, "populations", append = TRUE, verbose = verbose, logfile = logfile) modelInitial <- .celda_CG(counts, sampleLabel = sampleLabel, K = as.integer(initialK), L = as.integer(L), zInitialize = "split", yInitialize = "predefined", nchains = 1, yInit = overallY, alpha = alpha, beta = beta, gamma = gamma, delta = delta, verbose = FALSE, reorder = reorder) currentK <- length(unique(celdaClusters(modelInitial)$z)) + 1 overallZ <- as.integer(celdaClusters(modelInitial)$z) resList <- list(modelInitial) while (currentK <= maxK) { # previousY <- overallY tempSplit <- .singleSplitZ(countsY, overallZ, s, currentK, minCell = 3, alpha = alpha, beta = beta ) tempModel <- .celda_CG(counts, sampleLabel = sampleLabel, K = as.integer(currentK), L = as.integer(L), yInit = overallY, zInit = tempSplit$z, nchains = 1, zInitialize = "predefined", yInitialize = "predefined", splitOnLast = FALSE, stopIter = 5, alpha = alpha, beta = beta, gamma = gamma, delta = delta, verbose = FALSE, reorder = reorder ) # Calculate new decomposed counts matrix with new module labels # overallY = clusters(tempModel)$y # p = .cGReDecomposeCounts(counts, overallY, previousY, countsY, # nByG, L = as.integer(L)) # countsY = p$nTSByC # If the number of clusters is still "currentK", then keep the # reordering, otherwise keep the previous configuration if (length(unique(celdaClusters(tempModel)$z)) == currentK) { overallZ <- as.integer(celdaClusters(tempModel)$z) } else { overallZ <- tempSplit$z ll <- .logLikelihoodcelda_CG( counts, s, overallZ, as.integer(celdaClusters(tempModel)$y), currentK, L, alpha, beta, delta, gamma ) tempModel <- methods::new("celda_CG", clusters = list(z = overallZ, y = as.integer(celdaClusters(tempModel)$y)), params = list( K = as.integer(currentK), L = as.integer(L), alpha = alpha, beta = beta, delta = delta, gamma = gamma, countChecksum = countChecksum ), finalLogLik = ll, sampleLabel = sampleLabel, names = names ) } resList <- c(resList, list(tempModel)) .logMessages(date(), ".. Current cell population", currentK, "| logLik:", bestLogLikelihood(tempModel), append = TRUE, verbose = verbose, logfile = logfile ) currentK <- length(unique(overallZ)) + 1 } runK <- vapply(resList, function(mod) { params(mod)$K }, integer(1)) runL <- vapply(resList, function(mod) { params(mod)$L }, integer(1)) runParams <- data.frame( index = seq.int(1, length(resList)), L = as.integer(runL), K = as.integer(runK), stringsAsFactors = FALSE ) } else if (!is.null(tempL)) { L <- tempL .logMessages(date(), ".. Collapsing to", L, "temporary modules", append = TRUE, verbose = verbose, logfile = logfile ) tempY <- .initializeSplitY(counts, L = as.integer(L), tempK = max(100, maxK), minFeature = 3 ) tempY <- as.integer(as.factor(tempY)) L <- length(unique(tempY)) # Recalculate in case some modules are empty countsY <- .rowSumByGroup(counts, tempY, L) # Create initial model with initialK .logMessages(date(), ".. Initializing with", initialK, "populations", append = TRUE, verbose = verbose, logfile = logfile ) modelInitial <- .celda_C(countsY, sampleLabel = sampleLabel, K = as.integer(initialK), zInitialize = "split", nchains = 1, alpha = alpha, beta = beta, verbose = FALSE, reorder = reorder ) currentK <- length(unique(celdaClusters(modelInitial)$z)) + 1 overallZ <- as.integer(celdaClusters(modelInitial)$z) ll <- .logLikelihoodcelda_C( counts, s, overallZ, currentK, alpha, beta ) modelInitial@params$countChecksum <- countChecksum modelInitial@completeLogLik <- ll modelInitial@finalLogLik <- ll resList <- list(modelInitial) while (currentK <= maxK) { # Find next best split, then do a new celda_C run with that split tempSplit <- .singleSplitZ(countsY, overallZ, s, currentK, minCell = 3, alpha = alpha, beta = beta ) tempModel <- .celda_C(countsY, sampleLabel = sampleLabel, K = as.integer(currentK), nchains = 1, zInitialize = "random", alpha = alpha, beta = beta, stopIter = 5, splitOnLast = FALSE, verbose = FALSE, zInit = tempSplit$z, reorder = reorder ) # Handle rare cases where a population has no cells after running # the model if (length(unique(celdaClusters(tempModel)$z)) == currentK) { overallZ <- as.integer(celdaClusters(tempModel)$z) } else { overallZ <- tempSplit$z } # Need to change below line to use decompose counts to save time ll <- .logLikelihoodcelda_C( counts, s, overallZ, currentK, alpha, beta ) tempModel <- methods::new("celda_C", clusters = list(z = overallZ), params = list( K = as.integer(currentK), alpha = alpha, beta = beta, countChecksum = countChecksum ), finalLogLik = ll, sampleLabel = sampleLabel, names = names ) resList <- c(resList, list(tempModel)) .logMessages(date(), ".. Current cell population", currentK, "| logLik:", bestLogLikelihood(tempModel), append = TRUE, verbose = verbose, logfile = logfile ) currentK <- length(unique(overallZ)) + 1 } runK <- vapply(resList, function(mod) { params(mod)$K }, integer(1)) runParams <- data.frame( index = seq.int(1, length(resList)), K = as.integer(runK), stringsAsFactors = FALSE ) } else { # Create initial model with initialK .logMessages(date(), ".. Initializing with", initialK, "populations", append = TRUE, verbose = verbose, logfile = logfile ) modelInitial <- .celda_C(counts, sampleLabel = sampleLabel, K = as.integer(initialK), zInitialize = "split", nchains = 1, alpha = alpha, beta = beta, verbose = FALSE, reorder = reorder ) currentK <- length(unique(celdaClusters(modelInitial)$z)) + 1 overallZ <- as.integer(celdaClusters(modelInitial)$z) resList <- list(modelInitial) while (currentK <= maxK) { tempSplit <- .singleSplitZ(counts, overallZ, s, currentK, minCell = 3, alpha = alpha, beta = beta ) tempModel <- .celda_C(counts, sampleLabel = sampleLabel, K = as.integer(currentK), nchains = 1, zInitialize = "random", alpha = alpha, beta = beta, stopIter = 5, splitOnLast = FALSE, verbose = FALSE, zInit = tempSplit$z, reorder = reorder ) if (length(unique(celdaClusters(tempModel)$z)) == currentK) { overallZ <- as.integer(celdaClusters(tempModel)$z) } else { overallZ <- tempSplit$z ll <- .logLikelihoodcelda_C( counts, s, overallZ, currentK, alpha, beta ) tempModel <- methods::new("celda_C", clusters = list(z = overallZ), params = list( K = as.integer(currentK), alpha = alpha, beta = beta, countChecksum = countChecksum ), finalLogLik = ll, sampleLabel = sampleLabel, names = names ) } resList <- c(resList, list(tempModel)) .logMessages(date(), ".. Current cell population", currentK, "| logLik:", bestLogLikelihood(tempModel), append = TRUE, verbose = verbose, logfile = logfile ) currentK <- length(unique(overallZ)) + 1 } runK <- vapply(resList, function(mod) { params(mod)$K }, integer(1)) runParams <- data.frame( index = seq.int(1, length(resList)), K = as.integer(runK), stringsAsFactors = FALSE ) } # Summarize paramters of different models logliks <- vapply(resList, function(mod) { bestLogLikelihood(mod) }, double(1)) runParams <- data.frame(runParams, log_likelihood = logliks, stringsAsFactors = FALSE ) celdaRes <- methods::new("celdaList", runParams = runParams, resList = resList, countChecksum = countChecksum ) if (isTRUE(perplexity)) { .logMessages(date(), ".. Calculating perplexity", append = TRUE, verbose = verbose, logfile = NULL ) celdaRes <- resamplePerplexity(counts, celdaRes, doResampling = doResampling, numResample = numResample) } endTime <- Sys.time() .logMessages( paste(rep("=", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages( "Completed recursive cell population splitting. Total time:", format(difftime(endTime, startTime)), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages( paste(rep("=", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) return(celdaRes) } #' @title Recursive module splitting #' @description Uses the \link{celda_G} model to cluster features into modules #' for a range of possible L's. The module labels of the previous "L-1" model #' are used as the initial values in the current model with L modules. The best #' split of an existing module is found to create the L-th module. This #' procedure is much faster than randomly initializing each model with a #' different L. #' @param x A numeric \link{matrix} of counts or a #' \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a #' \link[SingleCellExperiment]{SingleCellExperiment} object. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @param initialL Integer. Initial number of modules. #' @param maxL Integer. Maximum number of modules. #' @param tempK Integer. Number of temporary cell populations to identify and #' use in module splitting. Only used if \code{zInit = NULL} Collapsing cells #' to a relatively smaller number of cell popluations will increase the #' speed of module clustering and tend to produce better modules. This number #' should be larger than the number of true cell populations expected in the #' dataset. Default \code{100}. #' @param zInit Integer vector. Collapse cells to cell populations based on #' labels in \code{zInit} and then perform module splitting. If NULL, no #' collapsing will be performed unless \code{tempK} is specified. #' Default \code{NULL}. #' @param sampleLabel Vector or factor. Denotes the sample label for each cell #' (column) in the count matrix. Default \code{NULL}. #' @param alpha Numeric. Concentration parameter for Theta. Adds a pseudocount #' to each cell population in each sample. Only used if \code{zInit} is set. #' Default \code{1}. #' @param beta Numeric. Concentration parameter for Phi. Adds a pseudocount #' to each feature module in each cell. Default 1. #' @param delta Numeric. Concentration parameter for Psi. Adds a pseudocount #' to each feature in each module. Default 1. #' @param gamma Numeric. Concentration parameter for Eta. Adds a pseudocount #' to the number of features in each module. Default 1. #' @param minFeature Integer. Only attempt to split modules with at least this #' many features. #' @param reorder Logical. Whether to reorder modules using hierarchical #' clustering after each model has been created. If FALSE, module numbers will #' correspond to the split which created the module (i.e. 'L15' was created at #' split 15, 'L16' was created at split 16, etc.). Default TRUE. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @param perplexity Logical. Whether to calculate perplexity for each model. #' If FALSE, then perplexity can be calculated later with #' \link{resamplePerplexity}. Default \code{TRUE}. #' @param doResampling Boolean. If \code{TRUE}, then each cell in the counts #' matrix will be resampled according to a multinomial distribution to introduce #' noise before calculating perplexity. Default \code{FALSE}. #' @param numResample Integer. The number of times to resample the counts matrix #' for evaluating perplexity if \code{doResampling} is set to \code{TRUE}. #' Default \code{5}. #' @param verbose Logical. Whether to print log messages. Default TRUE. #' @param logfile Character. Messages will be redirected to a file named #' "logfile". If NULL, messages will be printed to stdout. Default NULL. #' @return A \linkS4class{SingleCellExperiment} object. Function #' parameter settings and celda model results are stored in the #' \link{metadata} \code{"celda_grid_search"} slot. The models in #' the list will be of class \link{celda_G} if \code{zInit = NULL} or #' \link{celda_CG} if \code{zInit} is set. #' @seealso \code{recursiveSplitCell} for recursive splitting of cell #' populations. #' @export setGeneric("recursiveSplitModule", function(x, useAssay = "counts", altExpName = "featureSubset", initialL = 10, maxL = 100, tempK = 100, zInit = NULL, sampleLabel = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minFeature = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, verbose = TRUE, logfile = NULL) { standardGeneric("recursiveSplitModule")}) #' @rdname recursiveSplitModule #' @examples #' data(sceCeldaCG) #' ## Create models that range from L=3 to L=20 by recursively splitting modules #' ## into two #' moduleSplit <- recursiveSplitModule(sceCeldaCG, initialL = 3, maxL = 20) #' #' ## Example results with perplexity #' plotGridSearchPerplexity(moduleSplit) #' #' ## Select model for downstream analysis #' celdaMod <- subsetCeldaList(moduleSplit, list(L = 10)) #' @export setMethod("recursiveSplitModule", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", initialL = 10, maxL = 100, tempK = 100, zInit = NULL, sampleLabel = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minFeature = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, verbose = TRUE, logfile = NULL) { xClass <- "SingleCellExperiment" if (!altExpName %in% SingleCellExperiment::altExpNames(x)) { stop(altExpName, " not in 'altExpNames(x)'. Run ", "selectFeatures(x) first!") } altExp <- SingleCellExperiment::altExp(x, altExpName) if (!useAssay %in% SummarizedExperiment::assayNames(altExp)) { stop(useAssay, " not in assayNames(altExp(x, altExpName))") } counts <- SummarizedExperiment::assay(altExp, i = useAssay) if (!is.null(zInit)) { model <- "celda_CG" } else { model <- "celda_G" } celdaList <- .recursiveSplitModuleWithSeed(counts = counts, initialL = initialL, maxL = maxL, tempK = tempK, zInit = zInit, sampleLabel = sampleLabel, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minFeature = minFeature, reorder = reorder, seed = seed, perplexity = perplexity, doResampling = doResampling, numResample = numResample, verbose = verbose, logfile = logfile) altExp <- .createSCERecursiveSplitModule(celdaList = celdaList, sce = altExp, xClass = xClass, useAssay = useAssay, model = model, initialL = initialL, maxL = maxL, tempK = tempK, zInit = zInit, sampleLabel = sampleLabel, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minFeature = minFeature, reorder = reorder, seed = seed, perplexity = perplexity, verbose = verbose, logfile = logfile) SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) #' @rdname recursiveSplitModule #' @examples #' data(celdaCGSim) #' ## Create models that range from L=3 to L=20 by recursively splitting modules #' ## into two #' moduleSplit <- recursiveSplitModule(celdaCGSim$counts, #' initialL = 3, maxL = 20) #' #' ## Example results with perplexity #' plotGridSearchPerplexity(moduleSplit) #' #' ## Select model for downstream analysis #' celdaMod <- subsetCeldaList(moduleSplit, list(L = 10)) #' @export setMethod("recursiveSplitModule", signature(x = "matrix"), function(x, useAssay = "counts", altExpName = "featureSubset", initialL = 10, maxL = 100, tempK = 100, zInit = NULL, sampleLabel = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minFeature = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, verbose = TRUE, logfile = NULL) { ls <- list() ls[[useAssay]] <- x sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) SingleCellExperiment::altExp(sce, altExpName) <- sce xClass <- "matrix" if (!is.null(zInit)) { model <- "celda_CG" } else { model <- "celda_G" } celdaList <- .recursiveSplitModuleWithSeed(counts = x, initialL = initialL, maxL = maxL, tempK = tempK, zInit = zInit, sampleLabel = sampleLabel, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minFeature = minFeature, reorder = reorder, seed = seed, perplexity = perplexity, doResampling = doResampling, numResample = numResample, verbose = verbose, logfile = logfile) altExp <- .createSCERecursiveSplitModule(celdaList = celdaList, sce = SingleCellExperiment::altExp(sce, altExpName), xClass = xClass, useAssay = useAssay, model = model, initialL = initialL, maxL = maxL, tempK = tempK, zInit = zInit, sampleLabel = sampleLabel, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minFeature = minFeature, reorder = reorder, seed = seed, perplexity = perplexity, verbose = verbose, logfile = logfile) SingleCellExperiment::altExp(sce, altExpName) <- altExp return(sce) } ) .recursiveSplitModuleWithSeed <- function(counts, initialL, maxL, tempK, zInit, sampleLabel, alpha, beta, delta, gamma, minFeature, reorder, seed, perplexity, doResampling, numResample, verbose, logfile) { if (is.null(seed)) { celdaList <- .recursiveSplitModule( counts = counts, initialL = initialL, maxL = maxL, tempK = tempK, zInit = zInit, sampleLabel = sampleLabel, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minFeature = minFeature, reorder = reorder, perplexity = perplexity, verbose = verbose, logfile = logfile, doResampling = doResampling, numResample = numResample) } else { with_seed(seed, celdaList <- .recursiveSplitModule( counts = counts, initialL = initialL, maxL = maxL, tempK = tempK, zInit = zInit, sampleLabel = sampleLabel, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minFeature = minFeature, reorder = reorder, perplexity = perplexity, verbose = verbose, logfile = logfile, doResampling = doResampling, numResample = numResample) ) } return(celdaList) } .recursiveSplitModule <- function(counts, initialL = 10, maxL = 100, tempK = 100, zInit = NULL, sampleLabel = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minFeature = 3, reorder = TRUE, perplexity = TRUE, verbose = TRUE, logfile = NULL, doResampling = FALSE, numResample = 5) { .logMessages(paste(rep("=", 50), collapse = ""), logfile = logfile, append = FALSE, verbose = verbose ) .logMessages("Starting recursive module splitting.", logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("=", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) startTime <- Sys.time() .validateCounts(counts) counts <- .processCounts(counts) countChecksum <- .createCountChecksum(counts) names <- list(row = rownames(counts), column = colnames(counts)) sampleLabel <- .processSampleLabels(sampleLabel, numCells = ncol(counts) ) s <- as.integer(sampleLabel) if (!is.null(zInit)) { # Create collapsed module matrix K <- length(unique(zInit)) .logMessages( date(), ".. Collapsing to", K, "cell populations", append = TRUE, verbose = verbose, logfile = logfile ) overallZ <- .initializeCluster( N = K, len = ncol(counts), initial = zInit ) countsZ <- .colSumByGroup(counts, overallZ, K) # Create initial model with initialL and predifined z labels .logMessages(date(), ".. Initializing with", initialL, "modules", append = TRUE, verbose = verbose, logfile = logfile ) modelInitial <- .celda_CG(counts, sampleLabel = sampleLabel, L = initialL, K = K, zInitialize = "predefined", yInitialize = "split", nchains = 1, zInit = overallZ, alpha = alpha, beta = beta, gamma = gamma, delta = delta, verbose = FALSE, reorder = reorder) currentL <- length(unique(celdaClusters(modelInitial)$y)) + 1 overallY <- as.integer(celdaClusters(modelInitial)$y) resList <- list(modelInitial) while (currentL <= maxL) { # Allow features to cluster further with celda_CG tempSplit <- .singleSplitY( countsZ, overallY, currentL, minFeature = 3, beta = beta, delta = delta, gamma = gamma ) tempModel <- .celda_CG( counts, L = currentL, K = K, stopIter = 5, splitOnIter = -1, splitOnLast = FALSE, nchains = 1, verbose = FALSE, yInitialize = "predefined", zInitialize = "predefined", yInit = tempSplit$y, zInit = overallZ, reorder = reorder ) overallY <- as.integer(celdaClusters(tempModel)$y) ## Add new model to results list and increment L .logMessages( date(), ".. Created module", currentL, "| logLik:", bestLogLikelihood(tempModel), append = TRUE, verbose = verbose, logfile = NULL ) resList <- c(resList, list(tempModel)) currentL <- currentL + 1 } runK <- vapply(resList, function(mod) { params(mod)$K }, integer(1)) runL <- vapply(resList, function(mod) { params(mod)$L }, integer(1)) runParams <- data.frame( index = seq.int(1, length(resList)), L = runL, K = runK, stringsAsFactors = FALSE ) } else if (!is.null(tempK)) { K <- tempK .logMessages( date(), ".. Collapsing to", K, "temporary cell populations", append = TRUE, verbose = verbose, logfile = logfile ) z <- .initializeSplitZ(counts, K = K, minCell = 3 ) countsZ <- .colSumByGroup(counts, z, length(unique(z))) .logMessages( date(), ".. Initializing with", initialL, "modules", append = TRUE, verbose = verbose, logfile = logfile ) modelInitial <- .celda_G( countsZ, L = initialL, nchains = 1, verbose = FALSE ) modelInitial@params$countChecksum <- countChecksum currentL <- length(unique(as.integer(celdaClusters(modelInitial)$y))) + 1 overallY <- as.integer(celdaClusters(modelInitial)$y) ## Decomposed counts for full count matrix p <- .cGDecomposeCounts(counts, overallY, currentL) nTSByC <- p$nTSByC nByTS <- p$nByTS nGByTS <- p$nGByTS nByG <- p$nByG nG <- p$nG nM <- p$nM resList <- list(modelInitial) while (currentL <= maxL) { # Allow features to cluster further previousY <- overallY tempSplit <- .singleSplitY( countsZ, overallY, currentL, minFeature = 3, beta = beta, delta = delta, gamma = gamma ) tempModel <- .celda_G( countsZ, L = currentL, stopIter = 5, splitOnIter = -1, splitOnLast = FALSE, nchains = 1, verbose = FALSE, yInitialize = "predefined", yInit = tempSplit$y, reorder = reorder ) overallY <- as.integer(celdaClusters(tempModel)$y) # Adjust decomposed count matrices p <- .cGReDecomposeCounts(counts, overallY, previousY, nTSByC, nByG, L = currentL ) nTSByC <- p$nTSByC nByTS <- p$nByTS nGByTS <- p$nGByTS previousY <- overallY ## Create the final model object with correct info on full counts ## matrix tempModel@finalLogLik <- .cGCalcLL( nTSByC = nTSByC, nByTS = nByTS, nByG = nByG, nGByTS = nGByTS, nM = nM, nG = nG, L = currentL, beta = beta, delta = delta, gamma = gamma ) tempModel@completeLogLik <- bestLogLikelihood(tempModel) tempModel@params$countChecksum <- countChecksum tempModel@names <- names ## Add extra row/column for next round of L nTSByC <- rbind(nTSByC, rep(0L, ncol(nTSByC))) nByTS <- c(nByTS, 0L) nGByTS <- c(nGByTS, 0L) ## Add new model to results list and increment L .logMessages( date(), ".. Created module", currentL, "| logLik:", bestLogLikelihood(tempModel), append = TRUE, verbose = verbose, logfile = NULL ) resList <- c(resList, list(tempModel)) currentL <- currentL + 1 } runL <- vapply(resList, function(mod) { params(mod)$L }, integer(1)) runParams <- data.frame( index = seq.int(1, length(resList)), L = runL, stringsAsFactors = FALSE ) } else { .logMessages( date(), ".. Initializing with", initialL, "modules", append = TRUE, verbose = verbose, logfile = logfile ) modelInitial <- .celda_G( counts, L = initialL, maxIter = 20, nchains = 1, verbose = FALSE ) overallY <- as.integer(celdaClusters(modelInitial)$y) currentL <- length(unique(overallY)) + 1 ## Perform splitting for y labels resList <- list(modelInitial) while (currentL <= maxL) { # Allow features to cluster further previousY <- overallY tempSplit <- .singleSplitY( counts, overallY, currentL, minFeature = 3, beta = beta, delta = delta, gamma = gamma ) tempModel <- .celda_G( counts, L = currentL, stopIter = 5, splitOnIter = -1, splitOnLast = FALSE, nchains = 1, verbose = FALSE, yInitialize = "predefined", yInit = tempSplit$y, reorder = reorder ) overallY <- as.integer(celdaClusters(tempModel)$y) ## Add new model to results list and increment L .logMessages( date(), ".. Created module", currentL, "| logLik:", bestLogLikelihood(tempModel), append = TRUE, verbose = verbose, logfile = NULL ) resList <- c(resList, list(tempModel)) currentL <- currentL + 1 } runL <- vapply(resList, function(mod) { params(mod)$L }, integer(1)) runParams <- data.frame( index = seq.int(1, length(resList)), L = runL, stringsAsFactors = FALSE ) } ## Summarize parameters of different models logliks <- vapply(resList, function(mod) { bestLogLikelihood(mod) }, double(1)) runParams <- data.frame(runParams, log_likelihood = logliks, stringsAsFactors = FALSE ) celdaRes <- methods::new( "celdaList", runParams = runParams, resList = resList, countChecksum = countChecksum ) if (isTRUE(perplexity)) { .logMessages( date(), ".. Calculating perplexity", append = TRUE, verbose = verbose, logfile = NULL ) celdaRes <- resamplePerplexity(counts, celdaRes, doResampling = doResampling, numResample = numResample) } endTime <- Sys.time() .logMessages(paste(rep("=", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages("Completed recursive module splitting. Total time:", format(difftime(endTime, startTime)), logfile = logfile, append = TRUE, verbose = verbose ) .logMessages(paste(rep("=", 50), collapse = ""), logfile = logfile, append = TRUE, verbose = verbose ) return(celdaRes) } .createSCERecursiveSplitCell <- function(celdaList, sce, xClass, useAssay, model, sampleLabel, initialK, maxK, tempL, yInit, alpha, beta, delta, gamma, minCell, reorder, seed, perplexity, logfile, verbose) { S4Vectors::metadata(sce)[["celda_grid_search"]] <- celdaList S4Vectors::metadata(sce)$celda_grid_search@celdaGridSearchParameters <- list(xClass = xClass, useAssay = useAssay, model = model, sampleLabel = sampleLabel, initialK = initialK, maxK = maxK, tempL = tempL, yInit = yInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minCell = minCell, reorder = reorder, seed = seed, perplexity = perplexity, logfile = logfile, verbose = verbose) SummarizedExperiment::colData(sce)$"celda_sample_label" <- sampleLabel return(sce) } .createSCERecursiveSplitModule <- function(celdaList, sce, xClass, useAssay, model, initialL, maxL, tempK, zInit, sampleLabel, alpha, beta, delta, gamma, minFeature, reorder, seed, perplexity, verbose, logfile) { S4Vectors::metadata(sce)[["celda_grid_search"]] <- celdaList S4Vectors::metadata(sce)$celda_grid_search@celdaGridSearchParameters <- list(xClass = xClass, useAssay = useAssay, model = model, sampleLabel = sampleLabel, initialL = initialL, maxL = maxL, tempK = tempK, zInit = zInit, alpha = alpha, beta = beta, delta = delta, gamma = gamma, minFeature = minFeature, reorder = reorder, seed = seed, perplexity = perplexity, logfile = logfile, verbose = verbose) SummarizedExperiment::colData(sce)$"celda_sample_label" <- sampleLabel return(sce) } ================================================ FILE: R/reorderCelda.R ================================================ #' @title Reorder cells populations and/or features modules using #' hierarchical clustering #' @description Apply hierarchical clustering to reorder the cell populations #' and/or feature modules and group similar ones together based #' on the cosine distance of the factorized matrix #' from \link{factorizeMatrix}. #' @param x Can be one of #' \itemize{ #' \item A \linkS4class{SingleCellExperiment} object returned by #' \link{celda_C}, \link{celda_G} or \link{celda_CG}, with the matrix #' located in the \code{useAssay} assay slot in \code{altExp(x, altExpName)}. #' Rows represent features and columns represent cells. #' \item Integer count matrix. Rows represent features and columns represent #' cells. This matrix should be the same as the one used to generate #' \code{celdaMod}.} #' @param useAssay A string specifying which \link{assay} #' slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. #' Default "counts". #' @param altExpName The name for the \link{altExp} slot. #' Default "featureSubset". #' @param method Passed to \link{hclust}. The agglomeration method #' to be used to be used. Default "complete". #' @param celdaMod Celda model object. Only works if \code{x} is an integer #' counts matrix. Ignored if \code{x} is a #' \linkS4class{SingleCellExperiment} object. #' @return A \linkS4class{SingleCellExperiment} object (or Celda model object) #' with updated cell cluster and/or feature module labels. #' @export setGeneric("reorderCelda", function(x, celdaMod, useAssay = "counts", altExpName = "featureSubset", method = "complete") { standardGeneric("reorderCelda")}) #' @examples #' data(sceCeldaCG) #' reordersce <- reorderCelda(sceCeldaCG) #' @rdname reorderCelda #' @export setMethod("reorderCelda", signature(x = "SingleCellExperiment"), function(x, useAssay = "counts", altExpName = "featureSubset", method = "complete") { altExp <- SingleCellExperiment::altExp(x, e = altExpName) if (!useAssay %in% SummarizedExperiment::assayNames(altExp)) { stop(useAssay, " not in assayNames(altExp(x, altExpName))") } if (celdaModel(x, altExpName = altExpName) == "celda_C") { sce <- .reorderCeldaCsce(sce = x, useAssay = useAssay, altExpName = altExpName, method = method) } else if (celdaModel(x, altExpName = altExpName) == "celda_CG") { sce <- .reorderCeldaCGsce(sce = x, useAssay = useAssay, altExpName = altExpName, method = method) } else if (celdaModel(x, altExpName = altExpName) == "celda_G") { sce <- .reorderCeldaGsce(sce = x, useAssay = useAssay, altExpName = altExpName, method = method) } else { stop("S4Vectors::metadata(altExp(x, altExpName))$", "celda_parameters$model must be", " one of 'celda_C', 'celda_G', or 'celda_CG'") } return(sce) }) #' @examples #' data(celdaCGSim, celdaCGMod) #' reorderCeldaCG <- reorderCelda(celdaCGSim$counts, celdaCGMod) #' @rdname reorderCelda #' @export setMethod("reorderCelda", signature(x = "matrix", celdaMod = "celda_CG"), function(x, celdaMod, method = "complete") { res <- .reorderCeldaCG(x, celdaMod, method = method) return(res) }) #' @examples #' data(celdaCSim, celdaCMod) #' reorderCeldaC <- reorderCelda(celdaCSim$counts, celdaCMod) #' @rdname reorderCelda #' @export setMethod("reorderCelda", signature(x = "matrix", celdaMod = "celda_C"), function(x, celdaMod, method = "complete") { res <- .reorderCeldaC(x, celdaMod, method = method) return(res) }) #' @examples #' data(celdaGSim, celdaGMod) #' reorderCeldaG <- reorderCelda(celdaGSim$counts, celdaGMod) #' @rdname reorderCelda #' @export setMethod("reorderCelda", signature(x = "matrix", celdaMod = "celda_G"), function(x, celdaMod, method = "complete") { res <- .reorderCeldaG(x, celdaMod, method = method) return(res) }) .reorderCeldaC <- function(counts, res, method = "complete") { if (params(res)$K > 2 & isTRUE(length(unique(celdaClusters(res)$z)) > 1)) { res@clusters$z <- as.integer(as.factor(celdaClusters(res)$z)) fm <- factorizeMatrix(counts, res, type = "posterior") uniqueZ <- sort(unique(celdaClusters(res)$z)) d <- .cosineDist(fm$posterior$module[, uniqueZ]) h <- stats::hclust(d, method = method) res <- .recodeClusterZ(res, from = h$order, to = seq(length(h$order)) ) } return(res) } .reorderCeldaCsce <- function(sce, useAssay, altExpName, method = "complete") { if (S4Vectors::metadata(SingleCellExperiment::altExp(sce, altExpName))$celda_parameters$K > 2 & isTRUE(length(unique(celdaClusters(sce, altExpName))) > 1)) { celdaClusters(sce, altExpName) <- as.integer(as.factor(celdaClusters(sce, altExpName))) fm <- factorizeMatrix(sce, useAssay = useAssay, altExpName = altExpName, type = "posterior") uniqueZ <- sort(unique(celdaClusters(sce))) d <- .cosineDist(fm$posterior$module[, uniqueZ]) h <- stats::hclust(d, method = method) sce <- recodeClusterZ(sce, from = h$order, to = seq(length(h$order)), altExpName = altExpName) } return(sce) } .reorderCeldaG <- function(counts, res, method = "complete") { if (params(res)$L > 2 & isTRUE(length(unique(celdaClusters(res)$y)) > 1)) { res@clusters$y <- as.integer(as.factor(celdaClusters(res)$y)) fm <- factorizeMatrix(counts, res, type = "posterior") uniqueY <- sort(unique(celdaClusters(res)$y)) cs <- prop.table(t(fm$posterior$cell[uniqueY, ]), 2) d <- .cosineDist(cs) h <- stats::hclust(d, method = method) res <- .recodeClusterY(res, from = h$order, to = seq(length(h$order))) } return(res) } .reorderCeldaGsce <- function(sce, useAssay, altExpName, method = "complete") { if (S4Vectors::metadata(SingleCellExperiment::altExp(sce, altExpName))$celda_parameters$L > 2 & isTRUE(length(unique(celdaModules(sce, altExpName))) > 1)) { celdaModules(sce, altExpName) <- as.integer(as.factor(celdaModules(sce, altExpName))) fm <- factorizeMatrix(sce, useAssay = useAssay, altExpName = altExpName, type = "posterior") uniqueY <- sort(unique(celdaModules(sce))) cs <- prop.table(t(fm$posterior$cell[uniqueY, ]), 2) d <- .cosineDist(cs) h <- stats::hclust(d, method = method) sce <- recodeClusterY(sce, from = h$order, to = seq(length(h$order)), altExpName = altExpName) } return(sce) } .reorderCeldaCG <- function(counts, res, method = "complete") { # Reorder K if (params(res)$K > 2 & isTRUE(length(unique(celdaClusters(res)$z)) > 1)) { res@clusters$z <- as.integer(as.factor(celdaClusters(res)$z)) fm <- factorizeMatrix(counts, res, type = "posterior") uniqueZ <- sort(unique(celdaClusters(res)$z)) d <- .cosineDist(fm$posterior$cellPopulation[, uniqueZ]) h <- stats::hclust(d, method = method) res <- .recodeClusterZ(res, from = h$order, to = seq(length(h$order))) } # Reorder L if (params(res)$L > 2 & isTRUE(length(unique(celdaClusters(res)$y)) > 1)) { res@clusters$y <- as.integer(as.factor(celdaClusters(res)$y)) fm <- factorizeMatrix(counts, res, type = "posterior") uniqueY <- sort(unique(celdaClusters(res)$y)) cs <- prop.table(t(fm$posterior$cellPopulation[uniqueY, ]), 2) d <- .cosineDist(cs) h <- stats::hclust(d, method = method) res <- .recodeClusterY(res, from = h$order, to = seq(length(h$order))) } return(res) } .reorderCeldaCGsce <- function(sce, useAssay, altExpName, method = "complete") { # Reorder K if (S4Vectors::metadata(SingleCellExperiment::altExp(sce, altExpName))$celda_parameters$K > 2 & isTRUE(length(unique(celdaClusters(sce, altExpName))) > 1)) { celdaClusters(sce, altExpName) <- as.integer(as.factor(celdaClusters(sce, altExpName))) fm <- factorizeMatrix(sce, useAssay = useAssay, altExpName = altExpName, type = "posterior") uniqueZ <- sort(unique(celdaClusters(sce))) d <- .cosineDist(fm$posterior$cellPopulation[, uniqueZ]) h <- stats::hclust(d, method = method) sce <- recodeClusterZ(sce, from = h$order, to = seq(length(h$order)), altExpName = altExpName) } # Reorder L if (S4Vectors::metadata(SingleCellExperiment::altExp(sce, altExpName))$celda_parameters$L > 2 & isTRUE(length(unique(celdaModules(sce, altExpName))) > 1)) { celdaModules(sce, altExpName) <- as.integer(as.factor(celdaModules(sce, altExpName))) fm <- factorizeMatrix(sce, useAssay = useAssay, altExpName = altExpName, type = "posterior") uniqueY <- sort(unique(celdaModules(sce))) cs <- prop.table(t(fm$posterior$cellPopulation[uniqueY, ]), 2) d <- .cosineDist(cs) h <- stats::hclust(d, method = method) sce <- recodeClusterY(sce, from = h$order, to = seq(length(h$order)), altExpName = altExpName) } return(sce) } ================================================ FILE: R/reports.R ================================================ #' @title Generate an HTML report for celda_CG #' @name reportceldaCG #' @description \code{reportCeldaCGRun} will run \link{recursiveSplitModule} and #' \link{recursiveSplitCell} to find the number of modules (\code{L}) and the #' number of cell populations (\code{K}). A final \link{celda_CG} model will #' be selected from \link{recursiveSplitCell}. After a \link{celda_CG} model #' has been fit, \code{reportCeldaCGPlotResults} can be used to create an HTML #' report for visualization and exploration of the \link{celda_CG} model #' results. Some of the plotting and feature selection functions require the #' installation of the Bioconductor package \code{singleCellTK}. #' @param sce A \linkS4class{SingleCellExperiment} with the matrix located in #' the assay slot under \code{useAssay}. Rows represent features and columns #' represent cells. #' @param L Integer. Final number of feature modules. See \code{celda_CG} for #' more information. #' @param K Integer. Final number of cell populations. See \code{celda_CG} for #' more information. #' @param sampleLabel Vector or factor. Denotes the sample label for each cell #' (column) in the count matrix. #' @param altExpName The name for the \link{altExp} slot to use. Default #' \code{"featureSubset"}. #' @param useAssay A string specifying which \link{assay} slot to use. Default #' \code{"counts"}. #' @param initialL Integer. Minimum number of modules to try. See #' \link{recursiveSplitModule} for more information. Defailt \code{10}. #' @param maxL Integer. Maximum number of modules to try. See #' \link{recursiveSplitModule} for more information. Default \code{150}. #' @param initialK Integer. Initial number of cell populations to try. #' @param maxK Integer. Maximum number of cell populations to try. #' @param minCell Integer. Minimum number of cells required for feature #' selection. See \link{selectFeatures} for more information. Default #' \code{3}. #' @param minCount Integer. Minimum number of counts required for feature #' selection. See \link{selectFeatures} for more information. Default #' \code{3}. #' @param maxFeatures Integer. Maximum number of features to include. If the #' number of features after filtering for \code{minCell} and \code{minCount} #' are greater than \code{maxFeature}, then Seurat's VST function is used to #' select the top variable features. Default \code{5000}. #' @param reducedDimName Character. Name of the reduced dimensional object to be #' used in 2-D scatter plots throughout the report. Default \code{celda_UMAP}. #' @param features Character vector. Expression of these features will be #' displayed on a reduced dimensional plot defined by \code{reducedDimName}. #' If \code{NULL}, then no plotting of features on a reduced dimensinoal plot #' will be performed. Default \code{NULL}. #' @param displayName Character. The name to use for display in scatter plots #' and heatmaps. If \code{NULL}, then the rownames of the \code{sce} object #' will be used. This can also be set to the name of a column in the row data #' of \code{sce} or \code{altExp(sce, altExpName)}. Default \code{NULL}. #' @param cellAnnot Character vector. The cell-level annotations to display on #' the reduced dimensional plot. These variables should be present in the #' column data of the \code{sce} object. Default \code{NULL}. #' @param cellAnnotLabel Character vector. Additional cell-level annotations #' to display on the reduced dimensional plot. Variables will be treated #' as categorial and labels for each group will be placed on the plot. #' These variables should be present in the column data of the \code{sce} #' object. Default \code{NULL}. #' @param exactMatch Boolean. Whether to only identify exact matches or to #' identify partial matches using \code{\link{grep}}. Default \code{FALSE}. #' @param moduleFilePrefix Character. The features in each module will be #' written to a a csv file starting with this name. If \code{NULL}, then no #' file will be written. Default \code{"module_features"}. #' @param output_file Character. Prefix of the html file. Default #' \code{"CeldaCG_ResultReport"}. #' @param output_sce_prefix Character. The \code{sce} object with #' \code{celda_CG} results will be saved to an \code{.rds} file starting with #' this prefix. Default \code{celda_cg}. #' @param output_dir Character. Path to save the html file. Default \code{.}. #' @param pdf Boolean. Whether to create PDF versions of each plot in addition #' to PNGs. Default \code{FALSE}. #' @param showSetup Boolean. Whether to show the setup code at the beginning. #' Default \code{TRUE}. #' @param showSession Boolean. Whether to show the session information at the #' end. Default \code{TRUE}. #' @return .html file #' @rdname reportceldaCG #' @examples #' data(sceCeldaCG) #' \dontrun{ #' library(SingleCellExperiment) #' sceCeldaCG$sum <- colSums(counts(sceCeldaCG)) #' rowData(sceCeldaCG)$rownames <- rownames(sceCeldaCG) #' sceCeldaCG <- reportCeldaCGRun(sceCeldaCG, #' initialL = 5, maxL = 20, initialK = 5, #' maxK = 20, L = 10, K = 5) #' reportCeldaCGPlotResults(sce = sceCeldaCG, #' reducedDimName = "celda_UMAP", #' features = c("Gene_1", "Gene_100"), #' displayName = "rownames", #' cellAnnot="sum") #' } NULL #' @rdname reportceldaCG #' @export reportCeldaCGRun <- function(sce, L, K, sampleLabel = NULL, altExpName = "featureSubset", useAssay = "counts", initialL = 10, maxL = 150, initialK = 5, maxK = 50, minCell = 3, minCount = 3, maxFeatures = 5000, output_file = "CeldaCG_RunReport", output_sce_prefix = "celda_cg", output_dir = ".", pdf = FALSE, showSession = TRUE) { sceFile <- file.path(normalizePath(output_dir), paste0(output_sce_prefix, ".rds")) rmarkdown::render( system.file("rmarkdown/CeldaCG_Run.Rmd", package = "celda"), params = list( sce = sce, L = L, K = K, sampleLabel = sampleLabel, altExpName = altExpName, useAssay = useAssay, initialL = initialL, maxL = maxL, initialK = initialK, maxK = maxK, minCell = minCell, minCount = minCount, maxFeatures = maxFeatures, sceFile = sceFile, pdf = isTRUE(pdf), showSession = isTRUE(showSession) ), output_file = output_file, output_dir = output_dir, intermediates_dir = output_dir, knit_root_dir = output_dir ) if (!is.null(output_sce_prefix)) { if (file.exists(sceFile)) { sce <- readRDS(sceFile) invisible(sce) } else { warning( "The file '", sceFile, "' could not be found. The SCE with celda_CG results was not reloaded." ) } } } #' @rdname reportceldaCG #' @export reportCeldaCGPlotResults <- function(sce, reducedDimName, features = NULL, displayName = NULL, altExpName = "featureSubset", useAssay = "counts", cellAnnot = NULL, cellAnnotLabel = NULL, exactMatch = TRUE, moduleFilePrefix = "module_features", output_file = "CeldaCG_ResultReport", output_dir = ".", pdf = FALSE, showSetup = TRUE, showSession = TRUE) { moduleFileName <- NULL if (!is.null(moduleFilePrefix)) { moduleFileName <- file.path(normalizePath(output_dir), paste0(moduleFilePrefix, ".csv")) } rmarkdown::render( system.file("rmarkdown/CeldaCG_PlotResults.Rmd", package = "celda"), params = list( sce = sce, altExpName = altExpName, useAssay = useAssay, reducedDimName = reducedDimName, features = features, displayName = displayName, cellAnnot = cellAnnot, cellAnnotLabel = cellAnnotLabel, exactMatch = isTRUE(exactMatch), moduleFileName = moduleFileName, pdf = isTRUE(pdf), showSetup = isTRUE(showSetup), showSession = isTRUE(showSession) ), output_file = output_file, output_dir = output_dir, intermediates_dir = output_dir, knit_root_dir = output_dir ) } ================================================ FILE: R/selectFeatures.R ================================================ #' @title Simple feature selection by feature counts #' @description A simple heuristic feature selection procedure. #' Select features with at least \code{minCount} counts #' in at least \code{minCell} cells. A \linkS4class{SingleCellExperiment} #' object with subset features will be stored in the #' \link{altExp} slot with name \code{altExpName}. #' The name of the \code{assay} slot in \link{altExp} #' will be the same as \code{useAssay}. #' @param x A numeric \link{matrix} of counts or a #' \linkS4class{SingleCellExperiment} #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. #' @param minCount Minimum number of counts required for feature selection. #' @param minCell Minimum number of cells required for feature selection. #' @param useAssay A string specifying the name of the #' \link{assay} slot to use. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default "featureSubset". #' @return A \linkS4class{SingleCellExperiment} object with a #' \code{altExpName} \link{altExp} slot. Function #' parameter settings are stored in the \link{metadata} #' \code{"select_features"} slot. #' @export setGeneric("selectFeatures", function(x, minCount = 3, minCell = 3, useAssay = "counts", altExpName = "featureSubset") { standardGeneric("selectFeatures")}) #' @rdname selectFeatures #' @examples #' data(sceCeldaCG) #' sce <- selectFeatures(sceCeldaCG) #' @export setMethod("selectFeatures", signature(x = "SingleCellExperiment"), function(x, minCount = 3, minCell = 3, useAssay = "counts", altExpName = "featureSubset") { assay <- SummarizedExperiment::assay(x, i = useAssay) sceSubset <- x[Matrix::rowSums(assay >= minCount) >= minCell, ] SingleCellExperiment::altExp(x, altExpName) <- sceSubset S4Vectors::metadata(x)[["select_features"]] <- list( xClass = "SingleCellExperiment", minCount = minCount, minCell = minCell, useAssay = useAssay, altExpName = altExpName) return(x) } ) #' @rdname selectFeatures #' @examples #' data(celdaCGSim) #' sce <- selectFeatures(celdaCGSim$counts) #' @export setMethod("selectFeatures", signature(x = "matrix"), function(x, minCount = 3, minCell = 3, useAssay = "counts", altExpName = "featureSubset") { ls <- list() ls[[useAssay]] <- x sce <- SingleCellExperiment::SingleCellExperiment(assays = ls) sceSubset <- sce[rowSums(x >= minCount) >= minCell, ] SingleCellExperiment::altExp(sce, altExpName) <- sceSubset S4Vectors::metadata(sce)[["select_features"]] <- list( xClass = "matrix", minCount = minCount, minCell = minCell, useAssay = useAssay, altExpName = altExpName) return(sce) } ) ================================================ FILE: R/semi_pheatmap.R ================================================ # Adapted originally from the very excellent pheatmap package # (https://cran.r-project.org/web/packages/pheatmap/index.html) #' @importFrom gtable gtable .lo <- function(rown, coln, nrow, ncol, cellHeight = NA, cellWidth = NA, treeHeightCol, treeHeightRow, legend, annotationRow, annotationCol, annotationColors, annotationLegend, annotationNamesRow, annotationNamesCol, main, fontSize, fontSizeRow, fontSizeCol, gapsRow, gapsCol, ...) { # Get height of colnames and length of rownames if (!is.null(coln[1]) | (!.is.na2(annotationRow) & annotationNamesRow)) { if (!is.null(coln[1])) { t <- coln } else { t <- "" } tw <- strwidth(t, units = "in", cex = fontSizeCol / fontSize) if (annotationNamesRow) { t <- c(t, colnames(annotationRow)) tw <- c(tw, strwidth(colnames(annotationRow), units = "in")) } longestColn <- which.max(tw) gp <- list(fontSize = ifelse(longestColn <= length(coln), fontSizeCol, fontSize ), ...) colnHeight <- unit( 1, "grobheight", textGrob(t[longestColn], rot = 90, gp = do.call(gpar, gp) ) ) + unit(10, "bigpts") } else { colnHeight <- unit(5, "bigpts") } if (!is.null(rown[1])) { t <- rown tw <- strwidth(t, units = "in", cex = fontSizeRow / fontSize) if (annotationNamesCol) { t <- c(t, colnames(annotationCol)) tw <- c(tw, strwidth(colnames(annotationCol), units = "in")) } longestRown <- which.max(tw) gp <- list(fontSize = ifelse(longestRown <= length(rown), fontSizeRow, fontSize ), ...) rownWidth <- unit( 1, "grobwidth", textGrob(t[longestRown], gp = do.call(gpar, gp) ) ) + unit(10, "bigpts") } else { rownWidth <- unit(5, "bigpts") } gp <- list(fontSize = fontSize, ...) # Legend position if (!.is.na2(legend)) { longestBreak <- which.max(nchar(names(legend))) longestBreak <- unit( 1.1, "grobwidth", textGrob(as.character(names(legend))[longestBreak], gp = do.call(gpar, gp) ) ) titleLength <- unit( 1.1, "grobwidth", textGrob("Scale", gp = gpar( fontface = "bold", ... ) ) ) legendWidth <- unit(12, "bigpts") + longestBreak * 1.2 legendWidth <- max(titleLength, legendWidth) } else { legendWidth <- unit(0, "bigpts") } # Set main title height if (is.na(main)) { mainHeight <- unit(0, "npc") } else { mainHeight <- unit( 1.5, "grobheight", textGrob(main, gp = gpar( fontSize = 1.3 * fontSize, ... ) ) ) } # Column annotations textheight <- unit(fontSize, "bigpts") if (!.is.na2(annotationCol)) { # Column annotation height annotColHeight <- ncol(annotationCol) * (textheight + unit(2, "bigpts")) + unit(2, "bigpts") # Width of the correponding legend t <- c(as.vector(as.matrix(annotationCol)), colnames(annotationCol)) annotColLegendWidth <- unit( 1.2, "grobwidth", textGrob(t[which.max(nchar(t))], gp = gpar(...) ) ) + unit(12, "bigpts") if (!annotationLegend) { annotColLegendWidth <- unit(0, "npc") } } else { annotColHeight <- unit(0, "bigpts") annotColLegendWidth <- unit(0, "bigpts") } # Row annotations if (!.is.na2(annotationRow)) { # Row annotation width annotRowWidth <- ncol(annotationRow) * (textheight + unit(2, "bigpts")) + unit(2, "bigpts") # Width of the correponding legend t <- c( as.vector(as.matrix(annotationRow)), colnames(annotationRow) ) annotRowLegendWidth <- unit( 1.2, "grobwidth", textGrob(t[which.max(nchar(t))], gp = gpar(...) ) ) + unit( 12, "bigpts" ) if (!annotationLegend) { annotRowLegendWidth <- unit(0, "npc") } } else { annotRowWidth <- unit(0, "bigpts") annotRowLegendWidth <- unit(0, "bigpts") } annotLegendWidth <- max(annotRowLegendWidth, annotColLegendWidth) # Tree height treeHeightCol <- unit(treeHeightCol, "bigpts") + unit(5, "bigpts") treeHeightRow <- unit(treeHeightRow, "bigpts") + unit(5, "bigpts") # Set cell sizes if (is.na(cellWidth)) { matWidth <- unit(1, "npc") - rownWidth - legendWidth - treeHeightRow - annotRowWidth - annotLegendWidth } else { matWidth <- unit(cellWidth * ncol, "bigpts") + length(gapsCol) * unit(0, "bigpts") } if (is.na(cellHeight)) { matHeight <- unit(1, "npc") - mainHeight - colnHeight - treeHeightCol - annotColHeight } else { matHeight <- unit(cellHeight * nrow, "bigpts") + length(gapsRow) * unit(0, "bigpts") } # Produce gtable gt <- gtable::gtable( widths = unit.c( treeHeightRow, annotRowWidth, matWidth, rownWidth, legendWidth, annotLegendWidth ), heights = unit.c( mainHeight, treeHeightCol, annotColHeight, matHeight, colnHeight ), vp = viewport(gp = do.call(gpar, gp)) ) cw <- convertWidth(matWidth - (length(gapsCol) * unit(0, "bigpts")), "bigpts", valueOnly = TRUE ) / ncol ch <- convertHeight(matHeight - (length(gapsRow) * unit(0, "bigpts")), "bigpts", valueOnly = TRUE ) / nrow # Return minimal cell dimension in bigpts to decide if borders are drawn mindim <- min(cw, ch) res <- list(gt = gt, mindim = mindim) return(res) } .findCoordinates <- function(n, gaps, m = seq(1, n)) { if (length(gaps) == 0) { return(list( coord = unit(m / n, "npc"), size = unit(1 / n, "npc") )) } if (max(gaps) > n) { stop("Gaps do not match with matrix size") } size <- (1 / n) * (unit(1, "npc") - length(gaps) * unit("0", "bigpts")) gaps2 <- base::apply(vapply( gaps, function(gap, x) { x > gap }, integer(n), m ), 1, sum) coord <- m * size + (gaps2 * unit("0", "bigpts")) return(list(coord = coord, size = size)) } .drawDendrogram <- function(hc, gaps, horizontal = TRUE) { h <- hc$height / max(hc$height) / 1.05 m <- hc$merge o <- hc$order n <- length(o) m[m > 0] <- n + m[m > 0] m[m < 0] <- abs(m[m < 0]) dist <- matrix(0, nrow = 2 * n - 1, ncol = 2, dimnames = list(NULL, c("x", "y")) ) dist[seq(1, n), 1] <- 1 / n / 2 + (1 / n) * (match(seq(1, n), o) - 1) for (i in seq(1, nrow(m))) { dist[n + i, 1] <- (dist[m[i, 1], 1] + dist[m[i, 2], 1]) / 2 dist[n + i, 2] <- h[i] } drawConnection <- function(x1, x2, y1, y2, y) { res <- list( x = c(x1, x1, x2, x2), y = c(y1, y, y, y2) ) return(res) } x <- rep(NA, nrow(m) * 4) y <- rep(NA, nrow(m) * 4) id <- rep(seq(nrow(m)), rep(4, nrow(m))) for (i in seq(1, nrow(m))) { c <- drawConnection( dist[m[i, 1], 1], dist[m[i, 2], 1], dist[m[i, 1], 2], dist[m[i, 2], 2], h[i] ) k <- (i - 1) * 4 + 1 x[seq(k, k + 3)] <- c$x y[seq(k, k + 3)] <- c$y } x <- .findCoordinates(n, gaps, x * n)$coord y <- unit(y, "npc") if (!horizontal) { a <- x x <- unit(1, "npc") - y y <- unit(1, "npc") - a } res <- polylineGrob(x = x, y = y, id = id) return(res) } .drawMatrix <- function(matrix, borderColor, gapsRows, gapsCols, fmat, fontSizeNumber, numberColor) { n <- nrow(matrix) m <- ncol(matrix) coordX <- .findCoordinates(m, gapsCols) coordY <- .findCoordinates(n, gapsRows) x <- coordX$coord - 0.5 * coordX$size y <- unit(1, "npc") - (coordY$coord - 0.5 * coordY$size) coord <- expand.grid(y = y, x = x) res <- gList() res[["rect"]] <- rectGrob( x = coord$x, y = coord$y, width = coordX$size, height = coordY$size, gp = gpar(fill = matrix, col = borderColor) ) if (attr(fmat, "draw")) { res[["text"]] <- textGrob( x = coord$x, y = coord$y, label = fmat, gp = gpar(col = numberColor, fontSize = fontSizeNumber) ) } res <- gTree(children = res) return(res) } .drawColnames <- function(coln, gaps, ...) { coord <- .findCoordinates(length(coln), gaps) x <- coord$coord - 0.5 * coord$size res <- textGrob(coln, x = x, y = unit(1, "npc") - unit(3, "bigpts"), vjust = 0.5, hjust = 0, rot = 270, gp = gpar(...) ) return(res) } .drawRownames <- function(rown, gaps, ...) { coord <- .findCoordinates(length(rown), gaps) y <- unit(1, "npc") - (coord$coord - 0.5 * coord$size) res <- textGrob(rown, x = unit(3, "bigpts"), y = y, vjust = 0.5, hjust = 0, gp = gpar(...) ) return(res) } .drawLegend <- function(color, breaks, legend, ...) { height <- min(unit(1, "npc"), unit(150, "bigpts")) legendPos <- (legend - min(breaks)) / (max(breaks) - min(breaks)) legendPos <- height * legendPos + (unit(1, "npc") - height) breaks <- (breaks - min(breaks)) / (max(breaks) - min(breaks)) breaks <- height * breaks + (unit(1, "npc") - height) h <- breaks[-1] - breaks[-length(breaks)] rect <- rectGrob( x = 0, y = breaks[-length(breaks)], width = unit(10, "bigpts"), height = h, hjust = 0, vjust = 0, gp = gpar(fill = color, col = "#FFFFFF00") ) text <- textGrob(names(legend), x = unit(14, "bigpts"), y = legendPos, hjust = 0, gp = gpar(...) ) res <- grobTree(rect, text) return(res) } .convertAnnotations <- function(annotation, annotationColors) { new <- annotation for (i in seq(ncol(annotation))) { a <- annotation[, i] b <- annotationColors[[colnames(annotation)[i]]] if (is.character(a) | is.factor(a)) { a <- as.character(a) if (length(setdiff(setdiff(a, NA), names(b))) > 0) { stop(sprintf( "Factor levels on variable %s do not match with annotationColors", colnames(annotation)[i] )) } new[, i] <- b[a] } else { a <- cut(a, breaks = 100) new[, i] <- colorRampPalette(b)(100)[a] } } return(as.matrix(new)) } .drawAnnotations <- function(convertedAnnotations, borderColor, gaps, fontSize, horizontal) { n <- ncol(convertedAnnotations) m <- nrow(convertedAnnotations) coordX <- .findCoordinates(m, gaps) x <- coordX$coord - 0.5 * coordX$size # y = cumsum(rep(fontSize, n)) - 4 + cumsum(rep(2, n)) y <- cumsum(rep(fontSize, n)) + cumsum(rep(2, n)) - fontSize / 2 + 1 y <- unit(y, "bigpts") if (horizontal) { coord <- expand.grid(x = x, y = y) res <- rectGrob( x = coord$x, y = coord$y, width = coordX$size, height = unit(fontSize, "bigpts"), gp = gpar(fill = convertedAnnotations, col = borderColor) ) } else { a <- x x <- unit(1, "npc") - y y <- unit(1, "npc") - a coord <- expand.grid(y = y, x = x) res <- rectGrob( x = coord$x, y = coord$y, width = unit(fontSize, "bigpts"), height = coordX$size, gp = gpar(fill = convertedAnnotations, col = borderColor) ) } return(res) } .drawAnnotationNames <- function(annotations, fontSize, horizontal) { n <- ncol(annotations) x <- unit(3, "bigpts") y <- cumsum(rep(fontSize, n)) + cumsum(rep(2, n)) - fontSize / 2 + 1 y <- unit(y, "bigpts") if (horizontal) { res <- textGrob(colnames(annotations), x = x, y = y, hjust = 0, gp = gpar(fontSize = fontSize, fontface = 2) ) } else { a <- x x <- unit(1, "npc") - y y <- unit(1, "npc") - a res <- textGrob(colnames(annotations), x = x, y = y, vjust = 0.5, hjust = 0, rot = 270, gp = gpar(fontSize = fontSize, fontface = 2) ) } return(res) } .drawAnnotationLegend <- function(annotation, annotationColors, borderColor, ...) { y <- unit(1, "npc") textHeight <- unit( 1, "grobheight", textGrob("FGH", gp = gpar(...)) ) res <- gList() for (i in names(annotation)) { res[[i]] <- textGrob(i, x = 0, y = y, vjust = 1, hjust = 0, gp = gpar(fontface = "bold", ...) ) y <- y - 1.5 * textHeight if (is.character(annotation[[i]]) | is.factor(annotation[[i]])) { n <- length(annotationColors[[i]]) yy <- y - (seq(n) - 1) * 2 * textHeight res[[paste(i, "r")]] <- rectGrob( x = unit(0, "npc"), y = yy, hjust = 0, vjust = 1, height = 2 * textHeight, width = 2 * textHeight, gp = gpar(col = borderColor, fill = annotationColors[[i]]) ) res[[paste(i, "t")]] <- textGrob(names(annotationColors[[i]]), x = textHeight * 2.4, y = yy - textHeight, hjust = 0, vjust = 0.5, gp = gpar(...) ) y <- y - n * 2 * textHeight } else { yy <- y - 8 * textHeight + seq(0, 1, 0.25)[-1] * 8 * textHeight h <- 8 * textHeight * 0.25 res[[paste(i, "r")]] <- rectGrob( x = unit(0, "npc"), y = yy, hjust = 0, vjust = 1, height = h, width = 2 * textHeight, gp = gpar( col = NA, fill = colorRampPalette(annotationColors[[i]])(4) ) ) res[[paste(i, "r2")]] <- rectGrob( x = unit(0, "npc"), y = y, hjust = 0, vjust = 1, height = 8 * textHeight, width = 2 * textHeight, gp = gpar(col = borderColor, fill = NA) ) txt <- rev(range(grid::grid.pretty(range(annotation[[i]], na.rm = TRUE )))) yy <- y - c(1, 7) * textHeight res[[paste(i, "t")]] <- textGrob(txt, x = textHeight * 2.4, y = yy, hjust = 0, vjust = 0.5, gp = gpar(...) ) y <- y - 8 * textHeight } y <- y - 1.5 * textHeight } res <- gTree(children = res) return(res) } .drawMain <- function(text, ...) { res <- textGrob(text, gp = gpar(fontface = "bold", ...)) return(res) } vplayout <- function(x, y) { return(viewport(layout.pos.row = x, layout.pos.col = y)) } #' @importFrom gtable gtable_height #' @importFrom gtable gtable_width #' @importFrom gtable gtable_add_grob #' @import grDevices .heatmapMotor <- function(matrix, borderColor, cellWidth, cellHeight, treeCol, treeRow, treeHeightCol, treeHeightRow, fileName, width, height, breaks, color, legend, annotationRow, annotationCol, annotationColors, annotationLegend, annotationNamesRow, annotationNamesCol, main, fontSize, fontSizeRow, fontSizeCol, fmat, fontSizeNumber, numberColor, gapsCol, gapsRow, labelsRow, labelsCol, ...) { # Set layout lo <- .lo( coln = labelsCol, rown = labelsRow, nrow = nrow(matrix), ncol = ncol(matrix), cellWidth = cellWidth, cellHeight = cellHeight, treeHeightCol = treeHeightCol, treeHeightRow = treeHeightRow, legend = legend, annotationCol = annotationCol, annotationRow = annotationRow, annotationColors = annotationColors, annotationLegend = annotationLegend, annotationNamesRow = annotationNamesRow, annotationNamesCol = annotationNamesCol, main = main, fontSize = fontSize, fontSizeRow = fontSizeRow, fontSizeCol = fontSizeCol, gapsRow = gapsRow, gapsCol = gapsCol, ... ) res <- lo$gt mindim <- lo$mindim if (!is.na(fileName)) { if (is.na(height)) { height <- convertHeight(gtable::gtable_height(res), "inches", valueOnly = TRUE ) } if (is.na(width)) { width <- convertWidth(gtable::gtable_width(res), "inches", valueOnly = TRUE ) } # Get file type r <- regexpr("\\.[a-zA-Z]*$", fileName) if (r == -1) { stop("Improper fileName") } ending <- substr( fileName, r + 1, r + attr(r, "match.length") ) f <- switch(ending, pdf = function(x, ...) { pdf(x, ...) }, png = function(x, ...) { png(x, units = "in", res = 300, ... ) }, jpeg = function(x, ...) { jpeg(x, units = "in", res = 300, ... ) }, jpg = function(x, ...) { jpeg(x, units = "in", res = 300, ... ) }, tiff = function(x, ...) { tiff(x, units = "in", res = 300, compression = "lzw", ... ) }, bmp = function(x, ...) { bmp(x, units = "in", res = 300, ... ) }, stop("File type should be: pdf, png, bmp, jpg, tiff") ) # print(sprintf("height:%f width:%f", height, width)) # gt = .heatmapMotor(matrix, # cellWidth = cellWidth, # cellHeight = cellHeight, # borderColor = borderColor, # treeCol = treeCol, # treeRow = treeRow, # treeHeightCol = treeHeightCol, # treeHeightRow = treeHeightRow, # breaks = breaks, # color = color, # legend = legend, # annotationCol = annotationCol, # annotationRow = annotationRow, # annotationColors = annotationColors, # annotationLegend = annotationLegend, # fileName = NA, main = main, # fontSize = fontSize, # fontSizeRow = fontSizeRow, # fontSizeCol = fontSizeCol, # fmat = fmat, # fontSizeNumber = fontSizeNumber, # numberColor = numberColor, # labelsRow = labelsRow, # labelsCol = labelsCol, # gapsCol = gapsCol, # gapsRow = gapsRow, ...) f(fileName, height = height, width = width) gt <- .heatmapMotor(matrix, cellWidth = cellWidth, cellHeight = cellHeight, borderColor = borderColor, treeCol = treeCol, treeRow = treeRow, treeHeightCol = treeHeightCol, treeHeightRow = treeHeightRow, breaks = breaks, color = color, legend = legend, annotationCol = annotationCol, annotationRow = annotationRow, annotationColors = annotationColors, annotationLegend = annotationLegend, annotationNamesRow = annotationNamesRow, annotationNamesCol = annotationNamesCol, fileName = NA, main = main, fontSize = fontSize, fontSizeRow = fontSizeRow, fontSizeCol = fontSizeCol, fmat = fmat, fontSizeNumber = fontSizeNumber, numberColor = numberColor, labelsRow = labelsRow, labelsCol = labelsCol, gapsCol = gapsCol, gapsRow = gapsRow, ... ) grid.draw(gt) dev.off() return(gt) } # Omit border color if cell size is too small if (mindim < 3) { borderColor <- NA } # Draw title if (!is.na(main)) { elem <- .drawMain(main, fontSize = 1.3 * fontSize, ...) res <- gtable::gtable_add_grob(res, elem, t = 1, l = 3, name = "main", clip = "off" ) } # Draw tree for the columns if (!.is.na2(treeCol) & treeHeightCol != 0) { elem <- .drawDendrogram(treeCol, gapsCol, horizontal = TRUE) res <- gtable::gtable_add_grob(res, elem, t = 2, l = 3, name = "col_tree" ) } # Draw tree for the rows if (!.is.na2(treeRow) & treeHeightRow != 0) { elem <- .drawDendrogram(treeRow, gapsRow, horizontal = FALSE) res <- gtable::gtable_add_grob(res, elem, t = 4, l = 1, name = "row_tree" ) } # Draw matrix elem <- .drawMatrix( matrix, borderColor, gapsRow, gapsCol, fmat, fontSizeNumber, numberColor ) res <- gtable::gtable_add_grob(res, elem, t = 4, l = 3, clip = "off", name = "matrix" ) # Draw colnames if (length(labelsCol) != 0) { pars <- list(labelsCol, gaps = gapsCol, fontSize = fontSizeCol, ... ) elem <- do.call(.drawColnames, pars) res <- gtable::gtable_add_grob(res, elem, t = 5, l = 3, clip = "off", name = "col_names" ) } # Draw rownames if (length(labelsRow) != 0) { pars <- list(labelsRow, gaps = gapsRow, fontSize = fontSizeRow, ... ) elem <- do.call(.drawRownames, pars) res <- gtable::gtable_add_grob(res, elem, t = 4, l = 4, clip = "off", name = "row_names" ) } # Draw annotation tracks on cols if (!.is.na2(annotationCol)) { # Draw tracks convertedAnnotation <- .convertAnnotations( annotationCol, annotationColors ) elem <- .drawAnnotations(convertedAnnotation, borderColor, gapsCol, fontSize, horizontal = TRUE ) res <- gtable::gtable_add_grob(res, elem, t = 3, l = 3, clip = "off", name = "col_annotation" ) # Draw names if (annotationNamesCol) { elem <- .drawAnnotationNames(annotationCol, fontSize, horizontal = TRUE ) res <- gtable::gtable_add_grob(res, elem, t = 3, l = 4, clip = "off", name = "col_annotation_names" ) } } # Draw annotation tracks on rows if (!.is.na2(annotationRow)) { # Draw tracks convertedAnnotation <- .convertAnnotations( annotationRow, annotationColors ) elem <- .drawAnnotations(convertedAnnotation, borderColor, gapsRow, fontSize, horizontal = FALSE ) res <- gtable::gtable_add_grob(res, elem, t = 4, l = 2, clip = "off", name = "row_annotation" ) # Draw names if (annotationNamesRow) { elem <- .drawAnnotationNames(annotationRow, fontSize, horizontal = FALSE ) res <- gtable::gtable_add_grob(res, elem, t = 5, l = 2, clip = "off", name = "row_annotation_names" ) } } # Draw annotation legend annotation <- c( annotationCol[seq(length(annotationCol), 1)], annotationRow[seq(length(annotationRow), 1)] ) annotation <- annotation[unlist(lapply( annotation, function(x) !.is.na2(x) ))] if (length(annotation) > 0 & annotationLegend) { elem <- .drawAnnotationLegend(annotation, annotationColors, borderColor, fontSize = fontSize, ... ) t <- ifelse(is.null(labelsRow), 4, 3) res <- gtable::gtable_add_grob(res, elem, t = t, l = 6, b = 5, clip = "off", name = "annotationLegend" ) } # Draw legend if (!.is.na2(legend)) { elem <- .drawLegend(color, breaks, legend, fontSize = fontSize, ...) t <- ifelse(is.null(labelsRow), 4, 3) res <- gtable::gtable_add_grob(res, elem, t = t, l = 5, b = 5, clip = "off", name = "legend" ) } return(res) } .generateBreaks <- function(x, n, center = FALSE) { if (center) { m <- max(abs(c( min(x, na.rm = TRUE), max(x, na.rm = TRUE) ))) res <- seq(-m, m, length.out = n + 1) } else { res <- seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n + 1 ) } return(res) } .scaleVecColours <- function(x, col = rainbow(10), breaks = NA) { return(col[as.numeric(cut(x, breaks = breaks, include.lowest = TRUE))]) } .scaleColours <- function(mat, col = rainbow(10), breaks = NA) { mat <- as.matrix(mat) return(matrix( .scaleVecColours(as.vector(mat), col = col, breaks = breaks), nrow(mat), ncol(mat), dimnames = list(rownames(mat), colnames(mat)) )) } ## changed the original clusterMat() in the pheatmap.r #' @importFrom scales hue_pal .clusterMat <- function(mat, labels, distance, method) { # this funciton is going to change the .clusterMat() in pheatmap if (!(method %in% c( "ward.D", "ward.D2", "ward", "single", "complete", "average", "mcquitty", "median", "centroid" ))) { stop("clustering method has to one form the list: 'ward', 'ward.D', 'ward.D2', 'single', 'complete', 'average', 'mcquitty', 'median' or 'centroid'.") } class.label <- unique(labels) nGroup <- length(class.label) # [#group] # get "hclust" object for each group then wrap them up as group.hclust # distance function preparation dis <- function(mat, distance) { if (!(distance[1] %in% c( "correlation", "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski" )) & !methods::is(distance, "dist")) { stop( "distance has to be a dissimilarity structure as produced by", " dist or one measure form the list: 'correlation',", " 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary',", " 'minkowski'" ) } if (distance[1] == "correlation") { # this part should be confirmed whether being wrong? # ToDo: how is the correlation matrix converted to a dsit matrix d <- stats::as.dist(1 - stats::cor(t(mat))) } else { d <- stats::dist(mat, method = distance) } return(d) } # initiate the final returning value: a "hclust" object cum.hclust <- list() if (nGroup == 1) { # matrix has only 1 group if (length(labels) == 1) { stop("only one row/column for the matrix") } group.hclust <- stats::hclust(dis( mat = mat, distance = distance ), method = method ) cum.hclust <- group.hclust } else { # matrix has more than 1 groups group.hclust <- vapply( class.label, function(x) { # get the positions of class label class.pos <- which(labels == x) if (length(class.pos) == 1) { # if only 1 row in the group return a manually made "hclust" # object sub.hclust <- as.list(seq(7)) names(sub.hclust) <- c( "merge", "height", "order", "labels", "method", "call", "dist.method" ) class(sub.hclust) <- "hclust" sub.hclust$merge <- matrix(c(0, 0), nrow = 1) sub.hclust$height <- 0 sub.hclust$order <- 1 return(sub.hclust) } else if (length(class.pos) > 1) { # if >1 rows return the "hclust" object return(stats::hclust(dis( mat = mat[class.pos, ], distance = distance ), method = method )) } }, list( "merge" = 0, "height" = 0, "order" = 0, "labels" = 0, "method" = 0, "call" = 0, "dist.method" = 0 ) ) # the length(group.hclust) is the [#group] == nGroup , # group.hclust[[i]] to get each "hclust" # then modify the "hclust" object and get them merged into one # "hclust" object # initiate the final "hclust" object cum.hclust <- group.hclust[, nGroup] # merge function preparation mergeHclust <- function(hclust1, hclust2) { # "hclust" object modifying function preparation if (hclust1$merge[1, 1] == 0 & hclust2$merge[1, 1] == 0) { # both groups have only 1 row hclustCom <- as.list(seq(7)) names(hclustCom) <- c( "merge", "height", "order", "labels", "method", "call", "dist.method" ) class(hclustCom) <- "hclust" hclustCom$merge <- matrix(c(-1, -2), nrow = 1) # check for different matrix whether 1 should be good hclustCom$height <- 1 hclustCom$order <- c(1, 2) return(hclustCom) } else if (hclust1$merge[1, 1] != 0 & hclust2$merge[1, 1] != 0) { # both group have >1 rows # nodes in the hclust1 group, so actually the #rows should # be dim()[1]+1 row.1 <- dim(hclust1$merge)[1] # nodes in the hclust2 group row.2 <- dim(hclust2$merge)[1] hclustCom <- list() mer <- hclust2$merge # modify the hclust2$merge matrix hclustCom$merge <- (mer > 0) * (mer + row.1) + (mer < 0) * (mer - row.1 - 1) # combine the merge matrix from the 2 groups hclustCom$merge <- rbind( hclust1$merge, hclustCom$merge ) hclustCom$merge <- rbind( hclustCom$merge, c(row.1, row.1 + row.2) ) hclustCom$height <- c(hclust1$height, hclust2$height) # check for different matrix whether 1 should be good hclustCom$height <- c( hclustCom$height, max(hclustCom$height) + 1 ) hclustCom$order <- c( hclust1$order, hclust2$order + row.1 + 1 ) class(hclustCom) <- "hclust" } else { # one group has only 1 row, the other group has >1 rows if (hclust1$merge[1, 1] == 0) { # hclust1 has 1 row , hclust2 has >1 rows # nodes in the hclust2 group row.2 <- dim(hclust2$merge)[1] hclustCom <- as.list(seq(7)) names(hclustCom) <- c( "merge", "height", "order", "labels", "method", "call", "dist.method" ) class(hclustCom) <- "hclust" mer <- hclust2$merge hclustCom$merge <- (mer > 0) * (mer) + (mer < 0) * (mer - 1) hclustCom$merge <- rbind( hclustCom$merge, c(-1, row.2) ) # check for different matrix whether 1 should be good hclustCom$height <- c( hclust2$height, max(hclust2$height) + 1 ) hclustCom$order <- c(1, hclust2$order + 1) } else if (hclust2$merge[1, 1] == 0) { # the hclust1 has >1 rows , and hclust2 has 1 row # nodes in the hclust1 group row.1 <- dim(hclust1$merge)[1] hclustCom <- as.list(seq(1, 7)) names(hclustCom) <- c( "merge", "height", "order", "labels", "method", "call", "dist.method" ) class(hclustCom) <- "hclust" hclustCom$merge <- hclust1$merge hclustCom$merge <- rbind( hclustCom$merge, c(row.1, - (row.1 + 2)) ) hclustCom$height <- c( hclust1$height, max(hclust1$height) + 1 ) hclustCom$order <- c( hclust1$order, max(hclust1$order) + 1 ) } } return(hclustCom) } # merge the "hclust" object into the final one "hclust" object for (i in seq(nGroup - 1, 1, -1)) { cum.hclust <- mergeHclust(group.hclust[, i], cum.hclust) } } cum.hclust$labels <- NULL cum.hclust$call <- NULL cum.hclust$method <- NULL cum.hclust$dist.method <- NULL return(cum.hclust) } .scaleRows <- function(x) { m <- base::apply(x, 1, mean, na.rm = TRUE) s <- base::apply(x, 1, stats::sd, na.rm = TRUE) return((x - m) / s) } .scaleMat <- function(mat, scale) { if (!(scale %in% c("none", "row", "column"))) { stop("scale argument shoud take values: 'none', 'row' or 'column'") } mat <- switch(scale, none = mat, row = .scaleRows(mat), column = t(.scaleRows(t(mat))) ) return(mat) } #' @importFrom scales dscale #' @importFrom scales brewer_pal .generateAnnotationColours <- function(annotation, annotationColors, drop) { if (.is.na2(annotationColors)) { annotationColors <- list() } count <- 0 for (i in seq(length(annotation))) { annotation[[i]] <- annotation[[i]][!is.na(annotation[[i]])] if (is.character(annotation[[i]]) | is.factor(annotation[[i]])) { if (is.factor(annotation[[i]]) & !drop) { count <- count + length(levels(annotation[[i]])) } else { count <- count + length(unique(annotation[[i]])) } } } factorColors <- scales::dscale( factor(seq(1, count)), scales::hue_pal(l = 75) ) contCounter <- 2 for (i in seq(length(annotation))) { if (!(names(annotation)[i] %in% names(annotationColors))) { if (is.character(annotation[[i]]) | is.factor(annotation[[i]])) { n <- length(unique(annotation[[i]])) if (is.factor(annotation[[i]]) & !drop) { n <- length(levels(annotation[[i]])) } ind <- sample(seq_along(factorColors), n) annotationColors[[names(annotation)[i]]] <- factorColors[ind] l <- levels(as.factor(annotation[[i]])) l <- l[l %in% unique(annotation[[i]])] if (is.factor(annotation[[i]]) & !drop) { l <- levels(annotation[[i]]) } names(annotationColors[[names(annotation)[i]]]) <- l factorColors <- factorColors[-ind] } else { annotationColors[[names(annotation)[i]]] <- scales::brewer_pal("seq", contCounter)(5)[seq(4)] contCounter <- contCounter + 1 } } } return(annotationColors) } .findGaps <- function(tree, cutreeN) { v <- stats::cutree(tree, cutreeN)[tree$order] gaps <- which((v[-1] - v[-length(v)]) != 0) return(gaps) } .is.na2 <- function(x) { if (is.list(x) | length(x) > 1) { return(FALSE) } if (length(x) == 0) { return(TRUE) } return(is.na(x)) } .identity2 <- function(x, ...) { return(x) } #' @title A function to draw clustered heatmaps. #' @description A function to draw clustered heatmaps where one has better #' control over some graphical parameters such as cell size, etc. #' #' The function also allows to aggregate the rows using kmeans clustering. #' This is advisable if number of rows is so big that R cannot handle their #' hierarchical clustering anymore, roughly more than 1000. Instead of showing #' all the rows separately one can cluster the rows in advance and show only #' the cluster centers. The number of clusters can be tuned with parameter #' kmeansK. #' @param mat numeric matrix of the values to be plotted. #' @param color vector of colors used in heatmap. #' @param kmeansK the number of kmeans clusters to make, if we want to #' agggregate the rows before drawing heatmap. If NA then the rows are not #' aggregated. #' @param breaks Numeric vector. A sequence of numbers that covers the range #' of values in the normalized `counts`. Values in the normalized `matrix` are #' assigned to each bin in `breaks`. Each break is assigned to a unique color #' from `col`. If NULL, then breaks are calculated automatically. Default NULL. #' @param borderColor color of cell borders on heatmap, use NA if no border #' should be drawn. #' @param cellWidth individual cell width in points. If left as NA, then the #' values depend on the size of plotting window. #' @param cellHeight individual cell height in points. If left as NA, then the #' values depend on the size of plotting window. #' @param scale character indicating if the values should be centered and #' scaled in either the row direction or the column direction, or none. #' Corresponding values are \code{"row"}, \code{"column"} and \code{"none"}. #' @param clusterRows boolean values determining if rows should be clustered or #' \code{hclust} object, #' @param clusterCols boolean values determining if columns should be clustered #' or \code{hclust} object. #' @param clusteringDistanceRows distance measure used in clustering rows. #' Possible values are \code{"correlation"} for Pearson correlation and all #' the distances supported by \code{\link{dist}}, such as \code{"euclidean"}, #' etc. If the value is none of the above it is assumed that a distance matrix #' is provided. #' @param clusteringDistanceCols distance measure used in clustering columns. #' Possible values the same as for clusteringDistanceRows. #' @param clusteringMethod clustering method used. Accepts the same values as #' \code{\link{hclust}}. #' @param clusteringCallback callback function to modify the clustering. Is #' called with two parameters: original \code{hclust} object and the matrix #' used for clustering. Must return a \code{hclust} object. #' @param cutreeRows number of clusters the rows are divided into, based on the #' hierarchical clustering (using cutree), if rows are not clustered, the #' argument is ignored #' @param cutreeCols similar to \code{cutreeRows}, but for columns #' @param treeHeightRow the height of a tree for rows, if these are clustered. #' Default value 50 points. #' @param treeHeightCol the height of a tree for columns, if these are #' clustered. Default value 50 points. #' @param legend logical to determine if legend should be drawn or not. #' @param legendBreaks vector of breakpoints for the legend. #' @param legendLabels vector of labels for the \code{legendBreaks}. #' @param annotationRow data frame that specifies the annotations shown on left #' side of the heatmap. Each row defines the features for a specific row. The #' rows in the data and in the annotation are matched using corresponding row #' names. Note that color schemes takes into account if variable is continuous #' or discrete. #' @param annotationCol similar to annotationRow, but for columns. #' @param annotation deprecated parameter that currently sets the annotationCol #' if it is missing. #' @param annotationColors list for specifying annotationRow and #' annotationCol track colors manually. It is possible to define the colors #' for only some of the features. Check examples for details. #' @param annotationLegend boolean value showing if the legend for annotation #' tracks should be drawn. #' @param annotationNamesRow boolean value showing if the names for row #' annotation tracks should be drawn. #' @param annotationNamesCol boolean value showing if the names for column #' annotation tracks should be drawn. #' @param dropLevels logical to determine if unused levels are also shown in #' the legend. #' @param showRownames boolean specifying if column names are be shown. #' @param showColnames boolean specifying if column names are be shown. #' @param main the title of the plot #' @param fontSize base fontsize for the plot #' @param fontSizeRow fontsize for rownames (Default: fontsize) #' @param fontSizeCol fontsize for colnames (Default: fontsize) #' @param displayNumbers logical determining if the numeric values are also #' printed to the cells. If this is a matrix (with same dimensions as original #' matrix), the contents of the matrix are shown instead of original values. #' @param numberFormat format strings (C printf style) of the numbers shown in #' cells. For example "\code{\%.2f}" shows 2 decimal places and "\code{\%.1e}" #' shows exponential notation (see more in \code{\link{sprintf}}). #' @param numberColor color of the text #' @param fontSizeNumber fontsize of the numbers displayed in cells #' @param gapsRow vector of row indices that show shere to put gaps into #' heatmap. Used only if the rows are not clustered. See \code{cutreeRow} #' to see how to introduce gaps to clustered rows. #' @param gapsCol similar to gapsRow, but for columns. #' @param labelsRow custom labels for rows that are used instead of rownames. #' @param labelsCol similar to labelsRow, but for columns. #' @param fileName file path where to save the picture. Filetype is decided by #' the extension in the path. Currently following formats are supported: png, #' pdf, tiff, bmp, jpeg. Even if the plot does not fit into the plotting #' window, the file size is calculated so that the plot would fit there, #' unless specified otherwise. #' @param width manual option for determining the output file width in inches. #' @param height manual option for determining the output file height in inches. #' @param silent do not draw the plot (useful when using the gtable output) #' @param rowLabel row cluster labels for semi-clustering #' @param colLabel column cluster labels for semi-clustering #' @param rowGroupOrder Vector. Specifies the order of feature clusters when #' semisupervised clustering is performed on the \code{y} labels. #' @param colGroupOrder Vector. Specifies the order of cell clusters when #' semisupervised clustering is performed on the \code{z} labels. #' @param \dots graphical parameters for the text used in plot. Parameters #' passed to \code{\link{grid.text}}, see \code{\link{gpar}}. #' @return #' Invisibly a list of components #' \itemize{ #' \item \code{treeRow} the clustering of rows as \code{\link{hclust}} #' object #' \item \code{treeCol} the clustering of columns as \code{\link{hclust}} #' object #' \item \code{kmeans} the kmeans clustering of rows if parameter #' \code{kmeansK} was specified #' } #' @author Raivo Kolde #' #@examples #' # Create test matrix #' test = matrix(rnorm(200), 20, 10) #' test[seq(10), seq(1, 10, 2)] = test[seq(10), seq(1, 10, 2)] + 3 #' test[seq(11, 20), seq(2, 10, 2)] = test[seq(11, 20), seq(2, 10, 2)] + 2 #' test[seq(15, 20), seq(2, 10, 2)] = test[seq(15, 20), seq(2, 10, 2)] + 4 #' colnames(test) = paste("Test", seq(10), sep = "") #' rownames(test) = paste("Gene", seq(20), sep = "") #' #' # Draw heatmaps #' pheatmap(test) #' pheatmap(test, kmeansK = 2) #' pheatmap(test, scale = "row", clusteringDistanceRows = "correlation") #' pheatmap(test, color = colorRampPalette(c("navy", #' "white", "firebrick3"))(50)) #' pheatmap(test, cluster_row = FALSE) #' pheatmap(test, legend = FALSE) #' #' # Show text within cells #' pheatmap(test, displayNumbers = TRUE) #' pheatmap(test, displayNumbers = TRUE, numberFormat = "\%.1e") #' pheatmap(test, displayNumbers = matrix(ifelse(test > 5, #' "*", ""), nrow(test))) #' pheatmap(test, cluster_row = FALSE, #' legendBreaks = seq(-1, 4), legendLabels = c("0", #' "1e-4", "1e-3", "1e-2", "1e-1", "1")) #' #' # Fix cell sizes and save to file with correct size #' pheatmap(test, cellWidth = 15, cellHeight = 12, main = "Example heatmap") #' pheatmap(test, cellWidth = 15, cellHeight = 12, fontSize = 8, #' fileName = "test.pdf") #' #' # Generate annotations for rows and columns #' annotationCol = data.frame(CellType = factor(rep(c("CT1", "CT2"), 5)), #' Time = seq(5)) #' rownames(annotationCol) = paste("Test", seq(10), sep = "") #' #' annotationRow = data.frame(GeneClass = factor(rep(c("Path1", #' "Path2", #' "Path3"), #' c(10, 4, 6)))) #' rownames(annotationRow) = paste("Gene", seq(20), sep = "") #' #' # Display row and color annotations #' pheatmap(test, annotationCol = annotationCol) #' pheatmap(test, annotationCol = annotationCol, annotationLegend = FALSE) #' pheatmap(test, annotationCol = annotationCol, annotationRow = annotationRow) #' #' # Specify colors #' ann_colors = list(Time = c("white", "firebrick"), #' CellType = c(CT1 = "#1B9E77", CT2 = "#D95F02"), #' GeneClass = c(Path1 = "#7570B3", Path2 = "#E7298A", Path3 = "#66A61E")) #' #' pheatmap(test, annotationCol = annotationCol, annotationColors = ann_colors, #' main = "Title") #' pheatmap(test, annotationCol = annotationCol, annotationRow = annotationRow, #' annotationColors = ann_colors) #' pheatmap(test, annotationCol = annotationCol, #' annotationColors = ann_colors[2]) #' #' # Gaps in heatmaps #' pheatmap(test, annotationCol = annotationCol, clusterRows = FALSE, #' gapsRow = c(10, 14)) #' pheatmap(test, annotationCol = annotationCol, clusterRows = FALSE, #' gapsRow = c(10, 14), cutreeCol = 2) #' #' # Show custom strings as row/col names #' labelsRow = c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", #' "", "", "Il10", "Il15", "Il1b") #' #' pheatmap(test, annotationCol = annotationCol, labelsRow = labelsRow) #' #' # Specifying clustering from distance matrix #' drows = stats::dist(test, method = "minkowski") #' dcols = stats::dist(t(test), method = "minkowski") #' pheatmap(test, #' clusteringDistanceRows = drows, #' clusteringDistanceCols = dcols) #' #' # Modify ordering of the clusters using clustering callback option #' callback = function(hc, mat) { #' sv = svd(t(mat))$v[, 1] #' dend = reorder(as.dendrogram(hc), wts = sv) #' as.hclust(dend) #' } #' #' pheatmap(test, clusteringCallback = callback) #' @importFrom grid grid.pretty #' @importFrom RColorBrewer brewer.pal semiPheatmap <- function(mat, color = colorRampPalette( rev(brewer.pal(n = 7, name = "RdYlBu")))(100), kmeansK = NA, breaks = NA, borderColor = "grey60", cellWidth = NA, cellHeight = NA, scale = "none", clusterRows = TRUE, clusterCols = TRUE, clusteringDistanceRows = "euclidean", clusteringDistanceCols = "euclidean", clusteringMethod = "complete", clusteringCallback = .identity2, cutreeRows = NA, cutreeCols = NA, treeHeightRow = ifelse(clusterRows, 50, 0), treeHeightCol = ifelse(clusterCols, 50, 0), legend = TRUE, legendBreaks = NA, legendLabels = NA, annotationRow = NA, annotationCol = NA, annotation = NA, annotationColors = NA, annotationLegend = TRUE, annotationNamesRow = TRUE, annotationNamesCol = TRUE, dropLevels = TRUE, showRownames = TRUE, showColnames = TRUE, main = NA, fontSize = 10, fontSizeRow = fontSize, fontSizeCol = fontSize, displayNumbers = FALSE, numberFormat = "%.2f", numberColor = "grey30", fontSizeNumber = 0.8 * fontSize, gapsRow = NULL, gapsCol = NULL, labelsRow = NULL, labelsCol = NULL, fileName = NA, width = NA, height = NA, silent = FALSE, rowLabel, colLabel, rowGroupOrder = NULL, colGroupOrder = NULL, ...) { # Set labels if (is.null(labelsRow) & !is.null(rownames(mat))) { labelsRow <- rownames(mat) } if (is.null(labelsRow) & is.null(rownames(mat))) { labelsRow <- seq(nrow(mat)) rownames(mat) <- seq(nrow(mat)) } if (is.null(labelsCol) & !is.null(colnames(mat))) { labelsCol <- colnames(mat) } if (is.null(labelsCol) & is.null(colnames(mat))) { labelsCol <- seq(ncol(mat)) colnames(mat) <- seq(ncol(mat)) } if (.is.na2(breaks)) { breaks <- .generateBreaks(mat, length(color), center = TRUE) } # Kmeans if (!is.na(kmeansK)) { # Cluster data km <- stats::kmeans(mat, kmeansK, iter.max = 100) mat <- km$centers # Compose rownames t <- table(km$cluster) labelsRow <- sprintf("Cluster: %s Size: %d", names(t), t) } else { km <- NA } # Format numbers to be displayed in cells if (is.matrix(displayNumbers) | is.data.frame(displayNumbers)) { if (nrow(displayNumbers) != nrow(mat) | ncol(displayNumbers) != ncol(mat)) { stop("If displayNumbers provided as matrix, its dimensions have to match with mat") } displayNumbers <- as.matrix(displayNumbers) fmat <- matrix(as.character(displayNumbers), nrow = nrow(displayNumbers), ncol = ncol(displayNumbers) ) fmatDraw <- TRUE } else { if (displayNumbers) { fmat <- matrix(sprintf(numberFormat, mat), nrow = nrow(mat), ncol = ncol(mat) ) fmatDraw <- TRUE } else { fmat <- matrix(NA, nrow = nrow(mat), ncol = ncol(mat)) fmatDraw <- FALSE } } # Do clustering for rows if (isTRUE(clusterRows)) { if (is.null(rowLabel)) { rowLabel <- rep(1, nrow(mat)) } else { # o <- order(rowLabel) o <- .Order(labels = rowLabel, groupOrder = rowGroupOrder) mat <- mat[o, , drop = FALSE] fmat <- fmat[o, , drop = FALSE] rowLabel <- rowLabel[o] if (!is.null(annotationRow) && !is.null(ncol(annotationRow))) { annotationRow <- annotationRow[o, , drop = FALSE] } } treeRow <- .clusterMat(mat, rowLabel, distance = clusteringDistanceRows, method = clusteringMethod ) treeRow <- clusteringCallback(treeRow, mat) mat <- mat[treeRow$order, , drop = FALSE] fmat <- fmat[treeRow$order, , drop = FALSE] labelsRow <- labelsRow[treeRow$order] if (!is.null(annotationRow) && !is.null(ncol(annotationRow))) { annotationRow <- annotationRow[treeRow$order, , drop = FALSE] } if (!is.na(cutreeRows)) { gapsRow <- .findGaps(treeRow, cutreeRows) } else { gapsRow <- NULL } } else { treeRow <- NA treeHeightRow <- 0 } ## Do clustering for columns if (isTRUE(clusterCols)) { if (is.null(colLabel)) { colLabel <- rep(1, ncol(mat)) } else { # o <- order(colLabel) o <- .Order(labels = colLabel, groupOrder = colGroupOrder) mat <- mat[, o, drop = FALSE] fmat <- fmat[, o, drop = FALSE] colLabel <- colLabel[o] if (!is.null(annotationCol) && !is.null(ncol(annotationCol))) { annotationCol <- annotationCol[o, , drop = FALSE] } } treeCol <- .clusterMat(t(mat), colLabel, distance = clusteringDistanceCols, method = clusteringMethod ) treeCol <- clusteringCallback(treeCol, t(mat)) mat <- mat[, treeCol$order, drop = FALSE] fmat <- fmat[, treeCol$order, drop = FALSE] labelsCol <- labelsCol[treeCol$order] if (!is.null(annotationCol) && !is.null(ncol(annotationCol))) { annotationCol <- annotationCol[treeCol$order, , drop = FALSE] } if (!is.na(cutreeCols)) { gapsCol <- .findGaps(treeCol, cutreeCols) } else { gapsCol <- NULL } } else { treeCol <- NA treeHeightCol <- 0 } attr(fmat, "draw") <- fmatDraw # Colors and scales if (!.is.na2(legendBreaks) & !.is.na2(legendLabels)) { if (length(legendBreaks) != length(legendLabels)) { stop("Lengths of legendBreaks and legendLabels must be the same") } } if (.is.na2(breaks)) { breaks <- .generateBreaks(as.vector(mat), length(color)) } if (legend & .is.na2(legendBreaks)) { legend <- grid::grid.pretty(range(as.vector(breaks))) names(legend) <- legend } else if (legend & !.is.na2(legendBreaks)) { legend <- legendBreaks[legendBreaks >= min(breaks) & legendBreaks <= max(breaks)] if (!.is.na2(legendLabels)) { legendLabels <- legendLabels[legendBreaks >= min(breaks) & legendBreaks <= max(breaks)] names(legend) <- legendLabels } else { names(legend) <- legend } } else { legend <- NA } mat <- .scaleColours(mat, col = color, breaks = breaks) annotation <- c(annotationRow, annotationCol) annotation <- annotation[unlist(lapply( annotation, function(x) !.is.na2(x) ))] if (length(annotation) != 0) { annotationColors <- .generateAnnotationColours(annotation, annotationColors, drop = dropLevels ) } else { annotationColors <- NA } labelsRow <- rownames(mat) labelsCol <- colnames(mat) if (!showRownames) { labelsRow <- NULL } if (!showColnames) { labelsCol <- NULL } # Draw heatmap gt <- .heatmapMotor(mat, borderColor = borderColor, cellWidth = cellWidth, cellHeight = cellHeight, treeHeightCol = treeHeightCol, treeHeightRow = treeHeightRow, treeCol = treeCol, treeRow = treeRow, fileName = fileName, width = width, height = height, breaks = breaks, color = color, legend = legend, annotationRow = annotationRow, annotationCol = annotationCol, annotationColors = annotationColors, annotationLegend = annotationLegend, annotationNamesRow = annotationNamesRow, annotationNamesCol = annotationNamesCol, main = main, fontSize = fontSize, fontSizeRow = fontSizeRow, fontSizeCol = fontSizeCol, fmat = fmat, fontSizeNumber = fontSizeNumber, numberColor = numberColor, gapsRow = gapsRow, gapsCol = gapsCol, labelsRow = labelsRow, labelsCol = labelsCol, ... ) return(gt) } # order function that order the row/column labels # based on the order of the group priority # return value is a vector of the ordered index # labels is a vector of any non-zero length # groupOrder, a column named dataframe/matrix # with the "groupName" column storing the group # name and the "groupIndex" storing the group priority .Order <- function(labels, groupOrder = NULL) { if (is.null(groupOrder)) { return(order(labels)) } else { # Throw error is length(unique(labels)) != nrow(groupOrder) olabels <- plyr::mapvalues( x = labels, from = groupOrder[, "groupName"], to = groupOrder[, "groupIndex"] ) # Make sure the olabels is integer for order() function olabels <- as.integer(olabels) return(order(olabels)) } } ================================================ FILE: R/simulateCells.R ================================================ #' @title Simulate count data from the celda generative models. #' @description This function generates a \linkS4class{SingleCellExperiment} #' containing a simulated counts matrix in the \code{"counts"} assay slot, as #' well as various parameters used in the simulation which can be #' useful for running celda and are stored in \code{metadata} slot. The user #' must provide the desired model (one of celda_C, celda_G, celda_CG) as well #' as any desired tuning parameters for those model's simulation functions #' as detailed below. #' @param model Character. Options available in \code{celda::availableModels}. #' Can be one of \code{"celda_CG"}, \code{"celda_C"}, or \code{"celda_G"}. #' Default \code{"celda_CG"}. #' @param S Integer. Number of samples to simulate. Default 5. Only used if #' \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}. #' @param CRange Integer vector. A vector of length 2 that specifies the lower #' and upper bounds of the number of cells to be generated in each sample. #' Default c(50, 100). Only used if #' \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}. #' @param NRange Integer vector. A vector of length 2 that specifies the lower #' and upper bounds of the number of counts generated for each cell. Default #' c(500, 1000). #' @param C Integer. Number of cells to simulate. Default 100. Only used if #' \code{model} is \code{"celda_G"}. #' @param G Integer. The total number of features to be simulated. Default 100. #' @param K Integer. Number of cell populations. Default 5. Only used if #' \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}. #' @param L Integer. Number of feature modules. Default 10. Only used if #' \code{model} is one of \code{"celda_CG"} or \code{"celda_G"}. #' @param alpha Numeric. Concentration parameter for Theta. Adds a pseudocount #' to each cell population in each sample. Default 1. Only used if #' \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}. #' @param beta Numeric. Concentration parameter for Phi. Adds a pseudocount to #' each feature module in each cell population. Default 1. #' @param gamma Numeric. Concentration parameter for Eta. Adds a pseudocount to #' the number of features in each module. Default 5. Only used if #' \code{model} is one of \code{"celda_CG"} or \code{"celda_G"}. #' @param delta Numeric. Concentration parameter for Psi. Adds a pseudocount to #' each feature in each module. Default 1. Only used if #' \code{model} is one of \code{"celda_CG"} or \code{"celda_G"}. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @return A \link[SingleCellExperiment]{SingleCellExperiment} object with #' simulated count matrix stored in the "counts" assay slot. Function #' parameter settings are stored in the \link{metadata} slot. For #' \code{"celda_CG"} and \code{"celda_C"} models, #' columns \code{celda_sample_label} and \code{celda_cell_cluster} in #' \link{colData} contain simulated sample labels and #' cell population clusters. For \code{"celda_CG"} and \code{"celda_G"} #' models, column \code{celda_feature_module} in #' \link{rowData} contains simulated gene modules. #' @examples #' sce <- simulateCells() #' @export simulateCells <- function( model = c("celda_CG", "celda_C", "celda_G"), S = 5, CRange = c(50, 100), NRange = c(500, 1000), C = 100, G = 100, K = 5, L = 10, alpha = 1, beta = 1, gamma = 5, delta = 1, seed = 12345) { model <- match.arg(model) if (model == "celda_C") { sce <- .simulateCellsMaincelda_C(model = model, S = S, CRange = CRange, NRange = NRange, G = G, K = K, alpha = alpha, beta = beta, seed = seed) } else if (model == "celda_CG") { sce <- .simulateCellsMaincelda_CG( model = model, S = S, CRange = CRange, NRange = NRange, G = G, K = K, L = L, alpha = alpha, beta = beta, gamma = gamma, delta = delta, seed = seed) } else if (model == "celda_G") { sce <- .simulateCellsMaincelda_G( model = model, C = C, L = L, NRange = NRange, G = G, beta = beta, delta = delta, gamma = gamma, seed = seed) } else { stop("'model' must be one of 'celda_C', 'celda_G', or 'celda_CG'") } return(sce) } .simulateCellsMaincelda_C <- function(model, S, CRange, NRange, G, K, alpha, beta, seed) { if (is.null(seed)) { res <- .simulateCellscelda_C(model = model, S = S, CRange = CRange, NRange = NRange, G = G, K = K, alpha = alpha, beta = beta) } else { res <- with_seed(seed, .simulateCellscelda_C(model = model, S = S, CRange = CRange, NRange = NRange, G = G, K = K, alpha = alpha, beta = beta)) } sce <- .createSCEsimulateCellsCeldaC(res, seed) return(sce) } .simulateCellscelda_C <- function(model, S = 5, CRange = c(50, 100), NRange = c(500, 1000), G = 100, K = 5, alpha = 1, beta = 1) { phi <- .rdirichlet(K, rep(beta, G)) theta <- .rdirichlet(S, rep(alpha, K)) ## Select the number of cells per sample nC <- sample(seq(CRange[1], CRange[2]), size = S, replace = TRUE) cellSampleLabel <- rep(seq(S), nC) ## Select state of the cells z <- unlist(lapply(seq(S), function(i) { sample(seq(K), size = nC[i], prob = theta[i, ], replace = TRUE) })) ## Select number of transcripts per cell nN <- sample(seq(NRange[1], NRange[2]), size = length(cellSampleLabel), replace = TRUE) ## Select transcript distribution for each cell cellCounts <- vapply(seq(length(cellSampleLabel)), function(i) { stats::rmultinom(1, size = nN[i], prob = phi[z[i], ]) }, integer(G)) rownames(cellCounts) <- paste0("Gene_", seq(nrow(cellCounts))) colnames(cellCounts) <- paste0("Cell_", seq(ncol(cellCounts))) cellSampleLabel <- paste0("Sample_", seq(S))[cellSampleLabel] cellSampleLabel <- factor(cellSampleLabel, levels = paste0("Sample_", seq(S))) ## Peform reordering on final Z and Y assigments: cellCounts <- .processCounts(cellCounts) names <- list(row = rownames(cellCounts), column = colnames(cellCounts), sample = unique(cellSampleLabel)) countChecksum <- .createCountChecksum(cellCounts) result <- methods::new("celda_C", clusters = list(z = z), params = list(K = as.integer(K), alpha = alpha, beta = beta, countChecksum = countChecksum), sampleLabel = cellSampleLabel, names = names) class(result) <- "celda_C" result <- .reorderCeldaC(counts = cellCounts, res = result) return(list(z = celdaClusters(result)$z, counts = cellCounts, sampleLabel = cellSampleLabel, G = G, K = K, alpha = alpha, beta = beta, CRange = CRange, NRange = NRange, S = S)) } .createSCEsimulateCellsCeldaC <- function(simList, seed) { sce <- SingleCellExperiment::SingleCellExperiment( assays = list(counts = simList$counts)) # add metadata S4Vectors::metadata(sce)[["celda_simulateCellscelda_C"]] <- list( model = "celda_C", sampleLevels = as.character(unique(simList$sampleLabel)), cellClusterLevels = sort(unique(simList$z)), S = simList$S, CRange = simList$CRange, NRange = simList$NRange, G = simList$G, K = simList$K, alpha = simList$alpha, beta = simList$beta, seed = seed) SummarizedExperiment::rowData(sce)["rownames"] <- rownames(simList$counts) SummarizedExperiment::colData(sce)["colnames"] <- colnames(simList$counts) SummarizedExperiment::colData(sce)["celda_sample_label"] <- as.factor(simList$sampleLabel) SummarizedExperiment::colData(sce)["celda_cell_cluster"] <- as.factor(simList$z) return(sce) } .simulateCellsMaincelda_CG <- function(model, S, CRange, NRange, G, K, L, alpha, beta, gamma, delta, seed) { if (is.null(seed)) { res <- .simulateCellscelda_CG( model = model, S = S, CRange = CRange, NRange = NRange, G = G, K = K, L = L, alpha = alpha, beta = beta, gamma = gamma, delta = delta) } else { with_seed( seed, res <- .simulateCellscelda_CG( model = model, S = S, CRange = CRange, NRange = NRange, G = G, K = K, L = L, alpha = alpha, beta = beta, gamma = gamma, delta = delta ) ) } sce <- .createSCEsimulateCellsCeldaCG(res, seed) return(sce) } .simulateCellscelda_CG <- function(model = model, S = S, CRange = CRange, NRange = NRange, G = G, K = K, L = L, alpha = alpha, beta = beta, gamma = gamma, delta = delta) { ## Number of cells per sample nC <- sample(seq(CRange[1], CRange[2]), size = S, replace = TRUE) nCSum <- sum(nC) cellSampleLabel <- rep(seq(S), nC) ## Select number of transcripts per cell nN <- sample(seq(NRange[1], NRange[2]), size = length(cellSampleLabel), replace = TRUE ) ## Generate cell population distribution for each sample theta <- t(.rdirichlet(S, rep(alpha, K))) ## Assign cells to cellular subpopulations z <- unlist(lapply(seq(S), function(i) { sample(seq(K), size = nC[i], prob = theta[, i], replace = TRUE ) })) ## Generate transcriptional state distribution for each cell subpopulation phi <- .rdirichlet(K, rep(beta, L)) ## Assign genes to gene modules eta <- .rdirichlet(1, rep(gamma, L)) y <- sample(seq(L), size = G, prob = eta, replace = TRUE ) if (length(table(y)) < L) { warning( "Some gene modules did not receive any genes after sampling.", " Try increasing G and/or making gamma larger." ) L <- length(table(y)) y <- as.integer(as.factor(y)) } psi <- matrix(0, nrow = G, ncol = L) for (i in seq(L)) { ind <- y == i psi[ind, i] <- .rdirichlet(1, rep(delta, sum(ind))) } ## Select transcript distribution for each cell cellCounts <- matrix(0, nrow = G, ncol = nCSum) for (i in seq(nCSum)) { transcriptionalStateDist <- as.integer(stats::rmultinom(1, size = nN[i], prob = phi[z[i], ] )) for (j in seq(L)) { if (transcriptionalStateDist[j] > 0) { cellCounts[, i] <- cellCounts[, i] + stats::rmultinom(1, size = transcriptionalStateDist[j], prob = psi[, j] ) } } } ## Ensure that there are no all-0 rows in the counts matrix, which violates ## a celda modeling ## constraint (columns are guarnteed at least one count): zeroRowIdx <- which(rowSums(cellCounts) == 0) if (length(zeroRowIdx > 0)) { cellCounts <- cellCounts[-zeroRowIdx, ] y <- y[-zeroRowIdx] } ## Assign gene/cell/sample names rownames(cellCounts) <- paste0("Gene_", seq(nrow(cellCounts))) colnames(cellCounts) <- paste0("Cell_", seq(ncol(cellCounts))) cellSampleLabel <- paste0("Sample_", seq(S))[cellSampleLabel] cellSampleLabel <- factor(cellSampleLabel, levels = paste0("Sample_", seq(S)) ) ## Peform reordering on final Z and Y assigments: cellCounts <- .processCounts(cellCounts) names <- list( row = rownames(cellCounts), column = colnames(cellCounts), sample = unique(cellSampleLabel) ) countChecksum <- .createCountChecksum(cellCounts) result <- methods::new("celda_CG", clusters = list(z = z, y = y), params = list( K = as.integer(K), L = as.integer(L), alpha = alpha, beta = beta, delta = delta, gamma = gamma, countChecksum = countChecksum ), sampleLabel = cellSampleLabel, names = names ) result <- .reorderCeldaCG(counts = cellCounts, res = result) return(list( z = celdaClusters(result)$z, y = celdaClusters(result)$y, counts = cellCounts, sampleLabel = cellSampleLabel, G = G, K = K, L = L, CRange = CRange, NRange = NRange, S = S, alpha = alpha, beta = beta, gamma = gamma, delta = delta )) } .createSCEsimulateCellsCeldaCG <- function(simList, seed) { sce <- SingleCellExperiment::SingleCellExperiment( assays = list(counts = simList$counts)) # add metadata S4Vectors::metadata(sce)[["celda_simulateCellscelda_CG"]] <- list( model = "celda_CG", sampleLevels = as.character(unique(simList$sampleLabel)), cellClusterLevels = sort(unique(simList$z)), featureModuleLevels = sort(unique(simList$y)), S = simList$S, CRange = simList$CRange, NRange = simList$NRange, G = simList$G, K = simList$K, L = simList$L, alpha = simList$alpha, beta = simList$beta, gamma = simList$gamma, delta = simList$delta, seed = seed) SummarizedExperiment::rowData(sce)["rownames"] <- rownames(simList$counts) SummarizedExperiment::colData(sce)["colnames"] <- colnames(simList$counts) SummarizedExperiment::colData(sce)["celda_sample_label"] <- as.factor(simList$sampleLabel) SummarizedExperiment::colData(sce)["celda_cell_cluster"] <- as.factor(simList$z) SummarizedExperiment::rowData(sce)["celda_feature_module"] <- as.factor(simList$y) return(sce) } .simulateCellsMaincelda_G <- function(model, C, L, NRange, G, beta, delta, gamma, seed) { if (is.null(seed)) { res <- .simulateCellscelda_G( model = model, C = C, L = L, NRange = NRange, G = G, beta = beta, delta = delta, gamma = gamma) } else { with_seed( seed, res <- .simulateCellscelda_G( model = model, C = C, L = L, NRange = NRange, G = G, beta = beta, delta = delta, gamma = gamma) ) } sce <- .createSCEsimulateCellsCeldaG(res, seed) return(sce) } .simulateCellscelda_G <- function(model, C = 100, L = 10, NRange = c(500, 1000), G = 100, beta = 1, gamma = 5, delta = 1, ...) { eta <- .rdirichlet(1, rep(gamma, L)) y <- sample(seq(L), size = G, prob = eta, replace = TRUE ) if (length(table(y)) < L) { stop( "Some states did not receive any features after sampling. Try", " increasing G and/or setting gamma > 1." ) } psi <- matrix(0, nrow = G, ncol = L) for (i in seq(L)) { ind <- y == i psi[ind, i] <- .rdirichlet(1, rep(delta, sum(ind))) } phi <- .rdirichlet(C, rep(beta, L)) ## Select number of transcripts per cell nN <- sample(seq(NRange[1], NRange[2]), size = C, replace = TRUE) ## Select transcript distribution for each cell cellCounts <- matrix(0, nrow = G, ncol = C) for (i in seq(C)) { cellDist <- stats::rmultinom(1, size = nN[i], prob = phi[i, ]) for (j in seq(L)) { cellCounts[, i] <- cellCounts[, i] + stats::rmultinom(1, size = cellDist[j], prob = psi[, j] ) } } ## Ensure that there are no all-0 rows in the counts matrix, which violates ## a celda modeling ## constraint (columns are guarnteed at least one count): zeroRowIdx <- which(rowSums(cellCounts) == 0) if (length(zeroRowIdx > 0)) { cellCounts <- cellCounts[-zeroRowIdx, ] y <- y[-zeroRowIdx] } rownames(cellCounts) <- paste0("Gene_", seq(nrow(cellCounts))) colnames(cellCounts) <- paste0("Cell_", seq(ncol(cellCounts))) ## Peform reordering on final Z and Y assigments: cellCounts <- .processCounts(cellCounts) names <- list( row = rownames(cellCounts), column = colnames(cellCounts) ) countChecksum <- .createCountChecksum(cellCounts) result <- methods::new("celda_G", clusters = list(y = y), params = list( L = as.integer(L), beta = beta, delta = delta, gamma = gamma, countChecksum = countChecksum ), names = names ) result <- .reorderCeldaG(counts = cellCounts, res = result) return(list( y = celdaClusters(result)$y, counts = cellCounts, C = C, G = G, L = L, NRange = NRange, beta = beta, delta = delta, gamma = gamma )) } .createSCEsimulateCellsCeldaG <- function(simList, seed) { sce <- SingleCellExperiment::SingleCellExperiment( assays = list(counts = simList$counts)) # add metadata S4Vectors::metadata(sce)[["celda_simulateCellscelda_G"]] <- list( model = "celda_G", featureModuleLevels = sort(unique(simList$y)), NRange = simList$NRange, C = simList$C, G = simList$G, L = simList$L, beta = simList$beta, gamma = simList$gamma, delta = simList$delta, seed = seed) SummarizedExperiment::rowData(sce)["rownames"] <- rownames(simList$counts) SummarizedExperiment::colData(sce)["colnames"] <- colnames(simList$counts) SummarizedExperiment::rowData(sce)["celda_feature_module"] <- as.factor(simList$y) return(sce) } ================================================ FILE: R/splitModule.R ================================================ #' @title Split celda feature module #' @description Manually select a celda feature module to split into 2 or #' more modules. Useful for splitting up modules that show divergent #' expression of features in multiple cell clusters. #' @param x A \linkS4class{SingleCellExperiment} object #' with the matrix located in the assay slot under \code{useAssay}. #' Rows represent features and columns represent cells. #' @param module Integer. The module to be split. #' @param useAssay A string specifying which \link{assay} #' slot to use for \code{x}. Default "counts". #' @param altExpName The name for the \link{altExp} slot #' to use. Default \code{"featureSubset"}. #' @param n Integer. How many modules should \code{module} be split into. #' Default \code{2}. #' @param seed Integer. Passed to \link[withr]{with_seed}. For reproducibility, #' a default value of 12345 is used. If NULL, no calls to #' \link[withr]{with_seed} are made. #' @return A updated \linkS4class{SingleCellExperiment} object with new #' feature modules stored in column \code{celda_feature_module} in #' \code{\link{rowData}(x)}. #' @export setGeneric("splitModule", function(x, module, useAssay = "counts", altExpName = "featureSubset", n = 2, seed = 12345) { standardGeneric("splitModule") }) #' @rdname splitModule #' @examples #' data(sceCeldaCG) #' # Split module 5 into 2 new modules. #' sce <- splitModule(sceCeldaCG, module = 5) #' @export setMethod("splitModule", signature(x = "SingleCellExperiment"), function(x, module, useAssay = "counts", altExpName = "featureSubset", n = 2, seed = 12345) { altExp <- SingleCellExperiment::altExp(x, altExpName) if (!module %in% celdaModules(x, altExpName = altExpName)) { stop("Module ", module, " is not found in celdaModules(x,", " altExpName = altExpName).", " Please specify a valid module.") } celdaGMod <- .splitModuleWithSeed(x = altExp, useAssay = useAssay, module = module, n = n, seed = seed) S4Vectors::metadata(altExp)[["celda_parameters"]]$L <- params(celdaGMod)$L S4Vectors::metadata(altExp)[["celda_parameters"]]$finalLogLik <- celdaGMod@finalLogLik S4Vectors::metadata(altExp)[["celda_parameters"]]$featureModuleLevels <- sort(unique(celdaClusters(celdaGMod)$y)) SummarizedExperiment::rowData(altExp)["celda_feature_module"] <- as.factor(celdaClusters(celdaGMod)$y) SingleCellExperiment::altExp(x, altExpName) <- altExp return(x) } ) .splitModuleWithSeed <- function(x, useAssay, module, n, seed) { if (is.null(seed)) { celdaGMod <- .splitModule(x, useAssay, module, n) } else { with_seed(seed, celdaGMod <- .splitModule(x, useAssay, module, n)) } return(celdaGMod) } .splitModule <- function(x, useAssay, module, n) { counts <- SummarizedExperiment::assay(x, i = useAssay) counts <- .processCounts(counts) .validateCounts(counts) L <- S4Vectors::metadata(x)$celda_parameters$L y <- as.numeric(SummarizedExperiment::rowData(x)$celda_feature_module) ix <- y == module if (sum(ix) < n) { stop("Module ", module, " contains less than ", n, " features. ", "Module splitting was not performed.") } tempModel <- .celda_G( counts = counts[ix, , drop = FALSE], L = n, yInitialize = "random", splitOnIter = -1, splitOnLast = FALSE, nchains = 1, verbose = FALSE ) # Need to set some of the features to the original module number. # The remaining features need to have "L + something" as they represent # a new module. Note that there may be more than 1 new module. splitY <- as.numeric(as.character(celdaClusters(tempModel)$y)) splitIx <- splitY > 1 splitY[splitIx] <- L + splitY[splitIx] - 1 splitY[!splitIx] <- module # Set up new y and L newY <- y newY[ix] <- splitY newL <- max(newY) newLl <- .logLikelihoodcelda_G( counts = counts, y = newY, L = newL, beta = S4Vectors::metadata(x)$celda_parameters$beta, delta = S4Vectors::metadata(x)$celda_parameters$delta, gamma = S4Vectors::metadata(x)$celda_parameters$gamma ) model <- methods::new( "celda_G", clusters = list(y = factor(newY, seq(newL))), params = list( L = newL, beta = S4Vectors::metadata(x)$celda_parameters$beta, delta = S4Vectors::metadata(x)$celda_parameters$delta, gamma = S4Vectors::metadata(x)$celda_parameters$gamma, countChecksum = .createCountChecksum(counts) ), names = list( row = rownames(x), column = colnames(x), sample = x@metadata$celda_parameters$sampleLevels ), finalLogLik = newLl ) return(model) } ================================================ FILE: R/split_clusters.R ================================================ # .cCCalcLL = function(mCPByS, nGByCP, s, z, K, nS, nG, alpha, beta) .cCSplitZ <- function(counts, mCPByS, nGByCP, nCP, s, z, K, nS, nG, alpha, beta, zProb, maxClustersToTry = 10, minCell = 3) { ## Identify clusters to split zTa <- tabulate(z, K) zToSplit <- which(zTa >= minCell) zNonEmpty <- which(zTa > 0) if (length(zToSplit) == 0) { m <- paste0( date(), " .... Cluster sizes too small. No additional splitting was", " performed." ) return(list( z = z, mCPByS, nGByCP, nCP = nCP, message = m )) } ## Loop through each split-able Z and perform split clustSplit <- vector("list", K) for (i in zToSplit) { clustLabel <- .celda_C( counts[, z == i], K = 2, zInitialize = "random", maxIter = 5, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE ) clustSplit[[i]] <- as.integer(celdaClusters(clustLabel)$z) } ## Find second best assignment give current assignments for each cell zProb[cbind(seq(nrow(zProb)), z)] <- NA zSecond <- apply(zProb, 1, which.max) ## Set up initial variables zSplit <- matrix(NA, nrow = length(z), ncol = length(zToSplit) * maxClustersToTry ) zSplitLl <- rep(NA, times = length(zToSplit) * maxClustersToTry) zSplitLl[1] <- .cCCalcLL(mCPByS, nGByCP, s, z, K, nS, nG, alpha, beta) zSplit[, 1] <- z ## Select worst clusters to test for reshuffling previousZ <- z llShuffle <- rep(NA, K) for (i in zNonEmpty) { ix <- z == i newZ <- z newZ[ix] <- zSecond[ix] p <- .cCReDecomposeCounts(counts, s, newZ, previousZ, nGByCP, K) nGByCP <- p$nGByCP mCPByS <- p$mCPByS llShuffle[i] <- .cCCalcLL(mCPByS, nGByCP, s, z, K, nS, nG, alpha, beta) previousZ <- newZ } zToShuffle <- utils::head(order(llShuffle, decreasing = TRUE, na.last = NA), n = maxClustersToTry ) pairs <- c(NA, NA) splitIx <- 2 for (i in zToShuffle) { otherClusters <- setdiff(zToSplit, i) for (j in otherClusters) { newZ <- z ## Assign cluster i to the next most similar cluster (excluding ## cluster j) ## as defined above by the correlation ixToMove <- z == i newZ[ixToMove] <- zSecond[ixToMove] ## Split cluster j according to the clustering defined above ixToSplit <- z == j newZ[ixToSplit] <- ifelse(clustSplit[[j]] == 1, j, i) p <- .cCReDecomposeCounts(counts, s, newZ, previousZ, nGByCP, K) nGByCP <- p$nGByCP mCPByS <- p$mCPByS ## Calculate likelihood of split zSplitLl[splitIx] <- .cCCalcLL( mCPByS, nGByCP, s, z, K, nS, nG, alpha, beta ) zSplit[, splitIx] <- newZ splitIx <- splitIx + 1L previousZ <- newZ pairs <- rbind(pairs, c(i, j)) } } select <- which.max(zSplitLl) if (select == 1) { m <- paste0(date(), " .... No additional splitting was performed.") } else { m <- paste0( date(), " .... Cluster ", pairs[select, 1], " was reassigned and cluster ", pairs[select, 2], " was split in two." ) } p <- .cCReDecomposeCounts(counts, s, zSplit[, select], previousZ, nGByCP, K) return(list( z = zSplit[, select], mCPByS = p$mCPByS, nGByCP = p$nGByCP, nCP = p$nCP, message = m )) } # .cCGCalcLL = function(K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, # nS, nG, alpha, beta, delta, gamma) .cCGSplitZ <- function(counts, mCPByS, nTSByC, nTSByCP, nByG, nByTS, nGByTS, nCP, s, z, K, L, nS, nG, alpha, beta, delta, gamma, zProb, maxClustersToTry = 10, minCell = 3) { ## Identify clusters to split zTa <- tabulate(z, K) zToSplit <- which(zTa >= minCell) zNonEmpty <- which(zTa > 0) if (length(zToSplit) == 0) { m <- paste0( date(), " .... Cluster sizes too small. No additional splitting was", " performed." ) return(list( z = z, mCPByS = mCPByS, nTSByCP = nTSByCP, nCP = nCP, message = m )) } ## Loop through each split-able Z and perform split clustSplit <- vector("list", K) for (i in zToSplit) { clustLabel <- .celda_C(counts[, z == i], K = 2, zInitialize = "random", maxIter = 5, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE ) clustSplit[[i]] <- as.integer(celdaClusters(clustLabel)$z) } ## Find second best assignment give current assignments for each cell zProb[cbind(seq(nrow(zProb)), z)] <- NA zSecond <- apply(zProb, 1, which.max) ## Set up initial variables zSplit <- matrix(NA, nrow = length(z), ncol = length(zToSplit) * maxClustersToTry ) zSplitLl <- rep(NA, ncol = length(zToSplit) * maxClustersToTry) zSplitLl[1] <- .cCGCalcLL( K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, nS, nG, alpha, beta, delta, gamma ) zSplit[, 1] <- z ## Select worst clusters to test for reshuffling previousZ <- z llShuffle <- rep(NA, K) for (i in zNonEmpty) { ix <- z == i newZ <- z newZ[ix] <- zSecond[ix] p <- .cCReDecomposeCounts(nTSByC, s, newZ, previousZ, nTSByCP, K) nTSByCP <- p$nGByCP mCPByS <- p$mCPByS llShuffle[i] <- .cCGCalcLL( K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, nS, nG, alpha, beta, delta, gamma ) previousZ <- newZ } zToShuffle <- utils::head(order(llShuffle, decreasing = TRUE, na.last = NA), n = maxClustersToTry ) pairs <- c(NA, NA) splitIx <- 2 for (i in zToShuffle) { otherClusters <- setdiff(zToSplit, i) for (j in otherClusters) { newZ <- z ## Assign cluster i to the next most similar cluster (excluding ## cluster j) ## as defined above by the correlation ixToMove <- z == i newZ[ixToMove] <- zSecond[ixToMove] ## Split cluster j according to the clustering defined above ixToSplit <- z == j newZ[ixToSplit] <- ifelse(clustSplit[[j]] == 1, j, i) p <- .cCReDecomposeCounts(nTSByC, s, newZ, previousZ, nTSByCP, K) nTSByCP <- p$nGByCP mCPByS <- p$mCPByS ## Calculate likelihood of split zSplitLl[splitIx] <- .cCGCalcLL( K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, nS, nG, alpha, beta, delta, gamma ) zSplit[, splitIx] <- newZ splitIx <- splitIx + 1L previousZ <- newZ pairs <- rbind(pairs, c(i, j)) } } select <- which.max(zSplitLl) if (select == 1) { m <- paste0(date(), " .... No additional splitting was performed.") } else { m <- paste0( date(), " .... Cluster ", pairs[select, 1], " was reassigned and cluster ", pairs[select, 2], " was split in two." ) } p <- .cCReDecomposeCounts( nTSByC, s, zSplit[, select], previousZ, nTSByCP, K ) return(list( z = zSplit[, select], mCPByS = p$mCPByS, nTSByCP = p$nGByCP, nCP = p$nCP, message = m )) } .cCGSplitY <- function(counts, y, mCPByS, nGByCP, nTSByC, nTSByCP, nByG, nByTS, nGByTS, nCP, s, z, K, L, nS, nG, alpha, beta, delta, gamma, yProb, maxClustersToTry = 10, KSubclusters = 10, minCell = 3) { ######################### ## First, the cell dimension of the original matrix will be reduced by ## splitting each z cluster into 'KSubclusters'. ######################### ## This will not be as big as the original matrix (which can take a lot of ## time to process with large number of cells), but not as small as the ## 'nGByCP' with current z assignments zTa <- tabulate(z, K) zNonEmpty <- which(zTa > 0) tempZ <- rep(0, length(z)) currentTopZ <- 0 for (i in zNonEmpty) { ix <- z == i if (zTa[i] <= KSubclusters) { tempZ[ix] <- seq(currentTopZ + 1, currentTopZ + zTa[i]) } else { clustLabel <- .celda_C(counts[, z == i], K = KSubclusters, zInitialize = "random", maxIter = 5, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE ) tempZ[ix] <- as.integer(celdaClusters(clustLabel)$z) + currentTopZ } currentTopZ <- max(tempZ, na.rm = TRUE) } ## Decompose counts according to new/temp z labels tempNGByCP <- .colSumByGroup(counts, group = tempZ, K = currentTopZ) ######################### ## Second, different y splits will be estimated and tested ######################### ## Identify clusters to split yTa <- tabulate(y, L) yToSplit <- which(yTa >= minCell) yNonEmpty <- which(yTa > 0) if (length(yToSplit) == 0) { m <- paste0( date(), " .... Cluster sizes too small. No additional splitting was", " performed." ) return(list( y = y, mCPByS = mCPByS, nTSByCP = nTSByCP, nCP = nCP, message = m )) } ## Loop through each split-able Z and perform split clustSplit <- vector("list", L) for (i in yToSplit) { clustLabel <- .celda_G(tempNGByCP[y == i, ], L = 2, yInitialize = "random", maxIter = 5, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE ) clustSplit[[i]] <- as.integer(celdaClusters(clustLabel)$y) } ## Find second best assignment give current assignments for each cell yProb[cbind(seq(nrow(yProb)), y)] <- NA ySecond <- apply(yProb, 1, which.max) ## Set up initial variables ySplit <- matrix(NA, nrow = length(y), ncol = length(yToSplit) * maxClustersToTry ) ySplitLl <- rep(NA, ncol = length(yToSplit) * maxClustersToTry) ySplitLl[1] <- .cCGCalcLL( K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, nS, nG, alpha, beta, delta, gamma ) ySplit[, 1] <- y ## Select worst clusters to test for reshuffling previousY <- y llShuffle <- rep(NA, L) for (i in yNonEmpty) { ix <- y == i newY <- y newY[ix] <- ySecond[ix] p <- .cGReDecomposeCounts(nGByCP, newY, previousY, nTSByCP, nByG, L) nTSByCP <- p$nTSByC nByTS <- p$nByTS nGByTS <- p$nGByTS llShuffle[i] <- .cCGCalcLL( K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, nS, nG, alpha, beta, delta, gamma ) previousY <- newY } yToShuffle <- utils::head(order(llShuffle, decreasing = TRUE, na.last = NA), n = maxClustersToTry ) pairs <- c(NA, NA) splitIx <- 2 for (i in yToShuffle) { otherClusters <- setdiff(yToSplit, i) for (j in otherClusters) { newY <- y ## Assign cluster i to the next most similar cluster (excluding ## cluster j) ## as defined above by the correlation ixToMove <- y == i newY[ixToMove] <- ySecond[ixToMove] ## Split cluster j according to the clustering defined above ixToSplit <- y == j newY[ixToSplit] <- ifelse(clustSplit[[j]] == 1, j, i) p <- .cGReDecomposeCounts(nGByCP, newY, previousY, nTSByCP, nByG, L) nTSByCP <- p$nTSByC nByTS <- p$nByTS nGByTS <- p$nGByTS ## Calculate likelihood of split ySplitLl[splitIx] <- .cCGCalcLL( K, L, mCPByS, nTSByCP, nByG, nByTS, nGByTS, nS, nG, alpha, beta, delta, gamma ) ySplit[, splitIx] <- newY splitIx <- splitIx + 1L previousY <- newY pairs <- rbind(pairs, c(i, j)) } } select <- which.max(ySplitLl) if (select == 1) { m <- paste0(date(), " .... No additional splitting was performed.") } else { m <- paste0( date(), " .... Cluster ", pairs[select, 1], " was reassigned and cluster ", pairs[select, 2], " was split in two." ) } p <- .cGReDecomposeCounts( nGByCP, ySplit[, select], previousY, nTSByCP, nByG, L ) return(list( y = ySplit[, select], nTSByCP = p$nTSByC, nByTS = p$nByTS, nGByTS = p$nGByTS, message = m )) } # .cGCalcLL = function(nTSByC, nByTS, nByG, nGByTS, nM, nG, L, beta, delta, # gamma) { .cGSplitY <- function(counts, y, nTSByC, nByTS, nByG, nGByTS, nM, nG, L, beta, delta, gamma, yProb, minFeature = 3, maxClustersToTry = 10) { ## Identify clusters to split yTa <- table(factor(y, levels = seq(L))) yToSplit <- which(yTa >= minFeature) yNonEmpty <- which(yTa > 0) if (length(yToSplit) == 0) { m <- paste0( date(), " .... Cluster sizes too small. No additional splitting was", " performed." ) return(list( y = y, nTSByC = nTSByC, nByTS = nByTS, nGByTS = nGByTS, message = m )) } ## Loop through each split-able y and find best split clustSplit <- vector("list", L) for (i in yToSplit) { clustLabel <- .celda_G(counts[y == i, ], L = 2, yInitialize = "random", maxIter = 5, splitOnIter = -1, splitOnLast = FALSE, verbose = FALSE, reorder = FALSE ) clustSplit[[i]] <- as.integer(celdaClusters(clustLabel)$y) } ## Find second best assignment give current assignments for each cell yProb[cbind(seq(nrow(yProb)), y)] <- NA ySecond <- apply(yProb, 1, which.max) ## Set up initial variables ySplit <- matrix(NA, nrow = length(y), ncol = length(yToSplit) * maxClustersToTry ) ySplitLl <- rep(NA, ncol = length(yToSplit) * maxClustersToTry) ySplitLl[1] <- .cGCalcLL( nTSByC, nByTS, nByG, nGByTS, nM, nG, L, beta, delta, gamma ) ySplit[, 1] <- y ## Select worst clusters to test for reshuffling llShuffle <- rep(NA, L) previousY <- y for (i in yNonEmpty) { ix <- y == i newY <- y newY[ix] <- ySecond[ix] p <- .cGReDecomposeCounts(counts, newY, previousY, nTSByC, nByG, L) llShuffle[i] <- .cGCalcLL( p$nTSByC, p$nByTS, nByG, p$nGByTS, nM, nG, L, beta, delta, gamma ) previousY <- newY } yToShuffle <- utils::head(order(llShuffle, decreasing = TRUE, na.last = NA), n = maxClustersToTry ) pairs <- c(NA, NA) splitIx <- 2 for (i in yToShuffle) { otherClusters <- setdiff(yToSplit, i) for (j in otherClusters) { newY <- y ## Assign cluster i to the next most similar cluster (excluding ## cluster j) ## as defined above by the spearman correlation ixToMove <- y == i newY[ixToMove] <- ySecond[ixToMove] ## Split cluster j according to the clustering defined above ixToSplit <- y == j newY[ixToSplit] <- ifelse(clustSplit[[j]] == 1, j, i) ## Calculate likelihood of split p <- .cGReDecomposeCounts(counts, newY, previousY, nTSByC, nByG, L) ySplitLl[splitIx] <- .cGCalcLL( p$nTSByC, p$nByTS, nByG, p$nGByTS, nM, nG, L, beta, delta, gamma ) ySplit[, splitIx] <- newY splitIx <- splitIx + 1L previousY <- newY pairs <- rbind(pairs, c(i, j)) } } select <- which.max(ySplitLl) if (select == 1) { m <- paste0(date(), " .... No additional splitting was performed.") } else { m <- paste0( date(), " .... Cluster ", pairs[select, 1], " was reassigned and cluster ", pairs[select, 2], " was split in two." ) } p <- .cGReDecomposeCounts( counts, ySplit[, select], previousY, nTSByC, nByG, L ) return(list( y = ySplit[, select], nTSByC = p$nTSByC, nByTS = p$nByTS, nGByTS = p$nGByTS, message = m )) } ================================================ FILE: R/topRank.R ================================================ #' @title Identify features with the highest influence on clustering. #' @description topRank() can quickly identify the top `n` rows for each column #' of a matrix. For example, this can be useful for identifying the top `n` #' features per cell. #' @param matrix Numeric matrix. #' @param n Integer. Maximum number of items above `threshold` returned for each #' ranked row or column. #' @param margin Integer. Dimension of `matrix` to rank, with 1 for rows, 2 for #' columns. Default 2. #' @param threshold Numeric. Only return ranked rows or columns in the matrix #' that are above this threshold. If NULL, then no threshold will be applied. #' Default 0. #' @param decreasing Logical. Specifies if the rank should be decreasing. #' Default TRUE. #' @return List. The `index` variable provides the top `n` row (feature) indices #' contributing the most to each column (cell). The `names` variable provides #' the rownames corresponding to these indexes. #' @examples #' data(sampleCells) #' topRanksPerCell <- topRank(sampleCells, n = 5) #' topFeatureNamesForCell <- topRanksPerCell$names[1] #' @export topRank <- function(matrix, n = 25, margin = 2, threshold = 0, decreasing = TRUE) { if (is.null(threshold) || is.na(threshold)) { threshold <- min(matrix) - 1 } # Function to sort values in a vector and return 'n' top results # If there are not 'n' top results above 'thresh', then the # number of entries in 'v' that are above 'thresh' will be returned .topFunction <- function(v, n, thresh) { vAboveThresh <- sum(v > thresh) nToSelect <- min(vAboveThresh, n) h <- NA if (nToSelect > 0) { h <- utils::head(order(v, decreasing = decreasing), nToSelect) } return(h) } # Parse top ranked indices from matrix topIx <- base::apply(matrix, margin, .topFunction, thresh = threshold, n = n) # Convert to list if apply converted to a matrix because all # elements had the same length if (is.matrix(topIx)) { topIx <- lapply(seq(ncol(topIx)), function(i) { topIx[, i] }) names(topIx) <- dimnames(matrix)[[margin]] } # Parse names from returned margin oppositeMargin <- ifelse(margin - 1 > 0, margin - 1, length(dim(matrix))) topNames <- NULL namesToParse <- dimnames(matrix)[[oppositeMargin]] if (!is.null(namesToParse) & all(!is.na(topIx))) { topNames <- lapply( seq(length(topIx)), function(i) { ifelse(is.na(topIx[[i]]), NA, namesToParse[topIx[[i]]]) } ) names(topNames) <- names(topIx) } return(list(index = topIx, names = topNames)) } ================================================ FILE: README.md ================================================ [![R-CMD-check](https://github.com/campbio/celda/workflows/R-CMD-check/badge.svg)](https://github.com/campbio/celda/actions) [![Coverage Status](https://coveralls.io/repos/github/campbio/celda/badge.svg?branch=master)](https://coveralls.io/github/campbio/celda?branch=master) # celda: CEllular Latent Dirichlet Allocation "celda" stands for "**CE**llular **L**atent **D**irichlet **A**llocation". It is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. This package also includes a method called **DecontX** which can be used to estimate and remove contamination in single cell genomic data. ## Installation Instructions To install the latest stable release of **celda** from [Bioconductor](http://bioconductor.org/packages/celda/) (requires R version >= 3.6): ``` if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("celda") ``` The latest stable version of **celda** can be installed from GitHub using `devtools`: ``` library(devtools) install_github("campbio/celda") ``` The development version of **celda** can also be installed from GitHub using `devtools`: ``` library(devtools) install_github("campbio/celda@devel") ``` **NOTE** For MAC OSX users, `devtools::install_github()` requires installation of **libgit2.** This can be installed via homebrew: ``` brew install libgit2 ``` Also, if you receive installation errors when Rcpp is being installed and compiled, try following the steps outlined here to solve the issue: https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/ If you are running R 4.0.0 or later version on MacOS Catalina and you see error `'wchar.h' file not found`, you can try the method in this link: https://discourse.mc-stan.org/t/dealing-with-catalina-iii/12731/5 If you are trying to install on MacOS in an Apple Silicon computater and you see the following error: ``` ld: warning: directory not found for option '-L/opt/gfortran/lib/gcc/x86_64-apple-darwin20.0/12.2.0' ld: warning: directory not found for option '-L/opt/gfortran/lib' ld: library not found for -lgfortran clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [celda.so] Error 1 ERROR: compilation failed for package ‘celda’ ``` You can solve this by downloading and installing the gfortran pkg located [here](https://mac.r-project.org/tools/gfortran-12.2-universal.pkg) and then running the following command: ``` sudo /opt/gfortran/bin/gfortran-update-sdk ``` **NOTE** If you are trying to install **celda** using Rstudio and get this error: `could not find tools necessary to compile a package`, you can try this: ``` options(buildtools.check = function(action) TRUE) ``` ## Vignettes and examples To build the vignettes for Celda and DecontX during installation from GitHub, use the following command: ``` library(devtools) install_github("campbio/celda", build_vignettes = TRUE) ``` Note that installation may take an extra 5-10 minutes for building of the vignettes. The Celda and DecontX vignettes can then be accessed via the following commands: ``` vignette("celda") vignette("decontX") ``` ## For developers Check out our [Wiki](https://github.com/campbio/celda/wiki) for developer's guide if you want to contribute! - [Celda Development Coding Style Guide](https://github.com/campbio/celda/wiki/Celda-Development-Coding-Style-Guide) - [Celda Development Robust and Efficient Code](https://github.com/campbio/celda/wiki/Celda-Development-Robust-and-Efficient-Code) - [Celda Development Rstudio configuration](https://github.com/campbio/celda/wiki/Celda-Development-Rstudio-configuration) - [FAQ on how to use celda](https://github.com/campbio/celda/wiki/FAQ-on-how-to-use-celda) - [FAQ on package development](https://github.com/campbio/celda/wiki/FAQ-on-package-development) ================================================ FILE: _pkgdown.yml ================================================ template: params: bootswatch: yeti reference: - title: Primary celda functions desc: Functions for clustering of cells contents: - celda_CG - celda_C - celda_G - reportCeldaCGRun - reportCeldaCGPlotResults - selectFeatures - splitModule - title: Visualization functions for celda results desc: Functions for displaying celda resuls on 2-D embeddings, heatmaps, and violin plots contents: - celdaUmap - celdaTsne - moduleHeatmap - celdaProbabilityMap - plotDimReduceCluster - plotDimReduceFeature - plotDimReduceModule - plotDimReduceGrid - plotCeldaViolin - celdaHeatmap - title: Primary decontX functions desc: Functions for estimating and displaying contamination with decontX contents: - decontX - plotDecontXContamination - plotDecontXMarkerExpression - plotDecontXMarkerPercentage - decontXcounts - title: Functions for determining the numbers of clusters in celda desc: Functions for running and comparing multiple celda models with different number of modules or cell populations contents: - recursiveSplitCell - recursiveSplitModule - plotRPC - celdaGridSearch - plotGridSearchPerplexity - perplexity - resamplePerplexity - selectBestModel - resList - subsetCeldaList - appendCeldaList - celdaPerplexity - title: Miscellaneous celda functions desc: Various functions for manipulation of celda results contents: - celdaClusters - celdaModules - recodeClusterY - recodeClusterZ - reorderCelda - featureModuleLookup - featureModuleTable - celda - params - runParams - factorizeMatrix - bestLogLikelihood - clusterProbability - geneSetEnrich - plotHeatmap - retrieveFeatureIndex - normalizeCounts - distinctColors - matrixNames - logLikelihood - logLikelihoodHistory - topRank - sampleLabel - title: Simulation functions desc: Functions for generating data from the generative process of each model contents: - simulateCells - simulateContamination - title: Data objects desc: Small data objects used in examples contents: - sceCeldaCG - sceCeldaC - sceCeldaG - sceCeldaCGGridSearch - celdaCGGridSearchRes - sampleCells - contaminationSim - title: internal contents: - availableModels - celdaCGMod - celdaCGSim - celdaCMod - celdaCSim - celdaGMod - celdaGSim - celdaModel - celdatosce - compareCountMatrix - countChecksum - eigenMatMultInt - eigenMatMultNumeric - fastNormProp - fastNormPropLog - fastNormPropSqrt - nonzero - semiPheatmap - '`celdaPerplexity,celdaList-method`' - '`countChecksum,celdaList-method`' navbar: title: "celda" left: - icon: fa-home fa-lg href: https://www.camplab.net/celda - text: "Installation" href: articles/articles/installation.html - text: "Vignettes" menu: - text: "Celda - Analysis of PBMC3K" href: articles/articles/celda_pbmc3k.html - text: "DecontX - Decontamination of PBMC4K" href: articles/articles/decontX_pbmc4k.html - text: "Reference" href: reference/index.html - text: "News" href: news/index.html - icon: fa-github href: https://github.com/campbio/celda ================================================ FILE: docs/404.html ================================================ Page not found (404) • celda
Content not found. Please use links in the navbar.

Site built with pkgdown 2.0.7.

================================================ FILE: docs/CONDUCT.html ================================================ Contributor Code of Conduct • celda

As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant (http:contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/

Site built with pkgdown 2.0.7.

================================================ FILE: docs/LICENSE-text.html ================================================ License • celda
MIT License

Copyright (c) 2018 Joshua D Campbell

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Site built with pkgdown 2.0.7.

================================================ FILE: docs/articles/articles/celda_pbmc3k.html ================================================ Celda - Analysis of PBMC3K • celda

Introduction

Celda is a Bayesian hierarchical model that can perform bi-clustering of features into modules and observations into subpopulations. In this tutorial, we will apply Celda to a real-world single-cell RNA sequencing (scRNA-seq) dataset of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor. This dataset (PBMC3K) is available from 10X Genomics and can be found on the 10X website.

The celda package uses the SingleCellExperiment (SCE) object for management of expression matrices, feature/cell annotation data, and metadata. All of the functions have an SCE object as the first input parameter. The functions operate on a matrix stored in the assay slot of the SCE object. The parameter useAssay can be used to specify which matrix to use (the default is "counts"). Matrices can be of class matrix or dgCMatrix from the Matrix package. While the primary clustering is performed with functions from the celda package, the singleCellTK package is used for some other tasks such as importing data, quality control, and marker identification with differential expression.

Importing data

The PBMC3K data can be easily loaded via the Bioconductor package TENxPBMCData. TENxPBMCData is an experiment package that provides resources for various PBMC datasets generated by 10X Genomics. When using this package, the column names of returned SCE object are NULL by default. For this example, we paste together the name of the sample with the cell barcode to generate column names for the SCE object. Additionally, the count matrix within sce object is converted from a DelayedMatrix object to a sparse matrix dgCMatrix object.

library(TENxPBMCData)
sce <- TENxPBMCData("pbmc3k")
colnames(sce) <- paste0("pbmc3k_", colData(sce)$Sequence)
counts(sce) <- as(counts(sce), "dgCMatrix")

If you have the singleCellTK package installed, then this dataset can be imported and converted with a single command:

To get your own data into a SingleCellExperiment object, the singleCellTK package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in multiple samples processed with CellRanger:

library(singleCellTK)
sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"))

Note: As a reminder, you can view the assays, column annotation, and row annotation stored in the SCE with the commands assays(sce), colData(sce), and rowData(sce), respectively.

Finally, we set the rownames of the SCE to the gene symbol:

rownames(sce) <- rowData(sce)$Symbol_TENx

Quality Control

Quality control and filtering of cells is often needed before down-stream analyses such as dimensionality reduction and clustering. Typical filtering procedures include exclusion of poor quality cells with low numbers of counts/UMIs, estimation and removal of ambient RNA, and identification of potential doublet/multiplets. Many tools and packages are available to perform these operations and users are free to apply their tool(s) of choice as the celda clustering functions will work with any matrix stored in an SCE object. The celda package does contain a Bayesian method called decontX to estimate and remove transcript contamination in individual cells in a scRNA-seq dataset.

To perform QC, we suggest using the runCellQC function in singleCellTK package. This is a wrapper for several methods for calculation of QC metrics, doublet detection, and estimation of ambient RNA (including decontX). Below is a quick example of how to perform standard QC before applying celda. If you have another preferred approach or your data has already been QC’ed, you can move to Feature selection section. For this tutorial, we will only run one doublet detection algorithm and one decontamination algorithms. For a full list of algorithms that this function runs by default, see ?runCellQC. We will also quantify the percentage of mitochondrial genes in each cell as this is often used as a measure of cell viability.

library(singleCellTK)

# Get list of mitochondrial genes
mito.genes <- grep("^MT-", rownames(sce), value = TRUE)

# Run QC
sce <- runCellQC(sce, sample = NULL, algorithms = c("QCMetrics", "scDblFinder", "decontX"), geneSetList = list(mito=mito.genes), geneSetListLocation = "rownames")

Note: If you have cells from multiple samples stored in the SCE object, make sure to supply the sample parameter as the QC tools need to be applied to cells from each sample individually.

Individual sets of QC metrics can be plotted with specific functions. For example to plot distributions of total numbers of UMIs derived from runPerCellQC, doublet scores from runScDblFinder, and contamination scores from runDecontX (all of which were run by the runCellQC function), the following plotting functions can be used:

plotScDblFinderResults(sce, reducedDimName = "decontX_UMAP")

plotDecontXResults(sce, reducedDimName = "decontX_UMAP")

An comprehensive HTML report can be generated to visualize and explore the QC metrics in greater detail:

After examining the distributions of various QC metrics, poor quality cells will need to be removed. Typically, thresholds for QC metrics should exclude cells that are outliers of the distribution (i.e. long tails in the violin or density plots). Cells can be removed using the subsetSCECols function. Metrics stored in the colData of the SCE object can be filtered using the colData parameter. Here we will limit to cells with at least 600 counts and 300 genes detected:

# Filter SCE
sce <- subsetSCECols(sce, colData = c("total > 600", "detected > 300"))

# See number of cells after filtering
ncol(sce)
## [1] 2675

Other common metrics to filter on include subsets_mito_percent for removal of cells with high mitochondrial percentage, decontX_contamination for removal of cells with higher levels of contamination from ambient RNA, scDblFinder_class to remove doublets (or calls from any of the other doublet detection algorithms). See the singleCellTK documentation For more information on performing comprehensive QC and filtering.

Feature selection

In general, removing features with low numbers of counts across all cells is recommended to reduce computational run time. A simple selection can be performed by removing features with a minimum number of counts in a minimum number of cells using the selectFeatures function:

# Select features with at least 3 counts in at least 3 cells
library(celda)
useAssay <- "counts"
altExpName <- "featureSubset"
sce <- selectFeatures(sce, minCount = 3, minCell = 3, useAssay = useAssay, altExpName = altExpName)

# See number of features after filtering
nrow(altExp(sce, altExpName))
## [1] 2639

The useAssay parameter is used to denote which assay/matrix within the SCE to use for filtering. The default raw counts matrix is traditionally stored in the "counts" assay. If decontX was previously run during QC, then the decontaminated counts can be used by setting this parameter to "decontXcounts". We will save this parameter in a variable called useAssay which will be used as input in several downstream functions.

Note: The subsetted matrix is stored in the “alternative experiment” slot (altExp) within the SCE. This allows for a matrix with a different number of rows to be stored within the same SCE object (rather than creating two SCE objects). The celda functions described in the next several sections operate on a matrix stored in the altExp slot. The default name given to the alternative experiment and used in all downstream celda functions is "featureSubset". If the altExpName parameter is changed here, then it will need to be supplied to downstream plotting functions as well. The list of alternative experiments in an SCE can be view with altExpNames(sce). If you have already have an SCE with selected features or do not want to perform feature selection, then you need to set the alternative experiment directly with a command like altExp(sce, "featureSubset") <- assay(sce, "counts"). In the future, this will be updated to be more simple by utilizing the ExperimentSubset package.

If the number of features is still relatively large (e.g. >5000), an alternative approach is to select highly variable features that can be used in the downstream clustering. The advantage of this approach is that it can greatly speed up celda and can improve with module detection among highly variable features with overall lower expression. The disadvantage of this approach is that features that do not fall into the highly variable group will not be clustered into modules. The celda package does not include methods for selection of highly variable genes (HVGs). However, the singleCellTK provides wrappers for methods used in Seurat and Scran. We recommend keeping at least 2,000-5,000 HVGs for clustering. Here is some example code of how to select the top 5,000 most variable genes and store it back in the SCE as an altExp:

library(singleCellTK)
sce <- seuratFindHVG(sce, useAssay = useAssay, hvgMethod = "vst")
g <- getTopHVG(sce, method = "vst", n = 5000)
altExp(sce, altExpName) <- sce[g, ]

For the rest of the analysis with the PBMC3K data, we will use the first approach where features with at least 3 counts in 3 cells were included.

Analysis with Celda

Bi-clustering with known numbers of clusters

As mentioned earlier, celda is discrete Bayesian model that is able to simultaneously bi-cluster features into modules and cells into cell clusters. The primary bi-clustering model can be accessed with the function celda_CG. This function operates on a matrix stored as an alternative experiment in the altExp slot. If you did not perform feature selection as recommended in the previous section and your matrix of interest is not currently located in an altExp slot, the following code can be used to copy a matrix in the main assay slot to the altExp slot:

useAssay <- "counts"
altExpName <- "featureSubset"
altExp(sce, altExpName) <- assay(sce, useAssay)`. 

The two major adjustable parameters in this model are L, the number of modules, and K, the number of cell populations. The following code bi-clusters the PBMC3K dataset into 100 modules and 15 cell populations:

sce <- celda_CG(sce, L = 100, K = 15, useAssay = useAssay, altExpName = altExpName)

However, in most cases, the number of feature modules (L) and the number of cell clusters (K) are not known beforehand. In the next sections, we outline procedures that can be used suggest reasonable choices for these parameters. If the data is clustered with the code above by supplying K and L directly to the celda_CG function, then you can skip the next section and proceed to Creating 2-D embeddings.

Finding the number of modules

In order to help choose a reasonable solutions for L and K, celda provides step-wise splitting procedures along with measurements of perplexity to suggest reasonable choices for L and K. First, the function recursiveSplitModule can be used to cluster features into modules for a range of L. Within each step, the best split of an existing module into 2 new modules is chosen to create the L-th module. The module labels of the previous model with \(L-1\) modules are used as the initial starting values in the next model with \(L\) modules. Note that the initialization step may take longer with larger numbers of cells in the dataset and the splitting procedure will take longer with larger numbers features in the dataset. Celda models with a L range between initialL = 10 and maxL = 150 are tested in the example below.

moduleSplit <- recursiveSplitModule(sce, useAssay = useAssay, altExpName = altExpName, initialL = 10, maxL = 150)

Perplexity has been commonly used in the topic models to measure how well a probabilistic model predicts observed samples (Blei et al., 2003). Here, we use perplexity to evaluate the performance of individual models by calculating the probability of observing expression counts given an estimated Celda model. Rather than performing cross-validation which is computationally expensive, a series of test sets are created by sampling the counts from each cell according to a multinomial distribution defined by dividing the counts for each gene in the cell by the total number of counts for that cell. Perplexity is then calculated on each test set and can be visualized using function plotGridSearchPerplexity. A lower perplexity indicates a better model fit.

plotGridSearchPerplexity(moduleSplit, altExpName = altExpName, sep = 10)

The perplexity alone often does not show a clear elbow or “leveling off”. However, the rate of perplexity change (RPC) can be more informative to determine when adding new modules does not add much additional information Zhao et al., 2015). An RPC closer to zero indicates that the addition of new modules or cell clusters is not substantially decreasing the perplexity. The RPC of models can be visualized using function plotRPC:

plotRPC(moduleSplit, altExpName = altExpName)

In this case, we will choose an L of 80 as the RPC curve tends to level off at this point:

L <- 80
Note: Perplexity and RPC are meant to be guides to give a sense of a possible starting point for L. However, they may not always give a clear “leveling off” depending of the complexity and quality of the dataset. Do not give up if the choice of L is unclear or imperfect! If the L to choose is unclear from these, then you can set a somewhat high number (e.g. 75) and move to the next step of selecting K. Later on, manual review of modules using functions such as moduleHeatmap can give a sense of whether individual modules should be further split up by selecting higher L. For example, you can start exploring the cell populations and modules with L = 75. If some modules need to be further split, you can then try L = 100, L = 125, and so on.

Finding the number of cell subpopulations

Now we extract the Celda model of L =\(L\) with function subsetCeldaList and run recursiveSplitCell to fit models with a range of K between 3 and 25:

temp <- subsetCeldaList(moduleSplit, list(L = L))
sce <- recursiveSplitCell(sce, useAssay = useAssay, altExpName = altExpName, initialK = 3, maxK = 25, yInit = celdaModules(temp))

The perplexities and RPC of models can be visualized using the same functions plotGridSearchPerplexity and plotRPC.

plotRPC(sce)

The perplexity continues to decrease with larger values of K. The RPC generally levels off between 13 and 16 and we choose the model with K = 14 for downstream analysis. The follow code selects the final celda_CG model with L = 80 and K = 14:

K <- 14
sce <- subsetCeldaList(sce, list(L = L, K = K))

Note: Similar to choosing L, you can guess an initial value of K based off of the perplexity and RPC plots and then move to the downstream exploratory analyses described in the next several sections. After reviewing the cell clusters on 2-D embeddings and module heatmaps, you may have to come back to tweak the choice of K until you have something that captures the cellular heterogeneity within the data without “over-clustering” cells into too many subpopulations. This may be an iterative procedure of going back-and-forth between choices of K and plotting the results. So do not let imperfect perplexity/PRC plots prevent you from moving on to the rest of the analysis. Often times, using an initial guess for K will allow you to move on in the analysis to get a sense of the major sources of biological heterogeneity present in the data.

Exploring cell populations

Creating 2-D embeddings

After selecting a celda model with specific values of L and K, we can then perform additional exploratory and downstream analyses to understand the biology of the transcriptional modules and cell populations. We can start by generating a dimension reduction plot with the Uniform Manifold Approximation and Projection (UMAP) method to visualize the relationships between the cells in a 2-D embedding. This can be done with function celdaUmap.

sce <- celdaUmap(sce, useAssay = useAssay, altExpName = altExpName)

Alternatively, a t-distributed stochastic neighbor embedding (t-SNE) can be generated using function celdaTsne. The UMAP and t-SNE plots generated by celdaUmap and celdaTsne are computed based on the module probabilities (analogous to using PCs from PCA). The calculated dimension reduction coordinates for the cells are stored under the reducedDim slot of the altExp slot in the original SCE object. The follow command lists the names of the dimensionality reductions that can be used in downstream plotting functions in the next few sections:

reducedDimNames(altExp(sce, altExpName))
## [1] "decontX_UMAP" "celda_UMAP"

Plotting cell population cluster labels

The function plotDimReduceCluster can be used to plot the cluster labels for cell populations identified by celda on the UMAP:

plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", labelClusters = TRUE)

Plotting expression of specific features

Usually, biological features of some cell populations are known a priori and can be identified with known marker genes. The expression of selected marker genes can be plotted on the UMAP with the function plotDimReduceFeature.

markers <- c("CD3D", "IL7R", "CD4", "CD8B", "CD19", "FCGR3A", "CD14", "FCER1A", "PF4")

plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = useAssay, altExpName = altExpName, normalize = TRUE)

The parameter displayName can be used to switch between IDs stored in the rownames of the SCE and columns of the rowData of the SCE. If the assay denoted by useAssay is a raw counts matrix, then setting normalize = TRUE is recommended (otherwise the z-score of the raw counts will be plotted). When set to TRUE, each count will be normalized by dividing by the total number of counts in each cell. An alternative approach is to perform normalization with another method and then point to the normalized assay with the useAssay parameter. For example, normalization can be performed with the scater package:

library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")
plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = "logcounts", altExpName = altExpName, normalize = FALSE)

This second approach may be faster if plotting a lot of marker genes or if the dataset is relatively large.

Plotting cell subpopulations with labels

Once we identify of various cell subpopulations using the known marker genes, these custom labels can be added on the UMAP colored by cluster:

g <- plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", altExpName = altExpName, labelClusters = TRUE)

labels <- c("1: Megakaryocytes",
    "2: CD14+ Monocytes 1",
    "3: CD14+ Monocytes 2",
    "4: FCGR3A (CD16+) Monocytes",
    "5: CD14+ Monocytes 3",
    "6: CD8+ Cytotoxic T-cells",
    "7: CD4+ T-cells",
    "8: CD8+ Cytotoxic T-cells",
    "9: B-cells",
    "10: Naive CD8+ T-cells",
    "11: Naive CD4+ T-cells",
    "12: NK-cells",
    "13: Unknown T-cells",
    "14: Dendritic cells")

library(ggplot2)
g <- g + scale_color_manual(labels = labels,
    values = distinctColors(length(labels)))
print(g)

Exploring relationship between modules and cell populations

Celda has the ability to identify modules of co-expressed features and quantify the probability of these modules in each cell population. An overview of the relationships between modules and cell subpopulations can be explored with the function celdaProbabilityMap. The “Absolute probability” heatmap on the left shows the proportion of counts in each module for each cell population. The “Absolute probability” map gives insights into the absolute abundance of a module within a given cell subpopulation. The absolute heatmap can be used to explore which modules are higher than other modules within a cell population. The “Relative expression” map shows the standardized (z-scored) module probabilities across cell subpopulations. The relative heatmap can be used to explore which modules are relatively higher than other modules across cell populations.

celdaProbabilityMap(sce, useAssay = useAssay, altExpName = altExpName)

In this plot, we can see a variety of patterns. Modules 15 - 20 are highly expressed across most cell populations indicating that they may contain housekeeping genes (e.g. ribosomal). Other modules are specific to a cell population or groups of cell populations. For example, module 35 is only on in population 1 while module 70 is expressed across populations 2, 3, and to some degree in population 5. The unknown T-cell population 13 has highly specific levels of modules 30. In the next section, we can look at the genes in these modules to gain insights into the biological properties of each of these cell populations.

Exploring feature modules

The primary advantage of celda over other tools is that it can cluster features that are co-expressed across cells into modules. These modules are often more biologically coherent than features correlated with principal components from PCA. Below are several ways in which modules can be explored and visualized.

Table of features in each module

The function featureModuleTable can be used to get the names of all features in each module into a data.frame.

# Save to a data.frame
ta <- featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName)
dim(ta)
## [1] 154  80
head(ta[,"L70"])
## [1] "S100A9"   "S100A8"   "S100A12"  "RBP7"     "FOLR3"    "C19orf59"

The parameter displayName can be used to switch between IDs stored in the rownames of the SCE and columns of the rowData of the SCE. The the outputFile parameter is set, the table will be saved to a tab-delimited text file instead of to a data.frame:

# Save to file called "modules.txt"
featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName, outputFile = "modules.txt")

The modules for this model are shown below:

L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 L25 L26 L27 L28 L29 L30 L31 L32 L33 L34 L35 L36 L37 L38 L39 L40 L41 L42 L43 L44 L45 L46 L47 L48 L49 L50 L51 L52 L53 L54 L55 L56 L57 L58 L59 L60 L61 L62 L63 L64 L65 L66 L67 L68 L69 L70 L71 L72 L73 L74 L75 L76 L77 L78 L79 L80
CCL3 GNLY CTSW NKG7 RPS19 MT-CO2 MT-CO3 DDX5 RPL28 RPL18A FOS EIF1 JUNB GIMAP7 RPL13A RPS6 RPS2 RPL10 RPL13 RPS14 RPSA RPS27 LTB PTPRCAP MALAT1 LDHB IL32 CD79B CD37 TUBA1B GAPDH PPIA ACTG1 CCL5 PPBP RGS10 OAZ1 TAGLN2 MT-ND1 MT-CO1 ARPC3 SH3BGRL3 CYBA PTMA TMSB10 LAPTM5 ARHGDIB HLA-B CFL1 SRGN ACTB TMSB4X C9orf142 ANXA1 UBB B2M MYL12A HLA-A FCGR3A IFITM2 FAM26F FCER1G AIF1 FTH1 FCER1A HLA-DQA1 HLA-DPB1 CD74 HLA-DRA S100A9 LYZ CST3 VIM NEAT1 S100A4 GSTP1 LGALS1 GABARAP TYROBP FTL
IGFBP7 GZMB CD247 GZMA NACA CD52 MT-ND4 TSC22D3 RPS9 RPL12 FXYD5 H3F3B TMEM66 GIMAP4 RPS18 RPS3 RPL19 RPL11 RPL32 EEF1A1 JUN RPL21 MYC CXCR4 MYLIP IL7R CD3D CD79A SNHG7 HMGB2 EIF4A1 HNRNPA2B1 CORO1A GZMK PF4 TUBA4A FKBP1A GDI2 PFDN5 LSP1 YBX1 SERF2 CLIC1 HNRNPA1 EIF3K SNX3 UBC SRP14 PSMB9 ITGB2 PFN1 GMFG APOBEC3G HCST RAC2 HLA-C HSPA8 CALM1 RHOC CTSC NCF1 FGR LST1 COTL1 CLEC10A HLA-DQB1 HLA-DPA1 IRF8 HLA-DMA S100A8 LGALS2 CFP S100A10 ISG15 S100A6 GPX1 TYMP TSPO FCN1 CTSS
HAVCR2 FGFBP2 GZMM CST7 NAP1L1 PPDPF MT-CYB TXNIP FAU RPL8 CD48 DUSP1 ZFP36L2 FYB RPS8 RPS12 RPLP1 RPL6 RPLP2 RPS4X NPM1 RPS3A SIT1 ISG20 ATM NOSIP CD3E MS4A1 SNX2 EIF1AY SLC25A5 HMGB1 CHCHD2 LAG3 HIST1H2AC CDC42SE2 TALDO1 ATP5C1 SLC25A6 ATP6V0E1 LY6E ARPC1B SUPT4H1 RPL36AL ATP5E UQCRH MYL12B PSME1 PPP1CA CD63 MYL6 CAPZB CDC37 ID2 UCP2 HLA-E EVL CD99 CDKN1C MYO1G LYN CD86 IFITM3 SAT1 ENHO HLA-DQA2 HLA-DRB1 LAT2 LY86 S100A12 MS4A6A CPVL NFKBIA ANXA2 S100A11 AP1S2 LGALS3 RAC1 NCF2 NPC2
CCL4L1 CCL4 LYAR PRF1 SOD1 ATP6V1G1 MT-ND2 CIRBP UBA52 RPL29 APRT ITM2B TMEM123 GIMAP1 RPL10A RPL3 RPL15 RPL26 RPS16 RPL27A RPSAP58 RPS27A BIRC3 CD69 ANKRD44 GIMAP5 CD7 TCL1A PRKCB MANF TPI1 HSP90AA1 ENO1 SDPR FERMT3 H3F3A PRDX6 ATP5G2 BRK1 RHOA ALDOA IFI35 COX7C ATP5L GABARAPL2 YWHAB RBM3 PSMB8 CTSD ARPC2 TUBA1A ARGLU1 RNF181 CD53 ARL6IP5 IFITM1 SEPT7 CKB ABI3 POU2F2 CD300C CFD PSAP DNASE1L3 CD1C HLA-DRB5 EAF2 HLA-DMB RBP7 CD14 IGSF6 AMICA1 PRELID1 GRN TNFSF13B PYCARD CDA BRI3
SPON2 HOPX GZMH CYTIP PRR13 MT-ATP6 LIMD2 RPS24 GNB2L1 GSTK1 KLF6 BTG2 CITED2 RPS5 RPS15A RPS15 RPL14 RPS28 TPT1 HINT1 RPL9 RIC3 STK17A CARS C12orf57 CD2 IGLL5 ARL4A HMGA1 PKM HMGN1 SLC25A3 TSC22D1 DNAJB6 GSTO1 DNAJC8 ZFP36 C11orf31 CALM2 LAMTOR4 IRF7 CNBP PSMA7 POLD4 OST4 C19orf43 TPM3 MYO1F EMP3 PGK1 TMCO1 CKLF PLAC8 BIN2 AES PDIA3 LYPD2 ATP1B3 SCPEP1 FAM49A SERPINA1 TIMP1 SERPINF1 FCGR2B PLD4 MEF2C RNASE6 FOLR3 ALDH2 RAB32 MYADM IFI6 CEBPD RNF130 TKT SLC7A7 CTSB
CLIC3 C12orf75 KLRD1 CLNS1A SYF2 SSR2 KLF2 COX4I1 RPS11 CD44 IER2 NDFIP1 SEPW1 RPS23 RPL23A RPS7 RPL27 RPL36 RPL7 EIF4A2 BTG1 STMN3 ACAP1 DNAJB1 LCK IGJ HVCN1 STMN1 COX8A SRSF7 LDHA NRGN LIMS1 SOD2 YWHAH SERP1 COPE RNASET2 COX5B MIR142 TMA7 COX6B1 SEP15 NDUFA4 SUMO2 COX6A1 FLNA PSME2 ARF1 ADRM1 IL10RA CD164 RAP1B RARRES3 LITAF VMO1 SPN SYNGR2 ARRB1 CD68 CEBPB LILRA4 PPP1R14A CCDC50 GAPT C19orf59 IL8 CD33 PPT1 C1orf162 FCGRT STX11 AP2S1 FPR1 BLVRA
XCL2 CD8A CD160 CCT7 EVI2B TMEM14B HNRNPDL BTF3 RPL23 PPP1R15A GPSM3 RWDD1 GIMAP2 RPL18 RPL30 RPL7A EEF1D RPL22 RPL35A SNRPD2 GLTSCR2 CRIP2 RHOH PRDX2 LAT LINC00926 SYPL1 PRDX4 PSMB3 RAN PPIB TUBB1 MAX POLE4 MRPL14 GPX4 COX5A DRAP1 TCEB2 LAP3 EIF3F VAMP8 LSM7 SKP1 EIF3G GUK1 GLIPR2 ARPC5 LRRFIP1 PPP2CA CMTM3 PPP1R18 HLA-F TRAF3IP3 XBP1 ADA LYST TCF7L2 SPI1 STXBP2 PHACTR1 CSF3R CD302 CORO1B RHOG CTSH MNDA ATP6V0B MAFB RHOB
TTC38 KLRG1 FCRL6 ATP1A1 YPEL5 C19orf24 ANAPC16 PABPC1 EEF2 PLP2 EDF1 SOCS3 DGCR6L RPL5 RPS25 RPL35 RPL24 C6orf48 RPL34 SELL FOXP1 TNFRSF4 BIN1 CD27 CD3G BANK1 ADK CHTF8 GNB2 HNRNPK SRSF3 GNG11 AP3S1 RAB10 SPINT2 ATP5D WDR83OS SH3BGRL BST2 TMEM179B UBE2D3 ARL6IP4 RASGRP2 COX6C TMBIM6 TRAPPC1 EFHD2 ATP5B CALM3 PDHB FYN SCP2 DHRS7 IL2RG ANXA6 SH3BP1 LYL1 IFI30 ASAH1 PLBD1 LILRB4 TMEM14C MT2A BLVRB FGL2 RGS2 FCGR2A PLAUR
AKR1C3 ZAP70 XCL1 ARID4B DAZAP2 MPC2 VAMP2 EIF3H NBEAL1 KHDRBS1 UQCR11 TRADD TMEM173 RPS10 RPL31 RPL4 RPL38 C21orf33 RPL17 PEBP1 RP11-796E2.4 RP11-706O15.1 OCIAD2 LEPROTL1 CD8B VPREB3 SMIM14 SP140 UQCRFS1 MIF PARK7 RGS18 CTSA PTPN18 H2AFJ HIGD2A FIS1 BLOC1S1 SF3B5 PLSCR1 ERP29 NEDD8 MX1 PTPRC SAP18 CAP1 PLEK MSN FKBP8 TPST2 ETHE1 TMEM9B TMEM50A CCND3 YWHAQ GUSB STX7 APOBEC3A BID ASGR1 RAB31 DHRS4L2 CSTB NUP214 KLF4 LINC00936 TMEM176B CTSZ
PRSS23 APMAP KLRC1 THYN1 SP110 TMED4 UBE2D2 PNRC1 EIF3L SURF1 SSR4 CHURC1 GBP1 RPS13 RPS20 RPL37A RPS4Y1 C1orf228 RPL36A CMPK1 KCNQ1OT1 KRT1 FAIM3 PIK3IP1 OPTN FCER2 MYCBP2 HIST1H4C ANAPC11 SRSF2 PSMB1 CLU NCOA4 GLUL TPM4 FOSB PSMB7 RNH1 GLRX TMEM205 GTF3A NDUFA13 MAF1 PTGES3 JTB LCP1 CD300A HN1 ATRAID PPP6C PTGER2 ZNF207 SIGIRR SRP9 ALOX5AP FAM110A YBX3 MS4A7 WARS TNFAIP2 CD4 C10orf54 IL1B ODF3B CNPY3 VSTM1 TBXAS1
GPR56 SAMD3 RP11-347P5.1 CCDC12 MRPL21 UXT UBXN1 ZFAS1 STAT1 MT-ND5 STAT3 DNAJA2 RPLP0 RPS29 RPL37 TOMM7 COMMD6 PPA1 LBH RGCC ITM2A PDLIM1 CYB561A3 PCNA PGAM1 HNRNPC H2AFZ CD9 RHEB RNPEP COMMD3 NDUFA11 SRSF9 PTPN6 SERPINB1 TLN1 PCBP2 ATP5G3 RCSD1 ACTR3 EID1 HCLS1 AOAH DOK2 M6PR ATXN10 CDC42EP3 NDUFS2 SH3KBP1 FUS TMED9 ASB8 SNX10 HCK NINJ1 BST1 UBE2Q1 ANXA5 PGD SMCO4 CAPG RETN G0S2
HBA1 NCR3 TTC3 STUB1 TMEM165 PNISR RPL39 HPCAL1 PCBP1 TNFAIP3 SORL1 EEF1B2 RPS26 EIF3D SRSF5 ST13 ANKRD12 GPR183 TCF7 RORA P2RX5 MBD4 NME1 TIMM13 SPCS1 GHITM ACRBP ODC1 FAM45A ETFA MCL1 ARF5 ZNF706 ATP6V1F ANXA4 HSP90AB1 UBL5 DDT COX7A2 NDUFB8 ATP5F1 CX3CR1 PTP4A2 C1orf43 TMX2 GYG1 PSMC4 RASSF5 DUSP2 SUN2 UTRN NOTCH2NL RP11-290F20.3 CPPED1 FCGR1A H2AFY MSRB1 TGFBI LGALS9 MTMR11 C5AR1
PTGDS CHST12 RBM4 C19orf70 LYPLA1 RBMX ZFP36L1 RBMS1 ICAM3 UBAC2 PTGER4 RPS21 ZFAND1 MZT2B CCNI CCDC109B SLC2A3 RCN2 SPOCK2 SH2D1A BLK IFT57 SNRNP25 MOB1A C1QBP PRDX1 MMD RBBP6 MPP1 CIR1 LAMTOR1 PPP4C GNG5 UBE2L6 LAIR1 ATP5O COX7A2L HERPUD1 CIB1 C9orf16 VPS28 MIR4435-1HG CAPZA2 GBP2 RPL7L1 IFI44 SRP72 SYTL1 GYPC TAP1 TMEM18 BTK LILRA5 DUSP6 QPCT PGLS VCAN NUDT16 SAT2 CSTA MPEG1
PTPN7 DSCR3 TNFRSF14 PCSK7 SON APEX1 USP3 NDUFB11 HSPB1 MT1X SNHG8 FBL RPL41 AIMP1 ARHGAP15 PPM1K CCR7 ETS1 FCRLA PPAPDC1B AHCY CYC1 TRMT112 SNRPB CA2 AMD1 PLEKHO1 ADIPOR1 SCAND1 EIF5 VASP POLR2L DDAH2 HNRNPA0 CRIP1 PAPOLA TMEM59 UFC1 DBI ACTN4 RAB8A VPS29 CLEC2B SMIM12 ZFAND6 RASAL3 PPP2R5C BTN3A2 INSIG1 UNC93B1 PILRA TESC PID1 CARD16 ID1 PRAM1 IFNGR2 CYBB
TIGIT BAZ1A POLR2I TBC1D10C EIF2S3 ACTR1B CHMP4A MRPS33 RCBTB2 EEF1G EBPL CUTA TNFAIP8 ARID5B AQP3 CDC25B MZB1 NAT9 MCM5 SNF8 ERH UBE2I PTCRA GRAP2 MTHFD2 FDFT1 GNAS LAMTOR5 RBX1 SEC11A PARP14 ANP32B ATP5H RTFDC1 HNRNPF ARF6 DYNLL1 ASCL2 IDH2 MKRN1 EMG1 FLOT1 PMAIP1 MAEA DDIT4 PRMT2 CUX1 SCIMP LRRC25 SLC16A3 CXCL2 CASP1 CD1D APLP2 SLC11A1
PRR5 MRPL18 ARF4 FAM65B MED30 SSU72 RP11-51J9.5 HNRNPH3 EIF3E PCNP PPP3CC FLT3LG BCL11B CD72 RBM5 FABP5 FIBP CNN2 SEC61B SPARC RSU1 SNN GRSF1 HSD17B11 REEP5 RGS19 CASP4 LMO4 ATP5A1 NDUFB10 LSM10 C11orf58 HNRNPM PSMB10 MPST CLTB MYH9 DPM1 PSTPIP1 RAB11B RAB37 TERF2IP BUB3 UBLCP1 ALOX5 LILRB2 CTSL EREG ARRB2 NCOR2 JUND CLEC7A
KLRB1 SMIM7 N4BP2L1 PPP1R2 RP11-349A22.5 RAP1A IL27RA EPB41L4A-AS1 RSL1D1 MZT2A LY9 TRAT1 GATA3 TSPAN13 TPD52 PTTG1 TMEM208 DAD1 NDUFA1 MYL9 NT5C3A TAX1BP3 ILK CHMP1B ATP5EP2 RTN4 NOP10 SKAP2 NDUFB9 UQCR10 RPL22L1 RBM39 HNRNPA3 PRDX5 TMEM140 PSMA4 NMI VPS25 LINC00152 NT5C SEPT1 HSPA5 CDIP1 HHEX C19orf38 NAGA TNFSF10 GSN C4orf48 SULT1A1
MATK TRIM22 GMPR2 NCOR1 KIAA0040 POLR1D PRMT10 IMPDH2 CCNL1 DDX24 RP1-313I6.12 PRKCQ-AS1 SIRPG HLA-DOB ITM2C MCM7 MX2 SERBP1 ABRACL GP9 R3HDM4 MFSD1 CNDP2 ERGIC3 POLR2J CHCHD10 C20orf24 EPN1 C19orf53 ATP5J2 ZC3H15 CDC42 EIF3I CSNK2B OASL MPC1 CRELD2 DNAJC1 ATF6B ARL4C WIPF1 PPM1N SYK SLC31A2 GNAI2 RASSF4 AGTRAP GNS
IL2RB GRK6 C10orf32 SP100 IFI44L UBE2K TAF1D NSA2 TSTD1 PITPNA-AS1 LEF1 GPR171 SPIB CXXC5 FEN1 TXNDC17 TUBB MDH2 F13A1 NGFRAP1 MARCH2 HIST1H2BK IDH3G SUMO3 NDUFB5 TMEM219 U2AF1 MRPL23 LSM2 YWHAZ RABAC1 NDUFB2 UBE2B MOB2 SMARCA4 FDPS TECR ARPC5L CD83 HMOX1 DYNLT1 ENTPD1 GCA ADAP2
S100B C1orf63 MESDC2 MEAF6 IGBP1 LGALS3BP ZNF331 ILF3-AS1 SBDS NUP54 MAL AC092580.4 PKIG MAP3K8 MAD2L1 TIMM8B SUB1 PSMB6 TMEM40 CD82 PARVB THOC6 MT-ND3 DBNL OAS1 CLTA ISCU ATP5I EIF1B C4orf3 ENSA RAB7A RNF149 CMC1 CTD-2035E11.3 GNG2 SEPT6 LAMP1 IFIT2 LILRA3 NAAA OSM C20orf27
CD320 PPCS POLR3K RNPS1 METTL9 CAMK2G YME1L1 PSIP1 CHI3L2 SUSD3 BLNK IL4R TYMS MRPL28 C14orf166 ATP5J TREML1 PTTG1IP CORO1C POP7 OXA1L SNX17 ATG3 NDUFS7 MORF4L1 NDUFA2 EZR SEPT9 TMEM258 FAM49B SQRDL RAB9A RAB27A PHF14 PIM1 ARL6IP1 SIGLEC10 CSF1R EPSTI1 SULF2 SCO2
SH2D2A ADAR TLE4 KRT10 MRFAP1 TMC8 DDX18 AAK1 KIAA0125 SWAP70 EZH2 SPCS3 CYCS UQCRQ ITGA2B ACTN1 H1F0 LMNA TMEM147 EIF6 CD55 AHNAK NDUFS5 ALKBH7 DECR1 PAIP2 RBM8A OSTF1 UPP1 MRPL19 RRAGC TMEM109 ADD3 MRPS6 RELT APOBEC3B OAZ2 MGST2 NAPRT1
PLEKHF1 DNAJC15 MORF4L2 EIF4B NDUFA3 ADSL NDNL2 LDLRAP1 POU2AF1 NXT2 KIAA0101 EIF3A XRCC6 CALR CMTM5 HIST1H1C SOX4 SLC39A3 C1orf86 XAF1 MARCKSL1 ZNHIT1 NHP2L1 RNF7 MRPL43 DNAJA1 AP2M1 ARPC4 MGAT1 DHX36 FAM105A MAPK1IP1L HDAC1 LSM14A KYNU GPBAR1 HSBP1 IER3 EIF4EBP1
ZNHIT3 SF3A1 ARHGEF1 CMTM7 CCT2 TTC39C AL928768.3 HELQ RRM2 ISOC2 EIF4G2 PSMD8 CLDN5 RUFY1 DAPP1 C14orf2 WAS ENY2 VAMP5 SMDT1 MRPL54 RNASEH2B VAPA IRF1 ARHGDIA MMP24-AS1 GFER CYB561D2 ANXA2R RNF167 PIK3AP1 RXRA ATP6V0D1 GMPR NAGK
PRKAR1A SRSF11 SQSTM1 FAM107B MGAT4A RP5-887A10.1 CENPN GGCT ATPIF1 AURKAIP1 ERV3-1 RIOK3 PQBP1 PFDN2 COX17 SDCBP CCDC85B VDAC1 YPEL3 C9orf78 TMEM230 TBCB ANAPC13 CXXC1 CELF1 HMOX2 C19orf66 KIAA0930 OSCAR EIF4E2 SRA1
APBB1IP TAF7 DPP7 FGFR1OP2 OXNAD1 TNFRSF13B GMNN YIF1B SPCS2 MTDH TUBA1C ZNF263 DNAJC7 ACTR2 LYSMD2 SAMHD1 EIF3M HADHA CCT6A DYNLRB1 KRTCAP2 BAX PLIN2 STX18 DTNBP1 RPA2 SLC9A3R1 HES1 LILRB1 ZYX MIR24-2
IKZF1 KMT2E FNBP1 PIM2 NUCB2 TNFRSF17 TK1 HAUS4 PSMC5 PSMA5 BBC3 DERA BNIP3L ERP44 GRB2 SDHB CSDE1 PSMG2 DRAM2 SELK PDCD6 SELT IFIT1 PGM1 BAZ2A NDUFA5 TSC22D4 HES4 TNFRSF1B NR4A1
SNAP23 TAPSAR1 MGST3 DNAJB9 RGL4 GINS2 HDGF RANBP1 H2AFV PLA2G12A PICALM NENF TWF2 LSM6 LAMTOR2 SF1 ETFB CAPZA1 PYURF VDAC2 TMBIM4 ARRDC1 DYNLL2 HSH2D DENND2D BZW1 CAMK1 CAPNS1 CECR1
BEST1 TCEA1 NUCB1 DEGS1 CD40LG ZWINT PGP DUT TOMM22 PGRMC1 APP AKR1A1 EMC7 ZFAND5 NDUFS6 TRA2B COX14 SMAP2 S1PR4 FAM96B PSMD4 IFI27 APOBEC3C TBCC DEF6 CD300E UBE2D1 MIDN
TAOK3 UBXN4 TGOLN2 XXbac-BPG299F13.17 TRABD2A BIK REEP3 TBCA NDUFC2 FHL1 MID1IP1 NDUFS3 CHMP4B CD97 NDUFV2 PSMB4 PLEKHJ1 BCAP31 LMAN2 WDR1 RSAD2 ZBP1 PBXIP1 C5orf56 CD300LF THEMIS2 GRINA
CARD8 MED10 PTPN2 PARP1 CD6 CCNA2 LMNB1 HSP90B1 LSM4 SLC40A1 HADHB BAG1 SMS TGFB1 UQCRC2 PSMD9 TMEM256 ICAM2 TAPBP NDUFB7 ATP5SL ABHD14B RSRC2 ARHGEF40 NANS PTPRE
KLF3 SEC62 EVI2A TOB1 SH3YL1 BIRC5 ECH1 PDIA6 PNMA1 TREX1 LRPAP1 CAT GLIPR1 EIF2S2 ADI1 BLOC1S2 TMED2 SUMO1 RAB5C COMMD10 RGS1 HDDC2 CXCL16 ATOX1 CARS2
LIPA CCT4 TNIP1 SVIP CAMK4 XRCC5 TUFM TPM1 PARL SHKBP1 LTA4H RILPL2 MINOS1 NAA10 WBP2 SSBP1 GADD45B CHMP2A CD38 IFRD1 SMARCE1 TPPP3 MAPKAPK3 DNTTIP1
ZNF394 KIF5B EMC10 TMEM261 SATB1 HNRNPU P4HB CCDC69 TRAPPC2L FAM32A SSR3 MRPL20 TXN2 EIF5B ANXA11 GADD45GIP1 SRI ITGB7 DDX6 LILRA2 CBR1 MYO9B
CLK3 STK4 MTIF3 BEX4 FHIT NCL DEK HTATIP2 C7orf50 PMVK UQCRC1 TOMM20 MRPL41 MMADHC CD47 ACP1 SIVA1 DNAJC19 SDF4 EMR2 TNFRSF1A UBE2R2
SNX9 POLR3GL FRG1 CYLD USP10 NHP2 UBE2L3 ARHGAP4 BNIP2 YWHAE SERPINB6 C19orf60 NDUFA7 SLTM NDUFB4 PSMA1 RALY RNF139 DERL1 MS4A4A ADRBK1 LACTB
OCIAD1 CSRNP1 ASF1A UXS1 PA2G4 C19orf10 MRPL40 MRPS23 PARVG LSMD1 IDS RNF187 HAGH NDUFA12 RPS19BP1 BSG FKBP11 PRPF31 CTD-2006K23.1 TCIRG1 CDKN1A
MPHOSPH8 SLC38A1 CISH NOL7 MYEOV2 MTHFS FAM173A ACAA1 PHPT1 ATF4 OLA1 COQ7 PSMA2 RPN2 TCF25 SKAP1 SPSB3 C1QA NAMPT CREG1
CRBN CDKN1B PASK MDH1 PSMB2 DNAJC4 HBP1 NFKBIZ NDUFB1 DGUOK MRPL52 FBXW5 SSB BANF1 POLR2G NAP1L4 EBP ZNF703 ZNF106 FUOM
TMEM243 LSM5 TNFRSF25 CCT8 MRPL51 PRPF8 TIMMDC1 VMP1 SPG21 MRPS21 MPG PNKD CELF2 HMGN3 NDUFS8 YTHDF2 GCHFR CEBPA AP2A1 FBP1
PPIG G3BP1 CCDC104 HSPE1 TXN ZNF581 ABTB1 CYTH4 IFI27L2 CAMLG NDUFA9 TINF2 RPS27L KARS BUD31 STT3B MFSD10 ALDH3B1 C11orf21 PDXK
MED4 MPHOSPH10 INTS12 PHB SNRPD1 ACAP2 ZNF511 MTPN VIMP PPP1CC EMC6 SMARCB1 NUDC SHISA5 RNF213 REXO2 UBA2 C1QB NRROS PLIN3
UGP2 THAP7 NELL2 CCT3 SNRPC RPL26L1 TPP1 MYD88 COMT NDUFAF3 MVP SET CAPN2 IMP3 ATP6AP2 RBL2 ALDH9A1 HCAR3 MANBA ATF3
DCK ID3 AKTIP EWSR1 MRPS34 TRMT1 IFNGR1 AKIRIN2 MAP2K3 ANXA7 NDUFC1 LRCH4 IK C16orf13 MAP1LC3B NOP58 ORAI1 CXCL3 MBOAT7
WTAP SLC25A45 LINC00176 HNRNPR PSMD7 FAM195A DOCK2 COMMD9 CHCHD5 HAX1 COX7B SIAH2 EIF4H CDC42SE1 RER1 CD96 SURF4 PRKCD
ORMDL1 RP11-489E7.4 CD28 CBX3 ILF2 FAM192A IFIT3 STX10 MBNL1 MRPS16 UBE2J1 HMHA1 SPAG7 FMNL1 SH3BP5 B3GAT3 SGK1
CCDC107 GPATCH4 SCGB3A1 CACYBP CCT5 CINP ERICH1 HM13 NR4A2 ZNRD1 C12orf10 JAK1 SDHC DCTN3 EMB PLGRKT RAB34
TAGAP TNRC6C OSTC ATP5G1 ARPC1A VKORC1 MRP63 MKKS SMCHD1 TRAM1 EMC4 PCMT1 PRKCH GBP5
PDCD4 RP11-291B21.2 THOC7 SHFM1 RFXANK ECHDC1 SF3B2 STRA13 TANK COPS6 APH1A CDKN2D NUDT16L1 PRKD2
TCP1 HAPLN3 PRMT1 ANP32A PCGF5 SLA RAD23A GPI STX5 RAB2A ARL5A RBCK1 ODF2L TRPV2
CCNG1 HSPD1 NDUFAB1 SNX5 SNAPIN TRAPPC6A ANAPC15 THRAP3 DSTN SF3B14 LAPTM4A C14orf1 STAMBP
NONO MATR3 NUTF2 ERCC1 RTN3 PHB2 KXD1 FAM96A CCDC115 PIN1 CSK CYB5B MLLT11
IL16 EIF5A HSD17B10 NCKAP1L PEPD RSL24D1 MRPL16 ISCA2 RHOF SRRM2 CHMP5 PSMD5-AS1 SYNE1
LUC7L3 PDCD5 POLR2E IL10RB STK38 STK17B SLIRP AAMP EAPP CFLAR SSNA1 ORMDL3 TFDP2
FNTA UBE2N PPP1R7 PFKL C9orf89 LINC00493 UQCRB CHPT1 COPS5 CCNDBP1 WASF2 CCDC167
N4BP2L2 MAGOH SEC11C UNC119 CDV3 GGNBP2 CTNNBL1 REL SNX6 COPZ1 KDELR2 GIMAP6
TPR PPHLN1 SNRPD3 ATP2B1 HEXB NSMCE1 FAM50A ITGAE SARS TMED10 COX6A1P2 PYCR2
G3BP2 VDAC3 PSMC3 RABGAP1L ZNF524 CWC15 KDELR1 USF2 FBXO7 DDOST SELPLG RNF115
ELOVL5 NUCKS1 OTUB1 NFYC VMA21 PDCD2 IRF2BP2 ITGA4 GNAI3 SDHD WSB1 PTPN4
SCAF11 AIP MRPS7 MTSS1 PLD3 MRPS18B PSMD11 NSFL1C CCM2 MRPL34 C7orf73 SYNE2
PRPF38B RBM17 PSMA3 RALB ATP6V1B2 PRRC2C HARS GRPEL1 SNHG15 TMEM160 CMTM6 TMEM87A
SLC3A2 GTF2A2 CDK2AP2 DPEP2 CHIC2 CNPY2 SEC13 SAMD9L ABT1 ZMAT2 SUCLG1 RBM38
DPY30 STRAP SRM SNAP29 OAS3 RP11-1143G9.4 FAM89B CCS RNF5 IFI16 PRDX3 THAP11
CYTH1 SNRPE UFD1L GCH1 CEBPG NECAP2 NIT2 ACO2 FAM162A TRAPPC3 MIEN1 OBFC1
CCDC59 EPC1 POMP GINM1 CBWD1 DDX17 FLYWCH2 MRPL12 CCNH C17orf62 RNASEH2C CD59
WHSC1L1 RNF126 MRPL11 RIPK2 TXNL1 PTOV1 STARD7 KTN1 PDAP1 SH3GLB1 CCND2
SF3B1 ADH5 GTF3C6 UTP6 H1FX FH TRIM38 LCP2 RNF166 AKAP13 PHACTR4
RBM23 PPM1G SDF2L1 PACSIN2 MED28 TRA2A BCL7B TXNDC12 ELOF1 GOLT1B
BBX SNRPF PSMC2 LINC01003 SRRM1 COMMD5 GLUD1 IDH3B NDUFA6 MTFP1
NAA38 WBSCR22 NDUFB3 RBM25 CKS1B ELF1 MRPL55 ATP6V1E1 CXCR3
IRF9 MAPRE1 TUBB4B DCXR DUSP22 COMMD8 POLR2F FKBP2 GALM
MAT2B METTL23 XRN2 SFPQ PTPN1 DCTN2 ARHGAP30 CAST ACD
GPBP1 SNRPA1 CARHSP1 CHMP3 SRSF6 TMEM126B DDX46 SFT2D1 TNFRSF18
CHD2 AKR1B1 C11orf48 DAP3 TEN1 NME4 COMMD7 CISD3
EGLN2 PPP1R12A NDUFS4 MRPS15 COMMD4 CCZ1 PPP1R11 NDUFB6
ARL2BP STOML2 TCEB1 EIF2A TSSC1 RPF1 AUP1 UBE2F
RBBP7 MRPL9 IMP4 MFNG CWC25 YY1 PPP2R1A PSENEN
RPAIN MFF MRPS18C PSMF1 CEPT1 RAB11A PSMD13 FAM204A
NKTR PITHD1 SEC61G PHF5A CHD9 KPNB1 PET100 SCAMP2
PNN ANAPC5 ZDHHC12 TAF9 CD81 PDCL3 ITGB1BP1 NAPA
DARS CDC123 PTBP1 NDUFA10 RFC2 GLG1 TEX264 OS9
NCBP2 SNRPG HNRNPD SNW1 ACP5 METTL5 CSNK1A1 ASNA1
AATF LYRM4 NUDT1 FDX1 UROS LSM1 MLX
RSF1 GLRX3 PAK2 ECHS1 MAGED2 GSDMD MEA1
CHCHD7 SRP19 SEPT2 MT-ND4L ZCRB1 RTF1 UBE2A
AL592183.1 C17orf89 AK2 DHPS CCDC53 NME3 EMC3
BTF3L4 SNRNP70 RPS6KB2 PMF1 APOL3 TAF12 STK10
FLI1 PSMD6 TIMM17A SUCLG2 SHARPIN PUF60 TPGS1
NR3C1 NUDT5 DCTPP1 PRKCSH PPA2 IAH1 TMEM141
ITSN2 DDX39A HMGN2 BABAM1 ZFAND2B PDLIM2 PKN1
ZC3HAV1 HSPA9 MTCH2 PABPC4 SLFN5 MGMT UBE2E3
MTERFD2 VCP MRPL47 ATG12 CCDC90B MRPS12 CCDC124
GRAMD1A SRSF1 UQCC2 NXT1 TMBIM1 BRD2 COA3
ZNF24 EIF4A3 MRPL36 PPIE TSNAX CANX RPN1
USE1 CLPP NKAP PIH1D1 SEPHS2 NAA20 RAB4A
USP16 TTC1 DAZAP1 UBE2E1 SLC44A2 NOP56 TADA3
ARFGAP2 IDI1 MRPL15 QARS ASCC2 FUNDC2 MRPS11
FRG1B MRPS26 TIMM10 GID8 BFAR MTCH1 DNAJC2
TIMM9 HNRNPUL1 DTYMK MRPL3 CCDC25 TXNL4A FKBP5
MARCH7 HINT2 TOP1 USP15 NRBP1 PTRHD1 RRP7A
LENG1 SLBP PSMG3 MRPL32 PHF3 SLC25A11 DNAJB11
CYB5A PSMD14 DLD MAPKAPK5-AS1 SYS1 COX16 YIPF3
SMIM19 FKBP3 DCPS PTP4A1 PPM1B NDUFV1 DIAPH1
LINC-PINT SDHA TFDP1 EIF1AX YAF2 ARHGAP9 HNRNPH2
OARD1 NCBP2-AS2 RP11-139H15.1 LNPEP LSM3 PFDN6
TTC14 SNRNP40 FXR1 PEX16 GPAA1
TMEM242 RAD21 EIF2B1 SRSF4 MAPK1
DDIT3 VBP1 ESD PHF11 ACAA2
SAFB2 C14orf119 SF3A3 VTI1B PANK2
CEBPZ-AS1 FAM177A1 MCTS1 AKR7A2 IFNAR2
BRD9 HPRT1 CCNC URM1 MAD2L2
PRDM2 SLC25A39 COA6 AHSA1 CCDC88C
CD84 RPA3 PHYKPL JAGN1 RAB4B
EGR1 SNRPA CAMTA1 GRHPR G6PD
GLRX5 C19orf25 ROMO1 ARFGAP3
TIAL1 CHCHD1 PFDN1 BCCIP
EXOSC8 EIF4E STX8 RABL6
LBR C11orf83 RNPEPL1 SMC4
ILF3 SREK1 TMEM248 MVD
SAP30BP ANKRD11 GGA1 MRPS28
ACTR10 TMEM138 IFT20
NUDT21 PTGES2 AKAP9
UBE2G1 LGALS8 C18orf32
LARP7 MLEC BRMS1
PPIH C5orf15 CAPN1
EBNA1BP2 FEM1B DDRGK1
VOPP1 RAD23B
GNL3 WDR33
CISD2 WDR61
SSRP1 SUGT1
PDIA4 LAGE3
SYNCRIP RBBP4
C1orf35 ESYT1
MAP7D1 APOA1BP
DNAJC9 MIF4GD
HAUS1 CFDP1
ILKAP UBE2J2
RPUSD3 MRPS5
CDKN2AIPNL SRPK2
POLD2 FAM200B
HNRNPAB C17orf49
TMEM106C NUBP2
CBX1 PRKAG1
NAA50 SURF2
MCM3 SSSCA1
CISD1 EI24
TFPT CSNK1D
UBA5 DCTD
LMAN1 PEX2
PGRMC2 PNKP
C19orf48 TMEM70
C14orf142 JMJD6
PAICS DCAF5
RCE1

Module lookup

If you want to quickly find which module a particular feature was assigned to, the featureModuleLookup function can be used. Here will will look up a marker gene for T-cells called “CD3E”:

mod <- featureModuleLookup(sce, feature = c("CD3E", "S100A8"))
mod
##   CD3E S100A8 
##     27     70

Module heatmaps

The function moduleHeatmap can be used to view the expression of features across cells for a specific module. The featureModule parameter denotes the module(s) to be displayed. Cells are ordered from those with the lowest probability of the module on the left to the highest probability on the right. Similarly, features are ordered from those with the highest probability within the module on the top to the lowest probability on the bottom.

moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName)

The parameter topCells can be used to control the number of cells included in the heatmap. By default, only the 100 cells with the lowest probabilities and the 100 cells with the highest probabilities for each selected module are included (i.e. topCells = 100 by default). To display all cells, this parameter can be set to NULL:

moduleHeatmap(sce, featureModule = 27, topCells = NULL, useAssay = useAssay, altExpName = altExpName)

Note: Multiple modules can be displayed by giving a vector of module indices to the parameter featureModule. If featureModule is not specified, then all modules will be plotted.

Module probabilities on 2-D embeddings

The function plotDimReduceModule can be used visualize the probabilities of a particular module or sets of modules on a reduced dimensional plot such as a UMAP. This can be another quick method to see how modules are expressed across various cells in 2-D space. As an example, we can look at module 70 which contained S100A8:

plotDimReduceModule(sce, modules = 70, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

Similarly, multiple modules can be plotting in a grid of UMAPs:

plotDimReduceModule(sce, modules = 70:78, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

In this grid, we can see that module 70 (which has high levels of S100A8 and S100A9) is highly expressed in cell populations 2 and 3, module 71 (which contains CD14) can be used to identify all CD14+ monocytes, module 72 (which contains CST3) is expressed across both CD14 and FCGR3A (CD16) expressing monocytes, and module 73 (which contains CD4) is expressed broadly across both monocytes and dendritic cells as well as some T-cell populations. If we were interesting in defining transcriptional programs active across all monocytes, we could examine the genes found in module 72. If we were interested in defining transcriptional programs for all CD14+ monocytes, we could examine the genes in module 71. These patterns can also be observed in the Probability Map

In the celda probability map, we saw that the unknown T-cell population 13 had high levels of module 30. We can examine both module heatmaps and module probability maps to further explore this:

moduleHeatmap(sce, featureModule = 30, useAssay = useAssay, altExpName = altExpName)

plotDimReduceModule(sce, modules = 30, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

Module 30 has high levels of genes associated with proliferation including HMGA1, STMN1, PCNA, HMGB2, and TUBA1B. We can therefore re-label these cells as “Proliferating T-cells”.

Identification and plotting of marker genes

In addition to examining modules, differential expression can be used to identify potential marker genes up-regulated in specific cell populations. The function findMarkerDiffExp in the singleCellTK package will find markers up-regulated in each cell population compared to all the others.

Differential expression to identify marker genes

# Normalize counts (if not performed previously)
library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")

# Run differential expression analysis
sce <- findMarkerDiffExp(sce, useAssay = "logcounts", method = "wilcox", cluster = celdaClusters(sce), minMeanExpr = 0, fdrThreshold = 0.05, log2fcThreshold = 0, minClustExprPerc = 0, maxCtrlExprPerc = 1)

The function plotMarkerDiffExp can be used to plot the results in a heatmap. The topN parameter will plot the top N ranked genes for each cluster.

# Plot differentially expressed genes that pass additional thresholds 'minClustExprPerc' and 'maxCtrlExprPerc'
plotMarkerDiffExp(sce, topN = 5, log2fcThreshold = 0, rowLabel = TRUE, fdrThreshold = 0.05, minClustExprPerc = 0.6, maxCtrlExprPerc = 0.4, minMeanExpr = 0)

Other parameters such as minClustExprPerc (the minimum number of cells expressing the marker gene in the cluster) and maxCtrlExprPerc (the maximum number of cells expression the marker gene in other clusters) can be used to control how specific each marker gene is to each cell populations. Similarly, adding a log2 fold-change cutoff (e.g. 1) can select for markers that are more strongly up-regulated in a cell population.

Violin plots for marker genes

The plotCeldaViolin function can be used to examine the distribution of expression of various features across cell population clusters derived from celda. Here we can see that the gene CD79A has high expression in the B-cell cluster and HMGB2 has high expression in the proliferating T-cell population.

# Normalize counts if not performed in previous steps
library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")

# Make violin plots for marker genes
plotCeldaViolin(sce, useAssay = "logcounts", features = c("CD79A", "HMGB2"))

Generating HTML reports

The celda package comes with two functions for generating comprehensive HTML reports that 1) capture the process of selecting K/L for a celda_CG model and 2) plot the results from the downstream analysis. The first report runs both recursiveSplitModule and recursiveSplitCell for selection of L and K, respectively. To recapitulate the complete analysis presented in this tutorial in the HTML report, the following command can be used:

sce <- reportCeldaCGRun(sce, sampleLabel = NULL, useAssay = useAssay, altExpName = altExpName, minCell = 3, minCount = 3, initialL = 10, maxL = 150, initialK = 3, maxK = 25, L = 80, K = 14)

All of the parameters in this function are the same that were used throughout this tutorial in the selectFeatures, recursiveSplitModule, and recursiveSplitCell functions. Note that this report does not do cell filtering, so that must be completed before running this function. The returned SCE object will have the celda_CG model with selected K and L which can be used in any of the downstream plotting functions as well as input into the second plotting report described next.

The second report takes in as input an SCE object with a fitted celda_CG model and systematically generates several plots that facilitate exploratory analysis including cell subpopulation cluster labels on 2-D embeddings, user-specified annotations on 2-D embeddings, module heatmaps, module probabilities, expression of marker genes on 2-D embeddings, and the celda probability map. The report can be generated with the following code:

reportCeldaCGPlotResults(sce, reducedDimName = "celda_UMAP", features = markers, useAssay = useAssay, altExpName = altExpName, cellAnnot = c("total", "detected", "decontX_contamination", "subsets_mito_percent"), cellAnnotLabel = "scDblFinder_class")

User-supplied annotations to plot on the 2-D embedding can be specified through the cellAnnot and cellAnnotLabel variables. Both parameters will allow for plotting of variables stored in the colData of the SCE on the 2-D embedding plot specified by reducedDimName parameter. For cellAnnot, integer and numeric variables will be plotted as as continuous variables while factors and characters will be plotted as categorical variables. For cellAnnotLabel, all variables will be coerced to a factor and the labels of the categories will be plotted on the scatter plot.

Other useful functions

Matrix factorization

The celda model factorizes the original matrix into three matrices:

1) module - The probability of each feature in each module (Psi)

2) cellPopulation - The probability of each module in each cell population (Phi)

3) sample - The probability of each cell population in each sample (Theta)

Additionally, we can calculate the probability of each module within each cell (cell). The cell matrix can essentially be used to replace PCs from PCA and is useful for downstream visualization (e.g. generating 2-D embeddings). All of these matrices can be retrieved with the factorizeMatrix function. The matrices are returned in three different versions: unnormalized counts, proportions (normalized by the total), or posterior estimates (where the Dirichlet concentration parameter is added in before normalization).

# Factorize the original counts matrix
fm <- factorizeMatrix(sce)

# Three different version of each matrix:
names(fm)
## [1] "counts"      "proportions" "posterior"
# Get normalized proportional matrices
dim(fm$proportions$cell) # Matrix of module probabilities for each cell
## [1]   80 2675
dim(fm$proportions$module) # Matrix of feature probabilities for each module
## [1] 2639   80
dim(fm$proportions$cellPopulation) # Matrix of module probabilities for each cell population
## [1] 80 14
dim(fm$proportions$sample) # Matrix of cell population probabilities in each sample
## [1] 14  1

Changing the feature display name

The parameter displayName can be used to change the labels of the rows from the rownames to a column in the rowData of the SCE object. The function is available in plotDimReduceFeature and moduleHeatmap. For example, if we did not change the rownames to Symbol_TENx in the beginning of the tutorial, the following code still could be run in moduleHeatmap to display the gene symbol even if the rownames were set to the original Ensembl IDs:

moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName, displayName = "Symbol_TENx")

Session information

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.18.6               kableExtra_1.3.4           
##  [3] knitr_1.31                  ggplot2_3.3.5              
##  [5] celda_1.12.0                singleCellTK_2.5.1         
##  [7] TENxPBMCData_1.8.0          HDF5Array_1.18.1           
##  [9] rhdf5_2.34.0                DelayedArray_0.16.2        
## [11] Matrix_1.3-2                SingleCellExperiment_1.12.0
## [13] SummarizedExperiment_1.20.0 Biobase_2.50.0             
## [15] GenomicRanges_1.42.0        GenomeInfoDb_1.26.4        
## [17] IRanges_2.24.1              S4Vectors_0.28.1           
## [19] BiocGenerics_0.36.0         MatrixGenerics_1.2.1       
## [21] matrixStats_0.58.0         
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.1                    reticulate_1.18              
##   [3] R.utils_2.10.1                tidyselect_1.1.0             
##   [5] RSQLite_2.2.4                 AnnotationDbi_1.52.0         
##   [7] grid_4.0.4                    combinat_0.0-8               
##   [9] BiocParallel_1.24.1           Rtsne_0.15                   
##  [11] scDblFinder_1.4.0             DropletUtils_1.10.3          
##  [13] munsell_0.5.0                 codetools_0.2-18             
##  [15] ragg_1.1.3                    statmod_1.4.35               
##  [17] scran_1.18.5                  xgboost_1.3.2.1              
##  [19] withr_2.4.1                   colorspace_2.0-0             
##  [21] highr_0.8                     rstudioapi_0.13              
##  [23] assertive.base_0.0-9          labeling_0.4.2               
##  [25] GenomeInfoDbData_1.2.4        GSVAdata_1.26.0              
##  [27] bit64_4.0.5                   farver_2.1.0                 
##  [29] rprojroot_2.0.2               vctrs_0.3.6                  
##  [31] generics_0.1.0                xfun_0.22                    
##  [33] BiocFileCache_1.14.0          fishpond_1.6.0               
##  [35] R6_2.5.0                      doParallel_1.0.16            
##  [37] ggbeeswarm_0.6.0              clue_0.3-58                  
##  [39] rsvd_1.0.3                    RcppEigen_0.3.3.9.1          
##  [41] locfit_1.5-9.4                bitops_1.0-6                 
##  [43] rhdf5filters_1.2.0            cachem_1.0.4                 
##  [45] gridGraphics_0.5-1            assertthat_0.2.1             
##  [47] promises_1.2.0.1              scales_1.1.1                 
##  [49] beeswarm_0.3.1                gtable_0.3.0                 
##  [51] beachmat_2.6.4                Cairo_1.5-12.2               
##  [53] rlang_0.4.10                  systemfonts_1.0.1            
##  [55] GlobalOptions_0.1.2           BiocManager_1.30.10          
##  [57] yaml_2.2.1                    reshape2_1.4.4               
##  [59] httpuv_1.5.5                  tools_4.0.4                  
##  [61] ellipsis_0.3.1                jquerylib_0.1.3              
##  [63] RColorBrewer_1.1-2            Rcpp_1.0.6                   
##  [65] plyr_1.8.6                    sparseMatrixStats_1.2.1      
##  [67] zlibbioc_1.36.0               purrr_0.3.4                  
##  [69] RCurl_1.98-1.2                dbscan_1.1-6                 
##  [71] GetoptLong_1.0.5              viridis_0.5.1                
##  [73] cowplot_1.1.1                 cluster_2.1.0                
##  [75] ggrepel_0.9.1                 fs_1.5.0                     
##  [77] magrittr_2.0.1                data.table_1.14.0            
##  [79] RSpectra_0.16-0               magick_2.7.0                 
##  [81] circlize_0.4.12               mime_0.10                    
##  [83] evaluate_0.14                 xtable_1.8-4                 
##  [85] gridExtra_2.3                 shape_1.4.5                  
##  [87] compiler_4.0.4                tibble_3.1.0                 
##  [89] crayon_1.4.1                  R.oo_1.24.0                  
##  [91] htmltools_0.5.1.1             later_1.1.0.1                
##  [93] MCMCprecision_0.4.0           DBI_1.1.1                    
##  [95] ExperimentHub_1.16.0          assertive.files_0.0-2        
##  [97] dbplyr_2.1.0                  ComplexHeatmap_2.6.2         
##  [99] rappdirs_0.3.3                assertive.numbers_0.0-2      
## [101] assertive.types_0.0-3         R.methodsS3_1.8.1            
## [103] igraph_1.2.6                  pkgconfig_2.0.3              
## [105] pkgdown_1.6.1                 scuttle_1.0.4                
## [107] xml2_1.3.2                    foreach_1.5.1                
## [109] svglite_2.0.0                 vipor_0.4.5                  
## [111] bslib_0.2.4                   dqrng_0.2.1                  
## [113] webshot_0.5.2                 XVector_0.30.0               
## [115] rvest_1.0.0                   stringr_1.4.0                
## [117] digest_0.6.27                 rmarkdown_2.7                
## [119] enrichR_3.0                   uwot_0.1.10                  
## [121] edgeR_3.32.1                  DelayedMatrixStats_1.12.3    
## [123] curl_4.3                      shiny_1.6.0                  
## [125] gtools_3.8.2                  rjson_0.2.20                 
## [127] lifecycle_1.0.0               jsonlite_1.7.2               
## [129] Rhdf5lib_1.12.1               BiocNeighbors_1.8.2          
## [131] desc_1.3.0                    viridisLite_0.3.0            
## [133] limma_3.46.0                  fansi_0.4.2                  
## [135] pillar_1.5.1                  lattice_0.20-41              
## [137] fastmap_1.1.0                 httr_1.4.2                   
## [139] interactiveDisplayBase_1.28.0 glue_1.4.2                   
## [141] FNN_1.1.3                     png_0.1-7                    
## [143] iterators_1.0.13              multipanelfigure_2.1.2       
## [145] bluster_1.0.0                 BiocVersion_3.12.0           
## [147] bit_4.0.4                     assertive.properties_0.0-4   
## [149] stringi_1.5.3                 sass_0.3.1                   
## [151] blob_1.2.1                    textshaping_0.3.5            
## [153] BiocSingular_1.6.0            AnnotationHub_2.22.0         
## [155] memoise_2.0.0                 dplyr_1.0.5                  
## [157] irlba_2.3.3

Site built with pkgdown 1.6.1.

================================================ FILE: docs/articles/articles/celda_pbmc3k_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/articles/articles/celda_pbmc3k_files/header-attrs-2.7/header-attrs.js ================================================ // Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); ================================================ FILE: docs/articles/articles/celda_pbmc3k_files/kePrint-0.0.1/kePrint.js ================================================ $(document).ready(function(){ if (typeof $('[data-toggle="tooltip"]').tooltip === 'function') { $('[data-toggle="tooltip"]').tooltip(); } if ($('[data-toggle="popover"]').popover === 'function') { $('[data-toggle="popover"]').popover(); } }); ================================================ FILE: docs/articles/articles/celda_pbmc3k_files/lightable-0.0.1/lightable.css ================================================ /*! * lightable v0.0.1 * Copyright 2020 Hao Zhu * Licensed under MIT (https://github.com/haozhu233/kableExtra/blob/master/LICENSE) */ .lightable-minimal { border-collapse: separate; border-spacing: 16px 1px; width: 100%; margin-bottom: 10px; } .lightable-minimal td { margin-left: 5px; margin-right: 5px; } .lightable-minimal th { margin-left: 5px; margin-right: 5px; } .lightable-minimal thead tr:last-child th { border-bottom: 2px solid #00000050; empty-cells: hide; } .lightable-minimal tbody tr:first-child td { padding-top: 0.5em; } .lightable-minimal.lightable-hover tbody tr:hover { background-color: #f5f5f5; } .lightable-minimal.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-classic { border-top: 0.16em solid #111111; border-bottom: 0.16em solid #111111; width: 100%; margin-bottom: 10px; margin: 10px 5px; } .lightable-classic tfoot tr td { border: 0; } .lightable-classic tfoot tr:first-child td { border-top: 0.14em solid #111111; } .lightable-classic caption { color: #222222; } .lightable-classic td { padding-left: 5px; padding-right: 5px; color: #222222; } .lightable-classic th { padding-left: 5px; padding-right: 5px; font-weight: normal; color: #222222; } .lightable-classic thead tr:last-child th { border-bottom: 0.10em solid #111111; } .lightable-classic.lightable-hover tbody tr:hover { background-color: #F9EEC1; } .lightable-classic.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-classic-2 { border-top: 3px double #111111; border-bottom: 3px double #111111; width: 100%; margin-bottom: 10px; } .lightable-classic-2 tfoot tr td { border: 0; } .lightable-classic-2 tfoot tr:first-child td { border-top: 3px double #111111; } .lightable-classic-2 caption { color: #222222; } .lightable-classic-2 td { padding-left: 5px; padding-right: 5px; color: #222222; } .lightable-classic-2 th { padding-left: 5px; padding-right: 5px; font-weight: normal; color: #222222; } .lightable-classic-2 tbody tr:last-child td { border-bottom: 3px double #111111; } .lightable-classic-2 thead tr:last-child th { border-bottom: 1px solid #111111; } .lightable-classic-2.lightable-hover tbody tr:hover { background-color: #F9EEC1; } .lightable-classic-2.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-material { min-width: 100%; white-space: nowrap; table-layout: fixed; font-family: Roboto, sans-serif; border: 1px solid #EEE; border-collapse: collapse; margin-bottom: 10px; } .lightable-material tfoot tr td { border: 0; } .lightable-material tfoot tr:first-child td { border-top: 1px solid #EEE; } .lightable-material th { height: 56px; padding-left: 16px; padding-right: 16px; } .lightable-material td { height: 52px; padding-left: 16px; padding-right: 16px; border-top: 1px solid #eeeeee; } .lightable-material.lightable-hover tbody tr:hover { background-color: #f5f5f5; } .lightable-material.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-material.lightable-striped tbody td { border: 0; } .lightable-material.lightable-striped thead tr:last-child th { border-bottom: 1px solid #ddd; } .lightable-material-dark { min-width: 100%; white-space: nowrap; table-layout: fixed; font-family: Roboto, sans-serif; border: 1px solid #FFFFFF12; border-collapse: collapse; margin-bottom: 10px; background-color: #363640; } .lightable-material-dark tfoot tr td { border: 0; } .lightable-material-dark tfoot tr:first-child td { border-top: 1px solid #FFFFFF12; } .lightable-material-dark th { height: 56px; padding-left: 16px; padding-right: 16px; color: #FFFFFF60; } .lightable-material-dark td { height: 52px; padding-left: 16px; padding-right: 16px; color: #FFFFFF; border-top: 1px solid #FFFFFF12; } .lightable-material-dark.lightable-hover tbody tr:hover { background-color: #FFFFFF12; } .lightable-material-dark.lightable-striped tbody tr:nth-child(even) { background-color: #FFFFFF12; } .lightable-material-dark.lightable-striped tbody td { border: 0; } .lightable-material-dark.lightable-striped thead tr:last-child th { border-bottom: 1px solid #FFFFFF12; } .lightable-paper { width: 100%; margin-bottom: 10px; color: #444; } .lightable-paper tfoot tr td { border: 0; } .lightable-paper tfoot tr:first-child td { border-top: 1px solid #00000020; } .lightable-paper thead tr:last-child th { color: #666; vertical-align: bottom; border-bottom: 1px solid #00000020; line-height: 1.15em; padding: 10px 5px; } .lightable-paper td { vertical-align: middle; border-bottom: 1px solid #00000010; line-height: 1.15em; padding: 7px 5px; } .lightable-paper.lightable-hover tbody tr:hover { background-color: #F9EEC1; } .lightable-paper.lightable-striped tbody tr:nth-child(even) { background-color: #00000008; } .lightable-paper.lightable-striped tbody td { border: 0; } ================================================ FILE: docs/articles/articles/decontX_pbmc4k.html ================================================ Decontamination of ambient RNA in single-cell genomic data with DecontX • celda

Introduction

Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a Bayesian method to estimate and remove contamination in individual cells. DecontX assumes the observed expression of a cell is a mixture of counts from two multinomial distributions: (1) a distribution of native transcript counts from the cell’s actual population and (2) a distribution of contaminating transcript counts from all other cell populations captured in the assay. Overall, computational decontamination of single cell counts can aid in downstream clustering and visualization.

The package can be loaded using the library command.

library(celda)

Importing data

DecontX can take either a SingleCellExperiment object or a counts matrix as input. decontX will attempt to convert any input matrix to class dgCMatrix from package Matrix before starting the analysis.

To import datasets directly into an SCE object, the singleCellTK package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in the filtered and raw matrices for multiple samples processed with CellRanger:

library(singleCellTK)
sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"))

Within each sample directory, there should be subfolders called "outs/filtered_feature_bc_matrix/" or "outs/raw_feature_bc_matrix/" with files called matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz. If these files are in different subdirectories, the importCellRangerV3Sample function can be used to import data from a different directory instead.

Optionally, the “raw” or “droplet” matrix can also be easily imported by setting the dataType argument to “raw”:

sce.raw <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"), dataType = "raw")

The raw matrix can be passed to the background parameter in decontX as described below. If using Seurat, go to the Working with Seurat section for details on how to convert between SCE and Seurat objects.

Load PBMC4k data from 10X

We will utilize the 10X PBMC 4K dataset as an example in this vignette. This data can be easily retrieved from the package TENxPBMCData. Make sure the the column names are set before running decontX.

# Load PBMC data
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")
colnames(sce) <- paste(sce$Sample, sce$Barcode, sep = "_")
rownames(sce) <- rowData(sce)$Symbol_TENx
counts(sce) <- as(counts(sce), "dgCMatrix")

Running decontX

A SingleCellExperiment (SCE) object or a sparse matrix containing the counts for filtered cells can be passed to decontX via the x parameter. The matrix to use in an SCE object can be specified with the assayName parameter, which is set to "counts" by default. There are two major ways to run decontX: with and without the raw/droplet matrix containing empty droplets. Here is an example of running decontX without supplying the background:

sce <- decontX(sce)

In this scenario, decontX will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset. The estimated contamination results can be found in the colData(sce)$decontX_contamination and the decontaminated counts can be accessed with decontXcounts(sce). decontX will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the z parameter. These results will be used throughout the rest of the vignette.

The raw/droplet matrix can be used to empirically estimate the distribution of ambient RNA, which is especially useful when cells that contributed to the ambient RNA are not accurately represented in the filtered count matrix containing the cells. For example, cells that were removed via flow cytometry or that were more sensitive to lysis during dissociation may have contributed to the ambient RNA but were not measured in the filtered/cell matrix. The raw/droplet matrix can be input as an SCE object or a sparse matrix using the background parameter:

sce <- decontX(sce, background = sce.raw)

Only empty droplets in the background matrix should be used to estimate the ambient RNA. If any cell ids (i.e. colnames) in the raw/droplet matrix supplied to the background parameter are also found in the filtered counts matrix (x), decontX will automatically remove them from the raw matrix. However, if the cell ids are not available for the input matrices, decontX will treat the entire background input as empty droplets. All of the outputs are the same as when running decontX without setting the background parameter.

Note: If the input object is just a matrix and not an SCE object, make sure to save the output into a variable with a different name (e.g. result <- decontX(mat)). The result object will be a list with contamination in result$contamination and the decontaminated counts in result$decontXcounts.

Plotting DecontX results

Cluster labels on UMAP

DecontX creates a UMAP which we can use to plot the cluster labels automatically identified in the analysis. Note that the clustering approach used here is designed to find “broad” cell types rather than individual cell subpopulations within a cell type.

umap <- reducedDim(sce, "decontX_UMAP")
plotDimReduceCluster(x = sce$decontX_clusters,
    dim1 = umap[, 1], dim2 = umap[, 2])

Contamination on UMAP

The percentage of contamination in each cell can be plotting on the UMAP to visualize what what clusters may have higher levels of ambient RNA.

Expression of markers on UMAP

Known marker genes can also be plotted on the UMAP to identify the cell types for each cluster. We will use CD3D and CD3E for T-cells, LYZ, S100A8, and S100A9 for monocytes, CD79A, CD79B, and MS4A1 for B-cells, GNLY for NK-cells, and PPBP for megakaryocytes.

library(scater)
sce <- logNormCounts(sce)
plotDimReduceFeature(as.matrix(logcounts(sce)),
    dim1 = umap[, 1],
    dim2 = umap[, 2],
    features = c("CD3D", "CD3E", "GNLY",
        "LYZ", "S100A8", "S100A9",
        "CD79A", "CD79B", "MS4A1"),
    exactMatch = TRUE)

Barplot of markers detected in cell clusters

The percetage of cells within a cluster that have detectable expression of marker genes can be displayed in a barplot. Markers for cell types need to be supplied in a named list. First, the detection of marker genes in the original counts assay is shown:

markers <- list(Tcell_Markers = c("CD3E", "CD3D"),
    Bcell_Markers = c("CD79A", "CD79B", "MS4A1"),
    Monocyte_Markers = c("S100A8", "S100A9", "LYZ"),
    NKcell_Markers = "GNLY")
cellTypeMappings <- list(Tcells = 2, Bcells = 5, Monocytes = 1, NKcells = 6)
plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "counts")

We can then look to see how much decontX removed aberrant expression of marker genes in each cell type by changing the assayName to decontXcounts:

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "decontXcounts")

Percentages of marker genes detected in other cell types were reduced or completely removed. For example, the percentage of cells that expressed Monocyte marker genes was greatly reduced in T-cells, B-cells, and NK-cells. The original counts and decontamined counts can be plotted side-by-side by listing multiple assays in the assayName parameter. This option is only available if the data is stored in SingleCellExperiment object.

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = c("counts", "decontXcounts"))

Some helpful hints when using plotDecontXMarkerPercentage:

  1. Cell clusters can be renamed and re-grouped using the groupCluster parameter, which also needs to be a named list. If groupCluster is used, cell clusters not included in the list will be excluded in the barplot. For example, if we wanted to group T-cells and NK-cells together, we could set cellTypeMappings <- list(NK_Tcells = c(2,6), Bcells = 5, Monocytes = 1)
  2. The level a gene that needs to be expressed to be considered detected in a cell can be adjusted using the threshold parameter.
  3. If you are not using a SingleCellExperiment, then you will need to supply the original counts matrix or the decontaminated counts matrix as the first argument to generate the barplots.

Violin plot to compare the distributions of original and decontaminated counts

Another useful way to assess the amount of decontamination is to view the expression of marker genes before and after decontX across cell types. Here we view the monocyte markers in each cell type. The violin plot shows that the markers have been removed from T-cells, B-cells, and NK-cells, but are largely unaffected in monocytes.

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3)

Some helpful hints when using plotDecontXMarkerExpression:

  1. groupClusters works the same way as in plotDecontXMarkerPercentage.
  2. This function will plot each pair of markers and clusters (or cell type specified by groupClusters). Therefore, you may want to keep the number of markers small in each plot and call the function multiple times for different sets of marker genes.
  3. You can also plot the individual points by setting plotDots = TRUE and/or log transform the points on the fly by setting log1p = TRUE.
  4. This function can plot any assay in a SingleCellExperiment. Therefore you could also examine normalized expression of the original and decontaminated counts. For example:
library(scater)
sce <- logNormCounts(sce,
    exprs_values = "decontXcounts",
    name = "decontXlogcounts")

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3,
    assayName = c("logcounts", "decontXlogcounts"))

Other important notes

Choosing appropriate cell clusters

The ability of DecontX to accurately identify contamination is dependent on the cell cluster labels. DecontX assumes that contamination for a cell cluster comes from combination of counts from all other clusters. The default clustering approach used by DecontX tends to select fewer clusters that represent broader cell types. For example, all T-cells tend to be clustered together rather than splitting naive and cytotoxic T-cells into separate clusters. Custom cell type labels can be suppled via the z parameter if some cells are not being clustered appropriately by the default method.

Adjusting the priors to influence contamination estimates

There are ways to force decontX to estimate more or less contamination across a dataset by manipulating the priors. The delta parameter is a numeric vector of length two. It is the concentration parameter for the Dirichlet distribution which serves as the prior for the proportions of native and contamination counts in each cell. The first element is the prior for the proportion of native counts while the second element is the prior for the proportion of contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If estimateDelta = TRUE, delta is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then delta is updated in each iteration. If estimateDelta = FALSE, then delta is fixed with these values for the entire inference procedure. Fixing delta and setting a high number in the second element will force decontX to be more aggressive and estimate higher levels of contamination in each cell at the expense of potentially removing native expression. For example, in the previous PBMC example, we can see what the estimated delta was by looking in the estimates:

metadata(sce)$decontX$estimates$all_cells$delta
## [1] 9.287164 1.038217

Setting a higher value in the second element of delta and estimateDelta = FALSE will force decontX to estimate higher levels of contamination per cell:

sce.delta <- decontX(sce, delta = c(9, 20), estimateDelta = FALSE)

plot(sce$decontX_contamination, sce.delta$decontX_contamination,
     xlab = "DecontX estimated priors",
     ylab = "Setting priors to estimate higher contamination")
abline(0, 1, col = "red", lwd = 2)

Working with Seurat

If you are using the Seurat package for downstream analysis, the following code can be used to read in a matrix and convert between Seurat and SCE objects:

# Read counts from CellRanger output
library(Seurat)
counts <- Read10X("sample/outs/filtered_feature_bc_matrix/")

# Create a SingleCellExperiment object and run decontX
sce <- SingleCellExperiment(list(counts = counts))
sce <- decontX(sce)

# Create a Seurat object from a SCE with decontX results
seuratObject <- CreateSeuratObject(round(decontXcounts(sce)))

Optionally, the “raw” matrix can be also be imported and used as the background:

counts.raw <- Read10X("sample/outs/raw_feature_bc_matrix/")
sce.raw <- SingleCellExperiment(list(counts = counts.raw))
sce <- decontX(sce, background = sce.raw)

Note that the decontaminated matrix of decontX consists of floating point numbers and must be rounded to integers before adding it to a Seurat object. If you already have a Seurat object containing the counts matrix and would like to run decontX, you can retrieve the count matrix, create a SCE object, and run decontX, and then add it back to the Seurat object:

counts <- GetAssayData(object = seuratObject, slot = "counts")
sce <- SingleCellExperiment(list(counts = counts))
sce <- decontX(sce)
seuratObj[["decontXcounts"]] <- CreateAssayObject(counts = decontXcounts(sce))

Session Information

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.18.6               ggplot2_3.3.5              
##  [3] TENxPBMCData_1.8.0          HDF5Array_1.18.1           
##  [5] rhdf5_2.34.0                DelayedArray_0.16.2        
##  [7] celda_1.12.0                Matrix_1.3-2               
##  [9] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
## [11] Biobase_2.50.0              GenomicRanges_1.42.0       
## [13] GenomeInfoDb_1.26.4         IRanges_2.24.1             
## [15] S4Vectors_0.28.1            BiocGenerics_0.36.0        
## [17] MatrixGenerics_1.2.1        matrixStats_0.58.0         
## [19] BiocStyle_2.18.1           
## 
## loaded via a namespace (and not attached):
##   [1] AnnotationHub_2.22.0          BiocFileCache_1.14.0         
##   [3] systemfonts_1.0.1             RcppEigen_0.3.3.9.1          
##   [5] plyr_1.8.6                    assertive.files_0.0-2        
##   [7] enrichR_3.0                   multipanelfigure_2.1.2       
##   [9] BiocParallel_1.24.1           digest_0.6.27                
##  [11] foreach_1.5.1                 htmltools_0.5.1.1            
##  [13] viridis_0.5.1                 magick_2.7.0                 
##  [15] fansi_0.4.2                   magrittr_2.0.1               
##  [17] memoise_2.0.0                 assertive.numbers_0.0-2      
##  [19] doParallel_1.0.16             pkgdown_1.6.1                
##  [21] colorspace_2.0-0              blob_1.2.1                   
##  [23] rappdirs_0.3.3                ggrepel_0.9.1                
##  [25] textshaping_0.3.5             xfun_0.22                    
##  [27] dplyr_1.0.5                   crayon_1.4.1                 
##  [29] RCurl_1.98-1.2                jsonlite_1.7.2               
##  [31] iterators_1.0.13              glue_1.4.2                   
##  [33] gtable_0.3.0                  zlibbioc_1.36.0              
##  [35] XVector_0.30.0                BiocSingular_1.6.0           
##  [37] Rhdf5lib_1.12.1               scales_1.1.1                 
##  [39] DBI_1.1.1                     Rcpp_1.0.6                   
##  [41] viridisLite_0.3.0             xtable_1.8-4                 
##  [43] gridGraphics_0.5-1            bit_4.0.4                    
##  [45] rsvd_1.0.3                    httr_1.4.2                   
##  [47] RColorBrewer_1.1-2            ellipsis_0.3.1               
##  [49] farver_2.1.0                  pkgconfig_2.0.3              
##  [51] scuttle_1.0.4                 sass_0.3.1                   
##  [53] uwot_0.1.10                   dbplyr_2.1.0                 
##  [55] utf8_1.2.1                    labeling_0.4.2               
##  [57] tidyselect_1.1.0              rlang_0.4.10                 
##  [59] reshape2_1.4.4                later_1.1.0.1                
##  [61] AnnotationDbi_1.52.0          munsell_0.5.0                
##  [63] BiocVersion_3.12.0            tools_4.0.4                  
##  [65] cachem_1.0.4                  dbscan_1.1-6                 
##  [67] generics_0.1.0                RSQLite_2.2.4                
##  [69] ExperimentHub_1.16.0          evaluate_0.14                
##  [71] stringr_1.4.0                 fastmap_1.1.0                
##  [73] yaml_2.2.1                    ragg_1.1.3                   
##  [75] knitr_1.31                    bit64_4.0.5                  
##  [77] fs_1.5.0                      purrr_0.3.4                  
##  [79] sparseMatrixStats_1.2.1       mime_0.10                    
##  [81] compiler_4.0.4                beeswarm_0.3.1               
##  [83] curl_4.3                      interactiveDisplayBase_1.28.0
##  [85] tibble_3.1.0                  bslib_0.2.4                  
##  [87] stringi_1.5.3                 highr_0.8                    
##  [89] RSpectra_0.16-0               desc_1.3.0                   
##  [91] lattice_0.20-41               assertive.base_0.0-9         
##  [93] vctrs_0.3.6                   pillar_1.5.1                 
##  [95] lifecycle_1.0.0               rhdf5filters_1.2.0           
##  [97] BiocManager_1.30.10           combinat_0.0-8               
##  [99] jquerylib_0.1.3               RcppAnnoy_0.0.18             
## [101] BiocNeighbors_1.8.2           data.table_1.14.0            
## [103] bitops_1.0-6                  irlba_2.3.3                  
## [105] httpuv_1.5.5                  assertive.types_0.0-3        
## [107] R6_2.5.0                      bookdown_0.21                
## [109] assertive.properties_0.0-4    promises_1.2.0.1             
## [111] gridExtra_2.3                 vipor_0.4.5                  
## [113] codetools_0.2-18              MCMCprecision_0.4.0          
## [115] assertthat_0.2.1              rprojroot_2.0.2              
## [117] rjson_0.2.20                  withr_2.4.1                  
## [119] GenomeInfoDbData_1.2.4        grid_4.0.4                   
## [121] beachmat_2.6.4                rmarkdown_2.7                
## [123] DelayedMatrixStats_1.12.3     Rtsne_0.15                   
## [125] shiny_1.6.0                   ggbeeswarm_0.6.0

Site built with pkgdown 1.6.1.

================================================ FILE: docs/articles/articles/decontX_pbmc4k_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/articles/articles/decontX_pbmc4k_files/header-attrs-2.7/header-attrs.js ================================================ // Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); ================================================ FILE: docs/articles/articles/installation.html ================================================ • celda

Introduction

“celda” stands for “CEllular Latent Dirichlet Allocation”. It is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. This package also includes a method called decontX which can be used to estimate and remove contamination in single cell genomic data.

Installation

To install the latest stable release of celda from Bioconductor (requires R version >= 3.6):

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("celda")

The latest stable version of celda can be installed from GitHub using devtools:

library(devtools)
install_github("campbio/celda")

The development version of celda can also be installed from GitHub using devtools:

library(devtools)
install_github("campbio/celda@devel")

NOTE For MAC OSX users, devtools::install_github() requires installation of libgit2. This can be installed via homebrew:

brew install libgit2

Troubleshooting

  • If you receive installation errors when Rcpp is being installed and compiled, try following the steps outlined here to solve the issue
  • If you are running R 4.0.0 or later version on MacOS Catalina and you see error 'wchar.h' file not found, you can try the method in this link:
  • If you are trying to install celda using Rstudio and get this error: could not find tools necessary to compile a package, you can try typing this before running the install command:
options(buildtools.check = function(action) TRUE)

Site built with pkgdown 1.6.1.

================================================ FILE: docs/articles/articles/installation_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/articles/articles/installation_files/header-attrs-2.7/header-attrs.js ================================================ // Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); ================================================ FILE: docs/articles/celda.html ================================================ Analysis of single-cell genomic data with celda • celda

Introduction

CEllular Latent Dirichlet Allocation (celda) is a collection of Bayesian hierarchical models to perform feature and cell bi-clustering for count data generated by single-cell platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications and has shown good performance with sparse data. celda simultaneously clusters features (i.e. gene expression) into modules based on co-expression patterns across cells and cells into subpopulations based on the probabilities of the feature modules within each cell.

Starting from Bioconductor release 3.12 (celda version 1.6.0), celda makes use of SingleCellExperiment (SCE) objects for storing data and results. In this vignette we will demonstrate how to use celda to perform cell and feature clustering with a simple, small simulated dataset. This vignette does not include upstream importing of data, quality control, or filtering. To see a more complete analysis of larger real-world datasets, visit camplab.net/celda for additional vignettes.

Installation

celda can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("celda")

To load the package, type the following:

library(celda)

A complete list of help files are accessible using the help command with the package option.

help(package = celda)

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.


Generation of a simulated single cell dataset

celda will take a matrix of counts where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of celda, we will apply it to a simulated dataset.

In the function simulateCells, the K parameter designates the number of cell clusters, the L parameter determines the number of feature modules, the S parameter determines the number of samples in the simulated dataset, the G parameter determines the number of features to be simulated, and CRange specifies the lower and upper bounds of the number of cells to be generated in each sample.

To simulate a dataset of 5 samples with 5 cell populations, 10 feature modules, 200 features, and between 30 to 50 cells per sample using celda_CG model:

simsce <- simulateCells("celda_CG",
    S = 5, K = 5, L = 10, G = 200, CRange = c(30, 50))

The counts assay slot in simsce contains the counts matrix. The dimensions of counts matrix:

library(SingleCellExperiment)
dim(counts(simsce))
## [1] 200 207

Columns celda_sample_label and celda_cell_cluster in colData(simsce) contain sample labels and celda cell population cluster labels. Here are the numbers of cells in each subpopulation and in each sample:

table(colData(simsce)$celda_cell_cluster)
## 
##  1  2  3  4  5 
## 42 44 40 47 34
table(colData(simsce)$celda_sample_label)
## 
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 
##       43       48       45       40       31

Column celda_feature_module in rowData(simsce) contains feature module labels. Here is the number of features in each feature module:

table(rowData(simsce)$celda_feature_module)
## 
##  1  2  3  4  5  6  7  8  9 10 
## 23 39 17 15 21 22 19 12  4 28

Feature selection

A simple heuristic feature selection is performed to reduce the size of features used for clustering. To speed up the process, only features with at least 3 counts in at least 3 cells are included in downstream clustering for this data. A subset SingleCellExperiment object with filtered features is stored in altExp(simsce, "featureSubset") slot by default.

simsce <- selectFeatures(simsce)

If the number of features is still too large, then a smaller subset of features can be obtained by selecting the top number of most variable genes. For an example code, see the PBMC3K tutorial in the online celda documentation.

Performing bi-clustering with celda

There are currently three models within celda package: celda_C will cluster cells, celda_G will cluster features, and celda_CG will simultaneously cluster cells and features. Within the functions the K parameter will be the number of cell populations to be estimated, while the L parameter will be the number of feature modules to be estimated in the output model.

sce <- celda_CG(x = simsce, K = 5, L = 10, verbose = FALSE, nchains = 1)

Here is a comparison between the true cluster labels and the estimated cluster labels.

##    
##      1  2  3  4  5
##   1  0 44  0  0  0
##   2 42  0  0  0  0
##   3  0  0 40  0  0
##   4  0  0  0 47  0
##   5  0  0  0  0 34
##     
##       1  2  3  4  5  6  7  8  9 10
##   1   0 32  0  0  0  0  0  0  0  0
##   2  19  0  0  0  0  0  0  0  0  0
##   3   0  0 15  0  0  0  0  0  0  0
##   4   0  0  0 13  0  0  0  0  0  0
##   5   0  0  0  0 21  0  0  0  0  0
##   6   0  0  0  0  0 19  0  0  0  0
##   7   0  0  0  0  0  0  0 12  0  0
##   8   0  0  0  0  0  0 17  0  0  0
##   9   0  0  0  0  0  0  0  0  3  0
##   10  0  0  0  0  0  0  0  0  0 20

Visualization

Plotting cell populations on 2D-embeddings

celda contains its own wrapper function for tSNE and UMAP called celdaTsne and celdaUmap, respectively. Both of these functions can be used to embed cells into 2-dimensions. The output can be used in the downstream plotting functions plotDimReduceCluster, plotDimReduceModule, and plotDimReduceFeature to show cell population clusters, module probabilities, and expression of individual features, respectively.

sce <- celdaUmap(sce)
plotDimReduceCluster(x = sce, reducedDimName = "celda_UMAP")

plotDimReduceModule(x = sce, reducedDimName = "celda_UMAP", rescale = TRUE)

plotDimReduceFeature(x = sce, reducedDimName = "celda_UMAP",
    normalize = TRUE, features = "Gene_1")

Creating an expression heatmap

The clustering results can be viewed with a heatmap of the normalized counts using the function celdaHeatmap. The top nfeatures in each module will be selected according to the factorized module probability matrix.

plot(celdaHeatmap(sce = sce, nfeatures = 10))

Displaying relationships between modules and cell populations

The relationships between feature modules and cell populations can be visualized with celdaProbabilityMap. The absolute probabilities of each feature module in each cellular subpopulation is shown on the left. The normalized and z-scored expression of each module in each cell population is shown on the right.

Examining co-expression with module heatmaps

moduleHeatmap creates a heatmap using only the features from a specific feature module. Cells are ordered from those with the lowest probability of the module to the highest. If more than one module is used, then cells will be ordered by the probabilities of the first module.

moduleHeatmap(sce, featureModule = c(1,2), topCells = 100)

Identifying reasonable numbers of feature modules and cell subpopulations

In the previous example, the best K (the number of cell clusters) and L (the number of feature modules) was already known. However, the optimal K and L for each new dataset will likely not be known beforehand and multiple choices of K and L may need to be tried and compared. celda offers two sets of functions to determine the optimum K and L, recursiveSplitModule/recursiveSplitCell, and celdaGridSearch.

Using recursive splitting

Functions recursiveSplitModule and recursiveSplitCell offer a fast method to generate a celda model with optimum K and L. First, recursiveSplitModule is used to determine the optimal L. recursiveSplitModule first splits features into however many modules are specified in initialL. The module labels are then recursively split in a way that would generate the highest log-likelihood, all the way up to maxL.

moduleSplit <- recursiveSplitModule(simsce, initialL = 2, maxL = 15)

Perplexity is a statistical measure of how well a probability model can predict new data. Lower perplexity indicates a better model. The perplexity of each model can be visualized with plotGridSearchPerplexity. In general, visual inspection of the plot can be used to select the optimal number of modules (L) or cell populations (K) by identifying the “elbow” - where the rate of decrease in the perplexity starts to drop off.

In this example, the perplexity for L stops decreasing at L = 10, thus L = 10 would be a good choice. Sometimes the perplexity alone does not show a clear elbow or “leveling off”. However, the rate of perplexity change (RPC) can be more informative to determine when adding new modules does not add much additional information Zhao et al., 2015). An RPC closer to zero indicates that the addition of new modules or cell clusters is not substantially decreasing the perplexity. The RPC of models can be visualized using function plotRPC:

plotRPC(moduleSplit)

Once you have identified the optimal L (in this case, L is selected to be 10), the module labels are used for initialization in recursiveSplitCell. Similarly to recursiveSplitModule, cells are initially split into a small number of subpopulations, and the subpopulations are recursively split up.

moduleSplitSelect <- subsetCeldaList(moduleSplit, params = list(L = 10))

cellSplit <- recursiveSplitCell(moduleSplitSelect,
    initialK = 3,
    maxK = 12,
    yInit = celdaModules(moduleSplitSelect))

plotRPC(cellSplit)

In this plot, the perplexity for K stops decreasing at K = 5, with a final K/L combination of K = 5, L = 10. Generally, this method can be used to pick a reasonable L and a potential range of K. However, manual review of specific selections of K is often required to ensure results are biologically coherent.

Once users have chosen the K/L parameters for further analysis, the subsetCeldaList function can be used to subset the celda list SCE object to a single model SCE object.

sce <- subsetCeldaList(cellSplit, params = list(K = 5, L = 10))

Alternativley to recursive splitting, celda is able to run multiple combinations of K and L with multiple chains in parallel via the celdaGridSearch function.

cgs <- celdaGridSearch(simsce,
    paramsTest = list(K = seq(4, 6), L = seq(9, 11)),
    cores = 1,
    model = "celda_CG",
    nchains = 2,
    maxIter = 100,
    verbose = FALSE,
    bestOnly = TRUE)

Setting verbose to TRUE will print the output of each model to a text file. These results can be visualized with plotGridSearchPerplexity. The major goal is to pick the lowest K and L combination with relatively good perplexity. In general, visual inspection of the plot can be used to select the number of modules (L) or cell populations (K) where the rate of decrease in the perplexity starts to drop off. bestOnly = TRUE indicates that only the chain with the best log likelihood will be returned for each K/L combination.

In this example, the perplexity for L stops decreasing at L = 10 for the majority of K values. For the line corresponding to L = 10, the perplexity stops decreasing at K = 5. Thus L = 10 and K = 5 would be a good choice. Again, manual review of specific selections of K is often be required to ensure results are biologically coherent.

Once users have chosen the K/L parameters for further analysis, the subsetCeldaList function can be used to subset the celda list SCE object to a single model SCE object.

sce <- subsetCeldaList(cgs, params = list(K = 5, L = 10))

If the “bestOnly” parameter is set to FALSE in the celdaGridSearch, then the selectBestModel function can be used to select the chains with the lowest log likelihoods within each combination of parameters. Alternatively, users can select a specific chain by specifying the index within the subsetCeldaList function.

cgs <- celdaGridSearch(simsce,
    paramsTest = list(K = seq(4, 6), L = seq(9, 11)),
    cores = 1,
    model = "celda_CG",
    nchains = 2,
    maxIter = 100,
    verbose = FALSE,
    bestOnly = FALSE)

cgs <- resamplePerplexity(cgs, celdaList = cgs, resample = 2)

cgsK5L10 <- subsetCeldaList(cgs, params = list(K = 5, L = 10))

sce <- selectBestModel(cgsK5L10)

Miscellaneous utility functions

celda also contains several utility functions for the users’ convenience.

Finding the modules for feature with featureModuleLookup

featureModuleLookup can be used to look up the module a specific feature was clustered to.

featureModuleLookup(sce, feature = c("Gene_99"))
## Gene_99 
##       4

Reordering cluster labels with recodeClusterZ, recodeClusterY

recodeClusterZ and recodeClusterY allows the user to recode the cell and feature cluster labels, respectively.

sceZRecoded <- recodeClusterZ(sce,
    from = c(1, 2, 3, 4, 5), to = c(2, 1, 3, 4, 5))

The model prior to reordering cell labels compared to after reordering cell labels:

table(celdaClusters(sce), celdaClusters(sceZRecoded))
##    
##      1  2  3  4  5
##   1  0 44  0  0  0
##   2 42  0  0  0  0
##   3  0  0 40  0  0
##   4  0  0  0 47  0
##   5  0  0  0  0 34

Session Information

## R version 4.3.3 (2024-02-29)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] celda_1.18.2                Matrix_1.6-5               
##  [3] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
##  [5] Biobase_2.62.0              GenomicRanges_1.54.1       
##  [7] GenomeInfoDb_1.38.8         IRanges_2.36.0             
##  [9] S4Vectors_0.40.2            BiocGenerics_0.48.1        
## [11] MatrixGenerics_1.14.0       matrixStats_1.2.0          
## [13] BiocStyle_2.30.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7            gridExtra_2.3           rlang_1.1.3            
##  [4] magrittr_2.0.3          clue_0.3-65             GetoptLong_1.0.5       
##  [7] compiler_4.3.3          png_0.1-8               systemfonts_1.0.6      
## [10] vctrs_0.6.5             reshape2_1.4.4          combinat_0.0-8         
## [13] stringr_1.5.1           shape_1.4.6.1           pkgconfig_2.0.3        
## [16] crayon_1.5.2            fastmap_1.1.1           magick_2.8.2           
## [19] XVector_0.42.0          labeling_0.4.3          utf8_1.2.4             
## [22] rmarkdown_2.25          ragg_1.3.0              purrr_1.0.2            
## [25] xfun_0.41               WriteXLS_6.5.0          zlibbioc_1.48.2        
## [28] cachem_1.0.8            jsonlite_1.8.8          highr_0.10             
## [31] DelayedArray_0.28.0     cluster_2.1.6           irlba_2.3.5.1          
## [34] parallel_4.3.3          R6_2.5.1                bslib_0.6.1            
## [37] stringi_1.8.3           RColorBrewer_1.1-3      MCMCprecision_0.4.0    
## [40] jquerylib_0.1.4         Rcpp_1.0.12             bookdown_0.37          
## [43] iterators_1.0.14        knitr_1.45              FNN_1.1.4              
## [46] tidyselect_1.2.0        rstudioapi_0.15.0       abind_1.4-5            
## [49] yaml_2.3.8              enrichR_3.2             doParallel_1.0.17      
## [52] codetools_0.2-19        curl_5.2.1              lattice_0.22-5         
## [55] tibble_3.2.1            plyr_1.8.9              withr_3.0.0            
## [58] evaluate_0.23           Rtsne_0.17              desc_1.4.3             
## [61] circlize_0.4.16         RcppEigen_0.3.4.0.0     pillar_1.9.0           
## [64] BiocManager_1.30.22     foreach_1.5.2           generics_0.1.3         
## [67] RCurl_1.98-1.14         ggplot2_3.5.0           munsell_0.5.1          
## [70] scales_1.3.0            glue_1.7.0              tools_4.3.3            
## [73] data.table_1.15.4       fs_1.6.3                Cairo_1.6-2            
## [76] grid_4.3.3              colorspace_2.1-0        GenomeInfoDbData_1.2.11
## [79] cli_3.6.2               textshaping_0.3.7       fansi_1.0.6            
## [82] S4Arrays_1.2.1          ComplexHeatmap_2.18.0   dplyr_1.1.4            
## [85] uwot_0.1.16             gtable_0.3.4            sass_0.4.8             
## [88] digest_0.6.35           SparseArray_1.2.4       ggrepel_0.9.5          
## [91] rjson_0.2.21            farver_2.1.1            memoise_2.0.1          
## [94] htmltools_0.5.7         pkgdown_2.0.7           lifecycle_1.0.4        
## [97] httr_1.4.7              GlobalOptions_0.1.2

Site built with pkgdown 2.0.7.

================================================ FILE: docs/articles/celda_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/articles/celda_files/header-attrs-2.7/header-attrs.js ================================================ // Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); ================================================ FILE: docs/articles/celda_pbmc3k.html ================================================ Celda - Analysis of PBMC3K • celda

Introduction

Celda is a Bayesian hierarchical model that can perform bi-clustering of features into modules and observations into subpopulations. In this tutorial, we will apply Celda to a real-world single-cell RNA sequencing (scRNA-seq) dataset of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor. This dataset (PBMC3K) is available from 10X Genomics and can be found on the 10X website.

The celda package uses the SingleCellExperiment (SCE) object for management of expression matrices, feature/cell annotation data, and metadata. All of the functions have an SCE object as the first input parameter. The functions operate on a matrix stored in the assay slot of the SCE object. The parameter useAssay can be used to specify which matrix to use (the default is "counts"). Matrices can be of class matrix or dgCMatrix from the Matrix package. While the primary clustering is performed with functions from the celda package, the singleCellTK package is used for some other tasks such as importing data, quality control, and marker identification with differential expression.

Importing data

The PBMC3K data can be easily loaded via the Bioconductor package TENxPBMCData. TENxPBMCData is an experiment package that provides resources for various PBMC datasets generated by 10X Genomics. When using this package, the column names of returned SCE object are NULL by default. For this example, we paste together the name of the sample with the cell barcode to generate column names for the SCE object. Additionally, the count matrix within sce object is converted from a DelayedMatrix object to a sparse matrix dgCMatrix object.

library(TENxPBMCData)
sce <- TENxPBMCData("pbmc3k")
colnames(sce) <- paste0("pbmc3k_", colData(sce)$Sequence)
counts(sce) <- as(counts(sce), "dgCMatrix")

If you have the singleCellTK package installed, then this dataset can be imported and converted with a single command:

To get your own data into a SingleCellExperiment object, the singleCellTK package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in multiple samples processed with CellRanger:

library(singleCellTK)
sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"))

Note: As a reminder, you can view the assays, column annotation, and row annotation stored in the SCE with the commands assays(sce), colData(sce), and rowData(sce), respectively.

Finally, we set the rownames of the SCE to the gene symbol:

rownames(sce) <- rowData(sce)$Symbol_TENx

Quality Control

Quality control and filtering of cells is often needed before down-stream analyses such as dimensionality reduction and clustering. Typical filtering procedures include exclusion of poor quality cells with low numbers of counts/UMIs, estimation and removal of ambient RNA, and identification of potential doublet/multiplets. Many tools and packages are available to perform these operations and users are free to apply their tool(s) of choice as the celda clustering functions will work with any matrix stored in an SCE object. The celda package does contain a Bayesian method called decontX to estimate and remove transcript contamination in individual cells in a scRNA-seq dataset.

To perform QC, we suggest using the runCellQC function in singleCellTK package. This is a wrapper for several methods for calculation of QC metrics, doublet detection, and estimation of ambient RNA (including decontX). Below is a quick example of how to perform standard QC before applying celda. If you have another preferred approach or your data has already been QC’ed, you can move to Feature selection section. For this tutorial, we will only run one doublet detection algorithm and one decontamination algorithms. For a full list of algorithms that this function runs by default, see ?runCellQC. We will also quantify the percentage of mitochondrial genes in each cell as this is often used as a measure of cell viability.

library(singleCellTK)

# Get list of mitochondrial genes
mito.genes <- grep("^MT-", rownames(sce), value = TRUE)

# Run QC
sce <- runCellQC(sce, sample = NULL, algorithms = c("QCMetrics", "scDblFinder", "decontX"), geneSetList = list(mito=mito.genes), geneSetListLocation = "rownames")

Note: If you have cells from multiple samples stored in the SCE object, make sure to supply the sample parameter as the QC tools need to be applied to cells from each sample individually.

Individual sets of QC metrics can be plotted with specific functions. For example to plot distributions of total numbers of UMIs derived from runPerCellQC, doublet scores from runScDblFinder, and contamination scores from runDecontX (all of which were run by the runCellQC function), the following plotting functions can be used:

plotScDblFinderResults(sce, reducedDimName = "decontX_UMAP")

plotDecontXResults(sce, reducedDimName = "decontX_UMAP")

An comprehensive HTML report can be generated to visualize and explore the QC metrics in greater detail:

After examining the distributions of various QC metrics, poor quality cells will need to be removed. Typically, thresholds for QC metrics should exclude cells that are outliers of the distribution (i.e. long tails in the violin or density plots). Cells can be removed using the subsetSCECols function. Metrics stored in the colData of the SCE object can be filtered using the colData parameter. Here we will limit to cells with at least 600 counts and 300 genes detected:

# Filter SCE
sce <- subsetSCECols(sce, colData = c("total > 600", "detected > 300"))

# See number of cells after filtering
ncol(sce)
## [1] 2675

Other common metrics to filter on include subsets_mito_percent for removal of cells with high mitochondrial percentage, decontX_contamination for removal of cells with higher levels of contamination from ambient RNA, scDblFinder_class to remove doublets (or calls from any of the other doublet detection algorithms). See the singleCellTK documentation For more information on performing comprehensive QC and filtering.

Feature selection

In general, removing features with low numbers of counts across all cells is recommended to reduce computational run time. A simple selection can be performed by removing features with a minimum number of counts in a minimum number of cells using the selectFeatures function:

# Select features with at least 3 counts in at least 3 cells
library(celda)
useAssay <- "counts"
altExpName <- "featureSubset"
sce <- selectFeatures(sce, minCount = 3, minCell = 3, useAssay = useAssay, altExpName = altExpName)

# See number of features after filtering
nrow(altExp(sce, altExpName))
## [1] 2639

The useAssay parameter is used to denote which assay/matrix within the SCE to use for filtering. The default raw counts matrix is traditionally stored in the "counts" assay. If decontX was previously run during QC, then the decontaminated counts can be used by setting this parameter to "decontXcounts". We will save this parameter in a variable called useAssay which will be used as input in several downstream functions.

Note: The subsetted matrix is stored in the “alternative experiment” slot (altExp) within the SCE. This allows for a matrix with a different number of rows to be stored within the same SCE object (rather than creating two SCE objects). The celda functions described in the next several sections operate on a matrix stored in the altExp slot. The default name given to the alternative experiment and used in all downstream celda functions is "featureSubset". If the altExpName parameter is changed here, then it will need to be supplied to downstream plotting functions as well. The list of alternative experiments in an SCE can be view with altExpNames(sce). If you have already have an SCE with selected features or do not want to perform feature selection, then you need to set the alternative experiment directly with a command like altExp(sce, "featureSubset") <- assay(sce, "counts"). In the future, this will be updated to be more simple by utilizing the ExperimentSubset package.

If the number of features is still relatively large (e.g. >5000), an alternative approach is to select highly variable features that can be used in the downstream clustering. The advantage of this approach is that it can greatly speed up celda and can improve with module detection among highly variable features with overall lower expression. The disadvantage of this approach is that features that do not fall into the highly variable group will not be clustered into modules. The celda package does not include methods for selection of highly variable genes (HVGs). However, the singleCellTK provides wrappers for methods used in Seurat and Scran. We recommend keeping at least 2,000-5,000 HVGs for clustering. Here is some example code of how to select the top 5,000 most variable genes and store it back in the SCE as an altExp:

library(singleCellTK)
sce <- seuratFindHVG(sce, useAssay = useAssay, hvgMethod = "vst")
g <- getTopHVG(sce, method = "vst", n = 5000)
altExp(sce, altExpName) <- sce[g, ]

For the rest of the analysis with the PBMC3K data, we will use the first approach where features with at least 3 counts in 3 cells were included.

Analysis with Celda

Bi-clustering with known numbers of clusters

As mentioned earlier, celda is discrete Bayesian model that is able to simultaneously bi-cluster features into modules and cells into cell clusters. The primary bi-clustering model can be accessed with the function celda_CG. This function operates on a matrix stored as an alternative experiment in the altExp slot. If you did not perform feature selection as recommended in the previous section and your matrix of interest is not currently located in an altExp slot, the following code can be used to copy a matrix in the main assay slot to the altExp slot:

useAssay <- "counts"
altExpName <- "featureSubset"
altExp(sce, altExpName) <- assay(sce, useAssay)`. 

The two major adjustable parameters in this model are L, the number of modules, and K, the number of cell populations. The following code bi-clusters the PBMC3K dataset into 100 modules and 15 cell populations:

sce <- celda_CG(sce, L = 100, K = 15, useAssay = useAssay, altExpName = altExpName)

However, in most cases, the number of feature modules (L) and the number of cell clusters (K) are not known beforehand. In the next sections, we outline procedures that can be used suggest reasonable choices for these parameters. If the data is clustered with the code above by supplying K and L directly to the celda_CG function, then you can skip the next section and proceed to Creating 2-D embeddings.

Finding the number of modules

In order to help choose a reasonable solutions for L and K, celda provides step-wise splitting procedures along with measurements of perplexity to suggest reasonable choices for L and K. First, the function recursiveSplitModule can be used to cluster features into modules for a range of L. Within each step, the best split of an existing module into 2 new modules is chosen to create the L-th module. The module labels of the previous model with \(L-1\) modules are used as the initial starting values in the next model with \(L\) modules. Note that the initialization step may take longer with larger numbers of cells in the dataset and the splitting procedure will take longer with larger numbers features in the dataset. Celda models with a L range between initialL = 10 and maxL = 150 are tested in the example below.

moduleSplit <- recursiveSplitModule(sce, useAssay = useAssay, altExpName = altExpName, initialL = 10, maxL = 150)

Perplexity has been commonly used in the topic models to measure how well a probabilistic model predicts observed samples (Blei et al., 2003). Here, we use perplexity to evaluate the performance of individual models by calculating the probability of observing expression counts given an estimated Celda model. Rather than performing cross-validation which is computationally expensive, a series of test sets are created by sampling the counts from each cell according to a multinomial distribution defined by dividing the counts for each gene in the cell by the total number of counts for that cell. Perplexity is then calculated on each test set and can be visualized using function plotGridSearchPerplexity. A lower perplexity indicates a better model fit.

plotGridSearchPerplexity(moduleSplit, altExpName = altExpName, sep = 10)

The perplexity alone often does not show a clear elbow or “leveling off”. However, the rate of perplexity change (RPC) can be more informative to determine when adding new modules does not add much additional information Zhao et al., 2015). An RPC closer to zero indicates that the addition of new modules or cell clusters is not substantially decreasing the perplexity. The RPC of models can be visualized using function plotRPC:

plotRPC(moduleSplit, altExpName = altExpName)

In this case, we will choose an L of 80 as the RPC curve tends to level off at this point:

L <- 80
Note: Perplexity and RPC are meant to be guides to give a sense of a possible starting point for L. However, they may not always give a clear “leveling off” depending of the complexity and quality of the dataset. Do not give up if the choice of L is unclear or imperfect! If the L to choose is unclear from these, then you can set a somewhat high number (e.g. 75) and move to the next step of selecting K. Later on, manual review of modules using functions such as moduleHeatmap can give a sense of whether individual modules should be further split up by selecting higher L. For example, you can start exploring the cell populations and modules with L = 75. If some modules need to be further split, you can then try L = 100, L = 125, and so on.

Finding the number of cell subpopulations

Now we extract the Celda model of L =\(L\) with function subsetCeldaList and run recursiveSplitCell to fit models with a range of K between 3 and 25:

temp <- subsetCeldaList(moduleSplit, list(L = L))
sce <- recursiveSplitCell(sce, useAssay = useAssay, altExpName = altExpName, initialK = 3, maxK = 25, yInit = celdaModules(temp))

The perplexities and RPC of models can be visualized using the same functions plotGridSearchPerplexity and plotRPC.

plotRPC(sce, , altExpName = altExpName)

The perplexity continues to decrease with larger values of K. The RPC generally levels off between 13 and 16 and we choose the model with K = 14 for downstream analysis. The follow code selects the final celda_CG model with L = 80 and K = 14:

K <- 14
sce <- subsetCeldaList(sce, list(L = L, K = K))

Note: Similar to choosing L, you can guess an initial value of K based off of the perplexity and RPC plots and then move to the downstream exploratory analyses described in the next several sections. After reviewing the cell clusters on 2-D embeddings and module heatmaps, you may have to come back to tweak the choice of K until you have something that captures the cellular heterogeneity within the data without “over-clustering” cells into too many subpopulations. This may be an iterative procedure of going back-and-forth between choices of K and plotting the results. So do not let imperfect perplexity/PRC plots prevent you from moving on to the rest of the analysis. Often times, using an initial guess for K will allow you to move on in the analysis to get a sense of the major sources of biological heterogeneity present in the data.

Exploring cell populations

Creating 2-D embeddings

After selecting a celda model with specific values of L and K, we can then perform additional exploratory and downstream analyses to understand the biology of the transcriptional modules and cell populations. We can start by generating a dimension reduction plot with the Uniform Manifold Approximation and Projection (UMAP) method to visualize the relationships between the cells in a 2-D embedding. This can be done with function celdaUmap.

sce <- celdaUmap(sce, useAssay = useAssay, altExpName = altExpName)

Alternatively, a t-distributed stochastic neighbor embedding (t-SNE) can be generated using function celdaTsne. The UMAP and t-SNE plots generated by celdaUmap and celdaTsne are computed based on the module probabilities (analogous to using PCs from PCA). The calculated dimension reduction coordinates for the cells are stored under the reducedDim slot of the altExp slot in the original SCE object. The follow command lists the names of the dimensionality reductions that can be used in downstream plotting functions in the next few sections:

reducedDimNames(altExp(sce, altExpName))
## [1] "decontX_UMAP" "celda_UMAP"

Plotting cell population cluster labels

The function plotDimReduceCluster can be used to plot the cluster labels for cell populations identified by celda on the UMAP:

plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", labelClusters = TRUE)

Plotting expression of specific features

Usually, biological features of some cell populations are known a priori and can be identified with known marker genes. The expression of selected marker genes can be plotted on the UMAP with the function plotDimReduceFeature.

markers <- c("CD3D", "IL7R", "CD4", "CD8B", "CD19", "FCGR3A", "CD14", "FCER1A", "PF4")

plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = useAssay, altExpName = altExpName, normalize = TRUE)

The parameter displayName can be used to switch between IDs stored in the rownames of the SCE and columns of the rowData of the SCE. If the assay denoted by useAssay is a raw counts matrix, then setting normalize = TRUE is recommended (otherwise the z-score of the raw counts will be plotted). When set to TRUE, each count will be normalized by dividing by the total number of counts in each cell. An alternative approach is to perform normalization with another method and then point to the normalized assay with the useAssay parameter. For example, normalization can be performed with the scater package:

library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")
plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = "logcounts", altExpName = altExpName, normalize = FALSE)

This second approach may be faster if plotting a lot of marker genes or if the dataset is relatively large.

Plotting cell subpopulations with labels

Once we identify of various cell subpopulations using the known marker genes, these custom labels can be added on the UMAP colored by cluster:

g <- plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", altExpName = altExpName, labelClusters = TRUE)

labels <- c("1: Megakaryocytes",
    "2: CD14+ Monocytes 1",
    "3: CD14+ Monocytes 2",
    "4: FCGR3A (CD16+) Monocytes",
    "5: CD14+ Monocytes 3",
    "6: CD8+ Cytotoxic T-cells",
    "7: CD4+ T-cells",
    "8: CD8+ Cytotoxic T-cells",
    "9: B-cells",
    "10: Naive CD8+ T-cells",
    "11: Naive CD4+ T-cells",
    "12: NK-cells",
    "13: Unknown T-cells",
    "14: Dendritic cells")

library(ggplot2)
g <- g + scale_color_manual(labels = labels,
    values = distinctColors(length(labels)))
print(g)

Exploring relationship between modules and cell populations

Celda has the ability to identify modules of co-expressed features and quantify the probability of these modules in each cell population. An overview of the relationships between modules and cell subpopulations can be explored with the function celdaProbabilityMap. The “Absolute probability” heatmap on the left shows the proportion of counts in each module for each cell population. The “Absolute probability” map gives insights into the absolute abundance of a module within a given cell subpopulation. The absolute heatmap can be used to explore which modules are higher than other modules within a cell population. The “Relative expression” map shows the standardized (z-scored) module probabilities across cell subpopulations. The relative heatmap can be used to explore which modules are relatively higher than other modules across cell populations.

celdaProbabilityMap(sce, useAssay = useAssay, altExpName = altExpName)

In this plot, we can see a variety of patterns. Modules 15 - 20 are highly expressed across most cell populations indicating that they may contain housekeeping genes (e.g. ribosomal). Other modules are specific to a cell population or groups of cell populations. For example, module 35 is only on in population 1 while module 70 is expressed across populations 2, 3, and to some degree in population 5. The unknown T-cell population 13 has highly specific levels of modules 30. In the next section, we can look at the genes in these modules to gain insights into the biological properties of each of these cell populations.

Exploring feature modules

The primary advantage of celda over other tools is that it can cluster features that are co-expressed across cells into modules. These modules are often more biologically coherent than features correlated with principal components from PCA. Below are several ways in which modules can be explored and visualized.

Table of features in each module

The function featureModuleTable can be used to get the names of all features in each module into a data.frame.

# Save to a data.frame
ta <- featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName)
dim(ta)
## [1] 154  80
head(ta[,"L70"])
## [1] "S100A9"   "S100A8"   "S100A12"  "RBP7"     "FOLR3"    "C19orf59"

The parameter displayName can be used to switch between IDs stored in the rownames of the SCE and columns of the rowData of the SCE. The the outputFile parameter is set, the table will be saved to a tab-delimited text file instead of to a data.frame:

# Save to file called "modules.txt"
featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName, outputFile = "modules.txt")

The modules for this model are shown below:

L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 L25 L26 L27 L28 L29 L30 L31 L32 L33 L34 L35 L36 L37 L38 L39 L40 L41 L42 L43 L44 L45 L46 L47 L48 L49 L50 L51 L52 L53 L54 L55 L56 L57 L58 L59 L60 L61 L62 L63 L64 L65 L66 L67 L68 L69 L70 L71 L72 L73 L74 L75 L76 L77 L78 L79 L80
CCL3 GNLY CTSW NKG7 RPS19 MT-CO2 MT-CO3 DDX5 RPL28 RPL18A FOS EIF1 JUNB GIMAP7 RPL13A RPS6 RPS2 RPL10 RPL13 RPS14 RPSA RPS27 LTB PTPRCAP MALAT1 LDHB IL32 CD79B CD37 TUBA1B GAPDH PPIA ACTG1 CCL5 PPBP RGS10 OAZ1 TAGLN2 MT-ND1 MT-CO1 ARPC3 SH3BGRL3 CYBA PTMA TMSB10 LAPTM5 ARHGDIB HLA-B CFL1 SRGN ACTB TMSB4X C9orf142 ANXA1 UBB B2M MYL12A HLA-A FCGR3A IFITM2 FAM26F FCER1G AIF1 FTH1 FCER1A HLA-DQA1 HLA-DPB1 CD74 HLA-DRA S100A9 LYZ CST3 VIM NEAT1 S100A4 GSTP1 LGALS1 GABARAP TYROBP FTL
IGFBP7 GZMB CD247 GZMA NACA CD52 MT-ND4 TSC22D3 RPS9 RPL12 FXYD5 H3F3B TMEM66 GIMAP4 RPS18 RPS3 RPL19 RPL11 RPL32 EEF1A1 JUN RPL21 MYC CXCR4 MYLIP IL7R CD3D CD79A SNHG7 HMGB2 EIF4A1 HNRNPA2B1 CORO1A GZMK PF4 TUBA4A FKBP1A GDI2 PFDN5 LSP1 YBX1 SERF2 CLIC1 HNRNPA1 EIF3K SNX3 UBC SRP14 PSMB9 ITGB2 PFN1 GMFG APOBEC3G HCST RAC2 HLA-C HSPA8 CALM1 RHOC CTSC NCF1 FGR LST1 COTL1 CLEC10A HLA-DQB1 HLA-DPA1 IRF8 HLA-DMA S100A8 LGALS2 CFP S100A10 ISG15 S100A6 GPX1 TYMP TSPO FCN1 CTSS
HAVCR2 FGFBP2 GZMM CST7 NAP1L1 PPDPF MT-CYB TXNIP FAU RPL8 CD48 DUSP1 ZFP36L2 FYB RPS8 RPS12 RPLP1 RPL6 RPLP2 RPS4X NPM1 RPS3A SIT1 ISG20 ATM NOSIP CD3E MS4A1 SNX2 EIF1AY SLC25A5 HMGB1 CHCHD2 LAG3 HIST1H2AC CDC42SE2 TALDO1 ATP5C1 SLC25A6 ATP6V0E1 LY6E ARPC1B SUPT4H1 RPL36AL ATP5E UQCRH MYL12B PSME1 PPP1CA CD63 MYL6 CAPZB CDC37 ID2 UCP2 HLA-E EVL CD99 CDKN1C MYO1G LYN CD86 IFITM3 SAT1 ENHO HLA-DQA2 HLA-DRB1 LAT2 LY86 S100A12 MS4A6A CPVL NFKBIA ANXA2 S100A11 AP1S2 LGALS3 RAC1 NCF2 NPC2
CCL4L1 CCL4 LYAR PRF1 SOD1 ATP6V1G1 MT-ND2 CIRBP UBA52 RPL29 APRT ITM2B TMEM123 GIMAP1 RPL10A RPL3 RPL15 RPL26 RPS16 RPL27A RPSAP58 RPS27A BIRC3 CD69 ANKRD44 GIMAP5 CD7 TCL1A PRKCB MANF TPI1 HSP90AA1 ENO1 SDPR FERMT3 H3F3A PRDX6 ATP5G2 BRK1 RHOA ALDOA IFI35 COX7C ATP5L GABARAPL2 YWHAB RBM3 PSMB8 CTSD ARPC2 TUBA1A ARGLU1 RNF181 CD53 ARL6IP5 IFITM1 SEPT7 CKB ABI3 POU2F2 CD300C CFD PSAP DNASE1L3 CD1C HLA-DRB5 EAF2 HLA-DMB RBP7 CD14 IGSF6 AMICA1 PRELID1 GRN TNFSF13B PYCARD CDA BRI3
SPON2 HOPX GZMH CYTIP PRR13 MT-ATP6 LIMD2 RPS24 GNB2L1 GSTK1 KLF6 BTG2 CITED2 RPS5 RPS15A RPS15 RPL14 RPS28 TPT1 HINT1 RPL9 RIC3 STK17A CARS C12orf57 CD2 IGLL5 ARL4A HMGA1 PKM HMGN1 SLC25A3 TSC22D1 DNAJB6 GSTO1 DNAJC8 ZFP36 C11orf31 CALM2 LAMTOR4 IRF7 CNBP PSMA7 POLD4 OST4 C19orf43 TPM3 MYO1F EMP3 PGK1 TMCO1 CKLF PLAC8 BIN2 AES PDIA3 LYPD2 ATP1B3 SCPEP1 FAM49A SERPINA1 TIMP1 SERPINF1 FCGR2B PLD4 MEF2C RNASE6 FOLR3 ALDH2 RAB32 MYADM IFI6 CEBPD RNF130 TKT SLC7A7 CTSB
CLIC3 C12orf75 KLRD1 CLNS1A SYF2 SSR2 KLF2 COX4I1 RPS11 CD44 IER2 NDFIP1 SEPW1 RPS23 RPL23A RPS7 RPL27 RPL36 RPL7 EIF4A2 BTG1 STMN3 ACAP1 DNAJB1 LCK IGJ HVCN1 STMN1 COX8A SRSF7 LDHA NRGN LIMS1 SOD2 YWHAH SERP1 COPE RNASET2 COX5B MIR142 TMA7 COX6B1 SEP15 NDUFA4 SUMO2 COX6A1 FLNA PSME2 ARF1 ADRM1 IL10RA CD164 RAP1B RARRES3 LITAF VMO1 SPN SYNGR2 ARRB1 CD68 CEBPB LILRA4 PPP1R14A CCDC50 GAPT C19orf59 IL8 CD33 PPT1 C1orf162 FCGRT STX11 AP2S1 FPR1 BLVRA
XCL2 CD8A CD160 CCT7 EVI2B TMEM14B HNRNPDL BTF3 RPL23 PPP1R15A GPSM3 RWDD1 GIMAP2 RPL18 RPL30 RPL7A EEF1D RPL22 RPL35A SNRPD2 GLTSCR2 CRIP2 RHOH PRDX2 LAT LINC00926 SYPL1 PRDX4 PSMB3 RAN PPIB TUBB1 MAX POLE4 MRPL14 GPX4 COX5A DRAP1 TCEB2 LAP3 EIF3F VAMP8 LSM7 SKP1 EIF3G GUK1 GLIPR2 ARPC5 LRRFIP1 PPP2CA CMTM3 PPP1R18 HLA-F TRAF3IP3 XBP1 ADA LYST TCF7L2 SPI1 STXBP2 PHACTR1 CSF3R CD302 CORO1B RHOG CTSH MNDA ATP6V0B MAFB RHOB
TTC38 KLRG1 FCRL6 ATP1A1 YPEL5 C19orf24 ANAPC16 PABPC1 EEF2 PLP2 EDF1 SOCS3 DGCR6L RPL5 RPS25 RPL35 RPL24 C6orf48 RPL34 SELL FOXP1 TNFRSF4 BIN1 CD27 CD3G BANK1 ADK CHTF8 GNB2 HNRNPK SRSF3 GNG11 AP3S1 RAB10 SPINT2 ATP5D WDR83OS SH3BGRL BST2 TMEM179B UBE2D3 ARL6IP4 RASGRP2 COX6C TMBIM6 TRAPPC1 EFHD2 ATP5B CALM3 PDHB FYN SCP2 DHRS7 IL2RG ANXA6 SH3BP1 LYL1 IFI30 ASAH1 PLBD1 LILRB4 TMEM14C MT2A BLVRB FGL2 RGS2 FCGR2A PLAUR
AKR1C3 ZAP70 XCL1 ARID4B DAZAP2 MPC2 VAMP2 EIF3H NBEAL1 KHDRBS1 UQCR11 TRADD TMEM173 RPS10 RPL31 RPL4 RPL38 C21orf33 RPL17 PEBP1 RP11-796E2.4 RP11-706O15.1 OCIAD2 LEPROTL1 CD8B VPREB3 SMIM14 SP140 UQCRFS1 MIF PARK7 RGS18 CTSA PTPN18 H2AFJ HIGD2A FIS1 BLOC1S1 SF3B5 PLSCR1 ERP29 NEDD8 MX1 PTPRC SAP18 CAP1 PLEK MSN FKBP8 TPST2 ETHE1 TMEM9B TMEM50A CCND3 YWHAQ GUSB STX7 APOBEC3A BID ASGR1 RAB31 DHRS4L2 CSTB NUP214 KLF4 LINC00936 TMEM176B CTSZ
PRSS23 APMAP KLRC1 THYN1 SP110 TMED4 UBE2D2 PNRC1 EIF3L SURF1 SSR4 CHURC1 GBP1 RPS13 RPS20 RPL37A RPS4Y1 C1orf228 RPL36A CMPK1 KCNQ1OT1 KRT1 FAIM3 PIK3IP1 OPTN FCER2 MYCBP2 HIST1H4C ANAPC11 SRSF2 PSMB1 CLU NCOA4 GLUL TPM4 FOSB PSMB7 RNH1 GLRX TMEM205 GTF3A NDUFA13 MAF1 PTGES3 JTB LCP1 CD300A HN1 ATRAID PPP6C PTGER2 ZNF207 SIGIRR SRP9 ALOX5AP FAM110A YBX3 MS4A7 WARS TNFAIP2 CD4 C10orf54 IL1B ODF3B CNPY3 VSTM1 TBXAS1
GPR56 SAMD3 RP11-347P5.1 CCDC12 MRPL21 UXT UBXN1 ZFAS1 STAT1 MT-ND5 STAT3 DNAJA2 RPLP0 RPS29 RPL37 TOMM7 COMMD6 PPA1 LBH RGCC ITM2A PDLIM1 CYB561A3 PCNA PGAM1 HNRNPC H2AFZ CD9 RHEB RNPEP COMMD3 NDUFA11 SRSF9 PTPN6 SERPINB1 TLN1 PCBP2 ATP5G3 RCSD1 ACTR3 EID1 HCLS1 AOAH DOK2 M6PR ATXN10 CDC42EP3 NDUFS2 SH3KBP1 FUS TMED9 ASB8 SNX10 HCK NINJ1 BST1 UBE2Q1 ANXA5 PGD SMCO4 CAPG RETN G0S2
HBA1 NCR3 TTC3 STUB1 TMEM165 PNISR RPL39 HPCAL1 PCBP1 TNFAIP3 SORL1 EEF1B2 RPS26 EIF3D SRSF5 ST13 ANKRD12 GPR183 TCF7 RORA P2RX5 MBD4 NME1 TIMM13 SPCS1 GHITM ACRBP ODC1 FAM45A ETFA MCL1 ARF5 ZNF706 ATP6V1F ANXA4 HSP90AB1 UBL5 DDT COX7A2 NDUFB8 ATP5F1 CX3CR1 PTP4A2 C1orf43 TMX2 GYG1 PSMC4 RASSF5 DUSP2 SUN2 UTRN NOTCH2NL RP11-290F20.3 CPPED1 FCGR1A H2AFY MSRB1 TGFBI LGALS9 MTMR11 C5AR1
PTGDS CHST12 RBM4 C19orf70 LYPLA1 RBMX ZFP36L1 RBMS1 ICAM3 UBAC2 PTGER4 RPS21 ZFAND1 MZT2B CCNI CCDC109B SLC2A3 RCN2 SPOCK2 SH2D1A BLK IFT57 SNRNP25 MOB1A C1QBP PRDX1 MMD RBBP6 MPP1 CIR1 LAMTOR1 PPP4C GNG5 UBE2L6 LAIR1 ATP5O COX7A2L HERPUD1 CIB1 C9orf16 VPS28 MIR4435-1HG CAPZA2 GBP2 RPL7L1 IFI44 SRP72 SYTL1 GYPC TAP1 TMEM18 BTK LILRA5 DUSP6 QPCT PGLS VCAN NUDT16 SAT2 CSTA MPEG1
PTPN7 DSCR3 TNFRSF14 PCSK7 SON APEX1 USP3 NDUFB11 HSPB1 MT1X SNHG8 FBL RPL41 AIMP1 ARHGAP15 PPM1K CCR7 ETS1 FCRLA PPAPDC1B AHCY CYC1 TRMT112 SNRPB CA2 AMD1 PLEKHO1 ADIPOR1 SCAND1 EIF5 VASP POLR2L DDAH2 HNRNPA0 CRIP1 PAPOLA TMEM59 UFC1 DBI ACTN4 RAB8A VPS29 CLEC2B SMIM12 ZFAND6 RASAL3 PPP2R5C BTN3A2 INSIG1 UNC93B1 PILRA TESC PID1 CARD16 ID1 PRAM1 IFNGR2 CYBB
TIGIT BAZ1A POLR2I TBC1D10C EIF2S3 ACTR1B CHMP4A MRPS33 RCBTB2 EEF1G EBPL CUTA TNFAIP8 ARID5B AQP3 CDC25B MZB1 NAT9 MCM5 SNF8 ERH UBE2I PTCRA GRAP2 MTHFD2 FDFT1 GNAS LAMTOR5 RBX1 SEC11A PARP14 ANP32B ATP5H RTFDC1 HNRNPF ARF6 DYNLL1 ASCL2 IDH2 MKRN1 EMG1 FLOT1 PMAIP1 MAEA DDIT4 PRMT2 CUX1 SCIMP LRRC25 SLC16A3 CXCL2 CASP1 CD1D APLP2 SLC11A1
PRR5 MRPL18 ARF4 FAM65B MED30 SSU72 RP11-51J9.5 HNRNPH3 EIF3E PCNP PPP3CC FLT3LG BCL11B CD72 RBM5 FABP5 FIBP CNN2 SEC61B SPARC RSU1 SNN GRSF1 HSD17B11 REEP5 RGS19 CASP4 LMO4 ATP5A1 NDUFB10 LSM10 C11orf58 HNRNPM PSMB10 MPST CLTB MYH9 DPM1 PSTPIP1 RAB11B RAB37 TERF2IP BUB3 UBLCP1 ALOX5 LILRB2 CTSL EREG ARRB2 NCOR2 JUND CLEC7A
KLRB1 SMIM7 N4BP2L1 PPP1R2 RP11-349A22.5 RAP1A IL27RA EPB41L4A-AS1 RSL1D1 MZT2A LY9 TRAT1 GATA3 TSPAN13 TPD52 PTTG1 TMEM208 DAD1 NDUFA1 MYL9 NT5C3A TAX1BP3 ILK CHMP1B ATP5EP2 RTN4 NOP10 SKAP2 NDUFB9 UQCR10 RPL22L1 RBM39 HNRNPA3 PRDX5 TMEM140 PSMA4 NMI VPS25 LINC00152 NT5C SEPT1 HSPA5 CDIP1 HHEX C19orf38 NAGA TNFSF10 GSN C4orf48 SULT1A1
MATK TRIM22 GMPR2 NCOR1 KIAA0040 POLR1D PRMT10 IMPDH2 CCNL1 DDX24 RP1-313I6.12 PRKCQ-AS1 SIRPG HLA-DOB ITM2C MCM7 MX2 SERBP1 ABRACL GP9 R3HDM4 MFSD1 CNDP2 ERGIC3 POLR2J CHCHD10 C20orf24 EPN1 C19orf53 ATP5J2 ZC3H15 CDC42 EIF3I CSNK2B OASL MPC1 CRELD2 DNAJC1 ATF6B ARL4C WIPF1 PPM1N SYK SLC31A2 GNAI2 RASSF4 AGTRAP GNS
IL2RB GRK6 C10orf32 SP100 IFI44L UBE2K TAF1D NSA2 TSTD1 PITPNA-AS1 LEF1 GPR171 SPIB CXXC5 FEN1 TXNDC17 TUBB MDH2 F13A1 NGFRAP1 MARCH2 HIST1H2BK IDH3G SUMO3 NDUFB5 TMEM219 U2AF1 MRPL23 LSM2 YWHAZ RABAC1 NDUFB2 UBE2B MOB2 SMARCA4 FDPS TECR ARPC5L CD83 HMOX1 DYNLT1 ENTPD1 GCA ADAP2
S100B C1orf63 MESDC2 MEAF6 IGBP1 LGALS3BP ZNF331 ILF3-AS1 SBDS NUP54 MAL AC092580.4 PKIG MAP3K8 MAD2L1 TIMM8B SUB1 PSMB6 TMEM40 CD82 PARVB THOC6 MT-ND3 DBNL OAS1 CLTA ISCU ATP5I EIF1B C4orf3 ENSA RAB7A RNF149 CMC1 CTD-2035E11.3 GNG2 SEPT6 LAMP1 IFIT2 LILRA3 NAAA OSM C20orf27
CD320 PPCS POLR3K RNPS1 METTL9 CAMK2G YME1L1 PSIP1 CHI3L2 SUSD3 BLNK IL4R TYMS MRPL28 C14orf166 ATP5J TREML1 PTTG1IP CORO1C POP7 OXA1L SNX17 ATG3 NDUFS7 MORF4L1 NDUFA2 EZR SEPT9 TMEM258 FAM49B SQRDL RAB9A RAB27A PHF14 PIM1 ARL6IP1 SIGLEC10 CSF1R EPSTI1 SULF2 SCO2
SH2D2A ADAR TLE4 KRT10 MRFAP1 TMC8 DDX18 AAK1 KIAA0125 SWAP70 EZH2 SPCS3 CYCS UQCRQ ITGA2B ACTN1 H1F0 LMNA TMEM147 EIF6 CD55 AHNAK NDUFS5 ALKBH7 DECR1 PAIP2 RBM8A OSTF1 UPP1 MRPL19 RRAGC TMEM109 ADD3 MRPS6 RELT APOBEC3B OAZ2 MGST2 NAPRT1
PLEKHF1 DNAJC15 MORF4L2 EIF4B NDUFA3 ADSL NDNL2 LDLRAP1 POU2AF1 NXT2 KIAA0101 EIF3A XRCC6 CALR CMTM5 HIST1H1C SOX4 SLC39A3 C1orf86 XAF1 MARCKSL1 ZNHIT1 NHP2L1 RNF7 MRPL43 DNAJA1 AP2M1 ARPC4 MGAT1 DHX36 FAM105A MAPK1IP1L HDAC1 LSM14A KYNU GPBAR1 HSBP1 IER3 EIF4EBP1
ZNHIT3 SF3A1 ARHGEF1 CMTM7 CCT2 TTC39C AL928768.3 HELQ RRM2 ISOC2 EIF4G2 PSMD8 CLDN5 RUFY1 DAPP1 C14orf2 WAS ENY2 VAMP5 SMDT1 MRPL54 RNASEH2B VAPA IRF1 ARHGDIA MMP24-AS1 GFER CYB561D2 ANXA2R RNF167 PIK3AP1 RXRA ATP6V0D1 GMPR NAGK
PRKAR1A SRSF11 SQSTM1 FAM107B MGAT4A RP5-887A10.1 CENPN GGCT ATPIF1 AURKAIP1 ERV3-1 RIOK3 PQBP1 PFDN2 COX17 SDCBP CCDC85B VDAC1 YPEL3 C9orf78 TMEM230 TBCB ANAPC13 CXXC1 CELF1 HMOX2 C19orf66 KIAA0930 OSCAR EIF4E2 SRA1
APBB1IP TAF7 DPP7 FGFR1OP2 OXNAD1 TNFRSF13B GMNN YIF1B SPCS2 MTDH TUBA1C ZNF263 DNAJC7 ACTR2 LYSMD2 SAMHD1 EIF3M HADHA CCT6A DYNLRB1 KRTCAP2 BAX PLIN2 STX18 DTNBP1 RPA2 SLC9A3R1 HES1 LILRB1 ZYX MIR24-2
IKZF1 KMT2E FNBP1 PIM2 NUCB2 TNFRSF17 TK1 HAUS4 PSMC5 PSMA5 BBC3 DERA BNIP3L ERP44 GRB2 SDHB CSDE1 PSMG2 DRAM2 SELK PDCD6 SELT IFIT1 PGM1 BAZ2A NDUFA5 TSC22D4 HES4 TNFRSF1B NR4A1
SNAP23 TAPSAR1 MGST3 DNAJB9 RGL4 GINS2 HDGF RANBP1 H2AFV PLA2G12A PICALM NENF TWF2 LSM6 LAMTOR2 SF1 ETFB CAPZA1 PYURF VDAC2 TMBIM4 ARRDC1 DYNLL2 HSH2D DENND2D BZW1 CAMK1 CAPNS1 CECR1
BEST1 TCEA1 NUCB1 DEGS1 CD40LG ZWINT PGP DUT TOMM22 PGRMC1 APP AKR1A1 EMC7 ZFAND5 NDUFS6 TRA2B COX14 SMAP2 S1PR4 FAM96B PSMD4 IFI27 APOBEC3C TBCC DEF6 CD300E UBE2D1 MIDN
TAOK3 UBXN4 TGOLN2 XXbac-BPG299F13.17 TRABD2A BIK REEP3 TBCA NDUFC2 FHL1 MID1IP1 NDUFS3 CHMP4B CD97 NDUFV2 PSMB4 PLEKHJ1 BCAP31 LMAN2 WDR1 RSAD2 ZBP1 PBXIP1 C5orf56 CD300LF THEMIS2 GRINA
CARD8 MED10 PTPN2 PARP1 CD6 CCNA2 LMNB1 HSP90B1 LSM4 SLC40A1 HADHB BAG1 SMS TGFB1 UQCRC2 PSMD9 TMEM256 ICAM2 TAPBP NDUFB7 ATP5SL ABHD14B RSRC2 ARHGEF40 NANS PTPRE
KLF3 SEC62 EVI2A TOB1 SH3YL1 BIRC5 ECH1 PDIA6 PNMA1 TREX1 LRPAP1 CAT GLIPR1 EIF2S2 ADI1 BLOC1S2 TMED2 SUMO1 RAB5C COMMD10 RGS1 HDDC2 CXCL16 ATOX1 CARS2
LIPA CCT4 TNIP1 SVIP CAMK4 XRCC5 TUFM TPM1 PARL SHKBP1 LTA4H RILPL2 MINOS1 NAA10 WBP2 SSBP1 GADD45B CHMP2A CD38 IFRD1 SMARCE1 TPPP3 MAPKAPK3 DNTTIP1
ZNF394 KIF5B EMC10 TMEM261 SATB1 HNRNPU P4HB CCDC69 TRAPPC2L FAM32A SSR3 MRPL20 TXN2 EIF5B ANXA11 GADD45GIP1 SRI ITGB7 DDX6 LILRA2 CBR1 MYO9B
CLK3 STK4 MTIF3 BEX4 FHIT NCL DEK HTATIP2 C7orf50 PMVK UQCRC1 TOMM20 MRPL41 MMADHC CD47 ACP1 SIVA1 DNAJC19 SDF4 EMR2 TNFRSF1A UBE2R2
SNX9 POLR3GL FRG1 CYLD USP10 NHP2 UBE2L3 ARHGAP4 BNIP2 YWHAE SERPINB6 C19orf60 NDUFA7 SLTM NDUFB4 PSMA1 RALY RNF139 DERL1 MS4A4A ADRBK1 LACTB
OCIAD1 CSRNP1 ASF1A UXS1 PA2G4 C19orf10 MRPL40 MRPS23 PARVG LSMD1 IDS RNF187 HAGH NDUFA12 RPS19BP1 BSG FKBP11 PRPF31 CTD-2006K23.1 TCIRG1 CDKN1A
MPHOSPH8 SLC38A1 CISH NOL7 MYEOV2 MTHFS FAM173A ACAA1 PHPT1 ATF4 OLA1 COQ7 PSMA2 RPN2 TCF25 SKAP1 SPSB3 C1QA NAMPT CREG1
CRBN CDKN1B PASK MDH1 PSMB2 DNAJC4 HBP1 NFKBIZ NDUFB1 DGUOK MRPL52 FBXW5 SSB BANF1 POLR2G NAP1L4 EBP ZNF703 ZNF106 FUOM
TMEM243 LSM5 TNFRSF25 CCT8 MRPL51 PRPF8 TIMMDC1 VMP1 SPG21 MRPS21 MPG PNKD CELF2 HMGN3 NDUFS8 YTHDF2 GCHFR CEBPA AP2A1 FBP1
PPIG G3BP1 CCDC104 HSPE1 TXN ZNF581 ABTB1 CYTH4 IFI27L2 CAMLG NDUFA9 TINF2 RPS27L KARS BUD31 STT3B MFSD10 ALDH3B1 C11orf21 PDXK
MED4 MPHOSPH10 INTS12 PHB SNRPD1 ACAP2 ZNF511 MTPN VIMP PPP1CC EMC6 SMARCB1 NUDC SHISA5 RNF213 REXO2 UBA2 C1QB NRROS PLIN3
UGP2 THAP7 NELL2 CCT3 SNRPC RPL26L1 TPP1 MYD88 COMT NDUFAF3 MVP SET CAPN2 IMP3 ATP6AP2 RBL2 ALDH9A1 HCAR3 MANBA ATF3
DCK ID3 AKTIP EWSR1 MRPS34 TRMT1 IFNGR1 AKIRIN2 MAP2K3 ANXA7 NDUFC1 LRCH4 IK C16orf13 MAP1LC3B NOP58 ORAI1 CXCL3 MBOAT7
WTAP SLC25A45 LINC00176 HNRNPR PSMD7 FAM195A DOCK2 COMMD9 CHCHD5 HAX1 COX7B SIAH2 EIF4H CDC42SE1 RER1 CD96 SURF4 PRKCD
ORMDL1 RP11-489E7.4 CD28 CBX3 ILF2 FAM192A IFIT3 STX10 MBNL1 MRPS16 UBE2J1 HMHA1 SPAG7 FMNL1 SH3BP5 B3GAT3 SGK1
CCDC107 GPATCH4 SCGB3A1 CACYBP CCT5 CINP ERICH1 HM13 NR4A2 ZNRD1 C12orf10 JAK1 SDHC DCTN3 EMB PLGRKT RAB34
TAGAP TNRC6C OSTC ATP5G1 ARPC1A VKORC1 MRP63 MKKS SMCHD1 TRAM1 EMC4 PCMT1 PRKCH GBP5
PDCD4 RP11-291B21.2 THOC7 SHFM1 RFXANK ECHDC1 SF3B2 STRA13 TANK COPS6 APH1A CDKN2D NUDT16L1 PRKD2
TCP1 HAPLN3 PRMT1 ANP32A PCGF5 SLA RAD23A GPI STX5 RAB2A ARL5A RBCK1 ODF2L TRPV2
CCNG1 HSPD1 NDUFAB1 SNX5 SNAPIN TRAPPC6A ANAPC15 THRAP3 DSTN SF3B14 LAPTM4A C14orf1 STAMBP
NONO MATR3 NUTF2 ERCC1 RTN3 PHB2 KXD1 FAM96A CCDC115 PIN1 CSK CYB5B MLLT11
IL16 EIF5A HSD17B10 NCKAP1L PEPD RSL24D1 MRPL16 ISCA2 RHOF SRRM2 CHMP5 PSMD5-AS1 SYNE1
LUC7L3 PDCD5 POLR2E IL10RB STK38 STK17B SLIRP AAMP EAPP CFLAR SSNA1 ORMDL3 TFDP2
FNTA UBE2N PPP1R7 PFKL C9orf89 LINC00493 UQCRB CHPT1 COPS5 CCNDBP1 WASF2 CCDC167
N4BP2L2 MAGOH SEC11C UNC119 CDV3 GGNBP2 CTNNBL1 REL SNX6 COPZ1 KDELR2 GIMAP6
TPR PPHLN1 SNRPD3 ATP2B1 HEXB NSMCE1 FAM50A ITGAE SARS TMED10 COX6A1P2 PYCR2
G3BP2 VDAC3 PSMC3 RABGAP1L ZNF524 CWC15 KDELR1 USF2 FBXO7 DDOST SELPLG RNF115
ELOVL5 NUCKS1 OTUB1 NFYC VMA21 PDCD2 IRF2BP2 ITGA4 GNAI3 SDHD WSB1 PTPN4
SCAF11 AIP MRPS7 MTSS1 PLD3 MRPS18B PSMD11 NSFL1C CCM2 MRPL34 C7orf73 SYNE2
PRPF38B RBM17 PSMA3 RALB ATP6V1B2 PRRC2C HARS GRPEL1 SNHG15 TMEM160 CMTM6 TMEM87A
SLC3A2 GTF2A2 CDK2AP2 DPEP2 CHIC2 CNPY2 SEC13 SAMD9L ABT1 ZMAT2 SUCLG1 RBM38
DPY30 STRAP SRM SNAP29 OAS3 RP11-1143G9.4 FAM89B CCS RNF5 IFI16 PRDX3 THAP11
CYTH1 SNRPE UFD1L GCH1 CEBPG NECAP2 NIT2 ACO2 FAM162A TRAPPC3 MIEN1 OBFC1
CCDC59 EPC1 POMP GINM1 CBWD1 DDX17 FLYWCH2 MRPL12 CCNH C17orf62 RNASEH2C CD59
WHSC1L1 RNF126 MRPL11 RIPK2 TXNL1 PTOV1 STARD7 KTN1 PDAP1 SH3GLB1 CCND2
SF3B1 ADH5 GTF3C6 UTP6 H1FX FH TRIM38 LCP2 RNF166 AKAP13 PHACTR4
RBM23 PPM1G SDF2L1 PACSIN2 MED28 TRA2A BCL7B TXNDC12 ELOF1 GOLT1B
BBX SNRPF PSMC2 LINC01003 SRRM1 COMMD5 GLUD1 IDH3B NDUFA6 MTFP1
NAA38 WBSCR22 NDUFB3 RBM25 CKS1B ELF1 MRPL55 ATP6V1E1 CXCR3
IRF9 MAPRE1 TUBB4B DCXR DUSP22 COMMD8 POLR2F FKBP2 GALM
MAT2B METTL23 XRN2 SFPQ PTPN1 DCTN2 ARHGAP30 CAST ACD
GPBP1 SNRPA1 CARHSP1 CHMP3 SRSF6 TMEM126B DDX46 SFT2D1 TNFRSF18
CHD2 AKR1B1 C11orf48 DAP3 TEN1 NME4 COMMD7 CISD3
EGLN2 PPP1R12A NDUFS4 MRPS15 COMMD4 CCZ1 PPP1R11 NDUFB6
ARL2BP STOML2 TCEB1 EIF2A TSSC1 RPF1 AUP1 UBE2F
RBBP7 MRPL9 IMP4 MFNG CWC25 YY1 PPP2R1A PSENEN
RPAIN MFF MRPS18C PSMF1 CEPT1 RAB11A PSMD13 FAM204A
NKTR PITHD1 SEC61G PHF5A CHD9 KPNB1 PET100 SCAMP2
PNN ANAPC5 ZDHHC12 TAF9 CD81 PDCL3 ITGB1BP1 NAPA
DARS CDC123 PTBP1 NDUFA10 RFC2 GLG1 TEX264 OS9
NCBP2 SNRPG HNRNPD SNW1 ACP5 METTL5 CSNK1A1 ASNA1
AATF LYRM4 NUDT1 FDX1 UROS LSM1 MLX
RSF1 GLRX3 PAK2 ECHS1 MAGED2 GSDMD MEA1
CHCHD7 SRP19 SEPT2 MT-ND4L ZCRB1 RTF1 UBE2A
AL592183.1 C17orf89 AK2 DHPS CCDC53 NME3 EMC3
BTF3L4 SNRNP70 RPS6KB2 PMF1 APOL3 TAF12 STK10
FLI1 PSMD6 TIMM17A SUCLG2 SHARPIN PUF60 TPGS1
NR3C1 NUDT5 DCTPP1 PRKCSH PPA2 IAH1 TMEM141
ITSN2 DDX39A HMGN2 BABAM1 ZFAND2B PDLIM2 PKN1
ZC3HAV1 HSPA9 MTCH2 PABPC4 SLFN5 MGMT UBE2E3
MTERFD2 VCP MRPL47 ATG12 CCDC90B MRPS12 CCDC124
GRAMD1A SRSF1 UQCC2 NXT1 TMBIM1 BRD2 COA3
ZNF24 EIF4A3 MRPL36 PPIE TSNAX CANX RPN1
USE1 CLPP NKAP PIH1D1 SEPHS2 NAA20 RAB4A
USP16 TTC1 DAZAP1 UBE2E1 SLC44A2 NOP56 TADA3
ARFGAP2 IDI1 MRPL15 QARS ASCC2 FUNDC2 MRPS11
FRG1B MRPS26 TIMM10 GID8 BFAR MTCH1 DNAJC2
TIMM9 HNRNPUL1 DTYMK MRPL3 CCDC25 TXNL4A FKBP5
MARCH7 HINT2 TOP1 USP15 NRBP1 PTRHD1 RRP7A
LENG1 SLBP PSMG3 MRPL32 PHF3 SLC25A11 DNAJB11
CYB5A PSMD14 DLD MAPKAPK5-AS1 SYS1 COX16 YIPF3
SMIM19 FKBP3 DCPS PTP4A1 PPM1B NDUFV1 DIAPH1
LINC-PINT SDHA TFDP1 EIF1AX YAF2 ARHGAP9 HNRNPH2
OARD1 NCBP2-AS2 RP11-139H15.1 LNPEP LSM3 PFDN6
TTC14 SNRNP40 FXR1 PEX16 GPAA1
TMEM242 RAD21 EIF2B1 SRSF4 MAPK1
DDIT3 VBP1 ESD PHF11 ACAA2
SAFB2 C14orf119 SF3A3 VTI1B PANK2
CEBPZ-AS1 FAM177A1 MCTS1 AKR7A2 IFNAR2
BRD9 HPRT1 CCNC URM1 MAD2L2
PRDM2 SLC25A39 COA6 AHSA1 CCDC88C
CD84 RPA3 PHYKPL JAGN1 RAB4B
EGR1 SNRPA CAMTA1 GRHPR G6PD
GLRX5 C19orf25 ROMO1 ARFGAP3
TIAL1 CHCHD1 PFDN1 BCCIP
EXOSC8 EIF4E STX8 RABL6
LBR C11orf83 RNPEPL1 SMC4
ILF3 SREK1 TMEM248 MVD
SAP30BP ANKRD11 GGA1 MRPS28
ACTR10 TMEM138 IFT20
NUDT21 PTGES2 AKAP9
UBE2G1 LGALS8 C18orf32
LARP7 MLEC BRMS1
PPIH C5orf15 CAPN1
EBNA1BP2 FEM1B DDRGK1
VOPP1 RAD23B
GNL3 WDR33
CISD2 WDR61
SSRP1 SUGT1
PDIA4 LAGE3
SYNCRIP RBBP4
C1orf35 ESYT1
MAP7D1 APOA1BP
DNAJC9 MIF4GD
HAUS1 CFDP1
ILKAP UBE2J2
RPUSD3 MRPS5
CDKN2AIPNL SRPK2
POLD2 FAM200B
HNRNPAB C17orf49
TMEM106C NUBP2
CBX1 PRKAG1
NAA50 SURF2
MCM3 SSSCA1
CISD1 EI24
TFPT CSNK1D
UBA5 DCTD
LMAN1 PEX2
PGRMC2 PNKP
C19orf48 TMEM70
C14orf142 JMJD6
PAICS DCAF5
RCE1

Module lookup

If you want to quickly find which module a particular feature was assigned to, the featureModuleLookup function can be used. Here will will look up a marker gene for T-cells called “CD3E”:

mod <- featureModuleLookup(sce, feature = c("CD3E", "S100A8"))
mod
##   CD3E S100A8 
##     27     70

Module heatmaps

The function moduleHeatmap can be used to view the expression of features across cells for a specific module. The featureModule parameter denotes the module(s) to be displayed. Cells are ordered from those with the lowest probability of the module on the left to the highest probability on the right. Similarly, features are ordered from those with the highest probability within the module on the top to the lowest probability on the bottom.

moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName)

The parameter topCells can be used to control the number of cells included in the heatmap. By default, only the 100 cells with the lowest probabilities and the 100 cells with the highest probabilities for each selected module are included (i.e. topCells = 100 by default). To display all cells, this parameter can be set to NULL:

moduleHeatmap(sce, featureModule = 27, topCells = NULL, useAssay = useAssay, altExpName = altExpName)

Note: Multiple modules can be displayed by giving a vector of module indices to the parameter featureModule. If featureModule is not specified, then all modules will be plotted.

Module probabilities on 2-D embeddings

The function plotDimReduceModule can be used visualize the probabilities of a particular module or sets of modules on a reduced dimensional plot such as a UMAP. This can be another quick method to see how modules are expressed across various cells in 2-D space. As an example, we can look at module 70 which contained S100A8:

plotDimReduceModule(sce, modules = 70, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

Similarly, multiple modules can be plotting in a grid of UMAPs:

plotDimReduceModule(sce, modules = 70:78, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

In this grid, we can see that module 70 (which has high levels of S100A8 and S100A9) is highly expressed in cell populations 2 and 3, module 71 (which contains CD14) can be used to identify all CD14+ monocytes, module 72 (which contains CST3) is expressed across both CD14 and FCGR3A (CD16) expressing monocytes, and module 73 (which contains CD4) is expressed broadly across both monocytes and dendritic cells as well as some T-cell populations. If we were interesting in defining transcriptional programs active across all monocytes, we could examine the genes found in module 72. If we were interested in defining transcriptional programs for all CD14+ monocytes, we could examine the genes in module 71. These patterns can also be observed in the Probability Map

In the celda probability map, we saw that the unknown T-cell population 13 had high levels of module 30. We can examine both module heatmaps and module probability maps to further explore this:

moduleHeatmap(sce, featureModule = 30, useAssay = useAssay, altExpName = altExpName)

plotDimReduceModule(sce, modules = 30, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

Module 30 has high levels of genes associated with proliferation including HMGA1, STMN1, PCNA, HMGB2, and TUBA1B. We can therefore re-label these cells as “Proliferating T-cells”.

Identification and plotting of marker genes

In addition to examining modules, differential expression can be used to identify potential marker genes up-regulated in specific cell populations. The function findMarkerDiffExp in the singleCellTK package will find markers up-regulated in each cell population compared to all the others.

Differential expression to identify marker genes

# Normalize counts (if not performed previously)
library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")

# Run differential expression analysis
sce <- findMarkerDiffExp(sce, useAssay = "logcounts", method = "wilcox", cluster = celdaClusters(sce), minMeanExpr = 0, fdrThreshold = 0.05, log2fcThreshold = 0, minClustExprPerc = 0, maxCtrlExprPerc = 1)
## Warning: 'findMarkerDiffExp' is deprecated.
## Use 'runFindMarker' instead.
## See help("Deprecated")

The function plotMarkerDiffExp can be used to plot the results in a heatmap. The topN parameter will plot the top N ranked genes for each cluster.

# Plot differentially expressed genes that pass additional thresholds 'minClustExprPerc' and 'maxCtrlExprPerc'
plotMarkerDiffExp(sce, topN = 5, log2fcThreshold = 0, rowLabel = TRUE, fdrThreshold = 0.05, minClustExprPerc = 0.6, maxCtrlExprPerc = 0.4, minMeanExpr = 0)
## Warning: 'plotMarkerDiffExp' is deprecated.
## Use 'plotFindMarkerHeatmap' instead.
## See help("Deprecated")

Other parameters such as minClustExprPerc (the minimum number of cells expressing the marker gene in the cluster) and maxCtrlExprPerc (the maximum number of cells expression the marker gene in other clusters) can be used to control how specific each marker gene is to each cell populations. Similarly, adding a log2 fold-change cutoff (e.g. 1) can select for markers that are more strongly up-regulated in a cell population.

Violin plots for marker genes

The plotCeldaViolin function can be used to examine the distribution of expression of various features across cell population clusters derived from celda. Here we can see that the gene CD79A has high expression in the B-cell cluster and HMGB2 has high expression in the proliferating T-cell population.

# Normalize counts if not performed in previous steps
library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")

# Make violin plots for marker genes
plotCeldaViolin(sce, useAssay = "logcounts", features = c("CD79A", "HMGB2"))

Generating HTML reports

The celda package comes with two functions for generating comprehensive HTML reports that 1) capture the process of selecting K/L for a celda_CG model and 2) plot the results from the downstream analysis. The first report runs both recursiveSplitModule and recursiveSplitCell for selection of L and K, respectively. To recapitulate the complete analysis presented in this tutorial in the HTML report, the following command can be used:

sce <- reportCeldaCGRun(sce, sampleLabel = NULL, useAssay = useAssay, altExpName = altExpName, minCell = 3, minCount = 3, initialL = 10, maxL = 150, initialK = 3, maxK = 25, L = 80, K = 14)

All of the parameters in this function are the same that were used throughout this tutorial in the selectFeatures, recursiveSplitModule, and recursiveSplitCell functions. Note that this report does not do cell filtering, so that must be completed before running this function. The returned SCE object will have the celda_CG model with selected K and L which can be used in any of the downstream plotting functions as well as input into the second plotting report described next.

The second report takes in as input an SCE object with a fitted celda_CG model and systematically generates several plots that facilitate exploratory analysis including cell subpopulation cluster labels on 2-D embeddings, user-specified annotations on 2-D embeddings, module heatmaps, module probabilities, expression of marker genes on 2-D embeddings, and the celda probability map. The report can be generated with the following code:

reportCeldaCGPlotResults(sce, reducedDimName = "celda_UMAP", features = markers, useAssay = useAssay, altExpName = altExpName, cellAnnot = c("total", "detected", "decontX_contamination", "subsets_mito_percent"), cellAnnotLabel = "scDblFinder_doublet_call")

User-supplied annotations to plot on the 2-D embedding can be specified through the cellAnnot and cellAnnotLabel variables. Both parameters will allow for plotting of variables stored in the colData of the SCE on the 2-D embedding plot specified by reducedDimName parameter. For cellAnnot, integer and numeric variables will be plotted as as continuous variables while factors and characters will be plotted as categorical variables. For cellAnnotLabel, all variables will be coerced to a factor and the labels of the categories will be plotted on the scatter plot.

Other useful functions

Matrix factorization

The celda model factorizes the original matrix into three matrices:

1) module - The probability of each feature in each module (Psi)

2) cellPopulation - The probability of each module in each cell population (Phi)

3) sample - The probability of each cell population in each sample (Theta)

Additionally, we can calculate the probability of each module within each cell (cell). The cell matrix can essentially be used to replace PCs from PCA and is useful for downstream visualization (e.g. generating 2-D embeddings). All of these matrices can be retrieved with the factorizeMatrix function. The matrices are returned in three different versions: unnormalized counts, proportions (normalized by the total), or posterior estimates (where the Dirichlet concentration parameter is added in before normalization).

# Factorize the original counts matrix
fm <- factorizeMatrix(sce)

# Three different version of each matrix:
names(fm)
## [1] "counts"      "proportions" "posterior"
# Get normalized proportional matrices
dim(fm$proportions$cell) # Matrix of module probabilities for each cell
## [1]   80 2675
dim(fm$proportions$module) # Matrix of feature probabilities for each module
## [1] 2639   80
dim(fm$proportions$cellPopulation) # Matrix of module probabilities for each cell population
## [1] 80 14
dim(fm$proportions$sample) # Matrix of cell population probabilities in each sample
## [1] 14  1

Changing the feature display name

The parameter displayName can be used to change the labels of the rows from the rownames to a column in the rowData of the SCE object. The function is available in plotDimReduceFeature and moduleHeatmap. For example, if we did not change the rownames to Symbol_TENx in the beginning of the tutorial, the following code still could be run in moduleHeatmap to display the gene symbol even if the rownames were set to the original Ensembl IDs:

moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName, displayName = "Symbol_TENx")

Session information

sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.30.1               scuttle_1.12.0             
##  [3] kableExtra_1.4.0            knitr_1.45                 
##  [5] ggplot2_3.5.0               celda_1.18.2               
##  [7] singleCellTK_2.12.2         TENxPBMCData_1.20.0        
##  [9] HDF5Array_1.30.0            rhdf5_2.46.1               
## [11] DelayedArray_0.28.0         SparseArray_1.2.4          
## [13] S4Arrays_1.2.1              abind_1.4-5                
## [15] Matrix_1.6-5                SingleCellExperiment_1.24.0
## [17] SummarizedExperiment_1.32.0 Biobase_2.62.0             
## [19] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
## [21] IRanges_2.36.0              S4Vectors_0.40.2           
## [23] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
## [25] matrixStats_1.2.0          
## 
## loaded via a namespace (and not attached):
##   [1] later_1.3.2                   BiocIO_1.12.0                
##   [3] bitops_1.0-7                  filelock_1.0.3               
##   [5] tibble_3.2.1                  R.oo_1.26.0                  
##   [7] graph_1.80.0                  XML_3.99-0.16.1              
##   [9] lifecycle_1.0.4               scDblFinder_1.16.0           
##  [11] doParallel_1.0.17             edgeR_4.0.16                 
##  [13] lattice_0.22-5                MASS_7.3-60.0.1              
##  [15] magrittr_2.0.3                limma_3.58.1                 
##  [17] sass_0.4.8                    rmarkdown_2.25               
##  [19] jquerylib_0.1.4               yaml_2.3.8                   
##  [21] metapod_1.10.1                httpuv_1.6.14                
##  [23] reticulate_1.34.0             cowplot_1.1.3                
##  [25] RColorBrewer_1.1-3            DBI_1.2.1                    
##  [27] zlibbioc_1.48.2               Rtsne_0.17                   
##  [29] purrr_1.0.2                   R.utils_2.12.3               
##  [31] RCurl_1.98-1.14               WriteXLS_6.5.0               
##  [33] rappdirs_0.3.3                circlize_0.4.16              
##  [35] GenomeInfoDbData_1.2.11       ggrepel_0.9.5                
##  [37] irlba_2.3.5.1                 eds_1.4.0                    
##  [39] annotate_1.80.0               dqrng_0.3.2                  
##  [41] svglite_2.1.3                 pkgdown_2.0.7                
##  [43] DelayedMatrixStats_1.24.0     codetools_0.2-19             
##  [45] DropletUtils_1.22.0           xml2_1.3.6                   
##  [47] shape_1.4.6.1                 tidyselect_1.2.0             
##  [49] farver_2.1.1                  ScaledMatrix_1.10.0          
##  [51] viridis_0.6.5                 BiocFileCache_2.10.1         
##  [53] GenomicAlignments_1.38.2      jsonlite_1.8.8               
##  [55] GetoptLong_1.0.5              BiocNeighbors_1.20.2         
##  [57] ellipsis_0.3.2                iterators_1.0.14             
##  [59] systemfonts_1.0.6             dbscan_1.1-12                
##  [61] foreach_1.5.2                 tools_4.3.3                  
##  [63] ragg_1.3.0                    Rcpp_1.0.12                  
##  [65] glue_1.7.0                    gridExtra_2.3                
##  [67] xfun_0.41                     dplyr_1.1.4                  
##  [69] withr_3.0.0                   combinat_0.0-8               
##  [71] BiocManager_1.30.22           fastmap_1.1.1                
##  [73] MCMCprecision_0.4.0           rhdf5filters_1.14.1          
##  [75] bluster_1.12.0                fansi_1.0.6                  
##  [77] digest_0.6.35                 rsvd_1.0.5                   
##  [79] R6_2.5.1                      mime_0.12                    
##  [81] textshaping_0.3.7             colorspace_2.1-0             
##  [83] Cairo_1.6-2                   RSQLite_2.3.5                
##  [85] R.methodsS3_1.8.2             utf8_1.2.4                   
##  [87] generics_0.1.3                data.table_1.15.4            
##  [89] FNN_1.1.4                     rtracklayer_1.62.0           
##  [91] httr_1.4.7                    uwot_0.1.16                  
##  [93] pkgconfig_2.0.3               gtable_0.3.4                 
##  [95] blob_1.2.4                    ComplexHeatmap_2.18.0        
##  [97] XVector_0.42.0                htmltools_0.5.7              
##  [99] clue_0.3-65                   GSEABase_1.64.0              
## [101] scales_1.3.0                  png_0.1-8                    
## [103] enrichR_3.2                   scran_1.30.2                 
## [105] rstudioapi_0.15.0             reshape2_1.4.4               
## [107] rjson_0.2.21                  curl_5.2.1                   
## [109] GlobalOptions_0.1.2           cachem_1.0.8                 
## [111] stringr_1.5.1                 BiocVersion_3.18.1           
## [113] parallel_4.3.3                vipor_0.4.7                  
## [115] AnnotationDbi_1.64.1          restfulr_0.0.15              
## [117] desc_1.4.3                    pillar_1.9.0                 
## [119] grid_4.3.3                    vctrs_0.6.5                  
## [121] promises_1.2.1                BiocSingular_1.18.0          
## [123] dbplyr_2.4.0                  beachmat_2.18.1              
## [125] xtable_1.8-4                  cluster_2.1.6                
## [127] beeswarm_0.4.0                evaluate_0.23                
## [129] magick_2.8.2                  cli_3.6.2                    
## [131] locfit_1.5-9.9                compiler_4.3.3               
## [133] Rsamtools_2.18.0              rlang_1.1.3                  
## [135] crayon_1.5.2                  labeling_0.4.3               
## [137] plyr_1.8.9                    fs_1.6.3                     
## [139] ggbeeswarm_0.7.2              stringi_1.8.3                
## [141] viridisLite_0.4.2             BiocParallel_1.36.0          
## [143] munsell_0.5.1                 Biostrings_2.70.1            
## [145] ExperimentHub_2.10.0          RcppEigen_0.3.4.0.0          
## [147] GSVAdata_1.38.0               sparseMatrixStats_1.14.0     
## [149] bit64_4.0.5                   Rhdf5lib_1.24.1              
## [151] KEGGREST_1.42.0               statmod_1.5.0                
## [153] shiny_1.8.0                   highr_0.10                   
## [155] interactiveDisplayBase_1.40.0 AnnotationHub_3.10.0         
## [157] igraph_2.0.3                  memoise_2.0.1                
## [159] bslib_0.6.1                   bit_4.0.5                    
## [161] xgboost_1.7.7.1

Site built with pkgdown 2.0.7.

================================================ FILE: docs/articles/celda_pbmc3k_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/articles/celda_pbmc3k_files/kePrint-0.0.1/kePrint.js ================================================ $(document).ready(function(){ if (typeof $('[data-toggle="tooltip"]').tooltip === 'function') { $('[data-toggle="tooltip"]').tooltip(); } if ($('[data-toggle="popover"]').popover === 'function') { $('[data-toggle="popover"]').popover(); } }); ================================================ FILE: docs/articles/celda_pbmc3k_files/lightable-0.0.1/lightable.css ================================================ /*! * lightable v0.0.1 * Copyright 2020 Hao Zhu * Licensed under MIT (https://github.com/haozhu233/kableExtra/blob/master/LICENSE) */ .lightable-minimal { border-collapse: separate; border-spacing: 16px 1px; width: 100%; margin-bottom: 10px; } .lightable-minimal td { margin-left: 5px; margin-right: 5px; } .lightable-minimal th { margin-left: 5px; margin-right: 5px; } .lightable-minimal thead tr:last-child th { border-bottom: 2px solid #00000050; empty-cells: hide; } .lightable-minimal tbody tr:first-child td { padding-top: 0.5em; } .lightable-minimal.lightable-hover tbody tr:hover { background-color: #f5f5f5; } .lightable-minimal.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-classic { border-top: 0.16em solid #111111; border-bottom: 0.16em solid #111111; width: 100%; margin-bottom: 10px; margin: 10px 5px; } .lightable-classic tfoot tr td { border: 0; } .lightable-classic tfoot tr:first-child td { border-top: 0.14em solid #111111; } .lightable-classic caption { color: #222222; } .lightable-classic td { padding-left: 5px; padding-right: 5px; color: #222222; } .lightable-classic th { padding-left: 5px; padding-right: 5px; font-weight: normal; color: #222222; } .lightable-classic thead tr:last-child th { border-bottom: 0.10em solid #111111; } .lightable-classic.lightable-hover tbody tr:hover { background-color: #F9EEC1; } .lightable-classic.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-classic-2 { border-top: 3px double #111111; border-bottom: 3px double #111111; width: 100%; margin-bottom: 10px; } .lightable-classic-2 tfoot tr td { border: 0; } .lightable-classic-2 tfoot tr:first-child td { border-top: 3px double #111111; } .lightable-classic-2 caption { color: #222222; } .lightable-classic-2 td { padding-left: 5px; padding-right: 5px; color: #222222; } .lightable-classic-2 th { padding-left: 5px; padding-right: 5px; font-weight: normal; color: #222222; } .lightable-classic-2 tbody tr:last-child td { border-bottom: 3px double #111111; } .lightable-classic-2 thead tr:last-child th { border-bottom: 1px solid #111111; } .lightable-classic-2.lightable-hover tbody tr:hover { background-color: #F9EEC1; } .lightable-classic-2.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-material { min-width: 100%; white-space: nowrap; table-layout: fixed; font-family: Roboto, sans-serif; border: 1px solid #EEE; border-collapse: collapse; margin-bottom: 10px; } .lightable-material tfoot tr td { border: 0; } .lightable-material tfoot tr:first-child td { border-top: 1px solid #EEE; } .lightable-material th { height: 56px; padding-left: 16px; padding-right: 16px; } .lightable-material td { height: 52px; padding-left: 16px; padding-right: 16px; border-top: 1px solid #eeeeee; } .lightable-material.lightable-hover tbody tr:hover { background-color: #f5f5f5; } .lightable-material.lightable-striped tbody tr:nth-child(even) { background-color: #f5f5f5; } .lightable-material.lightable-striped tbody td { border: 0; } .lightable-material.lightable-striped thead tr:last-child th { border-bottom: 1px solid #ddd; } .lightable-material-dark { min-width: 100%; white-space: nowrap; table-layout: fixed; font-family: Roboto, sans-serif; border: 1px solid #FFFFFF12; border-collapse: collapse; margin-bottom: 10px; background-color: #363640; } .lightable-material-dark tfoot tr td { border: 0; } .lightable-material-dark tfoot tr:first-child td { border-top: 1px solid #FFFFFF12; } .lightable-material-dark th { height: 56px; padding-left: 16px; padding-right: 16px; color: #FFFFFF60; } .lightable-material-dark td { height: 52px; padding-left: 16px; padding-right: 16px; color: #FFFFFF; border-top: 1px solid #FFFFFF12; } .lightable-material-dark.lightable-hover tbody tr:hover { background-color: #FFFFFF12; } .lightable-material-dark.lightable-striped tbody tr:nth-child(even) { background-color: #FFFFFF12; } .lightable-material-dark.lightable-striped tbody td { border: 0; } .lightable-material-dark.lightable-striped thead tr:last-child th { border-bottom: 1px solid #FFFFFF12; } .lightable-paper { width: 100%; margin-bottom: 10px; color: #444; } .lightable-paper tfoot tr td { border: 0; } .lightable-paper tfoot tr:first-child td { border-top: 1px solid #00000020; } .lightable-paper thead tr:last-child th { color: #666; vertical-align: bottom; border-bottom: 1px solid #00000020; line-height: 1.15em; padding: 10px 5px; } .lightable-paper td { vertical-align: middle; border-bottom: 1px solid #00000010; line-height: 1.15em; padding: 7px 5px; } .lightable-paper.lightable-hover tbody tr:hover { background-color: #F9EEC1; } .lightable-paper.lightable-striped tbody tr:nth-child(even) { background-color: #00000008; } .lightable-paper.lightable-striped tbody td { border: 0; } ================================================ FILE: docs/articles/decontX.html ================================================ Decontamination of ambient RNA in single-cell genomic data with DecontX • celda

Introduction

Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a Bayesian method to estimate and remove contamination in individual cells. DecontX assumes the observed expression of a cell is a mixture of counts from two multinomial distributions: (1) a distribution of native transcript counts from the cell’s actual population and (2) a distribution of contaminating transcript counts from all other cell populations captured in the assay. Overall, computational decontamination of single cell counts can aid in downstream clustering and visualization.

The package can be loaded using the library command.

library(celda)

Importing data

DecontX can take either a SingleCellExperiment object or a counts matrix as input. decontX will attempt to convert any input matrix to class dgCMatrix from package Matrix before starting the analysis.

To import datasets directly into an SCE object, the singleCellTK package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in the filtered and raw matrices for multiple samples processed with CellRanger:

library(singleCellTK)
sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"))

Within each sample directory, there should be subfolders called "outs/filtered_feature_bc_matrix/" or "outs/raw_feature_bc_matrix/" with files called matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz. If these files are in different subdirectories, the importCellRangerV3Sample function can be used to import data from a different directory instead.

Optionally, the “raw” or “droplet” matrix can also be easily imported by setting the dataType argument to “raw”:

sce.raw <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"), dataType = "raw")

The raw matrix can be passed to the background parameter in decontX as described below. If using Seurat, go to the Working with Seurat section for details on how to convert between SCE and Seurat objects.

Load PBMC4k data from 10X

We will utilize the 10X PBMC 4K dataset as an example in this vignette. This data can be easily retrieved from the package TENxPBMCData. Make sure the the column names are set before running decontX.

# Load PBMC data
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")
colnames(sce) <- paste(sce$Sample, sce$Barcode, sep = "_")
rownames(sce) <- rowData(sce)$Symbol_TENx
counts(sce) <- as(counts(sce), "dgCMatrix")

Running decontX

A SingleCellExperiment (SCE) object or a sparse matrix containing the counts for filtered cells can be passed to decontX via the x parameter. The matrix to use in an SCE object can be specified with the assayName parameter, which is set to "counts" by default. There are two major ways to run decontX: with and without the raw/droplet matrix containing empty droplets. Here is an example of running decontX without supplying the background:

sce <- decontX(sce)

In this scenario, decontX will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset. The estimated contamination results can be found in the colData(sce)$decontX_contamination and the decontaminated counts can be accessed with decontXcounts(sce). decontX will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the z parameter. These results will be used throughout the rest of the vignette.

The raw/droplet matrix can be used to empirically estimate the distribution of ambient RNA, which is especially useful when cells that contributed to the ambient RNA are not accurately represented in the filtered count matrix containing the cells. For example, cells that were removed via flow cytometry or that were more sensitive to lysis during dissociation may have contributed to the ambient RNA but were not measured in the filtered/cell matrix. The raw/droplet matrix can be input as an SCE object or a sparse matrix using the background parameter:

sce <- decontX(sce, background = sce.raw)

Only empty droplets in the background matrix should be used to estimate the ambient RNA. If any cell ids (i.e. colnames) in the raw/droplet matrix supplied to the background parameter are also found in the filtered counts matrix (x), decontX will automatically remove them from the raw matrix. However, if the cell ids are not available for the input matrices, decontX will treat the entire background input as empty droplets. All of the outputs are the same as when running decontX without setting the background parameter.

Note: If the input object is just a matrix and not an SCE object, make sure to save the output into a variable with a different name (e.g. result <- decontX(mat)). The result object will be a list with contamination in result$contamination and the decontaminated counts in result$decontXcounts.

Plotting DecontX results

Cluster labels on UMAP

DecontX creates a UMAP which we can use to plot the cluster labels automatically identified in the analysis. Note that the clustering approach used here is designed to find “broad” cell types rather than individual cell subpopulations within a cell type.

umap <- reducedDim(sce, "decontX_UMAP")
plotDimReduceCluster(x = sce$decontX_clusters,
    dim1 = umap[, 1], dim2 = umap[, 2])

Contamination on UMAP

The percentage of contamination in each cell can be plotting on the UMAP to visualize what what clusters may have higher levels of ambient RNA.

Expression of markers on UMAP

Known marker genes can also be plotted on the UMAP to identify the cell types for each cluster. We will use CD3D and CD3E for T-cells, LYZ, S100A8, and S100A9 for monocytes, CD79A, CD79B, and MS4A1 for B-cells, GNLY for NK-cells, and PPBP for megakaryocytes.

library(scater)
sce <- logNormCounts(sce)
plotDimReduceFeature(as.matrix(logcounts(sce)),
    dim1 = umap[, 1],
    dim2 = umap[, 2],
    features = c("CD3D", "CD3E", "GNLY",
        "LYZ", "S100A8", "S100A9",
        "CD79A", "CD79B", "MS4A1"),
    exactMatch = TRUE)
## Warning in asMethod(object): sparse->dense coercion: allocating vector of size
## 1.1 GiB

Barplot of markers detected in cell clusters

The percetage of cells within a cluster that have detectable expression of marker genes can be displayed in a barplot. Markers for cell types need to be supplied in a named list. First, the detection of marker genes in the original counts assay is shown:

markers <- list(Tcell_Markers = c("CD3E", "CD3D"),
    Bcell_Markers = c("CD79A", "CD79B", "MS4A1"),
    Monocyte_Markers = c("S100A8", "S100A9", "LYZ"),
    NKcell_Markers = "GNLY")
cellTypeMappings <- list(Tcells = 2, Bcells = 5, Monocytes = 1, NKcells = 6)
plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "counts")

We can then look to see how much decontX removed aberrant expression of marker genes in each cell type by changing the assayName to decontXcounts:

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "decontXcounts")

Percentages of marker genes detected in other cell types were reduced or completely removed. For example, the percentage of cells that expressed Monocyte marker genes was greatly reduced in T-cells, B-cells, and NK-cells. The original counts and decontamined counts can be plotted side-by-side by listing multiple assays in the assayName parameter. This option is only available if the data is stored in SingleCellExperiment object.

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = c("counts", "decontXcounts"))

Some helpful hints when using plotDecontXMarkerPercentage:

  1. Cell clusters can be renamed and re-grouped using the groupCluster parameter, which also needs to be a named list. If groupCluster is used, cell clusters not included in the list will be excluded in the barplot. For example, if we wanted to group T-cells and NK-cells together, we could set cellTypeMappings <- list(NK_Tcells = c(2,6), Bcells = 5, Monocytes = 1)
  2. The level a gene that needs to be expressed to be considered detected in a cell can be adjusted using the threshold parameter.
  3. If you are not using a SingleCellExperiment, then you will need to supply the original counts matrix or the decontaminated counts matrix as the first argument to generate the barplots.

Violin plot to compare the distributions of original and decontaminated counts

Another useful way to assess the amount of decontamination is to view the expression of marker genes before and after decontX across cell types. Here we view the monocyte markers in each cell type. The violin plot shows that the markers have been removed from T-cells, B-cells, and NK-cells, but are largely unaffected in monocytes.

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3)

Some helpful hints when using plotDecontXMarkerExpression:

  1. groupClusters works the same way as in plotDecontXMarkerPercentage.
  2. This function will plot each pair of markers and clusters (or cell type specified by groupClusters). Therefore, you may want to keep the number of markers small in each plot and call the function multiple times for different sets of marker genes.
  3. You can also plot the individual points by setting plotDots = TRUE and/or log transform the points on the fly by setting log1p = TRUE.
  4. This function can plot any assay in a SingleCellExperiment. Therefore you could also examine normalized expression of the original and decontaminated counts. For example:
library(scater)
sce <- logNormCounts(sce,
    exprs_values = "decontXcounts",
    name = "decontXlogcounts")

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3,
    assayName = c("logcounts", "decontXlogcounts"))

Other important notes

Choosing appropriate cell clusters

The ability of DecontX to accurately identify contamination is dependent on the cell cluster labels. DecontX assumes that contamination for a cell cluster comes from combination of counts from all other clusters. The default clustering approach used by DecontX tends to select fewer clusters that represent broader cell types. For example, all T-cells tend to be clustered together rather than splitting naive and cytotoxic T-cells into separate clusters. Custom cell type labels can be suppled via the z parameter if some cells are not being clustered appropriately by the default method.

Adjusting the priors to influence contamination estimates

There are ways to force decontX to estimate more or less contamination across a dataset by manipulating the priors. The delta parameter is a numeric vector of length two. It is the concentration parameter for the Dirichlet distribution which serves as the prior for the proportions of native and contamination counts in each cell. The first element is the prior for the proportion of native counts while the second element is the prior for the proportion of contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If estimateDelta = TRUE, delta is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then delta is updated in each iteration. If estimateDelta = FALSE, then delta is fixed with these values for the entire inference procedure. Fixing delta and setting a high number in the second element will force decontX to be more aggressive and estimate higher levels of contamination in each cell at the expense of potentially removing native expression. For example, in the previous PBMC example, we can see what the estimated delta was by looking in the estimates:

metadata(sce)$decontX$estimates$all_cells$delta
## [1] 9.280108 1.038000

Setting a higher value in the second element of delta and estimateDelta = FALSE will force decontX to estimate higher levels of contamination per cell:

sce.delta <- decontX(sce, delta = c(9, 20), estimateDelta = FALSE)

plot(sce$decontX_contamination, sce.delta$decontX_contamination,
     xlab = "DecontX estimated priors",
     ylab = "Setting priors to estimate higher contamination")
abline(0, 1, col = "red", lwd = 2)

Working with Seurat

If you are using the Seurat package for downstream analysis, the following code can be used to read in a matrix and convert between Seurat and SCE objects:

# Read counts from CellRanger output
library(Seurat)
counts <- Read10X("sample/outs/filtered_feature_bc_matrix/")

# Create a SingleCellExperiment object and run decontX
sce <- SingleCellExperiment(list(counts = counts))
sce <- decontX(sce)

# Create a Seurat object from a SCE with decontX results
seuratObject <- CreateSeuratObject(round(decontXcounts(sce)))

Optionally, the “raw” matrix can be also be imported and used as the background:

counts.raw <- Read10X("sample/outs/raw_feature_bc_matrix/")
sce.raw <- SingleCellExperiment(list(counts = counts.raw))
sce <- decontX(sce, background = sce.raw)

Note that the decontaminated matrix of decontX consists of floating point numbers and must be rounded to integers before adding it to a Seurat object. If you already have a Seurat object containing the counts matrix and would like to run decontX, you can retrieve the count matrix, create a SCE object, and run decontX, and then add it back to the Seurat object:

counts <- GetAssayData(object = seuratObject, slot = "counts")
sce <- SingleCellExperiment(list(counts = counts))
sce <- decontX(sce)
seuratObj[["decontXcounts"]] <- CreateAssayObject(counts = decontXcounts(sce))

Session Information

## R version 4.3.3 (2024-02-29)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.30.1               ggplot2_3.5.0              
##  [3] scuttle_1.12.0              TENxPBMCData_1.20.0        
##  [5] HDF5Array_1.30.0            rhdf5_2.46.1               
##  [7] DelayedArray_0.28.0         SparseArray_1.2.4          
##  [9] S4Arrays_1.2.1              abind_1.4-5                
## [11] celda_1.18.2                Matrix_1.6-5               
## [13] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
## [15] Biobase_2.62.0              GenomicRanges_1.54.1       
## [17] GenomeInfoDb_1.38.8         IRanges_2.36.0             
## [19] S4Vectors_0.40.2            BiocGenerics_0.48.1        
## [21] MatrixGenerics_1.14.0       matrixStats_1.2.0          
## [23] BiocStyle_2.30.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3            rstudioapi_0.15.0            
##   [3] jsonlite_1.8.8                magrittr_2.0.3               
##   [5] ggbeeswarm_0.7.2              farver_2.1.1                 
##   [7] rmarkdown_2.25                fs_1.6.3                     
##   [9] zlibbioc_1.48.2               ragg_1.3.0                   
##  [11] vctrs_0.6.5                   memoise_2.0.1                
##  [13] DelayedMatrixStats_1.24.0     RCurl_1.98-1.14              
##  [15] htmltools_0.5.7               AnnotationHub_3.10.0         
##  [17] curl_5.2.1                    BiocNeighbors_1.20.2         
##  [19] Rhdf5lib_1.24.1               sass_0.4.8                   
##  [21] bslib_0.6.1                   desc_1.4.3                   
##  [23] plyr_1.8.9                    cachem_1.0.8                 
##  [25] mime_0.12                     lifecycle_1.0.4              
##  [27] iterators_1.0.14              pkgconfig_2.0.3              
##  [29] rsvd_1.0.5                    R6_2.5.1                     
##  [31] fastmap_1.1.1                 GenomeInfoDbData_1.2.11      
##  [33] shiny_1.8.0                   digest_0.6.35                
##  [35] colorspace_2.1-0              AnnotationDbi_1.64.1         
##  [37] irlba_2.3.5.1                 ExperimentHub_2.10.0         
##  [39] textshaping_0.3.7             RSQLite_2.3.5                
##  [41] beachmat_2.18.1               labeling_0.4.3               
##  [43] filelock_1.0.3                WriteXLS_6.5.0               
##  [45] fansi_1.0.6                   httr_1.4.7                   
##  [47] compiler_4.3.3                bit64_4.0.5                  
##  [49] withr_3.0.0                   doParallel_1.0.17            
##  [51] BiocParallel_1.36.0           viridis_0.6.5                
##  [53] DBI_1.2.1                     highr_0.10                   
##  [55] rappdirs_0.3.3                rjson_0.2.21                 
##  [57] tools_4.3.3                   vipor_0.4.7                  
##  [59] beeswarm_0.4.0                interactiveDisplayBase_1.40.0
##  [61] httpuv_1.6.14                 MCMCprecision_0.4.0          
##  [63] glue_1.7.0                    dbscan_1.1-12                
##  [65] rhdf5filters_1.14.1           promises_1.2.1               
##  [67] grid_4.3.3                    Rtsne_0.17                   
##  [69] reshape2_1.4.4                generics_0.1.3               
##  [71] gtable_0.3.4                  data.table_1.15.4            
##  [73] ScaledMatrix_1.10.0           BiocSingular_1.18.0          
##  [75] utf8_1.2.4                    XVector_0.42.0               
##  [77] RcppAnnoy_0.0.22              ggrepel_0.9.5                
##  [79] BiocVersion_3.18.1            foreach_1.5.2                
##  [81] pillar_1.9.0                  stringr_1.5.1                
##  [83] later_1.3.2                   dplyr_1.1.4                  
##  [85] BiocFileCache_2.10.1          lattice_0.22-5               
##  [87] bit_4.0.5                     tidyselect_1.2.0             
##  [89] Biostrings_2.70.1             knitr_1.45                   
##  [91] gridExtra_2.3                 bookdown_0.37                
##  [93] xfun_0.41                     stringi_1.8.3                
##  [95] yaml_2.3.8                    evaluate_0.23                
##  [97] codetools_0.2-19              RcppEigen_0.3.4.0.0          
##  [99] tibble_3.2.1                  BiocManager_1.30.22          
## [101] cli_3.6.2                     uwot_0.1.16                  
## [103] xtable_1.8-4                  systemfonts_1.0.6            
## [105] munsell_0.5.1                 jquerylib_0.1.4              
## [107] enrichR_3.2                   Rcpp_1.0.12                  
## [109] dbplyr_2.4.0                  png_0.1-8                    
## [111] parallel_4.3.3                ellipsis_0.3.2               
## [113] pkgdown_2.0.7                 blob_1.2.4                   
## [115] sparseMatrixStats_1.14.0      bitops_1.0-7                 
## [117] viridisLite_0.4.2             scales_1.3.0                 
## [119] purrr_1.0.2                   crayon_1.5.2                 
## [121] combinat_0.0-8                rlang_1.1.3                  
## [123] KEGGREST_1.42.0

Site built with pkgdown 2.0.7.

================================================ FILE: docs/articles/decontX_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/articles/decontX_files/header-attrs-2.7/header-attrs.js ================================================ // Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); ================================================ FILE: docs/articles/decontX_pbmc4k.html ================================================ Decontamination of ambient RNA in single-cell genomic data with DecontX • celda

Introduction

Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a Bayesian method to estimate and remove contamination in individual cells. DecontX assumes the observed expression of a cell is a mixture of counts from two multinomial distributions: (1) a distribution of native transcript counts from the cell’s actual population and (2) a distribution of contaminating transcript counts from all other cell populations captured in the assay. Overall, computational decontamination of single cell counts can aid in downstream clustering and visualization.

The package can be loaded using the library command.

library(celda)

Importing data

DecontX can take either a SingleCellExperiment object or a counts matrix as input. decontX will attempt to convert any input matrix to class dgCMatrix from package Matrix before starting the analysis.

To import datasets directly into an SCE object, the singleCellTK package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in the filtered and raw matrices for multiple samples processed with CellRanger:

library(singleCellTK)
sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"))

Within each sample directory, there should be subfolders called "outs/filtered_feature_bc_matrix/" or "outs/raw_feature_bc_matrix/" with files called matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz. If these files are in different subdirectories, the importCellRangerV3Sample function can be used to import data from a different directory instead.

Optionally, the “raw” or “droplet” matrix can also be easily imported by setting the dataType argument to “raw”:

sce.raw <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"), dataType = "raw")

The raw matrix can be passed to the background parameter in decontX as described below. If using Seurat, go to the Working with Seurat section for details on how to convert between SCE and Seurat objects.

Load PBMC4k data from 10X

We will utilize the 10X PBMC 4K dataset as an example in this vignette. This data can be easily retrieved from the package TENxPBMCData. Make sure the the column names are set before running decontX.

# Load PBMC data
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")
colnames(sce) <- paste(sce$Sample, sce$Barcode, sep = "_")
rownames(sce) <- rowData(sce)$Symbol_TENx
counts(sce) <- as(counts(sce), "dgCMatrix")

Running decontX

A SingleCellExperiment (SCE) object or a sparse matrix containing the counts for filtered cells can be passed to decontX via the x parameter. The matrix to use in an SCE object can be specified with the assayName parameter, which is set to "counts" by default. There are two major ways to run decontX: with and without the raw/droplet matrix containing empty droplets. Here is an example of running decontX without supplying the background:

sce <- decontX(sce)

In this scenario, decontX will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset. The estimated contamination results can be found in the colData(sce)$decontX_contamination and the decontaminated counts can be accessed with decontXcounts(sce). decontX will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the z parameter. These results will be used throughout the rest of the vignette.

The raw/droplet matrix can be used to empirically estimate the distribution of ambient RNA, which is especially useful when cells that contributed to the ambient RNA are not accurately represented in the filtered count matrix containing the cells. For example, cells that were removed via flow cytometry or that were more sensitive to lysis during dissociation may have contributed to the ambient RNA but were not measured in the filtered/cell matrix. The raw/droplet matrix can be input as an SCE object or a sparse matrix using the background parameter:

sce <- decontX(sce, background = sce.raw)

Only empty droplets in the background matrix should be used to estimate the ambient RNA. If any cell ids (i.e. colnames) in the raw/droplet matrix supplied to the background parameter are also found in the filtered counts matrix (x), decontX will automatically remove them from the raw matrix. However, if the cell ids are not available for the input matrices, decontX will treat the entire background input as empty droplets. All of the outputs are the same as when running decontX without setting the background parameter.

Note: If the input object is just a matrix and not an SCE object, make sure to save the output into a variable with a different name (e.g. result <- decontX(mat)). The result object will be a list with contamination in result$contamination and the decontaminated counts in result$decontXcounts.

Plotting DecontX results

Cluster labels on UMAP

DecontX creates a UMAP which we can use to plot the cluster labels automatically identified in the analysis. Note that the clustering approach used here is designed to find “broad” cell types rather than individual cell subpopulations within a cell type.

umap <- reducedDim(sce, "decontX_UMAP")
plotDimReduceCluster(x = sce$decontX_clusters,
    dim1 = umap[, 1], dim2 = umap[, 2])

Contamination on UMAP

The percentage of contamination in each cell can be plotting on the UMAP to visualize what what clusters may have higher levels of ambient RNA.

Expression of markers on UMAP

Known marker genes can also be plotted on the UMAP to identify the cell types for each cluster. We will use CD3D and CD3E for T-cells, LYZ, S100A8, and S100A9 for monocytes, CD79A, CD79B, and MS4A1 for B-cells, GNLY for NK-cells, and PPBP for megakaryocytes.

library(scater)
sce <- logNormCounts(sce)
plotDimReduceFeature(as.matrix(logcounts(sce)),
    dim1 = umap[, 1],
    dim2 = umap[, 2],
    features = c("CD3D", "CD3E", "GNLY",
        "LYZ", "S100A8", "S100A9",
        "CD79A", "CD79B", "MS4A1"),
    exactMatch = TRUE)
## Warning in asMethod(object): sparse->dense coercion: allocating vector of size
## 1.1 GiB

Barplot of markers detected in cell clusters

The percetage of cells within a cluster that have detectable expression of marker genes can be displayed in a barplot. Markers for cell types need to be supplied in a named list. First, the detection of marker genes in the original counts assay is shown:

markers <- list(Tcell_Markers = c("CD3E", "CD3D"),
    Bcell_Markers = c("CD79A", "CD79B", "MS4A1"),
    Monocyte_Markers = c("S100A8", "S100A9", "LYZ"),
    NKcell_Markers = "GNLY")
cellTypeMappings <- list(Tcells = 2, Bcells = 5, Monocytes = 1, NKcells = 6)
plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "counts")

We can then look to see how much decontX removed aberrant expression of marker genes in each cell type by changing the assayName to decontXcounts:

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "decontXcounts")

Percentages of marker genes detected in other cell types were reduced or completely removed. For example, the percentage of cells that expressed Monocyte marker genes was greatly reduced in T-cells, B-cells, and NK-cells. The original counts and decontamined counts can be plotted side-by-side by listing multiple assays in the assayName parameter. This option is only available if the data is stored in SingleCellExperiment object.

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = c("counts", "decontXcounts"))

Some helpful hints when using plotDecontXMarkerPercentage:

  1. Cell clusters can be renamed and re-grouped using the groupCluster parameter, which also needs to be a named list. If groupCluster is used, cell clusters not included in the list will be excluded in the barplot. For example, if we wanted to group T-cells and NK-cells together, we could set cellTypeMappings <- list(NK_Tcells = c(2,6), Bcells = 5, Monocytes = 1)
  2. The level a gene that needs to be expressed to be considered detected in a cell can be adjusted using the threshold parameter.
  3. If you are not using a SingleCellExperiment, then you will need to supply the original counts matrix or the decontaminated counts matrix as the first argument to generate the barplots.

Violin plot to compare the distributions of original and decontaminated counts

Another useful way to assess the amount of decontamination is to view the expression of marker genes before and after decontX across cell types. Here we view the monocyte markers in each cell type. The violin plot shows that the markers have been removed from T-cells, B-cells, and NK-cells, but are largely unaffected in monocytes.

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3)

Some helpful hints when using plotDecontXMarkerExpression:

  1. groupClusters works the same way as in plotDecontXMarkerPercentage.
  2. This function will plot each pair of markers and clusters (or cell type specified by groupClusters). Therefore, you may want to keep the number of markers small in each plot and call the function multiple times for different sets of marker genes.
  3. You can also plot the individual points by setting plotDots = TRUE and/or log transform the points on the fly by setting log1p = TRUE.
  4. This function can plot any assay in a SingleCellExperiment. Therefore you could also examine normalized expression of the original and decontaminated counts. For example:
library(scater)
sce <- logNormCounts(sce,
    exprs_values = "decontXcounts",
    name = "decontXlogcounts")

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3,
    assayName = c("logcounts", "decontXlogcounts"))

Other important notes

Choosing appropriate cell clusters

The ability of DecontX to accurately identify contamination is dependent on the cell cluster labels. DecontX assumes that contamination for a cell cluster comes from combination of counts from all other clusters. The default clustering approach used by DecontX tends to select fewer clusters that represent broader cell types. For example, all T-cells tend to be clustered together rather than splitting naive and cytotoxic T-cells into separate clusters. Custom cell type labels can be suppled via the z parameter if some cells are not being clustered appropriately by the default method.

Adjusting the priors to influence contamination estimates

There are ways to force decontX to estimate more or less contamination across a dataset by manipulating the priors. The delta parameter is a numeric vector of length two. It is the concentration parameter for the Dirichlet distribution which serves as the prior for the proportions of native and contamination counts in each cell. The first element is the prior for the proportion of native counts while the second element is the prior for the proportion of contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If estimateDelta = TRUE, delta is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then delta is updated in each iteration. If estimateDelta = FALSE, then delta is fixed with these values for the entire inference procedure. Fixing delta and setting a high number in the second element will force decontX to be more aggressive and estimate higher levels of contamination in each cell at the expense of potentially removing native expression. For example, in the previous PBMC example, we can see what the estimated delta was by looking in the estimates:

metadata(sce)$decontX$estimates$all_cells$delta
## [1] 9.280108 1.038000

Setting a higher value in the second element of delta and estimateDelta = FALSE will force decontX to estimate higher levels of contamination per cell:

sce.delta <- decontX(sce, delta = c(9, 20), estimateDelta = FALSE)

plot(sce$decontX_contamination, sce.delta$decontX_contamination,
     xlab = "DecontX estimated priors",
     ylab = "Setting priors to estimate higher contamination")
abline(0, 1, col = "red", lwd = 2)

Working with Seurat

If you are using the Seurat package for downstream analysis, the following code can be used to read in a matrix and convert between Seurat and SCE objects:

# Read counts from CellRanger output
library(Seurat)
counts <- Read10X("sample/outs/filtered_feature_bc_matrix/")

# Create a SingleCellExperiment object and run decontX
sce <- SingleCellExperiment(list(counts = counts))
sce <- decontX(sce)

# Create a Seurat object from a SCE with decontX results
seuratObject <- CreateSeuratObject(round(decontXcounts(sce)))

Optionally, the “raw” matrix can be also be imported and used as the background:

counts.raw <- Read10X("sample/outs/raw_feature_bc_matrix/")
sce.raw <- SingleCellExperiment(list(counts = counts.raw))
sce <- decontX(sce, background = sce.raw)

Note that the decontaminated matrix of decontX consists of floating point numbers and must be rounded to integers before adding it to a Seurat object. If you already have a Seurat object containing the counts matrix and would like to run decontX, you can retrieve the count matrix, create a SCE object, and run decontX, and then add it back to the Seurat object:

counts <- GetAssayData(object = seuratObject, slot = "counts")
sce <- SingleCellExperiment(list(counts = counts))
sce <- decontX(sce)
seuratObj[["decontXcounts"]] <- CreateAssayObject(counts = decontXcounts(sce))

Session Information

## R version 4.3.3 (2024-02-29)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.30.1               ggplot2_3.5.0              
##  [3] scuttle_1.12.0              TENxPBMCData_1.20.0        
##  [5] HDF5Array_1.30.0            rhdf5_2.46.1               
##  [7] DelayedArray_0.28.0         SparseArray_1.2.4          
##  [9] S4Arrays_1.2.1              abind_1.4-5                
## [11] celda_1.18.2                Matrix_1.6-5               
## [13] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
## [15] Biobase_2.62.0              GenomicRanges_1.54.1       
## [17] GenomeInfoDb_1.38.8         IRanges_2.36.0             
## [19] S4Vectors_0.40.2            BiocGenerics_0.48.1        
## [21] MatrixGenerics_1.14.0       matrixStats_1.2.0          
## [23] BiocStyle_2.30.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3            rstudioapi_0.15.0            
##   [3] jsonlite_1.8.8                magrittr_2.0.3               
##   [5] ggbeeswarm_0.7.2              farver_2.1.1                 
##   [7] rmarkdown_2.25                fs_1.6.3                     
##   [9] zlibbioc_1.48.2               ragg_1.3.0                   
##  [11] vctrs_0.6.5                   memoise_2.0.1                
##  [13] DelayedMatrixStats_1.24.0     RCurl_1.98-1.14              
##  [15] htmltools_0.5.7               AnnotationHub_3.10.0         
##  [17] curl_5.2.1                    BiocNeighbors_1.20.2         
##  [19] Rhdf5lib_1.24.1               sass_0.4.8                   
##  [21] bslib_0.6.1                   desc_1.4.3                   
##  [23] plyr_1.8.9                    cachem_1.0.8                 
##  [25] mime_0.12                     lifecycle_1.0.4              
##  [27] iterators_1.0.14              pkgconfig_2.0.3              
##  [29] rsvd_1.0.5                    R6_2.5.1                     
##  [31] fastmap_1.1.1                 GenomeInfoDbData_1.2.11      
##  [33] shiny_1.8.0                   digest_0.6.35                
##  [35] colorspace_2.1-0              AnnotationDbi_1.64.1         
##  [37] irlba_2.3.5.1                 ExperimentHub_2.10.0         
##  [39] textshaping_0.3.7             RSQLite_2.3.5                
##  [41] beachmat_2.18.1               labeling_0.4.3               
##  [43] filelock_1.0.3                WriteXLS_6.5.0               
##  [45] fansi_1.0.6                   httr_1.4.7                   
##  [47] compiler_4.3.3                bit64_4.0.5                  
##  [49] withr_3.0.0                   doParallel_1.0.17            
##  [51] BiocParallel_1.36.0           viridis_0.6.5                
##  [53] DBI_1.2.1                     highr_0.10                   
##  [55] rappdirs_0.3.3                rjson_0.2.21                 
##  [57] tools_4.3.3                   vipor_0.4.7                  
##  [59] beeswarm_0.4.0                interactiveDisplayBase_1.40.0
##  [61] httpuv_1.6.14                 MCMCprecision_0.4.0          
##  [63] glue_1.7.0                    dbscan_1.1-12                
##  [65] rhdf5filters_1.14.1           promises_1.2.1               
##  [67] grid_4.3.3                    Rtsne_0.17                   
##  [69] reshape2_1.4.4                generics_0.1.3               
##  [71] gtable_0.3.4                  data.table_1.15.4            
##  [73] ScaledMatrix_1.10.0           BiocSingular_1.18.0          
##  [75] utf8_1.2.4                    XVector_0.42.0               
##  [77] RcppAnnoy_0.0.22              ggrepel_0.9.5                
##  [79] BiocVersion_3.18.1            foreach_1.5.2                
##  [81] pillar_1.9.0                  stringr_1.5.1                
##  [83] later_1.3.2                   dplyr_1.1.4                  
##  [85] BiocFileCache_2.10.1          lattice_0.22-5               
##  [87] bit_4.0.5                     tidyselect_1.2.0             
##  [89] Biostrings_2.70.1             knitr_1.45                   
##  [91] gridExtra_2.3                 bookdown_0.37                
##  [93] xfun_0.41                     stringi_1.8.3                
##  [95] yaml_2.3.8                    evaluate_0.23                
##  [97] codetools_0.2-19              RcppEigen_0.3.4.0.0          
##  [99] tibble_3.2.1                  BiocManager_1.30.22          
## [101] cli_3.6.2                     uwot_0.1.16                  
## [103] xtable_1.8-4                  systemfonts_1.0.6            
## [105] munsell_0.5.1                 jquerylib_0.1.4              
## [107] enrichR_3.2                   Rcpp_1.0.12                  
## [109] dbplyr_2.4.0                  png_0.1-8                    
## [111] parallel_4.3.3                ellipsis_0.3.2               
## [113] pkgdown_2.0.7                 blob_1.2.4                   
## [115] sparseMatrixStats_1.14.0      bitops_1.0-7                 
## [117] viridisLite_0.4.2             scales_1.3.0                 
## [119] purrr_1.0.2                   crayon_1.5.2                 
## [121] combinat_0.0-8                rlang_1.1.3                  
## [123] KEGGREST_1.42.0

Site built with pkgdown 2.0.7.

================================================ FILE: docs/articles/decontX_pbmc4k_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/articles/index.html ================================================ Articles • celda

Site built with pkgdown 2.0.7.

================================================ FILE: docs/articles/installation.html ================================================ • celda

Introduction

“celda” stands for “CEllular Latent Dirichlet Allocation”. It is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. This package also includes a method called decontX which can be used to estimate and remove contamination in single cell genomic data.

Installation

To install the latest stable release of celda from Bioconductor (requires R version >= 3.6):

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("celda")

The latest stable version of celda can be installed from GitHub using devtools:

library(devtools)
install_github("campbio/celda")

The development version of celda can also be installed from GitHub using devtools:

library(devtools)
install_github("campbio/celda@devel")

NOTE For MAC OSX users, devtools::install_github() requires installation of libgit2. This can be installed via homebrew:

brew install libgit2

Troubleshooting

  • If you receive installation errors when Rcpp is being installed and compiled, try following the steps outlined here to solve the issue
  • If you are running R 4.0.0 or later version on MacOS Catalina and you see error 'wchar.h' file not found, you can try the method in this link:
  • If you are trying to install celda using Rstudio and get this error: could not find tools necessary to compile a package, you can try typing this before running the install command:
options(buildtools.check = function(action) TRUE)

Also, if you receive installation errors when Rcpp is being installed and compiled, try following the steps outlined here to solve the issue:

https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/

If you are running R 4.0.0 or later version on MacOS Catalina and you see error 'wchar.h' file not found, you can try the method in this link:

https://discourse.mc-stan.org/t/dealing-with-catalina-iii/12731/5

If you are trying to install on MacOS in an Apple Silicon computater and you see the following error:

ld: warning: directory not found for option '-L/opt/gfortran/lib/gcc/x86_64-apple-darwin20.0/12.2.0'
ld: warning: directory not found for option '-L/opt/gfortran/lib'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [celda.so] Error 1
ERROR: compilation failed for package ‘celda’

You can solve this by downloading and installing the gfortran pkg located here and then running the following command:

You can solve this by downloading and installing the gfortran pkg located here and then running the following command:

sudo /opt/gfortran/bin/gfortran-update-sdk

Vignettes and examples

To build the vignettes for Celda and DecontX during installation from GitHub, use the following command:

library(devtools)
install_github("campbio/celda", build_vignettes = TRUE)

Note that installation may take an extra 5-10 minutes for building of the vignettes. The Celda and DecontX vignettes can then be accessed via the following commands:

vignette("celda")
vignette("decontX")

Site built with pkgdown 2.0.7.

================================================ FILE: docs/articles/installation_files/accessible-code-block-0.0.1/empty-anchor.js ================================================ // Hide empty tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); ================================================ FILE: docs/authors.html ================================================ Authors and Citation • celda
  • Joshua Campbell. Author, maintainer.

  • Shiyi Yang. Author.

  • Zhe Wang. Author.

  • Sean Corbett. Author.

  • Yusuke Koga. Author.

Citation

Source: DESCRIPTION

Campbell J, Yang S, Wang Z, Corbett S, Koga Y (2024). celda: CEllular Latent Dirichlet Allocation. R package version 1.18.2.

@Manual{,
  title = {celda: CEllular Latent Dirichlet Allocation},
  author = {Joshua Campbell and Shiyi Yang and Zhe Wang and Sean Corbett and Yusuke Koga},
  year = {2024},
  note = {R package version 1.18.2},
}

Site built with pkgdown 2.0.7.

================================================ FILE: docs/bootstrap-toc.css ================================================ /*! * Bootstrap Table of Contents v0.4.1 (http://afeld.github.io/bootstrap-toc/) * Copyright 2015 Aidan Feldman * Licensed under MIT (https://github.com/afeld/bootstrap-toc/blob/gh-pages/LICENSE.md) */ /* modified from https://github.com/twbs/bootstrap/blob/94b4076dd2efba9af71f0b18d4ee4b163aa9e0dd/docs/assets/css/src/docs.css#L548-L601 */ /* All levels of nav */ nav[data-toggle='toc'] .nav > li > a { display: block; padding: 4px 20px; font-size: 13px; font-weight: 500; color: #767676; } nav[data-toggle='toc'] .nav > li > a:hover, nav[data-toggle='toc'] .nav > li > a:focus { padding-left: 19px; color: #563d7c; text-decoration: none; background-color: transparent; border-left: 1px solid #563d7c; } nav[data-toggle='toc'] .nav > .active > a, nav[data-toggle='toc'] .nav > .active:hover > a, nav[data-toggle='toc'] .nav > .active:focus > a { padding-left: 18px; font-weight: bold; color: #563d7c; background-color: transparent; border-left: 2px solid #563d7c; } /* Nav: second level (shown on .active) */ nav[data-toggle='toc'] .nav .nav { display: none; /* Hide by default, but at >768px, show it */ padding-bottom: 10px; } nav[data-toggle='toc'] .nav .nav > li > a { padding-top: 1px; padding-bottom: 1px; padding-left: 30px; font-size: 12px; font-weight: normal; } nav[data-toggle='toc'] .nav .nav > li > a:hover, nav[data-toggle='toc'] .nav .nav > li > a:focus { padding-left: 29px; } nav[data-toggle='toc'] .nav .nav > .active > a, nav[data-toggle='toc'] .nav .nav > .active:hover > a, nav[data-toggle='toc'] .nav .nav > .active:focus > a { padding-left: 28px; font-weight: 500; } /* from https://github.com/twbs/bootstrap/blob/e38f066d8c203c3e032da0ff23cd2d6098ee2dd6/docs/assets/css/src/docs.css#L631-L634 */ nav[data-toggle='toc'] .nav > .active > ul { display: block; } ================================================ FILE: docs/bootstrap-toc.js ================================================ /*! * Bootstrap Table of Contents v0.4.1 (http://afeld.github.io/bootstrap-toc/) * Copyright 2015 Aidan Feldman * Licensed under MIT (https://github.com/afeld/bootstrap-toc/blob/gh-pages/LICENSE.md) */ (function() { 'use strict'; window.Toc = { helpers: { // return all matching elements in the set, or their descendants findOrFilter: function($el, selector) { // http://danielnouri.org/notes/2011/03/14/a-jquery-find-that-also-finds-the-root-element/ // http://stackoverflow.com/a/12731439/358804 var $descendants = $el.find(selector); return $el.filter(selector).add($descendants).filter(':not([data-toc-skip])'); }, generateUniqueIdBase: function(el) { var text = $(el).text(); var anchor = text.trim().toLowerCase().replace(/[^A-Za-z0-9]+/g, '-'); return anchor || el.tagName.toLowerCase(); }, generateUniqueId: function(el) { var anchorBase = this.generateUniqueIdBase(el); for (var i = 0; ; i++) { var anchor = anchorBase; if (i > 0) { // add suffix anchor += '-' + i; } // check if ID already exists if (!document.getElementById(anchor)) { return anchor; } } }, generateAnchor: function(el) { if (el.id) { return el.id; } else { var anchor = this.generateUniqueId(el); el.id = anchor; return anchor; } }, createNavList: function() { return $(''); }, createChildNavList: function($parent) { var $childList = this.createNavList(); $parent.append($childList); return $childList; }, generateNavEl: function(anchor, text) { var $a = $(''); $a.attr('href', '#' + anchor); $a.text(text); var $li = $('
  • '); $li.append($a); return $li; }, generateNavItem: function(headingEl) { var anchor = this.generateAnchor(headingEl); var $heading = $(headingEl); var text = $heading.data('toc-text') || $heading.text(); return this.generateNavEl(anchor, text); }, // Find the first heading level (`

    `, then `

    `, etc.) that has more than one element. Defaults to 1 (for `

    `). getTopLevel: function($scope) { for (var i = 1; i <= 6; i++) { var $headings = this.findOrFilter($scope, 'h' + i); if ($headings.length > 1) { return i; } } return 1; }, // returns the elements for the top level, and the next below it getHeadings: function($scope, topLevel) { var topSelector = 'h' + topLevel; var secondaryLevel = topLevel + 1; var secondarySelector = 'h' + secondaryLevel; return this.findOrFilter($scope, topSelector + ',' + secondarySelector); }, getNavLevel: function(el) { return parseInt(el.tagName.charAt(1), 10); }, populateNav: function($topContext, topLevel, $headings) { var $context = $topContext; var $prevNav; var helpers = this; $headings.each(function(i, el) { var $newNav = helpers.generateNavItem(el); var navLevel = helpers.getNavLevel(el); // determine the proper $context if (navLevel === topLevel) { // use top level $context = $topContext; } else if ($prevNav && $context === $topContext) { // create a new level of the tree and switch to it $context = helpers.createChildNavList($prevNav); } // else use the current $context $context.append($newNav); $prevNav = $newNav; }); }, parseOps: function(arg) { var opts; if (arg.jquery) { opts = { $nav: arg }; } else { opts = arg; } opts.$scope = opts.$scope || $(document.body); return opts; } }, // accepts a jQuery object, or an options object init: function(opts) { opts = this.helpers.parseOps(opts); // ensure that the data attribute is in place for styling opts.$nav.attr('data-toggle', 'toc'); var $topContext = this.helpers.createChildNavList(opts.$nav); var topLevel = this.helpers.getTopLevel(opts.$scope); var $headings = this.helpers.getHeadings(opts.$scope, topLevel); this.helpers.populateNav($topContext, topLevel, $headings); } }; $(function() { $('nav[data-toggle="toc"]').each(function(i, el) { var $nav = $(el); Toc.init($nav); }); }); })(); ================================================ FILE: docs/docsearch.css ================================================ /* Docsearch -------------------------------------------------------------- */ /* Source: https://github.com/algolia/docsearch/ License: MIT */ .algolia-autocomplete { display: block; -webkit-box-flex: 1; -ms-flex: 1; flex: 1 } .algolia-autocomplete .ds-dropdown-menu { width: 100%; min-width: none; max-width: none; padding: .75rem 0; background-color: #fff; background-clip: padding-box; border: 1px solid rgba(0, 0, 0, .1); box-shadow: 0 .5rem 1rem rgba(0, 0, 0, .175); } @media (min-width:768px) { .algolia-autocomplete .ds-dropdown-menu { width: 175% } } .algolia-autocomplete .ds-dropdown-menu::before { display: none } .algolia-autocomplete .ds-dropdown-menu [class^=ds-dataset-] { padding: 0; background-color: rgb(255,255,255); border: 0; max-height: 80vh; } .algolia-autocomplete .ds-dropdown-menu .ds-suggestions { margin-top: 0 } .algolia-autocomplete .algolia-docsearch-suggestion { padding: 0; overflow: visible } .algolia-autocomplete .algolia-docsearch-suggestion--category-header { padding: .125rem 1rem; margin-top: 0; font-size: 1.3em; font-weight: 500; color: #00008B; border-bottom: 0 } .algolia-autocomplete .algolia-docsearch-suggestion--wrapper { float: none; padding-top: 0 } .algolia-autocomplete .algolia-docsearch-suggestion--subcategory-column { float: none; width: auto; padding: 0; text-align: left } .algolia-autocomplete .algolia-docsearch-suggestion--content { float: none; width: auto; padding: 0 } .algolia-autocomplete .algolia-docsearch-suggestion--content::before { display: none } .algolia-autocomplete .ds-suggestion:not(:first-child) .algolia-docsearch-suggestion--category-header { padding-top: .75rem; margin-top: .75rem; border-top: 1px solid rgba(0, 0, 0, .1) } .algolia-autocomplete .ds-suggestion .algolia-docsearch-suggestion--subcategory-column { display: block; padding: .1rem 1rem; margin-bottom: 0.1; font-size: 1.0em; font-weight: 400 /* display: none */ } .algolia-autocomplete .algolia-docsearch-suggestion--title { display: block; padding: .25rem 1rem; margin-bottom: 0; font-size: 0.9em; font-weight: 400 } .algolia-autocomplete .algolia-docsearch-suggestion--text { padding: 0 1rem .5rem; margin-top: -.25rem; font-size: 0.8em; font-weight: 400; line-height: 1.25 } .algolia-autocomplete .algolia-docsearch-footer { width: 110px; height: 20px; z-index: 3; margin-top: 10.66667px; float: right; font-size: 0; line-height: 0; } .algolia-autocomplete .algolia-docsearch-footer--logo { background-image: url("data:image/svg+xml;utf8,"); background-repeat: no-repeat; background-position: 50%; background-size: 100%; overflow: hidden; text-indent: -9000px; width: 100%; height: 100%; display: block; transform: translate(-8px); } .algolia-autocomplete .algolia-docsearch-suggestion--highlight { color: #FF8C00; background: rgba(232, 189, 54, 0.1) } .algolia-autocomplete .algolia-docsearch-suggestion--text .algolia-docsearch-suggestion--highlight { box-shadow: inset 0 -2px 0 0 rgba(105, 105, 105, .5) } .algolia-autocomplete .ds-suggestion.ds-cursor .algolia-docsearch-suggestion--content { background-color: rgba(192, 192, 192, .15) } ================================================ FILE: docs/docsearch.js ================================================ $(function() { // register a handler to move the focus to the search bar // upon pressing shift + "/" (i.e. "?") $(document).on('keydown', function(e) { if (e.shiftKey && e.keyCode == 191) { e.preventDefault(); $("#search-input").focus(); } }); $(document).ready(function() { // do keyword highlighting /* modified from https://jsfiddle.net/julmot/bL6bb5oo/ */ var mark = function() { var referrer = document.URL ; var paramKey = "q" ; if (referrer.indexOf("?") !== -1) { var qs = referrer.substr(referrer.indexOf('?') + 1); var qs_noanchor = qs.split('#')[0]; var qsa = qs_noanchor.split('&'); var keyword = ""; for (var i = 0; i < qsa.length; i++) { var currentParam = qsa[i].split('='); if (currentParam.length !== 2) { continue; } if (currentParam[0] == paramKey) { keyword = decodeURIComponent(currentParam[1].replace(/\+/g, "%20")); } } if (keyword !== "") { $(".contents").unmark({ done: function() { $(".contents").mark(keyword); } }); } } }; mark(); }); }); /* Search term highlighting ------------------------------*/ function matchedWords(hit) { var words = []; var hierarchy = hit._highlightResult.hierarchy; // loop to fetch from lvl0, lvl1, etc. for (var idx in hierarchy) { words = words.concat(hierarchy[idx].matchedWords); } var content = hit._highlightResult.content; if (content) { words = words.concat(content.matchedWords); } // return unique words var words_uniq = [...new Set(words)]; return words_uniq; } function updateHitURL(hit) { var words = matchedWords(hit); var url = ""; if (hit.anchor) { url = hit.url_without_anchor + '?q=' + escape(words.join(" ")) + '#' + hit.anchor; } else { url = hit.url + '?q=' + escape(words.join(" ")); } return url; } ================================================ FILE: docs/index.html ================================================ CEllular Latent Dirichlet Allocation • celda

    “celda” stands for “CEllular Latent Dirichlet Allocation”. It is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. This package also includes a method called DecontX which can be used to estimate and remove contamination in single cell genomic data.

    Installation Instructions

    To install the latest stable release of celda from Bioconductor (requires R version >= 3.6):

    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("celda")

    The latest stable version of celda can be installed from GitHub using devtools:

    library(devtools)
    install_github("campbio/celda")

    The development version of celda can also be installed from GitHub using devtools:

    library(devtools)
    install_github("campbio/celda@devel")

    NOTE For MAC OSX users, devtools::install_github() requires installation of libgit2. This can be installed via homebrew:

    brew install libgit2

    Also, if you receive installation errors when Rcpp is being installed and compiled, try following the steps outlined here to solve the issue:

    https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/

    If you are running R 4.0.0 or later version on MacOS Catalina and you see error 'wchar.h' file not found, you can try the method in this link:

    https://discourse.mc-stan.org/t/dealing-with-catalina-iii/12731/5

    If you are trying to install on MacOS in an Apple Silicon computater and you see the following error:

    ld: warning: directory not found for option '-L/opt/gfortran/lib/gcc/x86_64-apple-darwin20.0/12.2.0'
    ld: warning: directory not found for option '-L/opt/gfortran/lib'
    ld: library not found for -lgfortran
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    make: *** [celda.so] Error 1
    ERROR: compilation failed for package ‘celda’

    You can solve this by downloading and installing the gfortran pkg located here and then running the following command:

    sudo /opt/gfortran/bin/gfortran-update-sdk

    NOTE If you are trying to install celda using Rstudio and get this error: could not find tools necessary to compile a package, you can try this:

    options(buildtools.check = function(action) TRUE)

    Vignettes and examples

    To build the vignettes for Celda and DecontX during installation from GitHub, use the following command:

    library(devtools)
    install_github("campbio/celda", build_vignettes = TRUE)

    Note that installation may take an extra 5-10 minutes for building of the vignettes. The Celda and DecontX vignettes can then be accessed via the following commands:

    vignette("celda")
    vignette("decontX")

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/news/index.html ================================================ Changelog • celda
    • Updated Makevar files to new CRAN standards
    • Fixed unit test causing error
    • Update to match Bioconductor release version
    • Removed multipanelfigure as a dependency
    • Update to match Bioconductor release version
    • Bug fixes related to cluster labels stored as factors and plotting
    • Updated sparse matrix conversion to work with Matrix v1.4-2
    • Update to match Bioconductor 3.15 release version
    • Fixes to reports
    • Use smoothe splines for perplexity and RPC plots
    • Improvments to decontX vignette
    • Added ability to subsample to speed up perplexity calculations
    • Added ability to use batch parameter with the raw matrix in decontX
    • Update to match Bioconductor release version
    • Fixed bug in checking background matrix with decontX
    • Switched to using Github Actions for Continuous Integration
    • Fixed plotting bugs in celda results reports
    • Speed up final step in decontX when creating final decontaminated matrix
    • Added a NEWS.md file to track changes to the package.
    • Added new tutorials and documentation generated with pkgdown.
    • Removed warnings in plotRPC functions.
    • Added use of “displayName” to several functions that show feature names.
    • Minor bug fix when the input matrix was sparse and contained non-integer values.
    • Several improvements to plotting functions.
    • Added handling for sparse matrices
    • Added functions for creating HTML reports
    • Fixed bug in decontX plotting
    • Enable input of raw/droplet matrix into decontX to estimate ambient RNA
    • Add multiclass decision tree
    • Add Alternate headings support for plotDimReduceFeature
    • Add multiclass decision tree (MCDT) cell cluster annotation
    • Fix a bug in celdaHeatmap
    • Default seed setting to maintain reproducibility
    • Minor changes to the vignettes
    • Remove pheatmap import
    • Package celda, for bi-clustering of single-cell ’omics data.
    • Second submission to Bioconductor
    • First submission to Bioconductor

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/pkgdown.css ================================================ /* Sticky footer */ /** * Basic idea: https://philipwalton.github.io/solved-by-flexbox/demos/sticky-footer/ * Details: https://github.com/philipwalton/solved-by-flexbox/blob/master/assets/css/components/site.css * * .Site -> body > .container * .Site-content -> body > .container .row * .footer -> footer * * Key idea seems to be to ensure that .container and __all its parents__ * have height set to 100% * */ html, body { height: 100%; } body { position: relative; } body > .container { display: flex; height: 100%; flex-direction: column; } body > .container .row { flex: 1 0 auto; } footer { margin-top: 45px; padding: 35px 0 36px; border-top: 1px solid #e5e5e5; color: #666; display: flex; flex-shrink: 0; } footer p { margin-bottom: 0; } footer div { flex: 1; } footer .pkgdown { text-align: right; } footer p { margin-bottom: 0; } img.icon { float: right; } /* Ensure in-page images don't run outside their container */ .contents img { max-width: 100%; height: auto; } /* Fix bug in bootstrap (only seen in firefox) */ summary { display: list-item; } /* Typographic tweaking ---------------------------------*/ .contents .page-header { margin-top: calc(-60px + 1em); } dd { margin-left: 3em; } /* Section anchors ---------------------------------*/ a.anchor { display: none; margin-left: 5px; width: 20px; height: 20px; background-image: url(./link.svg); background-repeat: no-repeat; background-size: 20px 20px; background-position: center center; } h1:hover .anchor, h2:hover .anchor, h3:hover .anchor, h4:hover .anchor, h5:hover .anchor, h6:hover .anchor { display: inline-block; } /* Fixes for fixed navbar --------------------------*/ .contents h1, .contents h2, .contents h3, .contents h4 { padding-top: 60px; margin-top: -40px; } /* Navbar submenu --------------------------*/ .dropdown-submenu { position: relative; } .dropdown-submenu>.dropdown-menu { top: 0; left: 100%; margin-top: -6px; margin-left: -1px; border-radius: 0 6px 6px 6px; } .dropdown-submenu:hover>.dropdown-menu { display: block; } .dropdown-submenu>a:after { display: block; content: " "; float: right; width: 0; height: 0; border-color: transparent; border-style: solid; border-width: 5px 0 5px 5px; border-left-color: #cccccc; margin-top: 5px; margin-right: -10px; } .dropdown-submenu:hover>a:after { border-left-color: #ffffff; } .dropdown-submenu.pull-left { float: none; } .dropdown-submenu.pull-left>.dropdown-menu { left: -100%; margin-left: 10px; border-radius: 6px 0 6px 6px; } /* Sidebar --------------------------*/ #pkgdown-sidebar { margin-top: 30px; position: -webkit-sticky; position: sticky; top: 70px; } #pkgdown-sidebar h2 { font-size: 1.5em; margin-top: 1em; } #pkgdown-sidebar h2:first-child { margin-top: 0; } #pkgdown-sidebar .list-unstyled li { margin-bottom: 0.5em; } /* bootstrap-toc tweaks ------------------------------------------------------*/ /* All levels of nav */ nav[data-toggle='toc'] .nav > li > a { padding: 4px 20px 4px 6px; font-size: 1.5rem; font-weight: 400; color: inherit; } nav[data-toggle='toc'] .nav > li > a:hover, nav[data-toggle='toc'] .nav > li > a:focus { padding-left: 5px; color: inherit; border-left: 1px solid #878787; } nav[data-toggle='toc'] .nav > .active > a, nav[data-toggle='toc'] .nav > .active:hover > a, nav[data-toggle='toc'] .nav > .active:focus > a { padding-left: 5px; font-size: 1.5rem; font-weight: 400; color: inherit; border-left: 2px solid #878787; } /* Nav: second level (shown on .active) */ nav[data-toggle='toc'] .nav .nav { display: none; /* Hide by default, but at >768px, show it */ padding-bottom: 10px; } nav[data-toggle='toc'] .nav .nav > li > a { padding-left: 16px; font-size: 1.35rem; } nav[data-toggle='toc'] .nav .nav > li > a:hover, nav[data-toggle='toc'] .nav .nav > li > a:focus { padding-left: 15px; } nav[data-toggle='toc'] .nav .nav > .active > a, nav[data-toggle='toc'] .nav .nav > .active:hover > a, nav[data-toggle='toc'] .nav .nav > .active:focus > a { padding-left: 15px; font-weight: 500; font-size: 1.35rem; } /* orcid ------------------------------------------------------------------- */ .orcid { font-size: 16px; color: #A6CE39; /* margins are required by official ORCID trademark and display guidelines */ margin-left:4px; margin-right:4px; vertical-align: middle; } /* Reference index & topics ----------------------------------------------- */ .ref-index th {font-weight: normal;} .ref-index td {vertical-align: top; min-width: 100px} .ref-index .icon {width: 40px;} .ref-index .alias {width: 40%;} .ref-index-icons .alias {width: calc(40% - 40px);} .ref-index .title {width: 60%;} .ref-arguments th {text-align: right; padding-right: 10px;} .ref-arguments th, .ref-arguments td {vertical-align: top; min-width: 100px} .ref-arguments .name {width: 20%;} .ref-arguments .desc {width: 80%;} /* Nice scrolling for wide elements --------------------------------------- */ table { display: block; overflow: auto; } /* Syntax highlighting ---------------------------------------------------- */ pre, code, pre code { background-color: #f8f8f8; color: #333; } pre, pre code { white-space: pre-wrap; word-break: break-all; overflow-wrap: break-word; } pre { border: 1px solid #eee; } pre .img, pre .r-plt { margin: 5px 0; } pre .img img, pre .r-plt img { background-color: #fff; } code a, pre a { color: #375f84; } a.sourceLine:hover { text-decoration: none; } .fl {color: #1514b5;} .fu {color: #000000;} /* function */ .ch,.st {color: #036a07;} /* string */ .kw {color: #264D66;} /* keyword */ .co {color: #888888;} /* comment */ .error {font-weight: bolder;} .warning {font-weight: bolder;} /* Clipboard --------------------------*/ .hasCopyButton { position: relative; } .btn-copy-ex { position: absolute; right: 0; top: 0; visibility: hidden; } .hasCopyButton:hover button.btn-copy-ex { visibility: visible; } /* headroom.js ------------------------ */ .headroom { will-change: transform; transition: transform 200ms linear; } .headroom--pinned { transform: translateY(0%); } .headroom--unpinned { transform: translateY(-100%); } /* mark.js ----------------------------*/ mark { background-color: rgba(255, 255, 51, 0.5); border-bottom: 2px solid rgba(255, 153, 51, 0.3); padding: 1px; } /* vertical spacing after htmlwidgets */ .html-widget { margin-bottom: 10px; } /* fontawesome ------------------------ */ .fab { font-family: "Font Awesome 5 Brands" !important; } /* don't display links in code chunks when printing */ /* source: https://stackoverflow.com/a/10781533 */ @media print { code a:link:after, code a:visited:after { content: ""; } } /* Section anchors --------------------------------- Added in pandoc 2.11: https://github.com/jgm/pandoc-templates/commit/9904bf71 */ div.csl-bib-body { } div.csl-entry { clear: both; } .hanging-indent div.csl-entry { margin-left:2em; text-indent:-2em; } div.csl-left-margin { min-width:2em; float:left; } div.csl-right-inline { margin-left:2em; padding-left:1em; } div.csl-indent { margin-left: 2em; } ================================================ FILE: docs/pkgdown.js ================================================ /* http://gregfranko.com/blog/jquery-best-practices/ */ (function($) { $(function() { $('.navbar-fixed-top').headroom(); $('body').css('padding-top', $('.navbar').height() + 10); $(window).resize(function(){ $('body').css('padding-top', $('.navbar').height() + 10); }); $('[data-toggle="tooltip"]').tooltip(); var cur_path = paths(location.pathname); var links = $("#navbar ul li a"); var max_length = -1; var pos = -1; for (var i = 0; i < links.length; i++) { if (links[i].getAttribute("href") === "#") continue; // Ignore external links if (links[i].host !== location.host) continue; var nav_path = paths(links[i].pathname); var length = prefix_length(nav_path, cur_path); if (length > max_length) { max_length = length; pos = i; } } // Add class to parent
  • , and enclosing
  • if in dropdown if (pos >= 0) { var menu_anchor = $(links[pos]); menu_anchor.parent().addClass("active"); menu_anchor.closest("li.dropdown").addClass("active"); } }); function paths(pathname) { var pieces = pathname.split("/"); pieces.shift(); // always starts with / var end = pieces[pieces.length - 1]; if (end === "index.html" || end === "") pieces.pop(); return(pieces); } // Returns -1 if not found function prefix_length(needle, haystack) { if (needle.length > haystack.length) return(-1); // Special case for length-0 haystack, since for loop won't run if (haystack.length === 0) { return(needle.length === 0 ? 0 : -1); } for (var i = 0; i < haystack.length; i++) { if (needle[i] != haystack[i]) return(i); } return(haystack.length); } /* Clipboard --------------------------*/ function changeTooltipMessage(element, msg) { var tooltipOriginalTitle=element.getAttribute('data-original-title'); element.setAttribute('data-original-title', msg); $(element).tooltip('show'); element.setAttribute('data-original-title', tooltipOriginalTitle); } if(ClipboardJS.isSupported()) { $(document).ready(function() { var copyButton = ""; $("div.sourceCode").addClass("hasCopyButton"); // Insert copy buttons: $(copyButton).prependTo(".hasCopyButton"); // Initialize tooltips: $('.btn-copy-ex').tooltip({container: 'body'}); // Initialize clipboard: var clipboardBtnCopies = new ClipboardJS('[data-clipboard-copy]', { text: function(trigger) { return trigger.parentNode.textContent.replace(/\n#>[^\n]*/g, ""); } }); clipboardBtnCopies.on('success', function(e) { changeTooltipMessage(e.trigger, 'Copied!'); e.clearSelection(); }); clipboardBtnCopies.on('error', function() { changeTooltipMessage(e.trigger,'Press Ctrl+C or Command+C to copy'); }); }); } })(window.jQuery || window.$) ================================================ FILE: docs/pkgdown.yml ================================================ pandoc: 3.1.11.1 pkgdown: 2.0.7 pkgdown_sha: ~ articles: celda_pbmc3k: celda_pbmc3k.html decontX_pbmc4k: decontX_pbmc4k.html installation: installation.html celda: celda.html decontX: decontX.html last_built: 2024-04-02T22:54Z ================================================ FILE: docs/reference/appendCeldaList.html ================================================ Append two celdaList objects — appendCeldaList • celda

    Returns a single celdaList representing the combination of two provided celdaList objects.

    appendCeldaList(list1, list2)

    Arguments

    list1

    A celda_list object

    list2

    A celda_list object to be joined with list_1

    Value

    A celdaList object. This object contains all resList entries and runParam records from both lists.

    Examples

    data(celdaCGGridSearchRes)
    appendedList <- appendCeldaList(
      celdaCGGridSearchRes,
      celdaCGGridSearchRes
    )
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/availableModels.html ================================================ available models — availableModels • celda

    available models

    availableModels

    Format

    An object of class character of length 3.

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/bestLogLikelihood.html ================================================ Get the log-likelihood — bestLogLikelihood • celda

    Retrieves the final log-likelihood from all iterations of Gibbs sampling used to generate a celdaModel.

    bestLogLikelihood(x, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    bestLogLikelihood(x, altExpName = "featureSubset")
    
    # S4 method for celdaModel
    bestLogLikelihood(x)

    Arguments

    x

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG, or a celda model object.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    Numeric. The log-likelihood at the final step of Gibbs sampling used to generate the model.

    Examples

    data(sceCeldaCG)
    bestLogLikelihood(sceCeldaCG)
    #> [1] -1212891
    data(celdaCGMod)
    bestLogLikelihood(celdaCGMod)
    #> [1] -1215541
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celda.html ================================================ Celda models — celda • celda

    List of available Celda models with correpsonding descriptions.

    celda()

    Value

    None

    Examples

    celda()
    #> celda_C: Clusters the columns of a count matrix containing single-cell data into K subpopulations.
    #> celda_G: Clusters the rows of a count matrix containing single-cell data into L modules.
    #> celda_CG: Clusters the rows and columns of a count matrix containing single-cell data into L modules and K subpopulations, respectively.
    #> celdaGridSearch: Run Celda with different combinations of parameters and multiple chains in parallel.
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaCGGridSearchRes.html ================================================ celdaCGGridSearchRes — celdaCGGridSearchRes • celda

    Example results of old celdaGridSearch on celdaCGSim

    celdaCGGridSearchRes

    Format

    An object as returned from old celdaGridSearch()

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaCGMod.html ================================================ celdaCGmod — celdaCGMod • celda

    celda_CG model object generated from celdaCGSim using old celda_CG function.

    celdaCGMod

    Format

    A celda_CG object

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaCGSim.html ================================================ celdaCGSim — celdaCGSim • celda

    An deprecated example of simulated count matrix from the celda_CG model.

    celdaCGSim

    Format

    A list of counts and properties as returned from old simulateCells().

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaCMod.html ================================================ celdaCMod — celdaCMod • celda

    Old celda_C results generated from celdaCSim

    celdaCMod

    Format

    A celda_C object

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaCSim.html ================================================ celdaCSim — celdaCSim • celda

    An old example simulated count matrix from the celda_C model.

    celdaCSim

    Format

    A list of counts and properties as returned from old simulateCells().

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaClusters.html ================================================ Get or set the cell cluster labels from a celda SingleCellExperiment object or celda model object. — celdaClusters • celda

    Return or set the cell cluster labels determined by celda_C or celda_CG models.

    celdaClusters(x, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    celdaClusters(x, altExpName = "featureSubset")
    
    # S4 method for celdaModel
    celdaClusters(x)
    
    celdaClusters(x, altExpName = "featureSubset") <- value
    
    # S4 method for SingleCellExperiment
    celdaClusters(x, altExpName = "featureSubset") <- value

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned by celda_C, or celda_CG, with the matrix located in the useAssay assay slot. The a altExp slot with name altExpName will be used. Rows represent features and columns represent cells.

    • Celda model object.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    value

    Character vector of cell cluster labels for replacements. Works only if x is a SingleCellExperiment object.

    Value

    One of

    • Character vector if x is a SingleCellExperiment object. Contains cell cluster labels for each cell in x.

    • List if x is a celda model object. Contains cell cluster labels (for celda_C and celdaCG Models) and/or feature module labels (for celda_G and celdaCG Models).

    Examples

    data(sceCeldaCG)
    celdaClusters(sceCeldaCG)
    #>   [1] 1 2 2 2 1 1 3 1 1 2 3 2 4 3 2 1 2 4 4 1 3 5 3 2 1 3 3 2 3 3 5 3 2 5 5 3 4
    #>  [38] 4 3 2 1 2 1 1 2 3 4 2 5 3 5 1 1 3 1 3 3 1 4 5 4 4 1 3 5 2 5 2 1 3 1 2 4 1
    #>  [75] 5 2 1 3 4 4 3 5 1 1 4 4 4 1 1 3 3 1 3 1 1 4 4 3 5 3 4 3 4 4 1 3 4 4 1 3 1
    #> [112] 3 4 3 1 3 3 3 3 3 5 4 4 4 4 1 1 4 1 4 1 4 1 1 1 5 4 1 4 3 5 4 4 5 4 3 3 3
    #> [149] 1 4 4 4 1 4 1 4 3 3 5 4 1 1 4 4 3 4 1 3 2 4 4 3 1 4 1 5 1 3 4 5 1 4 4 3 4
    #> [186] 3 5 4 5 5 5 5 5 5 2 5 1 2 3 2 5 5 5 2 1 5 5 2 4 2 1 1 5 5 5 5 5 5 2 2 5 2
    #> [223] 5 1 2 5 1 5 2 5 5 5 1 2 1 5 2 5 3 5 5 2 3 5 5 1 3 2 5 5 5 2 5 4 5 5 5 5 5
    #> [260] 5 1 3 2 5 2 5 2 3 2 5 2 5 5 1 1 5 5 1 4 5 5 5 3 3 1 2 1 2 3 3 2 1 1 3 1 1
    #> [297] 1 3 1 3 3 3 2 3 3 5 5 1 1 3 3 3 1 3 3 3 3 1 1 3 3 3 1 3 5 2 1 1 1 1 1 3 1
    #> [334] 2 3 3 1 3 5 1 3 1 3 5 3 3 3 1 1 5 1 3 3 3 4 1 4 3 4 3 1 2 1 1 4 2 1 4 4 3
    #> [371] 5 1 4 5 1 3 5 3 3 1 3 5 1 4 4 4 3 3 1 3 1 5 1 3 3 5 3 1 1 1 3 1 2 1 2 4 1
    #> [408] 2 4 3 1 4 1 5 1 3 2 1 5 2 1 5 2 4 1
    #> Levels: 1 2 3 4 5
    data(celdaCGMod)
    celdaClusters(celdaCGMod)
    #> $z
    #>   [1] 2 1 1 1 2 2 3 2 2 1 3 1 4 3 1 2 1 4 4 2 3 5 3 1 2 3 3 1 3 3 5 3 1 5 5 3 4
    #>  [38] 4 3 1 2 1 2 2 1 3 4 1 5 3 5 2 2 3 2 3 3 2 4 5 4 4 2 3 5 1 5 1 2 3 2 1 4 2
    #>  [75] 5 1 2 3 4 4 3 5 2 2 4 4 4 2 2 3 3 2 3 2 2 4 4 3 5 3 4 3 4 4 2 3 4 4 2 3 2
    #> [112] 3 4 3 2 3 3 3 3 3 5 4 4 4 4 2 2 4 2 4 2 4 2 2 2 5 4 2 4 3 5 4 4 5 4 3 3 3
    #> [149] 2 4 4 4 2 4 2 4 3 3 5 4 2 2 4 4 3 4 2 3 1 4 4 3 2 4 2 5 2 3 4 5 2 4 4 3 4
    #> [186] 3 5 4 5 5 5 5 5 5 1 5 2 1 3 1 5 5 5 1 2 5 5 1 4 1 2 2 5 5 5 5 5 5 1 1 5 1
    #> [223] 5 2 1 5 2 5 1 5 5 5 2 1 2 5 1 5 3 5 5 1 3 5 5 2 3 1 5 5 5 1 5 4 5 5 5 5 5
    #> [260] 5 2 3 1 5 1 5 1 3 1 5 1 5 5 2 2 5 5 2 4 5 5 5 3 3 2 1 2 1 3 3 1 2 2 3 2 2
    #> [297] 2 3 2 3 3 3 1 3 3 5 5 2 2 3 3 3 2 3 3 3 3 2 2 3 3 3 2 3 5 1 2 2 2 2 2 3 2
    #> [334] 1 3 3 2 3 5 2 3 2 3 5 3 3 3 2 2 5 2 3 3 3 4 2 4 3 4 3 2 1 2 2 4 1 2 4 4 3
    #> [371] 5 2 4 5 2 3 5 3 3 2 3 5 2 4 4 4 3 3 2 3 2 5 2 3 3 5 3 2 2 2 3 2 1 2 1 4 2
    #> [408] 1 4 3 2 4 2 5 2 3 1 2 5 1 2 5 1 4 2
    #> 
    #> $y
    #>   [1]  7  5  1  8  1  4 10  9  2  4  7 10 10  5  3 10  2 10  9  5  9  3  5  7  6
    #>  [26]  9  9  4  8  2  9  5  4  5  9  5  5  4  4  6  5  8  8  5  1  6 10  9  7  1
    #>  [51]  4  3  7  9  6 10  9 10  7  6  8  6  9  5  4  5  9  9 10  5  4  3  8  7  9
    #>  [76]  6  2  3  7  6  3  4  9  3  9 10  4 10  7  1  1 10 10  1  6  1  6  9  5  4
    #> 
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaGMod.html ================================================ celdaGMod — celdaGMod • celda

    Old celda_G results generated from celdaGsim

    celdaGMod

    Format

    A celda_G object

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaGSim.html ================================================ celdaGSim — celdaGSim • celda

    An old example simulated count matrix from the celda_G model.

    celdaGSim

    Format

    A list of counts and properties as returned from old simulateCells()

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaGridSearch.html ================================================ Run Celda in parallel with multiple parameters — celdaGridSearch • celda

    Run Celda with different combinations of parameters and multiple chains in parallel. The variable availableModels contains the potential models that can be utilized. Different parameters to be tested should be stored in a list and passed to the argument paramsTest. Fixed parameters to be used in all models, such as sampleLabel, can be passed as a list to the argument paramsFixed. When verbose = TRUE, output from each chain will be sent to a log file but not be displayed in stdout.

    celdaGridSearch(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      model,
      paramsTest,
      paramsFixed = NULL,
      maxIter = 200,
      nchains = 3,
      cores = 1,
      bestOnly = TRUE,
      seed = 12345,
      perplexity = TRUE,
      verbose = TRUE,
      logfilePrefix = "Celda"
    )
    
    # S4 method for SingleCellExperiment
    celdaGridSearch(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      model,
      paramsTest,
      paramsFixed = NULL,
      maxIter = 200,
      nchains = 3,
      cores = 1,
      bestOnly = TRUE,
      seed = 12345,
      perplexity = TRUE,
      verbose = TRUE,
      logfilePrefix = "Celda"
    )
    
    # S4 method for matrix
    celdaGridSearch(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      model,
      paramsTest,
      paramsFixed = NULL,
      maxIter = 200,
      nchains = 3,
      cores = 1,
      bestOnly = TRUE,
      seed = 12345,
      perplexity = TRUE,
      verbose = TRUE,
      logfilePrefix = "Celda"
    )

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    useAssay

    A string specifying the name of the assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    model

    Celda model. Options available in availableModels.

    paramsTest

    List. A list denoting the combinations of parameters to run in a celda model. For example, list(K = seq(5, 10), L = seq(15, 20)) will run all combinations of K from 5 to 10 and L from 15 to 20 in model celda_CG.

    paramsFixed

    List. A list denoting additional parameters to use in each celda model. Default NULL.

    maxIter

    Integer. Maximum number of iterations of sampling to perform. Default 200.

    nchains

    Integer. Number of random cluster initializations. Default 3.

    cores

    Integer. The number of cores to use for parallel estimation of chains. Default 1.

    bestOnly

    Logical. Whether to return only the chain with the highest log likelihood per combination of parameters or return all chains. Default TRUE.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. Seed values seq(seed, (seed + nchains - 1)) will be supplied to each chain in nchains. If NULL, no calls to with_seed are made.

    perplexity

    Logical. Whether to calculate perplexity for each model. If FALSE, then perplexity can be calculated later with resamplePerplexity. Default TRUE.

    verbose

    Logical. Whether to print log messages during celda chain execution. Default TRUE.

    logfilePrefix

    Character. Prefix for log files from worker threads and main process. Default "Celda".

    Value

    A SingleCellExperiment object. Function parameter settings and celda model results are stored in the

    metadata

    "celda_grid_search" slot.

    See also

    celda_G for feature clustering, celda_C for clustering of cells, and celda_CG for simultaneous clustering of features and cells. subsetCeldaList can subset the celdaList object. selectBestModel can get the best model for each combination of parameters.

    Examples

    if (FALSE) {
    data(celdaCGSim)
    ## Run various combinations of parameters with 'celdaGridSearch'
    celdaCGGridSearchRes <- celdaGridSearch(celdaCGSim$counts,
      model = "celda_CG",
      paramsTest = list(K = seq(4, 6), L = seq(9, 11)),
      paramsFixed = list(sampleLabel = celdaCGSim$sampleLabel),
      bestOnly = TRUE,
      nchains = 1,
      cores = 1)
    }
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaHeatmap.html ================================================ Plot celda Heatmap — celdaHeatmap • celda

    Render a stylable heatmap of count data based on celda clustering results.

    celdaHeatmap(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      featureIx = NULL,
      nfeatures = 25,
      ...
    )
    
    # S4 method for SingleCellExperiment
    celdaHeatmap(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      featureIx = NULL,
      nfeatures = 25,
      ...
    )

    Arguments

    sce

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG.

    useAssay

    A string specifying which assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    featureIx

    Integer vector. Select features for display in heatmap. If NULL, no subsetting will be performed. Default NULL. Only used for sce containing celda_C model result returned by celda_C.

    nfeatures

    Integer. Maximum number of features to select for each gene module. Default 25. Only used for sce containing celda_CG or celda_G model results returned by celda_CG or celda_G.

    ...

    Additional parameters passed to plotHeatmap.

    Value

    list A list containing dendrogram information and the heatmap grob

    See also

    `celdaTsne()` for generating 2-dimensional tSNE coordinates

    Examples

    data(sceCeldaCG)
    celdaHeatmap(sceCeldaCG)
    #> TableGrob (5 x 6) "layout": 9 grobs
    #>   z     cells                 name                       grob
    #> 1 1 (2-2,3-3)             col_tree polyline[GRID.polyline.16]
    #> 2 2 (4-4,1-1)             row_tree polyline[GRID.polyline.17]
    #> 3 3 (4-4,3-3)               matrix       gTree[GRID.gTree.19]
    #> 4 4 (3-3,3-3)       col_annotation         rect[GRID.rect.20]
    #> 5 5 (3-3,4-4) col_annotation_names         text[GRID.text.21]
    #> 6 6 (4-4,2-2)       row_annotation         rect[GRID.rect.22]
    #> 7 7 (5-5,2-2) row_annotation_names         text[GRID.text.23]
    #> 8 8 (4-5,6-6)     annotationLegend       gTree[GRID.gTree.31]
    #> 9 9 (4-5,5-5)               legend       gTree[GRID.gTree.34]
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaModel.html ================================================ Get celda model from a celda SingleCellExperiment object — celdaModel • celda

    Return the celda model for sce returned by celda_C, celda_G or celda_CG.

    celdaModel(sce, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    celdaModel(sce, altExpName = "featureSubset")

    Arguments

    sce

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    Character. The celda model. Can be one of "celda_C", "celda_G", or "celda_CG".

    Examples

    data(sceCeldaCG)
    celdaModel(sceCeldaCG)
    #> [1] "celda_CG"
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaModules.html ================================================ Get or set the feature module labels from a celda SingleCellExperiment object. — celdaModules • celda

    Return or set the feature module cluster labels determined by celda_G or celda_CG models.

    celdaModules(sce, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    celdaModules(sce, altExpName = "featureSubset")
    
    celdaModules(sce, altExpName = "featureSubset") <- value
    
    # S4 method for SingleCellExperiment
    celdaModules(sce, altExpName = "featureSubset") <- value

    Arguments

    sce

    A SingleCellExperiment object returned by celda_G, or celda_CG, with the matrix located in the useAssay assay slot. Rows represent features and columns represent cells.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    value

    Character vector of feature module labels for replacements. Works only if x is a SingleCellExperiment object.

    Value

    Character vector. Contains feature module labels for each feature in x.

    Examples

    data(sceCeldaCG)
    celdaModules(sceCeldaCG)
    #>  [1] 7  5  1  8  1  4  10 9  2  4  7  10 10 5  3  10 2  10 9  5  9  3  5  7  6 
    #> [26] 9  4  8  2  9  5  4  5  9  5  5  4  4  6  5  8  8  5  1  6  10 9  7  1  4 
    #> [51] 3  7  9  6  9  10 7  6  8  6  9  5  4  5  9  9  10 5  3  8  7  9  6  2  3 
    #> [76] 7  6  3  4  9  3  9  10 4  10 7  1  1  10 10 1  6  6  9  5  4 
    #> Levels: 1 2 3 4 5 6 7 8 9 10
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaPerplexity-celdaList-method.html ================================================ Get perplexity for every model in a celdaList — celdaPerplexity,celdaList-method • celda

    Returns perplexity for each model in a celdaList as calculated by `perplexity().`

    # S4 method for celdaList
    celdaPerplexity(celdaList)

    Arguments

    celdaList

    An object of class celdaList.

    Value

    List. Contains one celdaModel object for each of the parameters specified in the `runParams()` of the provided celda list.

    Examples

    data(celdaCGGridSearchRes)
    celdaCGGridModelPerplexities <- celdaPerplexity(celdaCGGridSearchRes)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaPerplexity.html ================================================ Get perplexity for every model in a celdaList — celdaPerplexity • celda

    Returns perplexity for each model in a celdaList as calculated by `perplexity().`

    celdaPerplexity(celdaList)

    Arguments

    celdaList

    An object of class celdaList.

    Value

    List. Contains one celdaModel object for each of the parameters specified in the `runParams()` of the provided celda list.

    Examples

    data(celdaCGGridSearchRes)
    celdaCGGridModelPerplexities <- celdaPerplexity(celdaCGGridSearchRes)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaProbabilityMap.html ================================================ Probability map for a celda model — celdaProbabilityMap • celda

    Renders probability and relative expression heatmaps to visualize the relationship between features and cell populations (or cell populations and samples).

    celdaProbabilityMap(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      level = c("cellPopulation", "sample"),
      ncols = 100,
      col2 = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")),
      title1 = "Absolute probability",
      title2 = "Relative expression",
      showColumnNames = TRUE,
      showRowNames = TRUE,
      rowNamesgp = grid::gpar(fontsize = 8),
      colNamesgp = grid::gpar(fontsize = 12),
      clusterRows = FALSE,
      clusterColumns = FALSE,
      showHeatmapLegend = TRUE,
      heatmapLegendParam = list(title = NULL, legend_height = grid::unit(6, "cm")),
      ...
    )
    
    # S4 method for SingleCellExperiment
    celdaProbabilityMap(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      level = c("cellPopulation", "sample"),
      ncols = 100,
      col2 = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")),
      title1 = "Absolute probability",
      title2 = "Relative expression",
      showColumnNames = TRUE,
      showRowNames = TRUE,
      rowNamesgp = grid::gpar(fontsize = 8),
      colNamesgp = grid::gpar(fontsize = 12),
      clusterRows = FALSE,
      clusterColumns = FALSE,
      showHeatmapLegend = TRUE,
      heatmapLegendParam = list(title = NULL, legend_height = grid::unit(6, "cm")),
      ...
    )

    Arguments

    sce

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG.

    useAssay

    A string specifying which assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    level

    Character. One of "cellPopulation" or "Sample". "cellPopulation" will display the absolute probabilities and relative normalized expression of each module in each cell population. level = "cellPopulation" only works for celda_CG sce objects. "sample" will display the absolute probabilities and relative normalized abundance of each cell population in each sample. Default "cellPopulation".

    ncols

    The number of colors (>1) to be in the color palette of the absolute probability heatmap.

    col2

    Passed to col argument of Heatmap. Set color boundaries and colors for the relative expression heatmap.

    title1

    Passed to column_title argument of Heatmap. Figure title for the absolute probability heatmap.

    title2

    Passed to column_title argument of Heatmap. Figure title for the relative expression heatmap.

    showColumnNames

    Passed to show_column_names argument of Heatmap. Show column names.

    showRowNames

    Passed to show_row_names argument of Heatmap. Show row names.

    rowNamesgp

    Passed to row_names_gp argument of Heatmap. Set row name font.

    colNamesgp

    Passed to column_names_gp argument of Heatmap. Set column name font.

    clusterRows

    Passed to cluster_rows argument of Heatmap. Cluster rows.

    clusterColumns

    Passed to cluster_columns argument of Heatmap. Cluster columns.

    showHeatmapLegend

    Passed to show_heatmap_legend argument of Heatmap. Show heatmap legend.

    heatmapLegendParam

    Passed to heatmap_legend_param argument of Heatmap. Heatmap legend parameters.

    ...

    Additional parameters passed to Heatmap.

    Value

    A HeatmapList object containing 2

    Heatmap-class objects

    See also

    celda_C for clustering cells. celda_CG for clustering features and cells

    Examples

    data(sceCeldaCG)
    celdaProbabilityMap(sceCeldaCG)
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaTsne.html ================================================ t-Distributed Stochastic Neighbor Embedding (t-SNE) dimension reduction for celda sce object — celdaTsne • celda

    Embeds cells in two dimensions using Rtsne based on a celda model. For celda_C sce objects, PCA on the normalized counts is used to reduce the number of features before applying t-SNE. For celda_CG and celda_G sce objects, tSNE is run on module probabilities to reduce the number of features instead of using PCA. Module probabilities are square-root transformed before applying tSNE.

    celdaTsne(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      maxCells = NULL,
      minClusterSize = 100,
      initialDims = 20,
      modules = NULL,
      perplexity = 20,
      maxIter = 2500,
      normalize = "proportion",
      scaleFactor = NULL,
      transformationFun = sqrt,
      seed = 12345
    )
    
    # S4 method for SingleCellExperiment
    celdaTsne(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      maxCells = NULL,
      minClusterSize = 100,
      initialDims = 20,
      modules = NULL,
      perplexity = 20,
      maxIter = 2500,
      normalize = "proportion",
      scaleFactor = NULL,
      transformationFun = sqrt,
      seed = 12345
    )

    Arguments

    sce

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG.

    useAssay

    A string specifying which assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    maxCells

    Integer. Maximum number of cells to plot. Cells will be randomly subsampled if ncol(counts) > maxCells. Larger numbers of cells requires more memory. If NULL, no subsampling will be performed. Default NULL.

    minClusterSize

    Integer. Do not subsample cell clusters below this threshold. Default 100.

    initialDims

    Integer. PCA will be used to reduce the dimensionality of the dataset. The top 'initialDims' principal components will be used for tSNE. Default 20.

    modules

    Integer vector. Determines which feature modules to use for tSNE. If NULL, all modules will be used. Default NULL.

    perplexity

    Numeric. Perplexity parameter for tSNE. Default 20.

    maxIter

    Integer. Maximum number of iterations in tSNE generation. Default 2500.

    normalize

    Character. Passed to normalizeCounts in normalization step. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells.

    scaleFactor

    Numeric. Sets the scale factor for cell-level normalization. This scale factor is multiplied to each cell after the library size of each cell had been adjusted in normalize. Default NULL which means no scale factor is applied.

    transformationFun

    Function. Applys a transformation such as 'sqrt', 'log', 'log2', 'log10', or 'log1p'. If NULL, no transformation will be applied. Occurs after applying normalization and scale factor. Default NULL.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    Value

    sce with t-SNE coordinates (columns "celda_tSNE1" & "celda_tSNE2") added to

    reducedDim(sce, "celda_tSNE").

    Examples

    data(sceCeldaCG)
    tsneRes <- celdaTsne(sceCeldaCG)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdaUmap.html ================================================ Uniform Manifold Approximation and Projection (UMAP) dimension reduction for celda sce object — celdaUmap • celda

    Embeds cells in two dimensions using umap based on a celda model. For celda_C sce objects, PCA on the normalized counts is used to reduce the number of features before applying UMAP. For celda_CG sce object, UMAP is run on module probabilities to reduce the number of features instead of using PCA. Module probabilities are square-root transformed before applying UMAP.

    celdaUmap(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      maxCells = NULL,
      minClusterSize = 100,
      modules = NULL,
      seed = 12345,
      nNeighbors = 30,
      minDist = 0.75,
      spread = 1,
      pca = TRUE,
      initialDims = 50,
      normalize = "proportion",
      scaleFactor = NULL,
      transformationFun = sqrt,
      cores = 1,
      ...
    )
    
    # S4 method for SingleCellExperiment
    celdaUmap(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      maxCells = NULL,
      minClusterSize = 100,
      modules = NULL,
      seed = 12345,
      nNeighbors = 30,
      minDist = 0.75,
      spread = 1,
      pca = TRUE,
      initialDims = 50,
      normalize = "proportion",
      scaleFactor = NULL,
      transformationFun = sqrt,
      cores = 1,
      ...
    )

    Arguments

    sce

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG.

    useAssay

    A string specifying which assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    maxCells

    Integer. Maximum number of cells to plot. Cells will be randomly subsampled if ncol(sce) > maxCells. Larger numbers of cells requires more memory. If NULL, no subsampling will be performed. Default NULL.

    minClusterSize

    Integer. Do not subsample cell clusters below this threshold. Default 100.

    modules

    Integer vector. Determines which features modules to use for UMAP. If NULL, all modules will be used. Default NULL.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    nNeighbors

    The size of local neighborhood used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. Default 30. See umap for more information.

    minDist

    The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. Default 0.75. See umap for more information.

    spread

    The effective scale of embedded points. In combination with min_dist, this determines how clustered/clumped the embedded points are. Default 1. See umap for more information.

    pca

    Logical. Whether to perform dimensionality reduction with PCA before UMAP. Only works for celda_C sce objects.

    initialDims

    Integer. Number of dimensions from PCA to use as input in UMAP. Default 50. Only works for celda_C sce objects.

    normalize

    Character. Passed to normalizeCounts in normalization step. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells.

    scaleFactor

    Numeric. Sets the scale factor for cell-level normalization. This scale factor is multiplied to each cell after the library size of each cell had been adjusted in normalize. Default NULL which means no scale factor is applied.

    transformationFun

    Function. Applys a transformation such as 'sqrt', 'log', 'log2', 'log10', or 'log1p'. If NULL, no transformation will be applied. Occurs after applying normalization and scale factor. Default NULL.

    cores

    Number of threads to use. Default 1.

    ...

    Additional parameters to pass to umap.

    Value

    sce with UMAP coordinates (columns "celda_UMAP1" & "celda_UMAP2") added to

    reducedDim(sce, "celda_UMAP").

    Examples

    data(sceCeldaCG)
    umapRes <- celdaUmap(sceCeldaCG)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celda_C.html ================================================ Cell clustering with Celda — celda_C • celda

    Clusters the columns of a count matrix containing single-cell data into K subpopulations. The useAssay assay slot in altExpName altExp slot will be used if it exists. Otherwise, the useAssay assay slot in x will be used if x is a SingleCellExperiment object.

    celda_C(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      K,
      alpha = 1,
      beta = 1,
      algorithm = c("EM", "Gibbs"),
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      zInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      zInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for SingleCellExperiment
    celda_C(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      K,
      alpha = 1,
      beta = 1,
      algorithm = c("EM", "Gibbs"),
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      zInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      zInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for ANY
    celda_C(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      K,
      alpha = 1,
      beta = 1,
      algorithm = c("EM", "Gibbs"),
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      zInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      zInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )

    Arguments

    x

    A SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells. Alternatively, any matrix-like object that can be coerced to a sparse matrix of class "dgCMatrix" can be directly used as input. The matrix will automatically be converted to a SingleCellExperiment object.

    useAssay

    A string specifying the name of the assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    sampleLabel

    Vector or factor. Denotes the sample label for each cell (column) in the count matrix.

    K

    Integer. Number of cell populations.

    alpha

    Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default 1.

    beta

    Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature in each cell population. Default 1.

    algorithm

    String. Algorithm to use for clustering cell subpopulations. One of 'EM' or 'Gibbs'. The EM algorithm is faster, especially for larger numbers of cells. However, more chains may be required to ensure a good solution is found. If 'EM' is selected, then 'stopIter' will be automatically set to 1. Default 'EM'.

    stopIter

    Integer. Number of iterations without improvement in the log likelihood to stop inference. Default 10.

    maxIter

    Integer. Maximum number of iterations of Gibbs sampling or EM to perform. Default 200.

    splitOnIter

    Integer. On every `splitOnIter` iteration, a heuristic will be applied to determine if a cell population should be reassigned and another cell population should be split into two clusters. To disable splitting, set to -1. Default 10.

    splitOnLast

    Integer. After `stopIter` iterations have been performed without improvement, a heuristic will be applied to determine if a cell population should be reassigned and another cell population should be split into two clusters. If a split occurs, then `stopIter` will be reset. Default TRUE.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    nchains

    Integer. Number of random cluster initializations. Default 3.

    zInitialize

    Character. One of 'random', 'split', or 'predefined'. With 'random', cells are randomly assigned to a populations. With 'split', cells will be split into sqrt(K) populations and then each population will be subsequently split into another sqrt(K) populations. With 'predefined', values in `zInit` will be used to initialize `z`. Default 'split'.

    countChecksum

    Character. An MD5 checksum for the `counts` matrix. Default NULL.

    zInit

    Integer vector. Sets initial starting values of z. 'zInit' is only used when `zInitialize = 'predfined'`. Default NULL.

    logfile

    Character. Messages will be redirected to a file named `logfile`. If NULL, messages will be printed to stdout. Default NULL.

    verbose

    Logical. Whether to print log messages. Default TRUE.

    Value

    A SingleCellExperiment object. Function parameter settings are stored in the metadata

    "celda_parameters" slot. Columns celda_sample_label and celda_cell_cluster in

    colData contain sample labels and celda cell population clusters.

    See also

    celda_G for feature clustering and celda_CG for simultaneous clustering of features and cells. celdaGridSearch can be used to run multiple values of K and multiple chains in parallel.

    Examples

    data(celdaCSim)
    sce <- celda_C(celdaCSim$counts,
        K = celdaCSim$K,
        sampleLabel = celdaCSim$sampleLabel,
        nchains = 1)
    #> --------------------------------------------------
    #> Starting Celda_C: Clustering cells.
    #> --------------------------------------------------
    #> Tue Apr  2 18:54:46 2024 .. Initializing 'z' in chain 1 with 'split' 
    #> Tue Apr  2 18:54:46 2024 .... Completed iteration: 1 | logLik: -1282027.27277705
    #> Tue Apr  2 18:54:46 2024 .... Completed iteration: 2 | logLik: -1282027.27277705
    #> Tue Apr  2 18:54:46 2024 .. Finished chain 1
    #> --------------------------------------------------
    #> Completed Celda_C. Total time: 0.1405442 secs
    #> --------------------------------------------------
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celda_CG.html ================================================ Cell and feature clustering with Celda — celda_CG • celda

    Clusters the rows and columns of a count matrix containing single-cell data into L modules and K subpopulations, respectively. The useAssay assay slot in altExpName altExp slot will be used if it exists. Otherwise, the useAssay assay slot in x will be used if x is a SingleCellExperiment object.

    celda_CG(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      K,
      L,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      algorithm = c("EM", "Gibbs"),
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      zInitialize = c("split", "random", "predefined"),
      yInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      zInit = NULL,
      yInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for SingleCellExperiment
    celda_CG(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      K,
      L,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      algorithm = c("EM", "Gibbs"),
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      zInitialize = c("split", "random", "predefined"),
      yInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      zInit = NULL,
      yInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for ANY
    celda_CG(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      K,
      L,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      algorithm = c("EM", "Gibbs"),
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      zInitialize = c("split", "random", "predefined"),
      yInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      zInit = NULL,
      yInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )

    Arguments

    x

    A SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells. Alternatively, any matrix-like object that can be coerced to a sparse matrix of class "dgCMatrix" can be directly used as input. The matrix will automatically be converted to a SingleCellExperiment object.

    useAssay

    A string specifying the name of the assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    sampleLabel

    Vector or factor. Denotes the sample label for each cell (column) in the count matrix.

    K

    Integer. Number of cell populations.

    L

    Integer. Number of feature modules.

    alpha

    Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default 1.

    beta

    Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell population. Default 1.

    delta

    Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1.

    gamma

    Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 1.

    algorithm

    String. Algorithm to use for clustering cell subpopulations. One of 'EM' or 'Gibbs'. The EM algorithm for cell clustering is faster, especially for larger numbers of cells. However, more chains may be required to ensure a good solution is found. Default 'EM'.

    stopIter

    Integer. Number of iterations without improvement in the log likelihood to stop inference. Default 10.

    maxIter

    Integer. Maximum number of iterations of Gibbs sampling to perform. Default 200.

    splitOnIter

    Integer. On every splitOnIter iteration, a heuristic will be applied to determine if a cell population or feature module should be reassigned and another cell population or feature module should be split into two clusters. To disable splitting, set to -1. Default 10.

    splitOnLast

    Integer. After stopIter iterations have been performed without improvement, a heuristic will be applied to determine if a cell population or feature module should be reassigned and another cell population or feature module should be split into two clusters. If a split occurs, then 'stopIter' will be reset. Default TRUE.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    nchains

    Integer. Number of random cluster initializations. Default 3.

    zInitialize

    Chararacter. One of 'random', 'split', or 'predefined'. With 'random', cells are randomly assigned to a populations. With 'split', cells will be split into sqrt(K) populations and then each population will be subsequently split into another sqrt(K) populations. With 'predefined', values in zInit will be used to initialize z. Default 'split'.

    yInitialize

    Character. One of 'random', 'split', or 'predefined'. With 'random', features are randomly assigned to a modules. With 'split', features will be split into sqrt(L) modules and then each module will be subsequently split into another sqrt(L) modules. With 'predefined', values in yInit will be used to initialize y. Default 'split'.

    countChecksum

    Character. An MD5 checksum for the counts matrix. Default NULL.

    zInit

    Integer vector. Sets initial starting values of z. 'zInit' is only used when `zInitialize = 'predfined'`. Default NULL.

    yInit

    Integer vector. Sets initial starting values of y. 'yInit' is only be used when `yInitialize = "predefined"`. Default NULL.

    logfile

    Character. Messages will be redirected to a file named `logfile`. If NULL, messages will be printed to stdout. Default NULL.

    verbose

    Logical. Whether to print log messages. Default TRUE.

    Value

    A SingleCellExperiment object. Function parameter settings are stored in metadata

    "celda_parameters" in altExp slot. In altExp slot, columns celda_sample_label and celda_cell_cluster in

    colData contain sample labels and celda cell population clusters. Column celda_feature_module in

    rowData contains feature modules.

    See also

    celda_G for feature clustering and celda_C for clustering cells. celdaGridSearch can be used to run multiple values of K/L and multiple chains in parallel.

    Examples

    data(celdaCGSim)
    sce <- celda_CG(celdaCGSim$counts,
        K = celdaCGSim$K,
        L = celdaCGSim$L,
        sampleLabel = celdaCGSim$sampleLabel,
        nchains = 1)
    #> --------------------------------------------------
    #> Starting Celda_CG: Clustering cells and genes.
    #> --------------------------------------------------
    #> Tue Apr  2 18:54:47 2024 .. Initializing 'z' in chain 1 with 'split' 
    #> Tue Apr  2 18:54:47 2024 .. Initializing 'y' in chain 1 with 'split' 
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 1 | logLik: -1215542.98684529
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 2 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 3 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 4 | logLik: -1215542.98684529
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 5 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 6 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 7 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 8 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:50 2024 .... Completed iteration: 9 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:50 2024 .... Determining if any gene clusters should be split.
    #> Tue Apr  2 18:54:51 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:54:51 2024 .... Determining if any cell clusters should be split.
    #> Tue Apr  2 18:54:51 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:54:51 2024 .... Completed iteration: 10 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:51 2024 .... Determining if any cell clusters should be split.
    #> Tue Apr  2 18:54:51 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:54:51 2024 .... Completed iteration: 11 | logLik: -1215541.0958389
    #> Tue Apr  2 18:54:51 2024 .. Finished chain 1
    #> --------------------------------------------------
    #> Completed Celda_CG. Total time: 3.852616 secs
    #> --------------------------------------------------
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celda_G.html ================================================ Feature clustering with Celda — celda_G • celda

    Clusters the rows of a count matrix containing single-cell data into L modules. The useAssay assay slot in altExpName altExp slot will be used if it exists. Otherwise, the useAssay assay slot in x will be used if x is a SingleCellExperiment object.

    celda_G(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      L,
      beta = 1,
      delta = 1,
      gamma = 1,
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      yInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      yInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for SingleCellExperiment
    celda_G(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      L,
      beta = 1,
      delta = 1,
      gamma = 1,
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      yInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      yInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for ANY
    celda_G(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      L,
      beta = 1,
      delta = 1,
      gamma = 1,
      stopIter = 10,
      maxIter = 200,
      splitOnIter = 10,
      splitOnLast = TRUE,
      seed = 12345,
      nchains = 3,
      yInitialize = c("split", "random", "predefined"),
      countChecksum = NULL,
      yInit = NULL,
      logfile = NULL,
      verbose = TRUE
    )

    Arguments

    x

    A SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells. Alternatively, any matrix-like object that can be coerced to a sparse matrix of class "dgCMatrix" can be directly used as input. The matrix will automatically be converted to a SingleCellExperiment object.

    useAssay

    A string specifying the name of the assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    L

    Integer. Number of feature modules.

    beta

    Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell. Default 1.

    delta

    Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1.

    gamma

    Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 1.

    stopIter

    Integer. Number of iterations without improvement in the log likelihood to stop inference. Default 10.

    maxIter

    Integer. Maximum number of iterations of Gibbs sampling to perform. Default 200.

    splitOnIter

    Integer. On every `splitOnIter` iteration, a heuristic will be applied to determine if a feature module should be reassigned and another feature module should be split into two clusters. To disable splitting, set to -1. Default 10.

    splitOnLast

    Integer. After `stopIter` iterations have been performed without improvement, a heuristic will be applied to determine if a cell population should be reassigned and another cell population should be split into two clusters. If a split occurs, then `stopIter` will be reset. Default TRUE.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    nchains

    Integer. Number of random cluster initializations. Default 3.

    yInitialize

    Chararacter. One of 'random', 'split', or 'predefined'. With 'random', features are randomly assigned to a modules. With 'split', features will be split into sqrt(L) modules and then each module will be subsequently split into another sqrt(L) modules. With 'predefined', values in `yInit` will be used to initialize `y`. Default 'split'.

    countChecksum

    Character. An MD5 checksum for the `counts` matrix. Default NULL.

    yInit

    Integer vector. Sets initial starting values of y. `yInit` can only be used when `yInitialize = 'predefined'`. Default NULL.

    logfile

    Character. Messages will be redirected to a file named logfile. If NULL, messages will be printed to stdout. Default NULL.

    verbose

    Logical. Whether to print log messages. Default TRUE.

    Value

    A SingleCellExperiment object. Function parameter settings are stored in the metadata

    "celda_parameters" slot. Column celda_feature_module in

    rowData contains feature modules.

    See also

    celda_C for cell clustering and celda_CG for simultaneous clustering of features and cells. celdaGridSearch can be used to run multiple values of L and multiple chains in parallel.

    Examples

    data(celdaGSim)
    sce <- celda_G(celdaGSim$counts, L = celdaGSim$L, nchains = 1)
    #> --------------------------------------------------
    #> Starting Celda_G: Clustering genes.
    #> --------------------------------------------------
    #> Tue Apr  2 18:54:52 2024 .. Initializing 'y' in chain 1 with 'split' 
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 1 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 2 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 3 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 4 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 5 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 6 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 7 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 8 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Completed iteration: 9 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:53 2024 .... Determining if any gene clusters should be split.
    #> Tue Apr  2 18:54:54 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:54:54 2024 .... Completed iteration: 10 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:54 2024 .... Completed iteration: 11 | logLik: -290669.046132139
    #> Tue Apr  2 18:54:54 2024 .. Finished chain 1
    #> --------------------------------------------------
    #> Completed Celda_G. Total time: 2.01088 secs
    #> --------------------------------------------------
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/celdatosce.html ================================================ Convert old celda model object to SCE object — celdatosce • celda

    Convert a old celda model object (celda_C, celda_G, or celda_CG object) to a SingleCellExperiment object containing celda model information in metadata slot. Counts matrix is stored in the "counts" assay slot in assays.

    celdatosce(
      celdaModel,
      counts,
      useAssay = "counts",
      altExpName = "featureSubset"
    )
    
    # S4 method for celda_C
    celdatosce(
      celdaModel,
      counts,
      useAssay = "counts",
      altExpName = "featureSubset"
    )
    
    # S4 method for celda_G
    celdatosce(
      celdaModel,
      counts,
      useAssay = "counts",
      altExpName = "featureSubset"
    )
    
    # S4 method for celda_CG
    celdatosce(
      celdaModel,
      counts,
      useAssay = "counts",
      altExpName = "featureSubset"
    )
    
    # S4 method for celdaList
    celdatosce(
      celdaModel,
      counts,
      useAssay = "counts",
      altExpName = "featureSubset"
    )

    Arguments

    celdaModel

    A celdaModel or celdaList object generated using older versions of celda.

    counts

    A numeric matrix of counts used to generate celdaModel. Dimensions and MD5 checksum will be checked by compareCountMatrix.

    useAssay

    A string specifying the name of the assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    A SingleCellExperiment object. Function parameter settings are stored in the metadata

    "celda_parameters" slot. Columns celda_sample_label and celda_cell_cluster in

    colData contain sample labels and celda cell population clusters. Column celda_feature_module in

    rowData contain feature modules.

    Examples

    data(celdaCMod, celdaCSim)
    sce <- celdatosce(celdaCMod, celdaCSim$counts)
    data(celdaGMod, celdaGSim)
    sce <- celdatosce(celdaGMod, celdaGSim$counts)
    data(celdaCGMod, celdaCGSim)
    sce <- celdatosce(celdaCGMod, celdaCGSim$counts)
    data(celdaCGGridSearchRes, celdaCGSim)
    sce <- celdatosce(celdaCGGridSearchRes, celdaCGSim$counts)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/clusterProbability.html ================================================ Get the conditional probabilities of cell in subpopulations from celda model — clusterProbability • celda

    Calculate the conditional probability of each cell belonging to each subpopulation given all other cell cluster assignments and/or each feature belonging to each module given all other feature cluster assignments in a celda model.

    clusterProbability(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      log = FALSE
    )
    
    # S4 method for SingleCellExperiment
    clusterProbability(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      log = FALSE
    )

    Arguments

    sce

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG, with the matrix located in the useAssay assay slot. Rows represent features and columns represent cells.

    useAssay

    A string specifying which assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    log

    Logical. If FALSE, then the normalized conditional probabilities will be returned. If TRUE, then the unnormalized log probabilities will be returned. Default FALSE.

    Value

    A list containging a matrix for the conditional cell subpopulation cluster and/or feature module probabilities.

    See also

    `celda_C()` for clustering cells

    Examples

    data(sceCeldaCG)
    clusterProb <- clusterProbability(sceCeldaCG, log = TRUE)
    data(sceCeldaC)
    clusterProb <- clusterProbability(sceCeldaC)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/compareCountMatrix.html ================================================ Check count matrix consistency — compareCountMatrix • celda

    Checks if the counts matrix is the same one used to generate the celda model object by comparing dimensions and MD5 checksum.

    compareCountMatrix(counts, celdaMod, errorOnMismatch = TRUE)
    
    # S4 method for ANY,celdaModel
    compareCountMatrix(counts, celdaMod, errorOnMismatch = TRUE)
    
    # S4 method for ANY,celdaList
    compareCountMatrix(counts, celdaMod, errorOnMismatch = TRUE)

    Arguments

    counts

    Integer , Numeric, or Sparse matrix. Rows represent features and columns represent cells.

    celdaMod

    A celdaModel or celdaList object.

    errorOnMismatch

    Logical. Whether to throw an error in the event of a mismatch. Default TRUE.

    Value

    Returns TRUE if provided count matrix matches the one used in the celda object and/or errorOnMismatch = FALSE, FALSE otherwise.

    Examples

    data(celdaCGSim, celdaCGMod)
    compareCountMatrix(celdaCGSim$counts, celdaCGMod, errorOnMismatch = FALSE)
    #> [1] TRUE
    data(celdaCGSim, celdaCGGridSearchRes)
    compareCountMatrix(celdaCGSim$counts, celdaCGGridSearchRes,
        errorOnMismatch = FALSE)
    #> [1] TRUE
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/contaminationSim.html ================================================ contaminationSim — contaminationSim • celda

    A toy contamination data generated by simulateContamination

    contaminationSim

    Format

    A list

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/countChecksum-celdaList-method.html ================================================ Get the MD5 hash of the count matrix from the celdaList — countChecksum,celdaList-method • celda

    Returns the MD5 hash of the count matrix used to generate the celdaList.

    # S4 method for celdaList
    countChecksum(celdaList)

    Arguments

    celdaList

    An object of class celdaList.

    Value

    A character string of length 32 containing the MD5 digest of the count matrix.

    Examples

    data(celdaCGGridSearchRes)
    countChecksum <- countChecksum(celdaCGGridSearchRes)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/countChecksum.html ================================================ Get the MD5 hash of the count matrix from the celdaList — countChecksum • celda

    Returns the MD5 hash of the count matrix used to generate the celdaList.

    countChecksum(celdaList)

    Arguments

    celdaList

    An object of class celdaList.

    Value

    A character string of length 32 containing the MD5 digest of the count matrix.

    Examples

    data(celdaCGGridSearchRes)
    countChecksum <- countChecksum(celdaCGGridSearchRes)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/decontX.html ================================================ Contamination estimation with decontX — decontX • celda

    Identifies contamination from factors such as ambient RNA in single cell genomic datasets.

    decontX(x, ...)
    
    # S4 method for SingleCellExperiment
    decontX(
      x,
      assayName = "counts",
      z = NULL,
      batch = NULL,
      background = NULL,
      bgAssayName = NULL,
      bgBatch = NULL,
      maxIter = 500,
      delta = c(10, 10),
      estimateDelta = TRUE,
      convergence = 0.001,
      iterLogLik = 10,
      varGenes = 5000,
      dbscanEps = 1,
      seed = 12345,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for ANY
    decontX(
      x,
      z = NULL,
      batch = NULL,
      background = NULL,
      bgBatch = NULL,
      maxIter = 500,
      delta = c(10, 10),
      estimateDelta = TRUE,
      convergence = 0.001,
      iterLogLik = 10,
      varGenes = 5000,
      dbscanEps = 1,
      seed = 12345,
      logfile = NULL,
      verbose = TRUE
    )

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under assayName. Cells in each batch will be subsetted and converted to a sparse matrix of class dgCMatrix from package Matrix before analysis. This object should only contain filtered cells after cell calling. Empty cell barcodes (low expression droplets before cell calling) are not needed to run DecontX.

    ...

    For the generic, further arguments to pass to each method.

    assayName

    Character. Name of the assay to use if x is a SingleCellExperiment.

    z

    Numeric or character vector. Cell cluster labels. If NULL, PCA will be used to reduce the dimensionality of the dataset initially, 'umap' from the 'uwot' package will be used to further reduce the dataset to 2 dimenions and the 'dbscan' function from the 'dbscan' package will be used to identify clusters of broad cell types. Default NULL.

    batch

    Numeric or character vector. Batch labels for cells. If batch labels are supplied, DecontX is run on cells from each batch separately. Cells run in different channels or assays should be considered different batches. Default NULL.

    background

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under assayName. It should have the same data format as x except it contains the empty droplets instead of cells. When supplied, empirical distribution of transcripts from these empty droplets will be used as the contamination distribution. Default NULL.

    bgAssayName

    Character. Name of the assay to use if background is a SingleCellExperiment. Default to same as assayName.

    bgBatch

    Numeric or character vector. Batch labels for background. Its unique values should be the same as those in batch, such that each batch of cells have their corresponding batch of empty droplets as background, pointed by this parameter. Default to NULL.

    maxIter

    Integer. Maximum iterations of the EM algorithm. Default 500.

    delta

    Numeric Vector of length 2. Concentration parameters for the Dirichlet prior for the contamination in each cell. The first element is the prior for the native counts while the second element is the prior for the contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If estimateDelta = TRUE, this is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then fit_dirichlet is used to update delta in each iteration. If estimateDelta = FALSE, then delta is fixed with these values for the entire inference procedure. Fixing delta and setting a high number in the second element will force decontX to be more aggressive and estimate higher levels of contamination at the expense of potentially removing native expression. Default c(10, 10).

    estimateDelta

    Boolean. Whether to update delta at each iteration.

    convergence

    Numeric. The EM algorithm will be stopped if the maximum difference in the contamination estimates between the previous and current iterations is less than this. Default 0.001.

    iterLogLik

    Integer. Calculate log likelihood every iterLogLik iteration. Default 10.

    varGenes

    Integer. The number of variable genes to use in dimensionality reduction before clustering. Variability is calcualted using modelGeneVar function from the 'scran' package. Used only when z is not provided. Default 5000.

    dbscanEps

    Numeric. The clustering resolution parameter used in 'dbscan' to estimate broad cell clusters. Used only when z is not provided. Default 1.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    logfile

    Character. Messages will be redirected to a file named `logfile`. If NULL, messages will be printed to stdout. Default NULL.

    verbose

    Logical. Whether to print log messages. Default TRUE.

    Value

    If x is a matrix-like object, a list will be returned with the following items:

    decontXcounts:

    The decontaminated matrix. Values obtained from the variational inference procedure may be non-integer. However, integer counts can be obtained by rounding, e.g. round(decontXcounts).

    contamination:

    Percentage of contamination in each cell.

    estimates:

    List of estimated parameters for each batch. If z was not supplied, then the UMAP coordinates used to generated cell cluster labels will also be stored here.

    z:

    Cell population/cluster labels used for analysis.

    runParams:

    List of arguments used in the function call.

    If x is a SingleCellExperiment, then the decontaminated counts will be stored as an assay and can be accessed with decontXcounts(x). The contamination values and cluster labels will be stored in colData(x). estimates and runParams

    will be stored in metadata(x)$decontX. The UMAPs used to generated cell cluster labels will be stored in reducedDims slot in x.

    Author

    Shiyi Yang, Yuan Yin, Joshua Campbell

    Examples

    # Generate matrix with contamination
    s <- simulateContamination(seed = 12345)
    
    library(SingleCellExperiment)
    sce <- SingleCellExperiment(list(counts = s$observedCounts))
    sce <- decontX(sce)
    #> --------------------------------------------------
    #> Starting DecontX
    #> --------------------------------------------------
    #> Tue Apr  2 18:54:58 2024 .. Analyzing all cells
    #> Tue Apr  2 18:54:58 2024 .... Converting to sparse matrix
    #> Tue Apr  2 18:54:58 2024 .... Generating UMAP and estimating cell types
    #> Tue Apr  2 18:55:02 2024 .... Estimating contamination
    #> Tue Apr  2 18:55:02 2024 ...... Completed iteration: 9 | converge: 0.0009154
    #> Tue Apr  2 18:55:02 2024 .. Calculating final decontaminated matrix
    #> --------------------------------------------------
    #> Completed DecontX. Total time: 4.075659 secs
    #> --------------------------------------------------
    
    # Plot contamination on UMAP
    plotDecontXContamination(sce)
    
    
    # Plot decontX cluster labels
    umap <- reducedDim(sce)
    plotDimReduceCluster(x = sce$decontX_clusters,
        dim1 = umap[, 1], dim2 = umap[, 2], )
    
    
    # Plot percentage of marker genes detected
    # in each cell cluster before decontamination
    s$markers
    #> $CellType_1_Markers
    #> [1] "Gene_47" "Gene_32" "Gene_86"
    #> 
    #> $CellType_2_Markers
    #> [1] "Gene_70" "Gene_33" "Gene_48"
    #> 
    #> $CellType_3_Markers
    #> [1] "Gene_74" "Gene_26" "Gene_20"
    #> 
    plotDecontXMarkerPercentage(sce, markers = s$markers, assayName = "counts")
    
    
    # Plot percentage of marker genes detected
    # in each cell cluster after contamination
    plotDecontXMarkerPercentage(sce, markers = s$markers,
                                assayName = "decontXcounts")
    
    
    # Plot percentage of marker genes detected in each cell
    # comparing original and decontaminated counts side-by-side
    plotDecontXMarkerPercentage(sce, markers = s$markers,
                                assayName = c("counts", "decontXcounts"))
    
    
    # Plot raw counts of indiviual markers genes before
    # and after decontamination
    plotDecontXMarkerExpression(sce, unlist(s$markers))
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/decontXcounts.html ================================================ Get or set decontaminated counts matrix — decontXcounts • celda

    Gets or sets the decontaminated counts matrix from a a SingleCellExperiment object.

    decontXcounts(object, ...)
    
    decontXcounts(object, ...) <- value
    
    # S4 method for SingleCellExperiment
    decontXcounts(object, ...)
    
    # S4 method for SingleCellExperiment
    decontXcounts(object, ...) <- value

    Arguments

    object

    A SingleCellExperiment object.

    ...

    For the generic, further arguments to pass to each method.

    value

    A matrix to save as an assay called decontXcounts

    Value

    If getting, the assay from object with the name decontXcounts will be returned. If setting, a SingleCellExperiment object will be returned with decontXcounts listed in the assay slot.

    See also

    assay and assay<-

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/distinctColors.html ================================================ Create a color palette — distinctColors • celda

    Generate a palette of `n` distinct colors.

    distinctColors(
      n,
      hues = c("red", "cyan", "orange", "blue", "yellow", "purple", "green", "magenta"),
      saturationRange = c(0.7, 1),
      valueRange = c(0.7, 1)
    )

    Arguments

    n

    Integer. Number of colors to generate.

    hues

    Character vector. Colors available from `colors()`. These will be used as the base colors for the clustering scheme in HSV. Different saturations and values will be generated for each hue. Default c("red", "cyan", "orange", "blue", "yellow", "purple", "green", "magenta").

    saturationRange

    Numeric vector. A vector of length 2 denoting the saturation for HSV. Values must be in [0,1]. Default: c(0.25, 1).

    valueRange

    Numeric vector. A vector of length 2 denoting the range of values for HSV. Values must be in [0,1]. Default: `c(0.5, 1)`.

    Value

    A vector of distinct colors that have been converted to HEX from HSV.

    Examples

    colorPal <- distinctColors(6) # can be used in plotting functions
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/eigenMatMultInt.html ================================================ Fast matrix multiplication for double x int — eigenMatMultInt • celda

    Fast matrix multiplication for double x int

    eigenMatMultInt(A, B)

    Arguments

    A

    a double matrix

    B

    an integer matrix

    Value

    An integer matrix representing the product of A and B

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/eigenMatMultNumeric.html ================================================ Fast matrix multiplication for double x double — eigenMatMultNumeric • celda

    Fast matrix multiplication for double x double

    eigenMatMultNumeric(A, B)

    Arguments

    A

    a double matrix

    B

    an integer matrix

    Value

    An integer matrix representing the product of A and B

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/factorizeMatrix.html ================================================ Generate factorized matrices showing each feature's influence on cell / gene clustering — factorizeMatrix • celda

    Generates factorized matrices showing the contribution of each feature in each cell population or each cell population in each sample.

    factorizeMatrix(
      x,
      celdaMod,
      useAssay = "counts",
      altExpName = "featureSubset",
      type = c("counts", "proportion", "posterior")
    )
    
    # S4 method for SingleCellExperiment,ANY
    factorizeMatrix(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      type = c("counts", "proportion", "posterior")
    )
    
    # S4 method for ANY,celda_CG
    factorizeMatrix(x, celdaMod, type = c("counts", "proportion", "posterior"))
    
    # S4 method for ANY,celda_C
    factorizeMatrix(x, celdaMod, type = c("counts", "proportion", "posterior"))
    
    # S4 method for ANY,celda_G
    factorizeMatrix(x, celdaMod, type = c("counts", "proportion", "posterior"))

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned by celda_C, celda_G or celda_CG, with the matrix located in the useAssay assay slot in altExp(x, altExpName). Rows represent features and columns represent cells.

    • Integer counts matrix. Rows represent features and columns represent cells. This matrix should be the same as the one used to generate celdaMod.

    celdaMod

    Celda model object. Only works if x is an integer counts matrix.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    type

    Character vector. A vector containing one or more of "counts", "proportion", or "posterior". "counts" returns the raw number of counts for each factorized matrix. "proportions" returns the normalized probabilities for each factorized matrix, which are calculated by dividing the raw counts in each factorized matrix by the total counts in each column. "posterior" returns the posterior estimates which include the addition of the Dirichlet concentration parameter (essentially as a pseudocount). Default "counts".

    Value

    For celda_CG model, A list with elements for "counts", "proportions", or "posterior" probabilities. Each element will be a list containing factorized matrices for "module", "cellPopulation", and "sample". Additionally, the contribution of each module in each individual cell will be included in the "cell" element of "counts" and "proportions" elements.

    For celda_C model, a list with elements for "counts", "proportions", or "posterior" probabilities. Each element will be a list containing factorized matrices for "module" and "sample".

    For celda_G model, a list with elements for "counts", "proportions", or "posterior" probabilities. Each element will be a list containing factorized matrices for "module" and "cell".

    Examples

    data(sceCeldaCG)
    factorizedMatrices <- factorizeMatrix(sceCeldaCG, type = "posterior")
    data(celdaCGSim, celdaCGMod)
    factorizedMatrices <- factorizeMatrix(
      celdaCGSim$counts,
      celdaCGMod,
      "posterior")
    data(celdaCSim, celdaCMod)
    factorizedMatrices <- factorizeMatrix(
      celdaCSim$counts,
      celdaCMod, "posterior"
    )
    data(celdaGSim, celdaGMod)
    factorizedMatrices <- factorizeMatrix(
      celdaGSim$counts,
      celdaGMod, "posterior"
    )
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/fastNormProp.html ================================================ Fast normalization for numeric matrix — fastNormProp • celda

    Fast normalization for numeric matrix

    fastNormProp(R_counts, R_alpha)

    Arguments

    R_counts

    An integer matrix

    R_alpha

    A double value to be added to the matrix as a pseudocount

    Value

    A numeric matrix where the columns have been normalized to proportions

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/fastNormPropLog.html ================================================ Fast normalization for numeric matrix — fastNormPropLog • celda

    Fast normalization for numeric matrix

    fastNormPropLog(R_counts, R_alpha)

    Arguments

    R_counts

    An integer matrix

    R_alpha

    A double value to be added to the matrix as a pseudocount

    Value

    A numeric matrix where the columns have been normalized to proportions

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/fastNormPropSqrt.html ================================================ Fast normalization for numeric matrix — fastNormPropSqrt • celda

    Fast normalization for numeric matrix

    fastNormPropSqrt(R_counts, R_alpha)

    Arguments

    R_counts

    An integer matrix

    R_alpha

    A double value to be added to the matrix as a pseudocount

    Value

    A numeric matrix where the columns have been normalized to proportions

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/featureModuleLookup.html ================================================ Obtain the gene module of a gene of interest — featureModuleLookup • celda

    This function will output the corresponding feature module for a specified vector of genes from a celda_CG or celda_G celdaModel. features must match the rownames of sce.

    featureModuleLookup(
      sce,
      features,
      altExpName = "featureSubset",
      exactMatch = TRUE,
      by = "rownames"
    )
    
    # S4 method for SingleCellExperiment
    featureModuleLookup(
      sce,
      features,
      altExpName = "featureSubset",
      exactMatch = TRUE,
      by = "rownames"
    )

    Arguments

    sce

    A SingleCellExperiment object returned by celda_G, or celda_CG, with the matrix located in the useAssay assay slot. Rows represent features and columns represent cells.

    features

    Character vector. Identify feature modules for the specified feature names. feature must match the rownames of sce.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    exactMatch

    Logical. Whether to look for exactMatch of the gene name within counts matrix. Default TRUE.

    by

    Character. Where to search for features in the sce object. If set to "rownames" then the features will be searched for among rownames(sce). This can also be set to one of the colnames of rowData(sce). Default "rownames".

    Value

    Numeric vector containing the module numbers for each feature. If the feature was not found, then an NA value will be returned in that position. If no features were found, then an error will be given.

    Examples

    data(sceCeldaCG)
    module <- featureModuleLookup(sce = sceCeldaCG,
        features = c("Gene_1", "Gene_XXX"))
    #> Warning: The following features were not present in 'x': Gene_XXX
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/featureModuleTable.html ================================================ Output a feature module table — featureModuleTable • celda

    Creates a table that contains the list of features in each feature module.

    featureModuleTable(
      sce,
      useAssay = "counts",
      altExpName = "featureSubset",
      displayName = NULL,
      outputFile = NULL
    )

    Arguments

    sce

    A SingleCellExperiment object returned by celda_G, or celda_CG, with the matrix located in the useAssay assay slot. Rows represent features and columns represent cells.

    useAssay

    A string specifying which assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    displayName

    Character. The column name of rowData(sce) that specifies the display names for the features. Default NULL, which displays the row names.

    outputFile

    File name for feature module table. If NULL, file will not be created. Default NULL.

    Value

    Matrix. Contains a list of features per each column (feature module)

    Examples

    data(sceCeldaCG)
    featureModuleTable(sceCeldaCG)
    #>       L1        L2        L3        L4         L5        L6        L7       
    #>  [1,] "Gene_45" "Gene_77" "Gene_72" "Gene_87"  "Gene_41" "Gene_80" "Gene_24"
    #>  [2,] "Gene_5"  "Gene_17" "Gene_15" "Gene_6"   "Gene_70" "Gene_60" "Gene_53"
    #>  [3,] "Gene_94" "Gene_30" "Gene_22" "Gene_82"  "Gene_64" "Gene_55" "Gene_74"
    #>  [4,] "Gene_91" "Gene_9"  "Gene_52" "Gene_28"  "Gene_36" "Gene_40" "Gene_59"
    #>  [5,] "Gene_3"  ""        "Gene_78" "Gene_39"  "Gene_37" "Gene_25" "Gene_89"
    #>  [6,] "Gene_50" ""        "Gene_84" "Gene_10"  "Gene_14" "Gene_62" "Gene_11"
    #>  [7,] "Gene_90" ""        "Gene_81" "Gene_51"  "Gene_99" "Gene_97" "Gene_49"
    #>  [8,] ""        ""        ""        "Gene_100" "Gene_20" "Gene_95" "Gene_1" 
    #>  [9,] ""        ""        ""        "Gene_65"  "Gene_2"  "Gene_46" "Gene_79"
    #> [10,] ""        ""        ""        "Gene_33"  "Gene_44" "Gene_76" ""       
    #> [11,] ""        ""        ""        "Gene_38"  "Gene_66" ""        ""       
    #> [12,] ""        ""        ""        ""         "Gene_32" ""        ""       
    #> [13,] ""        ""        ""        ""         "Gene_34" ""        ""       
    #> [14,] ""        ""        ""        ""         "Gene_23" ""        ""       
    #> [15,] ""        ""        ""        ""         ""        ""        ""       
    #> [16,] ""        ""        ""        ""         ""        ""        ""       
    #>       L8        L9        L10      
    #>  [1,] "Gene_4"  "Gene_85" "Gene_69"
    #>  [2,] "Gene_43" "Gene_98" "Gene_47"
    #>  [3,] "Gene_42" "Gene_35" "Gene_13"
    #>  [4,] "Gene_29" "Gene_8"  "Gene_92"
    #>  [5,] "Gene_73" "Gene_83" "Gene_88"
    #>  [6,] "Gene_61" "Gene_19" "Gene_93"
    #>  [7,] ""        "Gene_63" "Gene_12"
    #>  [8,] ""        "Gene_75" "Gene_86"
    #>  [9,] ""        "Gene_21" "Gene_58"
    #> [10,] ""        "Gene_48" "Gene_18"
    #> [11,] ""        "Gene_31" "Gene_7" 
    #> [12,] ""        "Gene_57" "Gene_16"
    #> [13,] ""        "Gene_27" ""       
    #> [14,] ""        "Gene_67" ""       
    #> [15,] ""        "Gene_68" ""       
    #> [16,] ""        "Gene_54" ""       
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/geneSetEnrich.html ================================================ Gene set enrichment — geneSetEnrich • celda

    Identify and return significantly-enriched terms for each gene module in a Celda object or a SingleCellExperiment object. Performs gene set enrichment analysis for Celda identified modules using the enrichr.

    geneSetEnrich(
      x,
      celdaModel,
      useAssay = "counts",
      altExpName = "featureSubset",
      databases,
      fdr = 0.05
    )
    
    # S4 method for SingleCellExperiment
    geneSetEnrich(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      databases,
      fdr = 0.05
    )
    
    # S4 method for matrix
    geneSetEnrich(x, celdaModel, databases, fdr = 0.05)

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells. Rownames of the matrix or SingleCellExperiment object should be gene names.

    celdaModel

    Celda object of class celda_G or celda_CG.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    databases

    Character vector. Name of reference database. Available databases can be viewed by listEnrichrDbs.

    fdr

    False discovery rate (FDR). Numeric. Cutoff value for adjusted p-value, terms with FDR below this value are considered significantly enriched.

    Value

    List of length 'L' where each member contains the significantly enriched terms for the corresponding module.

    Author

    Ahmed Youssef, Zhe Wang

    Examples

    library(M3DExampleData)
    counts <- M3DExampleData::Mmus_example_list$data
    # subset 500 genes for fast clustering
    counts <- counts[seq(1501, 2000), ]
    # cluster genes into 10 modules for quick demo
    sce <- celda_G(x = as.matrix(counts), L = 10, verbose = FALSE)
    gse <- geneSetEnrich(sce,
      databases = c("GO_Biological_Process_2018", "GO_Molecular_Function_2018"))
    #> Error in handle_url(handle, url, ...): Must specify at least one of url or handle
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/index.html ================================================ Function reference • celda

    Primary celda functions

    Functions for clustering of cells

    celda_CG()

    Cell and feature clustering with Celda

    celda_C()

    Cell clustering with Celda

    celda_G()

    Feature clustering with Celda

    reportCeldaCGRun() reportCeldaCGPlotResults()

    Generate an HTML report for celda_CG

    selectFeatures()

    Simple feature selection by feature counts

    splitModule()

    Split celda feature module

    Visualization functions for celda results

    Functions for displaying celda resuls on 2-D embeddings, heatmaps, and violin plots

    celdaUmap()

    Uniform Manifold Approximation and Projection (UMAP) dimension reduction for celda sce object

    celdaTsne()

    t-Distributed Stochastic Neighbor Embedding (t-SNE) dimension reduction for celda sce object

    moduleHeatmap()

    Heatmap for featureModules

    celdaProbabilityMap()

    Probability map for a celda model

    plotDimReduceCluster()

    Plotting the cell labels on a dimension reduction plot

    plotDimReduceFeature()

    Plotting feature expression on a dimension reduction plot

    plotDimReduceModule()

    Plotting Celda module probability on a dimension reduction plot

    plotDimReduceGrid()

    Mapping the dimension reduction plot

    plotCeldaViolin()

    Feature Expression Violin Plot

    celdaHeatmap()

    Plot celda Heatmap

    Primary decontX functions

    Functions for estimating and displaying contamination with decontX

    decontX()

    Contamination estimation with decontX

    plotDecontXContamination()

    Plots contamination on UMAP coordinates

    plotDecontXMarkerExpression()

    Plots expression of marker genes before and after decontamination

    plotDecontXMarkerPercentage()

    Plots percentage of cells cell types expressing markers

    decontXcounts() `decontXcounts<-`()

    Get or set decontaminated counts matrix

    Functions for determining the numbers of clusters in celda

    Functions for running and comparing multiple celda models with different number of modules or cell populations

    recursiveSplitCell()

    Recursive cell splitting

    recursiveSplitModule()

    Recursive module splitting

    plotRPC()

    Visualize perplexity differences of a list of celda models

    celdaGridSearch()

    Run Celda in parallel with multiple parameters

    plotGridSearchPerplexity()

    Visualize perplexity of a list of celda models

    perplexity()

    Calculate the perplexity of a celda model

    resamplePerplexity()

    Calculate and visualize perplexity of all models in a celdaList

    selectBestModel()

    Select best chain within each combination of parameters

    resList()

    Get final celdaModels from a celda model SCE or celdaList object

    subsetCeldaList()

    Subset celda model from SCE object returned from celdaGridSearch

    appendCeldaList()

    Append two celdaList objects

    celdaPerplexity()

    Get perplexity for every model in a celdaList

    Miscellaneous celda functions

    Various functions for manipulation of celda results

    celdaClusters() `celdaClusters<-`()

    Get or set the cell cluster labels from a celda SingleCellExperiment object or celda model object.

    celdaModules() `celdaModules<-`()

    Get or set the feature module labels from a celda SingleCellExperiment object.

    recodeClusterY()

    Recode feature module labels

    recodeClusterZ()

    Recode cell cluster labels

    reorderCelda()

    Reorder cells populations and/or features modules using hierarchical clustering

    featureModuleLookup()

    Obtain the gene module of a gene of interest

    featureModuleTable()

    Output a feature module table

    celda()

    Celda models

    params()

    Get parameter values provided for celdaModel creation

    runParams()

    Get run parameters from a celda model SingleCellExperiment or celdaList object

    factorizeMatrix()

    Generate factorized matrices showing each feature's influence on cell / gene clustering

    bestLogLikelihood()

    Get the log-likelihood

    clusterProbability()

    Get the conditional probabilities of cell in subpopulations from celda model

    geneSetEnrich()

    Gene set enrichment

    plotHeatmap()

    Plots heatmap based on Celda model

    retrieveFeatureIndex()

    Retrieve row index for a set of features

    normalizeCounts()

    Normalization of count data

    distinctColors()

    Create a color palette

    matrixNames()

    Get feature, cell and sample names from a celdaModel

    logLikelihood()

    Calculate the Log-likelihood of a celda model

    logLikelihoodHistory()

    Get log-likelihood history

    topRank()

    Identify features with the highest influence on clustering.

    sampleLabel() `sampleLabel<-`()

    Get or set sample labels from a celda SingleCellExperiment object

    Simulation functions

    Functions for generating data from the generative process of each model

    simulateCells()

    Simulate count data from the celda generative models.

    simulateContamination()

    Simulate contaminated count matrix

    Data objects

    Small data objects used in examples

    sceCeldaCG

    sceCeldaCG

    sceCeldaC

    sceCeldaC

    sceCeldaG

    sceCeldaG

    sceCeldaCGGridSearch

    sceCeldaCGGridSearch

    celdaCGGridSearchRes

    celdaCGGridSearchRes

    sampleCells

    sampleCells

    contaminationSim

    contaminationSim

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/logLikelihood.html ================================================ Calculate the Log-likelihood of a celda model — logLikelihood • celda

    Calculate the log-likelihood for cell population and feature module cluster assignments on the count matrix, per celda model.

    logLikelihood(x, celdaMod, useAssay = "counts", altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment,ANY
    logLikelihood(x, useAssay = "counts", altExpName = "featureSubset")
    
    # S4 method for matrix,celda_C
    logLikelihood(x, celdaMod)
    
    # S4 method for matrix,celda_G
    logLikelihood(x, celdaMod)
    
    # S4 method for matrix,celda_CG
    logLikelihood(x, celdaMod)

    Arguments

    x

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG, with the matrix located in the useAssay assay slot. Rows represent features and columns represent cells.

    celdaMod

    celda model object. Ignored if x is a SingleCellExperiment object.

    useAssay

    A string specifying which assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    The log-likelihood of the cluster assignment for the provided SingleCellExperiment.

    See also

    `celda_C()` for clustering cells

    Examples

    data(sceCeldaC, sceCeldaCG)
    loglikC <- logLikelihood(sceCeldaC)
    loglikCG <- logLikelihood(sceCeldaCG)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/logLikelihoodHistory.html ================================================ Get log-likelihood history — logLikelihoodHistory • celda

    Retrieves the complete log-likelihood from all iterations of Gibbs sampling used to generate a celda model.

    logLikelihoodHistory(x, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    logLikelihoodHistory(x, altExpName = "featureSubset")
    
    # S4 method for celdaModel
    logLikelihoodHistory(x)

    Arguments

    x

    A SingleCellExperiment object returned by celda_C, celda_G, or celda_CG, or a celda model object.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    Numeric. The log-likelihood at each step of Gibbs sampling used to generate the model.

    Examples

    data(sceCeldaCG)
    logLikelihoodHistory(sceCeldaCG)
    #>  [1] -1212891 -1212891 -1212891 -1212891 -1212891 -1212891 -1212891 -1212891
    #>  [9] -1212891 -1212891 -1212891 -1212891
    data(celdaCGMod)
    logLikelihoodHistory(celdaCGMod)
    #>  [1] -1215541 -1215541 -1215541 -1215541 -1215541 -1215541 -1215541 -1215541
    #>  [9] -1215541 -1215541 -1215541 -1215541
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/matrixNames.html ================================================ Get feature, cell and sample names from a celdaModel — matrixNames • celda

    Retrieves the row, column, and sample names used to generate a celdaModel.

    matrixNames(celdaMod)
    
    # S4 method for celdaModel
    matrixNames(celdaMod)

    Arguments

    celdaMod

    celdaModel. Options available in `celda::availableModels`.

    Value

    List. Contains row, column, and sample character vectors corresponding to the values provided when the celdaModel was generated.

    Examples

    data(celdaCGMod)
    matrixNames(celdaCGMod)
    #> $row
    #>   [1] "Gene_1"   "Gene_2"   "Gene_3"   "Gene_4"   "Gene_5"   "Gene_6"  
    #>   [7] "Gene_7"   "Gene_8"   "Gene_9"   "Gene_10"  "Gene_11"  "Gene_12" 
    #>  [13] "Gene_13"  "Gene_14"  "Gene_15"  "Gene_16"  "Gene_17"  "Gene_18" 
    #>  [19] "Gene_19"  "Gene_20"  "Gene_21"  "Gene_22"  "Gene_23"  "Gene_24" 
    #>  [25] "Gene_25"  "Gene_26"  "Gene_27"  "Gene_28"  "Gene_29"  "Gene_30" 
    #>  [31] "Gene_31"  "Gene_32"  "Gene_33"  "Gene_34"  "Gene_35"  "Gene_36" 
    #>  [37] "Gene_37"  "Gene_38"  "Gene_39"  "Gene_40"  "Gene_41"  "Gene_42" 
    #>  [43] "Gene_43"  "Gene_44"  "Gene_45"  "Gene_46"  "Gene_47"  "Gene_48" 
    #>  [49] "Gene_49"  "Gene_50"  "Gene_51"  "Gene_52"  "Gene_53"  "Gene_54" 
    #>  [55] "Gene_55"  "Gene_56"  "Gene_57"  "Gene_58"  "Gene_59"  "Gene_60" 
    #>  [61] "Gene_61"  "Gene_62"  "Gene_63"  "Gene_64"  "Gene_65"  "Gene_66" 
    #>  [67] "Gene_67"  "Gene_68"  "Gene_69"  "Gene_70"  "Gene_71"  "Gene_72" 
    #>  [73] "Gene_73"  "Gene_74"  "Gene_75"  "Gene_76"  "Gene_77"  "Gene_78" 
    #>  [79] "Gene_79"  "Gene_80"  "Gene_81"  "Gene_82"  "Gene_83"  "Gene_84" 
    #>  [85] "Gene_85"  "Gene_86"  "Gene_87"  "Gene_88"  "Gene_89"  "Gene_90" 
    #>  [91] "Gene_91"  "Gene_92"  "Gene_93"  "Gene_94"  "Gene_95"  "Gene_96" 
    #>  [97] "Gene_97"  "Gene_98"  "Gene_99"  "Gene_100"
    #> 
    #> $column
    #>   [1] "Cell_1"   "Cell_2"   "Cell_3"   "Cell_4"   "Cell_5"   "Cell_6"  
    #>   [7] "Cell_7"   "Cell_8"   "Cell_9"   "Cell_10"  "Cell_11"  "Cell_12" 
    #>  [13] "Cell_13"  "Cell_14"  "Cell_15"  "Cell_16"  "Cell_17"  "Cell_18" 
    #>  [19] "Cell_19"  "Cell_20"  "Cell_21"  "Cell_22"  "Cell_23"  "Cell_24" 
    #>  [25] "Cell_25"  "Cell_26"  "Cell_27"  "Cell_28"  "Cell_29"  "Cell_30" 
    #>  [31] "Cell_31"  "Cell_32"  "Cell_33"  "Cell_34"  "Cell_35"  "Cell_36" 
    #>  [37] "Cell_37"  "Cell_38"  "Cell_39"  "Cell_40"  "Cell_41"  "Cell_42" 
    #>  [43] "Cell_43"  "Cell_44"  "Cell_45"  "Cell_46"  "Cell_47"  "Cell_48" 
    #>  [49] "Cell_49"  "Cell_50"  "Cell_51"  "Cell_52"  "Cell_53"  "Cell_54" 
    #>  [55] "Cell_55"  "Cell_56"  "Cell_57"  "Cell_58"  "Cell_59"  "Cell_60" 
    #>  [61] "Cell_61"  "Cell_62"  "Cell_63"  "Cell_64"  "Cell_65"  "Cell_66" 
    #>  [67] "Cell_67"  "Cell_68"  "Cell_69"  "Cell_70"  "Cell_71"  "Cell_72" 
    #>  [73] "Cell_73"  "Cell_74"  "Cell_75"  "Cell_76"  "Cell_77"  "Cell_78" 
    #>  [79] "Cell_79"  "Cell_80"  "Cell_81"  "Cell_82"  "Cell_83"  "Cell_84" 
    #>  [85] "Cell_85"  "Cell_86"  "Cell_87"  "Cell_88"  "Cell_89"  "Cell_90" 
    #>  [91] "Cell_91"  "Cell_92"  "Cell_93"  "Cell_94"  "Cell_95"  "Cell_96" 
    #>  [97] "Cell_97"  "Cell_98"  "Cell_99"  "Cell_100" "Cell_101" "Cell_102"
    #> [103] "Cell_103" "Cell_104" "Cell_105" "Cell_106" "Cell_107" "Cell_108"
    #> [109] "Cell_109" "Cell_110" "Cell_111" "Cell_112" "Cell_113" "Cell_114"
    #> [115] "Cell_115" "Cell_116" "Cell_117" "Cell_118" "Cell_119" "Cell_120"
    #> [121] "Cell_121" "Cell_122" "Cell_123" "Cell_124" "Cell_125" "Cell_126"
    #> [127] "Cell_127" "Cell_128" "Cell_129" "Cell_130" "Cell_131" "Cell_132"
    #> [133] "Cell_133" "Cell_134" "Cell_135" "Cell_136" "Cell_137" "Cell_138"
    #> [139] "Cell_139" "Cell_140" "Cell_141" "Cell_142" "Cell_143" "Cell_144"
    #> [145] "Cell_145" "Cell_146" "Cell_147" "Cell_148" "Cell_149" "Cell_150"
    #> [151] "Cell_151" "Cell_152" "Cell_153" "Cell_154" "Cell_155" "Cell_156"
    #> [157] "Cell_157" "Cell_158" "Cell_159" "Cell_160" "Cell_161" "Cell_162"
    #> [163] "Cell_163" "Cell_164" "Cell_165" "Cell_166" "Cell_167" "Cell_168"
    #> [169] "Cell_169" "Cell_170" "Cell_171" "Cell_172" "Cell_173" "Cell_174"
    #> [175] "Cell_175" "Cell_176" "Cell_177" "Cell_178" "Cell_179" "Cell_180"
    #> [181] "Cell_181" "Cell_182" "Cell_183" "Cell_184" "Cell_185" "Cell_186"
    #> [187] "Cell_187" "Cell_188" "Cell_189" "Cell_190" "Cell_191" "Cell_192"
    #> [193] "Cell_193" "Cell_194" "Cell_195" "Cell_196" "Cell_197" "Cell_198"
    #> [199] "Cell_199" "Cell_200" "Cell_201" "Cell_202" "Cell_203" "Cell_204"
    #> [205] "Cell_205" "Cell_206" "Cell_207" "Cell_208" "Cell_209" "Cell_210"
    #> [211] "Cell_211" "Cell_212" "Cell_213" "Cell_214" "Cell_215" "Cell_216"
    #> [217] "Cell_217" "Cell_218" "Cell_219" "Cell_220" "Cell_221" "Cell_222"
    #> [223] "Cell_223" "Cell_224" "Cell_225" "Cell_226" "Cell_227" "Cell_228"
    #> [229] "Cell_229" "Cell_230" "Cell_231" "Cell_232" "Cell_233" "Cell_234"
    #> [235] "Cell_235" "Cell_236" "Cell_237" "Cell_238" "Cell_239" "Cell_240"
    #> [241] "Cell_241" "Cell_242" "Cell_243" "Cell_244" "Cell_245" "Cell_246"
    #> [247] "Cell_247" "Cell_248" "Cell_249" "Cell_250" "Cell_251" "Cell_252"
    #> [253] "Cell_253" "Cell_254" "Cell_255" "Cell_256" "Cell_257" "Cell_258"
    #> [259] "Cell_259" "Cell_260" "Cell_261" "Cell_262" "Cell_263" "Cell_264"
    #> [265] "Cell_265" "Cell_266" "Cell_267" "Cell_268" "Cell_269" "Cell_270"
    #> [271] "Cell_271" "Cell_272" "Cell_273" "Cell_274" "Cell_275" "Cell_276"
    #> [277] "Cell_277" "Cell_278" "Cell_279" "Cell_280" "Cell_281" "Cell_282"
    #> [283] "Cell_283" "Cell_284" "Cell_285" "Cell_286" "Cell_287" "Cell_288"
    #> [289] "Cell_289" "Cell_290" "Cell_291" "Cell_292" "Cell_293" "Cell_294"
    #> [295] "Cell_295" "Cell_296" "Cell_297" "Cell_298" "Cell_299" "Cell_300"
    #> [301] "Cell_301" "Cell_302" "Cell_303" "Cell_304" "Cell_305" "Cell_306"
    #> [307] "Cell_307" "Cell_308" "Cell_309" "Cell_310" "Cell_311" "Cell_312"
    #> [313] "Cell_313" "Cell_314" "Cell_315" "Cell_316" "Cell_317" "Cell_318"
    #> [319] "Cell_319" "Cell_320" "Cell_321" "Cell_322" "Cell_323" "Cell_324"
    #> [325] "Cell_325" "Cell_326" "Cell_327" "Cell_328" "Cell_329" "Cell_330"
    #> [331] "Cell_331" "Cell_332" "Cell_333" "Cell_334" "Cell_335" "Cell_336"
    #> [337] "Cell_337" "Cell_338" "Cell_339" "Cell_340" "Cell_341" "Cell_342"
    #> [343] "Cell_343" "Cell_344" "Cell_345" "Cell_346" "Cell_347" "Cell_348"
    #> [349] "Cell_349" "Cell_350" "Cell_351" "Cell_352" "Cell_353" "Cell_354"
    #> [355] "Cell_355" "Cell_356" "Cell_357" "Cell_358" "Cell_359" "Cell_360"
    #> [361] "Cell_361" "Cell_362" "Cell_363" "Cell_364" "Cell_365" "Cell_366"
    #> [367] "Cell_367" "Cell_368" "Cell_369" "Cell_370" "Cell_371" "Cell_372"
    #> [373] "Cell_373" "Cell_374" "Cell_375" "Cell_376" "Cell_377" "Cell_378"
    #> [379] "Cell_379" "Cell_380" "Cell_381" "Cell_382" "Cell_383" "Cell_384"
    #> [385] "Cell_385" "Cell_386" "Cell_387" "Cell_388" "Cell_389" "Cell_390"
    #> [391] "Cell_391" "Cell_392" "Cell_393" "Cell_394" "Cell_395" "Cell_396"
    #> [397] "Cell_397" "Cell_398" "Cell_399" "Cell_400" "Cell_401" "Cell_402"
    #> [403] "Cell_403" "Cell_404" "Cell_405" "Cell_406" "Cell_407" "Cell_408"
    #> [409] "Cell_409" "Cell_410" "Cell_411" "Cell_412" "Cell_413" "Cell_414"
    #> [415] "Cell_415" "Cell_416" "Cell_417" "Cell_418" "Cell_419" "Cell_420"
    #> [421] "Cell_421" "Cell_422" "Cell_423" "Cell_424" "Cell_425"
    #> 
    #> $sample
    #> [1] "Sample_1" "Sample_2" "Sample_3" "Sample_4" "Sample_5"
    #> 
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/moduleHeatmap.html ================================================ Heatmap for featureModules — moduleHeatmap • celda

    Renders a heatmap for selected featureModule. Cells are ordered from those with the lowest probability of the module on the left to the highest probability on the right. Features are ordered from those with the highest probability in the module on the top to the lowest probability on the bottom.

    moduleHeatmap(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      modules = NULL,
      featureModule = NULL,
      col = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")),
      topCells = 100,
      topFeatures = NULL,
      normalizedCounts = NA,
      normalize = "proportion",
      transformationFun = sqrt,
      scaleRow = scale,
      showFeatureNames = TRUE,
      displayName = NULL,
      trim = c(-2, 2),
      rowFontSize = NULL,
      showHeatmapLegend = FALSE,
      showTopAnnotationLegend = FALSE,
      showTopAnnotationName = FALSE,
      topAnnotationHeight = 5,
      showModuleLabel = TRUE,
      moduleLabel = "auto",
      moduleLabelSize = NULL,
      byrow = TRUE,
      top = NA,
      unit = "mm",
      ncol = NULL,
      useRaster = TRUE,
      returnAsList = FALSE,
      ...
    )
    
    # S4 method for SingleCellExperiment
    moduleHeatmap(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      modules = NULL,
      featureModule = NULL,
      col = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")),
      topCells = 100,
      topFeatures = NULL,
      normalizedCounts = NA,
      normalize = "proportion",
      transformationFun = sqrt,
      scaleRow = scale,
      showFeatureNames = TRUE,
      displayName = NULL,
      trim = c(-2, 2),
      rowFontSize = NULL,
      showHeatmapLegend = FALSE,
      showTopAnnotationLegend = FALSE,
      showTopAnnotationName = FALSE,
      topAnnotationHeight = 5,
      showModuleLabel = TRUE,
      moduleLabel = "auto",
      moduleLabelSize = NULL,
      byrow = TRUE,
      top = NA,
      unit = "mm",
      ncol = NULL,
      useRaster = TRUE,
      returnAsList = FALSE,
      ...
    )

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells. Celda results must be present under metadata(altExp(x, altExpName)).

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    modules

    Integer Vector. The featureModule(s) to display. Multiple modules can be included in a vector. Default NULL which plots all module heatmaps.

    featureModule

    Same as modules. Either can be used to specify the modules to display.

    col

    Passed to Heatmap. Set color boundaries and colors.

    topCells

    Integer. Number of cells with the highest and lowest probabilities for each module to include in the heatmap. For example, if topCells = 50, the 50 cells with the lowest probabilities and the 50 cells with the highest probabilities for each featureModule will be included. If NULL, all cells will be plotted. Default 100.

    topFeatures

    Integer. Plot `topFeatures` features with the highest probabilities in the module heatmap for each featureModule. If NULL, plot all features in the module. Default NULL.

    normalizedCounts

    Integer matrix. Rows represent features and columns represent cells. If you have a normalized matrix result from normalizeCounts, you can pass through the result here to skip the normalization step in this function. Make sure the colnames and rownames match the object in x. This matrix should correspond to one generated from this count matrix assay(altExp(x, altExpName), i = useAssay). If NA, normalization will be carried out in the following form normalizeCounts(assay(altExp(x, altExpName), i = useAssay), normalize = "proportion", transformationFun = sqrt). Use of this parameter is particularly useful for plotting many module heatmaps, where normalizing the counts matrix repeatedly would be too time consuming. Default NA.

    normalize

    Character. Passed to normalizeCounts if normalizedCounts is NA. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells. Default "proportion".

    transformationFun

    Function. Passed to normalizeCounts if normalizedCounts is NA. Applies a transformation such as sqrt, log, log2, log10, or log1p. If NULL, no transformation will be applied. Occurs after normalization. Default sqrt.

    scaleRow

    Function. Which function to use to scale each individual row. Set to NULL to disable. Occurs after normalization and log transformation. For example, scale will Z-score transform each row. Default scale.

    showFeatureNames

    Logical. Whether feature names should be displayed. Default TRUE.

    displayName

    Character. The column name of rowData(altExp(x, altExpName)) that specifies the display names for the features. Default NULL, which displays the row names. Only works if showFeaturenames is TRUE and x is a SingleCellExperiment object.

    trim

    Numeric vector. Vector of length two that specifies the lower and upper bounds for plotting the data. This threshold is applied after row scaling. Set to NULL to disable. Default c(-2,2).

    rowFontSize

    Numeric. Font size for feature names. If NULL, then the size will automatically be determined. Default NULL.

    showHeatmapLegend

    Passed to Heatmap. Show legend for expression levels.

    showTopAnnotationLegend

    Passed to HeatmapAnnotation. Show legend for cell annotation.

    showTopAnnotationName

    Passed to HeatmapAnnotation. Show heatmap top annotation name.

    topAnnotationHeight

    Passed to HeatmapAnnotation. Column annotation height. rowAnnotation. Show legend for module annotation.

    showModuleLabel

    Show left side module labels.

    moduleLabel

    The left side row titles for module heatmap. Must be vector of the same length as featureModule. Default "auto", which automatically pulls module labels from x.

    moduleLabelSize

    Passed to gpar. The size of text (in points).

    byrow

    Passed to matrix. logical. If FALSE (the default) the figure panel is filled by columns, otherwise the figure panel is filled by rows.

    top

    Passed to marrangeGrob. The title for each page.

    unit

    Passed to unit. Single character object defining the unit of all dimensions defined.

    ncol

    Integer. Number of columns of module heatmaps. If NULL, then this will be automatically calculated so that the number of columns and rows will be approximately the same. Default NULL.

    useRaster

    Boolean. Rasterizing will make the heatmap a single object and reduced the memory of the plot and the size of a file. If NULL, then rasterization will be automatically determined by the underlying Heatmap function. Default TRUE.

    returnAsList

    Boolean. If TRUE, then a list of plots will be returned instead of a single multi-panel figure. These plots can be displayed using the grid.draw function. Default FALSE.

    ...

    Additional parameters passed to Heatmap.

    Value

    A list object if plotting more than one module heatmaps. Otherwise a

    HeatmapList object is returned.

    Examples

    data(sceCeldaCG)
    moduleHeatmap(sceCeldaCG, displayName = "rownames")
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/nonzero.html ================================================ get row and column indices of none zero elements in the matrix — nonzero • celda

    get row and column indices of none zero elements in the matrix

    nonzero(R_counts)

    Arguments

    R_counts

    A matrix

    Value

    An integer matrix where each row is a row, column indices pair

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/normalizeCounts.html ================================================ Normalization of count data — normalizeCounts • celda

    Performs normalization, transformation, and/or scaling of a counts matrix

    normalizeCounts(
      counts,
      normalize = c("proportion", "cpm", "median", "mean"),
      scaleFactor = NULL,
      transformationFun = NULL,
      scaleFun = NULL,
      pseudocountNormalize = 0,
      pseudocountTransform = 0
    )

    Arguments

    counts

    Integer, Numeric or Sparse matrix. Rows represent features and columns represent cells.

    normalize

    Character. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells.

    scaleFactor

    Numeric. Sets the scale factor for cell-level normalization. This scale factor is multiplied to each cell after the library size of each cell had been adjusted in normalize. Default NULL which means no scale factor is applied.

    transformationFun

    Function. Applys a transformation such as sqrt, log, log2, log10, or log1p. If NULL, no transformation will be applied. Occurs after normalization. Default NULL.

    scaleFun

    Function. Scales the rows of the normalized and transformed count matrix. For example, 'scale' can be used to z-score normalize the rows. Default NULL.

    pseudocountNormalize

    Numeric. Add a pseudocount to counts before normalization. Default 0.

    pseudocountTransform

    Numeric. Add a pseudocount to normalized counts before applying the transformation function. Adding a pseudocount can be useful before applying a log transformation. Default 0.

    Value

    Numeric Matrix. A normalized matrix.

    Examples

    data(celdaCGSim)
    normalizedCounts <- normalizeCounts(celdaCGSim$counts, "proportion",
      pseudocountNormalize = 1)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/params.html ================================================ Get parameter values provided for celdaModel creation — params • celda

    Retrieves the K/L, model priors (e.g. alpha, beta), and count matrix checksum parameters provided during the creation of the provided celdaModel.

    params(celdaMod)
    
    # S4 method for celdaModel
    params(celdaMod)

    Arguments

    celdaMod

    celdaModel. Options available in celda::availableModels.

    Value

    List. Contains the model-specific parameters for the provided celda model object depending on its class.

    Examples

    data(celdaCGMod)
    params(celdaCGMod)
    #> $K
    #> [1] 5
    #> 
    #> $L
    #> [1] 10
    #> 
    #> $alpha
    #> [1] 1
    #> 
    #> $beta
    #> [1] 1
    #> 
    #> $delta
    #> [1] 1
    #> 
    #> $gamma
    #> [1] 1
    #> 
    #> $seed
    #> [1] 12345
    #> 
    #> $countChecksum
    #> [1] "b47286aed8081daa674f796655314d67"
    #> 
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/perplexity.html ================================================ Calculate the perplexity of a celda model — perplexity • celda

    Perplexity is a statistical measure of how well a probability model can predict new data. Lower perplexity indicates a better model.

    perplexity(
      x,
      celdaMod,
      useAssay = "counts",
      altExpName = "featureSubset",
      newCounts = NULL
    )
    
    # S4 method for SingleCellExperiment,ANY
    perplexity(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      newCounts = NULL
    )
    
    # S4 method for ANY,celda_CG
    perplexity(x, celdaMod, newCounts = NULL)
    
    # S4 method for ANY,celda_C
    perplexity(x, celdaMod, newCounts = NULL)
    
    # S4 method for ANY,celda_G
    perplexity(x, celdaMod, newCounts = NULL)

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned by celda_C, celda_G or celda_CG, with the matrix located in the useAssay assay slot. Rows represent features and columns represent cells.

    • Integer counts matrix. Rows represent features and columns represent cells. This matrix should be the same as the one used to generate celdaMod.

    celdaMod

    Celda model object. Only works if x is an integer counts matrix.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    newCounts

    A new counts matrix used to calculate perplexity. If NULL, perplexity will be calculated for the matrix in useAssay slot in x. Default NULL.

    Value

    Numeric. The perplexity for the provided x (and

    celdaModel).

    Examples

    data(sceCeldaCG)
    perplexity <- perplexity(sceCeldaCG)
    data(celdaCGSim, celdaCGMod)
    perplexity <- perplexity(celdaCGSim$counts, celdaCGMod)
    data(celdaCSim, celdaCMod)
    perplexity <- perplexity(celdaCSim$counts, celdaCMod)
    data(celdaGSim, celdaGMod)
    perplexity <- perplexity(celdaGSim$counts, celdaGMod)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotCeldaViolin.html ================================================ Feature Expression Violin Plot — plotCeldaViolin • celda

    Outputs a violin plot for feature expression data.

    plotCeldaViolin(
      x,
      celdaMod,
      features,
      displayName = NULL,
      useAssay = "counts",
      altExpName = "featureSubset",
      exactMatch = TRUE,
      plotDots = TRUE,
      dotSize = 0.1
    )
    
    # S4 method for SingleCellExperiment
    plotCeldaViolin(
      x,
      features,
      displayName = NULL,
      useAssay = "counts",
      altExpName = "featureSubset",
      exactMatch = TRUE,
      plotDots = TRUE,
      dotSize = 0.1
    )
    
    # S4 method for ANY
    plotCeldaViolin(
      x,
      celdaMod,
      features,
      exactMatch = TRUE,
      plotDots = TRUE,
      dotSize = 0.1
    )

    Arguments

    x

    Numeric matrix or a SingleCellExperiment object with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    celdaMod

    Celda object of class "celda_G" or "celda_CG". Used only if x is a matrix object.

    features

    Character vector. Uses these genes for plotting.

    displayName

    Character. The column name of rowData(x) that specifies the display names for the features. Default NULL, which displays the row names. Only works if x is a SingleCellExperiment object.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    exactMatch

    Logical. Whether an exact match or a partial match using grep() is used to look up the feature in the rownames of the counts matrix. Default TRUE.

    plotDots

    Boolean. If TRUE, the expression of features will be plotted as points in addition to the violin curve. Default TRUE.

    dotSize

    Numeric. Size of points if plotDots = TRUE. Default 0.1.

    Value

    Violin plot for each feature, grouped by celda cluster

    Examples

    data(sceCeldaCG)
    plotCeldaViolin(x = sceCeldaCG, features = "Gene_1")
    
    data(celdaCGSim, celdaCGMod)
    plotCeldaViolin(x = celdaCGSim$counts,
       celdaMod = celdaCGMod,
       features = "Gene_1")
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotDecontXContamination.html ================================================ Plots contamination on UMAP coordinates — plotDecontXContamination • celda

    A scatter plot of the UMAP dimensions generated by DecontX with cells colored by the estimated percentation of contamation.

    plotDecontXContamination(
      x,
      batch = NULL,
      colorScale = c("blue", "green", "yellow", "orange", "red"),
      size = 1
    )

    Arguments

    x

    Either a SingleCellExperiment with decontX results stored in metadata(x)$decontX or the result from running decontX on a count matrix.

    batch

    Character. Batch of cells to plot. If NULL, then the first batch in the list will be selected. Default NULL.

    colorScale

    Character vector. Contains the color spectrum to be passed to scale_colour_gradientn from package 'ggplot2'. Default c("blue","green","yellow","orange","red").

    size

    Numeric. Size of points in the scatterplot. Default 1.

    Value

    Returns a ggplot object.

    See also

    See decontX for a full example of how to estimate and plot contamination.

    Author

    Shiyi Yang, Joshua Campbell

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotDecontXMarkerExpression.html ================================================ Plots expression of marker genes before and after decontamination — plotDecontXMarkerExpression • celda

    Generates a violin plot that shows the counts of marker genes in cells across specific clusters or cell types. Can be used to view the expression of marker genes in different cell types before and after decontamination with decontX.

    plotDecontXMarkerExpression(
      x,
      markers,
      groupClusters = NULL,
      assayName = c("counts", "decontXcounts"),
      z = NULL,
      exactMatch = TRUE,
      by = "rownames",
      log1p = FALSE,
      ncol = NULL,
      plotDots = FALSE,
      dotSize = 0.1
    )

    Arguments

    x

    Either a SingleCellExperiment or a matrix-like object of counts.

    markers

    Character Vector or List. A character vector or list of character vectors with the names of the marker genes of interest.

    groupClusters

    List. A named list that allows cell clusters labels coded in z to be regrouped and renamed on the fly. For example, list(Tcells=c(1, 2), Bcells=7) would recode clusters 1 and 2 to "Tcells" and cluster 7 to "Bcells". Note that if this is used, clusters in z not found in groupClusters will be excluded. Default NULL.

    assayName

    Character vector. Name(s) of the assay(s) to plot if x is a SingleCellExperiment. If more than one assay is listed, then side-by-side violin plots will be generated. Default c("counts", "decontXcounts").

    z

    Character, Integer, or Vector. Indicates the cluster labels for each cell. If x is a SingleCellExperiment and z = NULL, then the cluster labels from decontX will be retreived from the colData of x (i.e. colData(x)$decontX_clusters). If z is a single character or integer, then that column will be retrived from colData of x. (i.e. colData(x)[,z]). If x is a counts matrix, then z will need to be a vector the same length as the number of columns in x that indicate the cluster to which each cell belongs. Default NULL.

    exactMatch

    Boolean. Whether to only identify exact matches for the markers or to identify partial matches using grep. See retrieveFeatureIndex for more details. Default TRUE.

    by

    Character. Where to search for the markers if x is a SingleCellExperiment. See retrieveFeatureIndex for more details. If x is a matrix, then this must be set to "rownames". Default "rownames".

    log1p

    Boolean. Whether to apply the function log1p to the data before plotting. This function will add a pseudocount of 1 and then log transform the expression values. Default FALSE.

    ncol

    Integer. Number of columns to make in the plot. Default NULL.

    plotDots

    Boolean. If TRUE, the expression of features will be plotted as points in addition to the violin curve. Default FALSE.

    dotSize

    Numeric. Size of points if plotDots = TRUE. Default 0.1.

    Value

    Returns a ggplot object.

    See also

    See decontX for a full example of how to estimate and plot contamination.

    Author

    Shiyi Yang, Joshua Campbell

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotDecontXMarkerPercentage.html ================================================ Plots percentage of cells cell types expressing markers — plotDecontXMarkerPercentage • celda

    Generates a barplot that shows the percentage of cells within clusters or cell types that have detectable levels of given marker genes. Can be used to view the expression of marker genes in different cell types before and after decontamination with decontX.

    plotDecontXMarkerPercentage(
      x,
      markers,
      groupClusters = NULL,
      assayName = c("counts", "decontXcounts"),
      z = NULL,
      threshold = 1,
      exactMatch = TRUE,
      by = "rownames",
      ncol = round(sqrt(length(markers))),
      labelBars = TRUE,
      labelSize = 3
    )

    Arguments

    x

    Either a SingleCellExperiment or a matrix-like object of counts.

    markers

    List. A named list indicating the marker genes for each cell type of interest. Multiple markers can be supplied for each cell type. For example, list(Tcell_Markers=c("CD3E", "CD3D"), Bcell_Markers=c("CD79A", "CD79B", "MS4A1") would specify markers for human T-cells and B-cells. A cell will be considered "positive" for a cell type if it has a count greater than threshold for at least one of the marker genes in the list.

    groupClusters

    List. A named list that allows cell clusters labels coded in z to be regrouped and renamed on the fly. For example, list(Tcells=c(1, 2), Bcells=7) would recode clusters 1 and 2 to "Tcells" and cluster 7 to "Bcells". Note that if this is used, clusters in z not found in groupClusters will be excluded from the barplot. Default NULL.

    assayName

    Character vector. Name(s) of the assay(s) to plot if x is a SingleCellExperiment. If more than one assay is listed, then side-by-side barplots will be generated. Default c("counts", "decontXcounts").

    z

    Character, Integer, or Vector. Indicates the cluster labels for each cell. If x is a SingleCellExperiment and z = NULL, then the cluster labels from decontX will be retived from the colData of x (i.e. colData(x)$decontX_clusters). If z is a single character or integer, then that column will be retrived from colData of x. (i.e. colData(x)[,z]). If x is a counts matrix, then z will need to be a vector the same length as the number of columns in x that indicate the cluster to which each cell belongs. Default NULL.

    threshold

    Numeric. Markers greater than or equal to this value will be considered detected in a cell. Default 1.

    exactMatch

    Boolean. Whether to only identify exact matches for the markers or to identify partial matches using grep. See retrieveFeatureIndex for more details. Default TRUE.

    by

    Character. Where to search for the markers if x is a SingleCellExperiment. See retrieveFeatureIndex for more details. If x is a matrix, then this must be set to "rownames".Default "rownames".

    ncol

    Integer. Number of columns to make in the plot. Default round(sqrt(length(markers)).

    labelBars

    Boolean. Whether to display percentages above each bar Default TRUE.

    labelSize

    Numeric. Size of the percentage labels in the barplot. Default 3.

    Value

    Returns a ggplot object.

    See also

    See decontX for a full example of how to estimate and plot contamination.

    Author

    Shiyi Yang, Joshua Campbell

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotDimReduceCluster.html ================================================ Plotting the cell labels on a dimension reduction plot — plotDimReduceCluster • celda

    Create a scatterplot for each row of a normalized gene expression matrix where x and y axis are from a data dimension reduction tool. The cells are colored by "celda_cell_cluster" column in colData(altExp(x, altExpName)) if x is a SingleCellExperiment object, or x if x is a integer vector of cell cluster labels.

    plotDimReduceCluster(
      x,
      reducedDimName,
      altExpName = "featureSubset",
      dim1 = NULL,
      dim2 = NULL,
      size = 0.5,
      xlab = NULL,
      ylab = NULL,
      specificClusters = NULL,
      labelClusters = FALSE,
      groupBy = NULL,
      labelSize = 3.5
    )
    
    # S4 method for SingleCellExperiment
    plotDimReduceCluster(
      x,
      reducedDimName,
      altExpName = "featureSubset",
      dim1 = 1,
      dim2 = 2,
      size = 0.5,
      xlab = NULL,
      ylab = NULL,
      specificClusters = NULL,
      labelClusters = FALSE,
      groupBy = NULL,
      labelSize = 3.5
    )
    
    # S4 method for vector
    plotDimReduceCluster(
      x,
      dim1,
      dim2,
      size = 0.5,
      xlab = "Dimension_1",
      ylab = "Dimension_2",
      specificClusters = NULL,
      labelClusters = FALSE,
      groupBy = NULL,
      labelSize = 3.5
    )

    Arguments

    x

    Integer vector of cell cluster labels or a SingleCellExperiment object containing cluster labels for each cell in "celda_cell_cluster" column in colData(x).

    reducedDimName

    The name of the dimension reduction slot in reducedDimNames(x) if x is a SingleCellExperiment object. Ignored if both dim1 and dim2 are set.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    dim1

    Integer or numeric vector. If reducedDimName is supplied, then, this will be used as an index to determine which dimension will be plotted on the x-axis. If reducedDimName is not supplied, then this should be a vector which will be plotted on the x-axis. Default 1.

    dim2

    Integer or numeric vector. If reducedDimName is supplied, then, this will be used as an index to determine which dimension will be plotted on the y-axis. If reducedDimName is not supplied, then this should be a vector which will be plotted on the y-axis. Default 2.

    size

    Numeric. Sets size of point on plot. Default 0.5.

    xlab

    Character vector. Label for the x-axis. Default NULL.

    ylab

    Character vector. Label for the y-axis. Default NULL.

    specificClusters

    Numeric vector. Only color cells in the specified clusters. All other cells will be grey. If NULL, all clusters will be colored. Default NULL.

    labelClusters

    Logical. Whether the cluster labels are plotted. Default FALSE.

    groupBy

    Character vector. Contains sample labels for each cell. If NULL, all samples will be plotted together. Default NULL.

    labelSize

    Numeric. Sets size of label if labelClusters is TRUE. Default 3.5.

    Value

    The plot as a ggplot object

    Examples

    data(sceCeldaCG)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceCluster(x = sce,
      reducedDimName = "celda_tSNE",
      specificClusters = c(1, 2, 3))
    
    library(SingleCellExperiment)
    data(sceCeldaCG, celdaCGMod)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceCluster(x = celdaClusters(celdaCGMod)$z,
      dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1],
      dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2],
      specificClusters = c(1, 2, 3))
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotDimReduceFeature.html ================================================ Plotting feature expression on a dimension reduction plot — plotDimReduceFeature • celda

    Create a scatterplot for each row of a normalized gene expression matrix where x and y axis are from a data dimension reduction tool. The cells are colored by expression of the specified feature.

    plotDimReduceFeature(
      x,
      features,
      reducedDimName = NULL,
      displayName = NULL,
      dim1 = NULL,
      dim2 = NULL,
      headers = NULL,
      useAssay = "counts",
      altExpName = "featureSubset",
      normalize = FALSE,
      zscore = TRUE,
      exactMatch = TRUE,
      trim = c(-2, 2),
      limits = c(-2, 2),
      size = 0.5,
      xlab = NULL,
      ylab = NULL,
      colorLow = "blue4",
      colorMid = "grey90",
      colorHigh = "firebrick1",
      midpoint = 0,
      ncol = NULL,
      decreasing = FALSE
    )
    
    # S4 method for SingleCellExperiment
    plotDimReduceFeature(
      x,
      features,
      reducedDimName = NULL,
      displayName = NULL,
      dim1 = 1,
      dim2 = 2,
      headers = NULL,
      useAssay = "counts",
      altExpName = "featureSubset",
      normalize = FALSE,
      zscore = TRUE,
      exactMatch = TRUE,
      trim = c(-2, 2),
      limits = c(-2, 2),
      size = 0.5,
      xlab = NULL,
      ylab = NULL,
      colorLow = "blue4",
      colorMid = "grey90",
      colorHigh = "firebrick1",
      midpoint = 0,
      ncol = NULL,
      decreasing = FALSE
    )
    
    # S4 method for ANY
    plotDimReduceFeature(
      x,
      features,
      dim1,
      dim2,
      headers = NULL,
      normalize = FALSE,
      zscore = TRUE,
      exactMatch = TRUE,
      trim = c(-2, 2),
      limits = c(-2, 2),
      size = 0.5,
      xlab = "Dimension_1",
      ylab = "Dimension_2",
      colorLow = "blue4",
      colorMid = "grey90",
      colorHigh = "firebrick1",
      midpoint = 0,
      ncol = NULL,
      decreasing = FALSE
    )

    Arguments

    x

    Numeric matrix or a SingleCellExperiment object with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    features

    Character vector. Features in the rownames of counts to plot.

    reducedDimName

    The name of the dimension reduction slot in reducedDimNames(x) if x is a SingleCellExperiment object. If NULL, then both dim1 and dim2 need to be set. Default NULL.

    displayName

    Character. The column name of rowData(x) that specifies the display names for the features. Default NULL, which displays the row names. Only works if x is a SingleCellExperiment object. Overwrites headers.

    dim1

    Integer or numeric vector. If reducedDimName is supplied, then, this will be used as an index to determine which dimension will be plotted on the x-axis. If reducedDimName is not supplied, then this should be a vector which will be plotted on the x-axis. Default 1.

    dim2

    Integer or numeric vector. If reducedDimName is supplied, then, this will be used as an index to determine which dimension will be plotted on the y-axis. If reducedDimName is not supplied, then this should be a vector which will be plotted on the y-axis. Default 2.

    headers

    Character vector. If NULL, the corresponding rownames are used as labels. Otherwise, these headers are used to label the features. Only works if displayName is NULL and exactMatch is FALSE.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    normalize

    Logical. Whether to normalize the columns of `counts`. Default FALSE.

    zscore

    Logical. Whether to scale each feature to have a mean 0 and standard deviation of 1. Default TRUE.

    exactMatch

    Logical. Whether an exact match or a partial match using grep() is used to look up the feature in the rownames of the counts matrix. Default TRUE.

    trim

    Numeric vector. Vector of length two that specifies the lower and upper bounds for the data. This threshold is applied after row scaling. Set to NULL to disable. Default c(-1,1).

    limits

    Passed to scale_colour_gradient2. The range of color scale.

    size

    Numeric. Sets size of point on plot. Default 1.

    xlab

    Character vector. Label for the x-axis. If reducedDimName is used, then this will be set to the column name of the first dimension of that object. Default "Dimension_1".

    ylab

    Character vector. Label for the y-axis. If reducedDimName is used, then this will be set to the column name of the second dimension of that object. Default "Dimension_2".

    colorLow

    Character. A color available from `colors()`. The color will be used to signify the lowest values on the scale.

    colorMid

    Character. A color available from `colors()`. The color will be used to signify the midpoint on the scale.

    colorHigh

    Character. A color available from `colors()`. The color will be used to signify the highest values on the scale.

    midpoint

    Numeric. The value indicating the midpoint of the diverging color scheme. If NULL, defaults to the mean with 10 percent of values trimmed. Default 0.

    ncol

    Integer. Passed to facet_wrap. Specify the number of columns for facet wrap.

    decreasing

    logical. Specifies the order of plotting the points. If FALSE, the points will be plotted in increasing order where the points with largest values will be on top. TRUE otherwise. If NULL, no sorting is performed. Points will be plotted in their current order in x. Default FALSE.

    Value

    The plot as a ggplot object

    Examples

    data(sceCeldaCG)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceFeature(x = sce,
      reducedDimName = "celda_tSNE",
      normalize = TRUE,
      features = c("Gene_98", "Gene_99"),
      exactMatch = TRUE)
    
    library(SingleCellExperiment)
    data(sceCeldaCG)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceFeature(x = counts(sce),
      dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1],
      dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2],
      normalize = TRUE,
      features = c("Gene_98", "Gene_99"),
      exactMatch = TRUE)
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotDimReduceGrid.html ================================================ Mapping the dimension reduction plot — plotDimReduceGrid • celda

    Creates a scatterplot given two dimensions from a data dimension reduction tool (e.g tSNE) output.

    plotDimReduceGrid(
      x,
      reducedDimName,
      dim1 = NULL,
      dim2 = NULL,
      useAssay = "counts",
      altExpName = "featureSubset",
      size = 1,
      xlab = "Dimension_1",
      ylab = "Dimension_2",
      limits = c(-2, 2),
      colorLow = "blue4",
      colorMid = "grey90",
      colorHigh = "firebrick1",
      midpoint = 0,
      varLabel = NULL,
      ncol = NULL,
      headers = NULL,
      decreasing = FALSE
    )
    
    # S4 method for SingleCellExperiment
    plotDimReduceGrid(
      x,
      reducedDimName,
      dim1 = NULL,
      dim2 = NULL,
      useAssay = "counts",
      altExpName = "featureSubset",
      size = 1,
      xlab = "Dimension_1",
      ylab = "Dimension_2",
      limits = c(-2, 2),
      colorLow = "blue4",
      colorMid = "grey90",
      colorHigh = "firebrick1",
      midpoint = 0,
      varLabel = NULL,
      ncol = NULL,
      headers = NULL,
      decreasing = FALSE
    )
    
    # S4 method for ANY
    plotDimReduceGrid(
      x,
      dim1,
      dim2,
      size = 1,
      xlab = "Dimension_1",
      ylab = "Dimension_2",
      limits = c(-2, 2),
      colorLow = "blue4",
      colorMid = "grey90",
      colorHigh = "firebrick1",
      midpoint = 0,
      varLabel = NULL,
      ncol = NULL,
      headers = NULL,
      decreasing = FALSE
    )

    Arguments

    x

    Numeric matrix or a SingleCellExperiment object with the matrix located in the assay slot under useAssay. Each row of the matrix will be plotted as a separate facet.

    reducedDimName

    The name of the dimension reduction slot in reducedDimNames(x) if x is a SingleCellExperiment object. Ignored if both dim1 and dim2 are set.

    dim1

    Numeric vector. Second dimension from data dimension reduction output.

    dim2

    Numeric vector. Second dimension from data dimension reduction output.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    size

    Numeric. Sets size of point on plot. Default 1.

    xlab

    Character vector. Label for the x-axis. Default 'Dimension_1'.

    ylab

    Character vector. Label for the y-axis. Default 'Dimension_2'.

    limits

    Passed to scale_colour_gradient2. The range of color scale.

    colorLow

    Character. A color available from `colors()`. The color will be used to signify the lowest values on the scale. Default "blue4".

    colorMid

    Character. A color available from `colors()`. The color will be used to signify the midpoint on the scale. Default "grey90".

    colorHigh

    Character. A color available from `colors()`. The color will be used to signify the highest values on the scale. Default "firebrick1".

    midpoint

    Numeric. The value indicating the midpoint of the diverging color scheme. If NULL, defaults to the mean with 10 percent of values trimmed. Default 0.

    varLabel

    Character vector. Title for the color legend.

    ncol

    Integer. Passed to facet_wrap. Specify the number of columns for facet wrap.

    headers

    Character vector. If `NULL`, the corresponding rownames are used as labels. Otherwise, these headers are used to label the genes.

    decreasing

    logical. Specifies the order of plotting the points. If FALSE, the points will be plotted in increasing order where the points with largest values will be on top. TRUE otherwise. If NULL, no sorting is performed. Points will be plotted in their current order in x. Default FALSE.

    Value

    The plot as a ggplot object

    Examples

    data(sceCeldaCG)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceGrid(x = sce,
      reducedDimName = "celda_tSNE",
      xlab = "Dimension1",
      ylab = "Dimension2",
      varLabel = "tSNE")
    
    library(SingleCellExperiment)
    data(sceCeldaCG)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceGrid(x = counts(sce),
      dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1],
      dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2],
      xlab = "Dimension1",
      ylab = "Dimension2",
      varLabel = "tSNE")
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotDimReduceModule.html ================================================ Plotting Celda module probability on a dimension reduction plot — plotDimReduceModule • celda

    Create a scatterplot for each row of a normalized gene expression matrix where x and y axis are from a data dimension reduction tool. The cells are colored by the module probability.

    plotDimReduceModule(
      x,
      reducedDimName,
      useAssay = "counts",
      altExpName = "featureSubset",
      celdaMod,
      modules = NULL,
      dim1 = NULL,
      dim2 = NULL,
      size = 0.5,
      xlab = NULL,
      ylab = NULL,
      rescale = TRUE,
      limits = c(0, 1),
      colorLow = "grey90",
      colorHigh = "firebrick1",
      ncol = NULL,
      decreasing = FALSE
    )
    
    # S4 method for SingleCellExperiment
    plotDimReduceModule(
      x,
      reducedDimName,
      useAssay = "counts",
      altExpName = "featureSubset",
      modules = NULL,
      dim1 = 1,
      dim2 = 2,
      size = 0.5,
      xlab = NULL,
      ylab = NULL,
      rescale = TRUE,
      limits = c(0, 1),
      colorLow = "grey90",
      colorHigh = "firebrick1",
      ncol = NULL,
      decreasing = FALSE
    )
    
    # S4 method for ANY
    plotDimReduceModule(
      x,
      celdaMod,
      modules = NULL,
      dim1,
      dim2,
      size = 0.5,
      xlab = "Dimension_1",
      ylab = "Dimension_2",
      rescale = TRUE,
      limits = c(0, 1),
      colorLow = "grey90",
      colorHigh = "firebrick1",
      ncol = NULL,
      decreasing = FALSE
    )

    Arguments

    x

    Numeric matrix or a SingleCellExperiment object with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    reducedDimName

    The name of the dimension reduction slot in reducedDimNames(x) if x is a SingleCellExperiment object. Ignored if both dim1 and dim2 are set.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    celdaMod

    Celda object of class "celda_G" or "celda_CG". Used only if x is a matrix object.

    modules

    Character vector. Module(s) from celda model to be plotted. e.g. c("1", "2").

    dim1

    Integer or numeric vector. If reducedDimName is supplied, then, this will be used as an index to determine which dimension will be plotted on the x-axis. If reducedDimName is not supplied, then this should be a vector which will be plotted on the x-axis. Default 1.

    dim2

    Integer or numeric vector. If reducedDimName is supplied, then, this will be used as an index to determine which dimension will be plotted on the y-axis. If reducedDimName is not supplied, then this should be a vector which will be plotted on the y-axis. Default 2.

    size

    Numeric. Sets size of point on plot. Default 0.5.

    xlab

    Character vector. Label for the x-axis. Default "Dimension_1".

    ylab

    Character vector. Label for the y-axis. Default "Dimension_2".

    rescale

    Logical. Whether rows of the matrix should be rescaled to [0, 1]. Default TRUE.

    limits

    Passed to scale_colour_gradient. The range of color scale.

    colorLow

    Character. A color available from `colors()`. The color will be used to signify the lowest values on the scale.

    colorHigh

    Character. A color available from `colors()`. The color will be used to signify the highest values on the scale.

    ncol

    Integer. Passed to facet_wrap. Specify the number of columns for facet wrap.

    decreasing

    logical. Specifies the order of plotting the points. If FALSE, the points will be plotted in increasing order where the points with largest values will be on top. TRUE otherwise. If NULL, no sorting is performed. Points will be plotted in their current order in x. Default FALSE.

    Value

    The plot as a ggplot object

    Examples

    data(sceCeldaCG)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceModule(x = sce,
      reducedDimName = "celda_tSNE",
      modules = c("1", "2"))
    
    library(SingleCellExperiment)
    data(sceCeldaCG, celdaCGMod)
    sce <- celdaTsne(sceCeldaCG)
    plotDimReduceModule(x = counts(sce),
      dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1],
      dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2],
      celdaMod = celdaCGMod,
      modules = c("1", "2"))
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotGridSearchPerplexity.html ================================================ Visualize perplexity of a list of celda models — plotGridSearchPerplexity • celda

    Visualize perplexity of every model in a celdaList, by unique K/L combinations

    plotGridSearchPerplexity(x, altExpName = "featureSubset", sep = 5, alpha = 0.5)
    
    # S4 method for SingleCellExperiment
    plotGridSearchPerplexity(x, altExpName = "featureSubset", sep = 5, alpha = 0.5)
    
    # S4 method for celdaList
    plotGridSearchPerplexity(x, sep = 5, alpha = 0.5)

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned from celdaGridSearch, recursiveSplitModule, or recursiveSplitCell. Must contain a list named "celda_grid_search" in metadata(x).

    • celdaList object.

    altExpName

    The name for the altExp slot to use. Default "featureSubset". Only works if x is a SingleCellExperiment object.

    sep

    Numeric. Breaks in the x axis of the resulting plot.

    alpha

    Numeric. Passed to geom_jitter. Opacity of the points. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.

    Value

    A ggplot plot object showing perplexity as a function of clustering parameters.

    Examples

    data(sceCeldaCGGridSearch)
    sce <- resamplePerplexity(sceCeldaCGGridSearch)
    plotGridSearchPerplexity(sce)
    
    data(celdaCGSim, celdaCGGridSearchRes)
    ## Run various combinations of parameters with 'celdaGridSearch'
    celdaCGGridSearchRes <- resamplePerplexity(
      celdaCGSim$counts,
      celdaCGGridSearchRes)
    plotGridSearchPerplexity(celdaCGGridSearchRes)
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotHeatmap.html ================================================ Plots heatmap based on Celda model — plotHeatmap • celda

    Renders a heatmap based on a matrix of counts where rows are features and columns are cells.

    plotHeatmap(
      counts,
      z = NULL,
      y = NULL,
      scaleRow = scale,
      trim = c(-2, 2),
      featureIx = NULL,
      cellIx = NULL,
      clusterFeature = TRUE,
      clusterCell = TRUE,
      colorScheme = c("divergent", "sequential"),
      colorSchemeSymmetric = TRUE,
      colorSchemeCenter = 0,
      col = NULL,
      annotationCell = NULL,
      annotationFeature = NULL,
      annotationColor = NULL,
      breaks = NULL,
      legend = TRUE,
      annotationLegend = TRUE,
      annotationNamesFeature = TRUE,
      annotationNamesCell = TRUE,
      showNamesFeature = FALSE,
      showNamesCell = FALSE,
      rowGroupOrder = NULL,
      colGroupOrder = NULL,
      hclustMethod = "ward.D2",
      treeheightFeature = ifelse(clusterFeature, 50, 0),
      treeheightCell = ifelse(clusterCell, 50, 0),
      silent = FALSE,
      ...
    )

    Arguments

    counts

    Numeric or sparse matrix. Normalized counts matrix where rows represent features and columns represent cells. .

    z

    Numeric vector. Denotes cell population labels.

    y

    Numeric vector. Denotes feature module labels.

    scaleRow

    Function. A function to scale each individual row. Set to NULL to disable. Occurs after normalization and log transformation. Defualt is 'scale' and thus will Z-score transform each row.

    trim

    Numeric vector. Vector of length two that specifies the lower and upper bounds for the data. This threshold is applied after row scaling. Set to NULL to disable. Default c(-2,2).

    featureIx

    Integer vector. Select features for display in heatmap. If NULL, no subsetting will be performed. Default NULL.

    cellIx

    Integer vector. Select cells for display in heatmap. If NULL, no subsetting will be performed. Default NULL.

    clusterFeature

    Logical. Determines whether rows should be clustered. Default TRUE.

    clusterCell

    Logical. Determines whether columns should be clustered. Default TRUE.

    colorScheme

    Character. One of "divergent" or "sequential". A "divergent" scheme is best for highlighting relative data (denoted by 'colorSchemeCenter') such as gene expression data that has been normalized and centered. A "sequential" scheme is best for highlighting data that are ordered low to high such as raw counts or probabilities. Default "divergent".

    colorSchemeSymmetric

    Logical. When the colorScheme is "divergent" and the data contains both positive and negative numbers, TRUE indicates that the color scheme should be symmetric from [-max(abs(data)), max(abs(data))]. For example, if the data ranges goes from -1.5 to 2, then setting this to TRUE will force the color scheme to range from -2 to 2. Default TRUE.

    colorSchemeCenter

    Numeric. Indicates the center of a "divergent" colorScheme. Default 0.

    col

    Color for the heatmap.

    annotationCell

    Data frame. Additional annotations for each cell will be shown in the column color bars. The format of the data frame should be one row for each cell and one column for each annotation. Numeric variables will be displayed as continuous color bars and factors will be displayed as discrete color bars. Default NULL.

    annotationFeature

    A data frame for the feature annotations (rows).

    annotationColor

    List. Contains color scheme for all annotations. See `?pheatmap` for more details.

    breaks

    Numeric vector. A sequence of numbers that covers the range of values in the normalized `counts`. Values in the normalized `matrix` are assigned to each bin in `breaks`. Each break is assigned to a unique color from `col`. If NULL, then breaks are calculated automatically. Default NULL.

    legend

    Logical. Determines whether legend should be drawn. Default TRUE.

    annotationLegend

    Logical. Whether legend for all annotations should be drawn. Default TRUE.

    annotationNamesFeature

    Logical. Whether the names for features should be shown. Default TRUE.

    annotationNamesCell

    Logical. Whether the names for cells should be shown. Default TRUE.

    showNamesFeature

    Logical. Specifies if feature names should be shown. Default TRUE.

    showNamesCell

    Logical. Specifies if cell names should be shown. Default FALSE.

    rowGroupOrder

    Vector. Specifies the order of feature clusters when semisupervised clustering is performed on the y labels.

    colGroupOrder

    Vector. Specifies the order of cell clusters when semisupervised clustering is performed on the z labels.

    hclustMethod

    Character. Specifies the method to use for the 'hclust' function. See `?hclust` for possible values. Default "ward.D2".

    treeheightFeature

    Numeric. Width of the feature dendrogram. Set to 0 to disable plotting of this dendrogram. Default: if clusterFeature == TRUE, then treeheightFeature = 50, else treeheightFeature = 0.

    treeheightCell

    Numeric. Height of the cell dendrogram. Set to 0 to disable plotting of this dendrogram. Default: if clusterCell == TRUE, then treeheightCell = 50, else treeheightCell = 0.

    silent

    Logical. Whether to plot the heatmap.

    ...

    Other arguments to be passed to underlying pheatmap function.

    Value

    list A list containing dendrogram information and the heatmap grob

    Examples

    data(celdaCGSim, celdaCGMod)
    plotHeatmap(celdaCGSim$counts,
      z = celdaClusters(celdaCGMod)$z, y = celdaClusters(celdaCGMod)$y
    )
    #> TableGrob (5 x 6) "layout": 9 grobs
    #>   z     cells                 name                          grob
    #> 1 1 (2-2,3-3)             col_tree polyline[GRID.polyline.12713]
    #> 2 2 (4-4,1-1)             row_tree polyline[GRID.polyline.12714]
    #> 3 3 (4-4,3-3)               matrix       gTree[GRID.gTree.12716]
    #> 4 4 (3-3,3-3)       col_annotation         rect[GRID.rect.12717]
    #> 5 5 (3-3,4-4) col_annotation_names         text[GRID.text.12718]
    #> 6 6 (4-4,2-2)       row_annotation         rect[GRID.rect.12719]
    #> 7 7 (5-5,2-2) row_annotation_names         text[GRID.text.12720]
    #> 8 8 (4-5,6-6)     annotationLegend       gTree[GRID.gTree.12728]
    #> 9 9 (4-5,5-5)               legend       gTree[GRID.gTree.12731]
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/plotRPC.html ================================================ Visualize perplexity differences of a list of celda models — plotRPC • celda

    Visualize perplexity differences of every model in a celdaList, by unique K/L combinations.

    plotRPC(x, altExpName = "featureSubset", sep = 5, alpha = 0.5)
    
    # S4 method for SingleCellExperiment
    plotRPC(x, altExpName = "featureSubset", sep = 5, alpha = 0.5)
    
    # S4 method for celdaList
    plotRPC(x, sep = 5, alpha = 0.5)

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned from celdaGridSearch, recursiveSplitModule, or recursiveSplitCell. Must contain a list named "celda_grid_search" in metadata(x).

    • celdaList object.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    sep

    Numeric. Breaks in the x axis of the resulting plot.

    alpha

    Numeric. Passed to geom_jitter. Opacity of the points. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.

    Value

    A ggplot plot object showing perplexity differences as a function of clustering parameters.

    Examples

    data(sceCeldaCGGridSearch)
    sce <- resamplePerplexity(sceCeldaCGGridSearch)
    plotRPC(sce)
    
    data(celdaCGSim, celdaCGGridSearchRes)
    ## Run various combinations of parameters with 'celdaGridSearch'
    celdaCGGridSearchRes <- resamplePerplexity(
      celdaCGSim$counts,
      celdaCGGridSearchRes)
    plotRPC(celdaCGGridSearchRes)
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/recodeClusterY.html ================================================ Recode feature module labels — recodeClusterY • celda

    Recode feature module clusters using a mapping in the from and to arguments.

    recodeClusterY(sce, from, to, altExpName = "featureSubset")

    Arguments

    sce

    SingleCellExperiment object returned from celda_G or celda_CG. Must contain column celda_feature_module in rowData(altExp(sce, altExpName)).

    from

    Numeric vector. Unique values in the range of seq(celdaModules(sce)) that correspond to the original module labels in sce.

    to

    Numeric vector. Unique values in the range of seq(celdaModules(sce)) that correspond to the new module labels.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    @return SingleCellExperiment object with recoded feature module labels.

    Examples

    data(sceCeldaCG)
    sceReorderedY <- recodeClusterY(sceCeldaCG, c(1, 3), c(3, 1))
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/recodeClusterZ.html ================================================ Recode cell cluster labels — recodeClusterZ • celda

    Recode cell subpopulaton clusters using a mapping in the from and to arguments.

    recodeClusterZ(sce, from, to, altExpName = "featureSubset")

    Arguments

    sce

    SingleCellExperiment object returned from celda_C or celda_CG. Must contain column celda_cell_cluster in colData(altExp(sce, altExpName)).

    from

    Numeric vector. Unique values in the range of seq(max(as.integer(celdaClusters(sce, altExpName = altExpName)))) that correspond to the original cluster labels in sce.

    to

    Numeric vector. Unique values in the range of seq(max(as.integer(celdaClusters(sce, altExpName = altExpName)))) that correspond to the new cluster labels.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    SingleCellExperiment object with recoded cell cluster labels.

    Examples

    data(sceCeldaCG)
    sceReorderedZ <- recodeClusterZ(sceCeldaCG, c(1, 3), c(3, 1))
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/recursiveSplitCell.html ================================================ Recursive cell splitting — recursiveSplitCell • celda

    Uses the celda_C model to cluster cells into population for range of possible K's. The cell population labels of the previous "K-1" model are used as the initial values in the current model with K cell populations. The best split of an existing cell population is found to create the K-th cluster. This procedure is much faster than randomly initializing each model with a different K. If module labels for each feature are given in 'yInit', the celda_CG model will be used to split cell populations based on those modules instead of individual features. Module labels will also be updated during sampling and thus may end up slightly different than yInit.

    recursiveSplitCell(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      initialK = 5,
      maxK = 25,
      tempL = NULL,
      yInit = NULL,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      minCell = 3,
      reorder = TRUE,
      seed = 12345,
      perplexity = TRUE,
      doResampling = FALSE,
      numResample = 5,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for SingleCellExperiment
    recursiveSplitCell(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      initialK = 5,
      maxK = 25,
      tempL = NULL,
      yInit = NULL,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      minCell = 3,
      reorder = TRUE,
      seed = 12345,
      perplexity = TRUE,
      doResampling = FALSE,
      numResample = 5,
      logfile = NULL,
      verbose = TRUE
    )
    
    # S4 method for matrix
    recursiveSplitCell(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      sampleLabel = NULL,
      initialK = 5,
      maxK = 25,
      tempL = NULL,
      yInit = NULL,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      minCell = 3,
      reorder = TRUE,
      seed = 12345,
      perplexity = TRUE,
      doResampling = FALSE,
      numResample = 5,
      logfile = NULL,
      verbose = TRUE
    )

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    useAssay

    A string specifying the name of the assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    sampleLabel

    Vector or factor. Denotes the sample label for each cell (column) in the count matrix.

    initialK

    Integer. Initial number of cell populations to try. Default 5.

    maxK

    Integer. Maximum number of cell populations to try. Default 25.

    tempL

    Integer. Number of temporary modules to identify and use in cell splitting. Only used if yInit = NULL. Collapsing features to a relatively smaller number of modules will increase the speed of clustering and tend to produce better cell populations. This number should be larger than the number of true modules expected in the dataset. Default NULL.

    yInit

    Integer vector. Module labels for features. Cells will be clustered using the celda_CG model based on the modules specified in yInit rather than the counts of individual features. While the features will be initialized to the module labels in yInit, the labels will be allowed to move within each new model with a different K.

    alpha

    Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default 1.

    beta

    Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature in each cell (if yInit is NULL) or to each module in each cell population (if yInit is set). Default 1.

    delta

    Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Only used if yInit is set. Default 1.

    gamma

    Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Only used if yInit is set. Default 1.

    minCell

    Integer. Only attempt to split cell populations with at least this many cells.

    reorder

    Logical. Whether to reorder cell populations using hierarchical clustering after each model has been created. If FALSE, cell populations numbers will correspond to the split which created the cell populations (i.e. 'K15' was created at split 15, 'K16' was created at split 16, etc.). Default TRUE.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    perplexity

    Logical. Whether to calculate perplexity for each model. If FALSE, then perplexity can be calculated later with resamplePerplexity. Default TRUE.

    doResampling

    Boolean. If TRUE, then each cell in the counts matrix will be resampled according to a multinomial distribution to introduce noise before calculating perplexity. Default FALSE.

    numResample

    Integer. The number of times to resample the counts matrix for evaluating perplexity if doResampling is set to TRUE. Default 5.

    logfile

    Character. Messages will be redirected to a file named "logfile". If NULL, messages will be printed to stdout. Default NULL.

    verbose

    Logical. Whether to print log messages. Default TRUE.

    Value

    A SingleCellExperiment object. Function parameter settings and celda model results are stored in the

    metadata

    "celda_grid_search" slot. The models in the list will be of class celda_C if yInit = NULL or

    celda_CG if zInit is set.

    See also

    recursiveSplitModule for recursive splitting of feature modules.

    Examples

    data(sceCeldaCG)
    ## Create models that range from K = 3 to K = 7 by recursively splitting
    ## cell populations into two to produce \link{celda_C} cell clustering models
    sce <- recursiveSplitCell(sceCeldaCG, initialK = 3, maxK = 7)
    #> ==================================================
    #> Starting recursive cell population splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:32 2024 .. Initializing with 3 populations
    #> Tue Apr  2 18:56:32 2024 .. Current cell population 4 | logLik: -1225755.01101897
    #> Tue Apr  2 18:56:32 2024 .. Current cell population 5 | logLik: -1213677.60126784
    #> Tue Apr  2 18:56:32 2024 .. Current cell population 6 | logLik: -1213903.59449854
    #> Tue Apr  2 18:56:32 2024 .. Current cell population 7 | logLik: -1214081.54311397
    #> Tue Apr  2 18:56:32 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive cell population splitting. Total time: 0.3127432 secs
    #> ==================================================
    
    ## Alternatively, first identify features modules using
    ## \link{recursiveSplitModule}
    moduleSplit <- recursiveSplitModule(sceCeldaCG, initialL = 3, maxL = 15)
    #> ==================================================
    #> Starting recursive module splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:32 2024 .. Collapsing to 100 temporary cell populations
    #> Tue Apr  2 18:56:34 2024 .. Initializing with 3 modules
    #> Tue Apr  2 18:56:35 2024 .. Created module 4 | logLik: -1241379.90928455
    #> Tue Apr  2 18:56:35 2024 .. Created module 5 | logLik: -1235212.7977535
    #> Tue Apr  2 18:56:35 2024 .. Created module 6 | logLik: -1232789.9817561
    #> Tue Apr  2 18:56:35 2024 .. Created module 7 | logLik: -1227246.66090571
    #> Tue Apr  2 18:56:35 2024 .. Created module 8 | logLik: -1223898.757694
    #> Tue Apr  2 18:56:35 2024 .. Created module 9 | logLik: -1221848.26936098
    #> Tue Apr  2 18:56:35 2024 .. Created module 10 | logLik: -1220147.96681948
    #> Tue Apr  2 18:56:35 2024 .. Created module 11 | logLik: -1220818.37022325
    #> Tue Apr  2 18:56:35 2024 .. Created module 12 | logLik: -1221489.07685946
    #> Tue Apr  2 18:56:35 2024 .. Created module 13 | logLik: -1222032.53497571
    #> Tue Apr  2 18:56:36 2024 .. Created module 14 | logLik: -1222712.17543857
    #> Tue Apr  2 18:56:36 2024 .. Created module 15 | logLik: -1223268.97596756
    #> Tue Apr  2 18:56:36 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive module splitting. Total time: 3.285614 secs
    #> ==================================================
    plotGridSearchPerplexity(moduleSplit)
    
    moduleSplitSelect <- subsetCeldaList(moduleSplit, list(L = 10))
    
    ## Then use module labels for initialization in \link{recursiveSplitCell} to
    ## produce \link{celda_CG} bi-clustering models
    cellSplit <- recursiveSplitCell(sceCeldaCG,
      initialK = 3, maxK = 7, yInit = celdaModules(moduleSplitSelect))
    #> ==================================================
    #> Starting recursive cell population splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:36 2024 .. Collapsing to 10 modules
    #> Tue Apr  2 18:56:36 2024 .. Initializing with 3 populations
    #> Tue Apr  2 18:56:37 2024 .. Current cell population 4 | logLik: -1225286.49558716
    #> Tue Apr  2 18:56:37 2024 .. Current cell population 5 | logLik: -1212955.15575681
    #> Tue Apr  2 18:56:37 2024 .. Current cell population 6 | logLik: -1212982.74290613
    #> Tue Apr  2 18:56:37 2024 .. Current cell population 7 | logLik: -1213005.40337891
    #> Tue Apr  2 18:56:37 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive cell population splitting. Total time: 1.227239 secs
    #> ==================================================
    plotGridSearchPerplexity(cellSplit)
    
    sce <- subsetCeldaList(cellSplit, list(K = 5, L = 10))
    data(celdaCGSim, celdaCSim)
    ## Create models that range from K = 3 to K = 7 by recursively splitting
    ## cell populations into two to produce \link{celda_C} cell clustering models
    sce <- recursiveSplitCell(celdaCSim$counts, initialK = 3, maxK = 7)
    #> ==================================================
    #> Starting recursive cell population splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:38 2024 .. Initializing with 3 populations
    #> Tue Apr  2 18:56:38 2024 .. Current cell population 4 | logLik: -1341630.1679001
    #> Tue Apr  2 18:56:38 2024 .. Current cell population 5 | logLik: -1327506.91718317
    #> Tue Apr  2 18:56:38 2024 .. Current cell population 6 | logLik: -1315227.54586167
    #> Tue Apr  2 18:56:38 2024 .. Current cell population 7 | logLik: -1304393.65802293
    #> Tue Apr  2 18:56:38 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive cell population splitting. Total time: 0.2604418 secs
    #> ==================================================
    
    ## Alternatively, first identify features modules using
    ## \link{recursiveSplitModule}
    moduleSplit <- recursiveSplitModule(celdaCGSim$counts,
      initialL = 3, maxL = 15)
    #> ==================================================
    #> Starting recursive module splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:39 2024 .. Collapsing to 100 temporary cell populations
    #> Tue Apr  2 18:56:40 2024 .. Initializing with 3 modules
    #> Tue Apr  2 18:56:40 2024 .. Created module 4 | logLik: -1243396.62348886
    #> Tue Apr  2 18:56:40 2024 .. Created module 5 | logLik: -1237610.11790137
    #> Tue Apr  2 18:56:40 2024 .. Created module 6 | logLik: -1232128.87013396
    #> Tue Apr  2 18:56:40 2024 .. Created module 7 | logLik: -1227611.8250329
    #> Tue Apr  2 18:56:40 2024 .. Created module 8 | logLik: -1225618.06184004
    #> Tue Apr  2 18:56:40 2024 .. Created module 9 | logLik: -1223967.77531912
    #> Tue Apr  2 18:56:41 2024 .. Created module 10 | logLik: -1222801.11395987
    #> Tue Apr  2 18:56:41 2024 .. Created module 11 | logLik: -1223402.66903597
    #> Tue Apr  2 18:56:41 2024 .. Created module 12 | logLik: -1224026.19892208
    #> Tue Apr  2 18:56:41 2024 .. Created module 13 | logLik: -1224675.63005464
    #> Tue Apr  2 18:56:41 2024 .. Created module 14 | logLik: -1225317.91966369
    #> Tue Apr  2 18:56:41 2024 .. Created module 15 | logLik: -1225971.50555157
    #> Tue Apr  2 18:56:41 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive module splitting. Total time: 2.560257 secs
    #> ==================================================
    plotGridSearchPerplexity(moduleSplit)
    
    moduleSplitSelect <- subsetCeldaList(moduleSplit, list(L = 10))
    
    ## Then use module labels for initialization in \link{recursiveSplitCell} to
    ## produce \link{celda_CG} bi-clustering models
    cellSplit <- recursiveSplitCell(celdaCGSim$counts,
      initialK = 3, maxK = 7, yInit = celdaModules(moduleSplitSelect))
    #> ==================================================
    #> Starting recursive cell population splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:42 2024 .. Collapsing to 10 modules
    #> Tue Apr  2 18:56:42 2024 .. Initializing with 3 populations
    #> Tue Apr  2 18:56:44 2024 .. Current cell population 4 | logLik: -1227944.5458832
    #> Tue Apr  2 18:56:44 2024 .. Current cell population 5 | logLik: -1215605.08613503
    #> Tue Apr  2 18:56:44 2024 .. Current cell population 6 | logLik: -1215627.62281773
    #> Tue Apr  2 18:56:44 2024 .. Current cell population 7 | logLik: -1215651.32538066
    #> Tue Apr  2 18:56:44 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive cell population splitting. Total time: 2.064909 secs
    #> ==================================================
    plotGridSearchPerplexity(cellSplit)
    
    sce <- subsetCeldaList(cellSplit, list(K = 5, L = 10))
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/recursiveSplitModule.html ================================================ Recursive module splitting — recursiveSplitModule • celda

    Uses the celda_G model to cluster features into modules for a range of possible L's. The module labels of the previous "L-1" model are used as the initial values in the current model with L modules. The best split of an existing module is found to create the L-th module. This procedure is much faster than randomly initializing each model with a different L.

    recursiveSplitModule(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      initialL = 10,
      maxL = 100,
      tempK = 100,
      zInit = NULL,
      sampleLabel = NULL,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      minFeature = 3,
      reorder = TRUE,
      seed = 12345,
      perplexity = TRUE,
      doResampling = FALSE,
      numResample = 5,
      verbose = TRUE,
      logfile = NULL
    )
    
    # S4 method for SingleCellExperiment
    recursiveSplitModule(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      initialL = 10,
      maxL = 100,
      tempK = 100,
      zInit = NULL,
      sampleLabel = NULL,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      minFeature = 3,
      reorder = TRUE,
      seed = 12345,
      perplexity = TRUE,
      doResampling = FALSE,
      numResample = 5,
      verbose = TRUE,
      logfile = NULL
    )
    
    # S4 method for matrix
    recursiveSplitModule(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      initialL = 10,
      maxL = 100,
      tempK = 100,
      zInit = NULL,
      sampleLabel = NULL,
      alpha = 1,
      beta = 1,
      delta = 1,
      gamma = 1,
      minFeature = 3,
      reorder = TRUE,
      seed = 12345,
      perplexity = TRUE,
      doResampling = FALSE,
      numResample = 5,
      verbose = TRUE,
      logfile = NULL
    )

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    initialL

    Integer. Initial number of modules.

    maxL

    Integer. Maximum number of modules.

    tempK

    Integer. Number of temporary cell populations to identify and use in module splitting. Only used if zInit = NULL Collapsing cells to a relatively smaller number of cell popluations will increase the speed of module clustering and tend to produce better modules. This number should be larger than the number of true cell populations expected in the dataset. Default 100.

    zInit

    Integer vector. Collapse cells to cell populations based on labels in zInit and then perform module splitting. If NULL, no collapsing will be performed unless tempK is specified. Default NULL.

    sampleLabel

    Vector or factor. Denotes the sample label for each cell (column) in the count matrix. Default NULL.

    alpha

    Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Only used if zInit is set. Default 1.

    beta

    Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell. Default 1.

    delta

    Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1.

    gamma

    Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 1.

    minFeature

    Integer. Only attempt to split modules with at least this many features.

    reorder

    Logical. Whether to reorder modules using hierarchical clustering after each model has been created. If FALSE, module numbers will correspond to the split which created the module (i.e. 'L15' was created at split 15, 'L16' was created at split 16, etc.). Default TRUE.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    perplexity

    Logical. Whether to calculate perplexity for each model. If FALSE, then perplexity can be calculated later with resamplePerplexity. Default TRUE.

    doResampling

    Boolean. If TRUE, then each cell in the counts matrix will be resampled according to a multinomial distribution to introduce noise before calculating perplexity. Default FALSE.

    numResample

    Integer. The number of times to resample the counts matrix for evaluating perplexity if doResampling is set to TRUE. Default 5.

    verbose

    Logical. Whether to print log messages. Default TRUE.

    logfile

    Character. Messages will be redirected to a file named "logfile". If NULL, messages will be printed to stdout. Default NULL.

    Value

    A SingleCellExperiment object. Function parameter settings and celda model results are stored in the

    metadata

    "celda_grid_search" slot. The models in the list will be of class celda_G if zInit = NULL or

    celda_CG if zInit is set.

    See also

    recursiveSplitCell for recursive splitting of cell populations.

    Examples

    data(sceCeldaCG)
    ## Create models that range from L=3 to L=20 by recursively splitting modules
    ## into two
    moduleSplit <- recursiveSplitModule(sceCeldaCG, initialL = 3, maxL = 20)
    #> ==================================================
    #> Starting recursive module splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:46 2024 .. Collapsing to 100 temporary cell populations
    #> Tue Apr  2 18:56:47 2024 .. Initializing with 3 modules
    #> Tue Apr  2 18:56:47 2024 .. Created module 4 | logLik: -1241379.90928455
    #> Tue Apr  2 18:56:47 2024 .. Created module 5 | logLik: -1235212.7977535
    #> Tue Apr  2 18:56:47 2024 .. Created module 6 | logLik: -1232789.9817561
    #> Tue Apr  2 18:56:48 2024 .. Created module 7 | logLik: -1227246.66090571
    #> Tue Apr  2 18:56:48 2024 .. Created module 8 | logLik: -1223898.757694
    #> Tue Apr  2 18:56:48 2024 .. Created module 9 | logLik: -1221848.26936098
    #> Tue Apr  2 18:56:48 2024 .. Created module 10 | logLik: -1220147.96681948
    #> Tue Apr  2 18:56:48 2024 .. Created module 11 | logLik: -1220818.37022325
    #> Tue Apr  2 18:56:48 2024 .. Created module 12 | logLik: -1221489.07685946
    #> Tue Apr  2 18:56:48 2024 .. Created module 13 | logLik: -1222032.53497571
    #> Tue Apr  2 18:56:48 2024 .. Created module 14 | logLik: -1222712.17543857
    #> Tue Apr  2 18:56:48 2024 .. Created module 15 | logLik: -1223268.97596756
    #> Tue Apr  2 18:56:48 2024 .. Created module 16 | logLik: -1223841.4834406
    #> Tue Apr  2 18:56:49 2024 .. Created module 17 | logLik: -1224394.02513994
    #> Tue Apr  2 18:56:49 2024 .. Created module 18 | logLik: -1224863.41435811
    #> Tue Apr  2 18:56:49 2024 .. Created module 19 | logLik: -1225480.30453125
    #> Tue Apr  2 18:56:49 2024 .. Created module 20 | logLik: -1226156.47078695
    #> Tue Apr  2 18:56:49 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive module splitting. Total time: 3.40894 secs
    #> ==================================================
    
    ## Example results with perplexity
    plotGridSearchPerplexity(moduleSplit)
    
    
    ## Select model for downstream analysis
    celdaMod <- subsetCeldaList(moduleSplit, list(L = 10))
    data(celdaCGSim)
    ## Create models that range from L=3 to L=20 by recursively splitting modules
    ## into two
    moduleSplit <- recursiveSplitModule(celdaCGSim$counts,
      initialL = 3, maxL = 20)
    #> ==================================================
    #> Starting recursive module splitting.
    #> ==================================================
    #> Tue Apr  2 18:56:50 2024 .. Collapsing to 100 temporary cell populations
    #> Tue Apr  2 18:56:51 2024 .. Initializing with 3 modules
    #> Tue Apr  2 18:56:51 2024 .. Created module 4 | logLik: -1243396.62348886
    #> Tue Apr  2 18:56:51 2024 .. Created module 5 | logLik: -1237610.11790137
    #> Tue Apr  2 18:56:51 2024 .. Created module 6 | logLik: -1232128.87013396
    #> Tue Apr  2 18:56:51 2024 .. Created module 7 | logLik: -1227611.8250329
    #> Tue Apr  2 18:56:51 2024 .. Created module 8 | logLik: -1225618.06184004
    #> Tue Apr  2 18:56:52 2024 .. Created module 9 | logLik: -1223967.77531912
    #> Tue Apr  2 18:56:52 2024 .. Created module 10 | logLik: -1222801.11395987
    #> Tue Apr  2 18:56:52 2024 .. Created module 11 | logLik: -1223402.66903597
    #> Tue Apr  2 18:56:52 2024 .. Created module 12 | logLik: -1224026.19892208
    #> Tue Apr  2 18:56:52 2024 .. Created module 13 | logLik: -1224675.63005464
    #> Tue Apr  2 18:56:52 2024 .. Created module 14 | logLik: -1225317.91966369
    #> Tue Apr  2 18:56:52 2024 .. Created module 15 | logLik: -1225971.50555157
    #> Tue Apr  2 18:56:52 2024 .. Created module 16 | logLik: -1226557.7881506
    #> Tue Apr  2 18:56:52 2024 .. Created module 17 | logLik: -1227080.13473523
    #> Tue Apr  2 18:56:53 2024 .. Created module 18 | logLik: -1227603.99622355
    #> Tue Apr  2 18:56:53 2024 .. Created module 19 | logLik: -1228247.84169741
    #> Tue Apr  2 18:56:53 2024 .. Created module 20 | logLik: -1228828.70617002
    #> Tue Apr  2 18:56:53 2024 .. Calculating perplexity
    #> ==================================================
    #> Completed recursive module splitting. Total time: 3.238301 secs
    #> ==================================================
    
    ## Example results with perplexity
    plotGridSearchPerplexity(moduleSplit)
    
    
    ## Select model for downstream analysis
    celdaMod <- subsetCeldaList(moduleSplit, list(L = 10))
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/reorderCelda.html ================================================ Reorder cells populations and/or features modules using hierarchical clustering — reorderCelda • celda

    Apply hierarchical clustering to reorder the cell populations and/or feature modules and group similar ones together based on the cosine distance of the factorized matrix from factorizeMatrix.

    reorderCelda(
      x,
      celdaMod,
      useAssay = "counts",
      altExpName = "featureSubset",
      method = "complete"
    )
    
    # S4 method for SingleCellExperiment,ANY
    reorderCelda(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      method = "complete"
    )
    
    # S4 method for matrix,celda_CG
    reorderCelda(x, celdaMod, method = "complete")
    
    # S4 method for matrix,celda_C
    reorderCelda(x, celdaMod, method = "complete")
    
    # S4 method for matrix,celda_G
    reorderCelda(x, celdaMod, method = "complete")

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned by celda_C, celda_G or celda_CG, with the matrix located in the useAssay assay slot in altExp(x, altExpName). Rows represent features and columns represent cells.

    • Integer count matrix. Rows represent features and columns represent cells. This matrix should be the same as the one used to generate celdaMod.

    celdaMod

    Celda model object. Only works if x is an integer counts matrix. Ignored if x is a SingleCellExperiment object.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot. Default "featureSubset".

    method

    Passed to hclust. The agglomeration method to be used to be used. Default "complete".

    Value

    A SingleCellExperiment object (or Celda model object) with updated cell cluster and/or feature module labels.

    Examples

    data(sceCeldaCG)
    reordersce <- reorderCelda(sceCeldaCG)
    #> Cluster labels are converted to factors.
    #> Module labels are converted to factors.
    data(celdaCGSim, celdaCGMod)
    reorderCeldaCG <- reorderCelda(celdaCGSim$counts, celdaCGMod)
    data(celdaCSim, celdaCMod)
    reorderCeldaC <- reorderCelda(celdaCSim$counts, celdaCMod)
    data(celdaGSim, celdaGMod)
    reorderCeldaG <- reorderCelda(celdaGSim$counts, celdaGMod)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/reportceldaCG.html ================================================ Generate an HTML report for celda_CG — reportceldaCG • celda

    reportCeldaCGRun will run recursiveSplitModule and recursiveSplitCell to find the number of modules (L) and the number of cell populations (K). A final celda_CG model will be selected from recursiveSplitCell. After a celda_CG model has been fit, reportCeldaCGPlotResults can be used to create an HTML report for visualization and exploration of the celda_CG model results. Some of the plotting and feature selection functions require the installation of the Bioconductor package singleCellTK.

    reportCeldaCGRun(
      sce,
      L,
      K,
      sampleLabel = NULL,
      altExpName = "featureSubset",
      useAssay = "counts",
      initialL = 10,
      maxL = 150,
      initialK = 5,
      maxK = 50,
      minCell = 3,
      minCount = 3,
      maxFeatures = 5000,
      output_file = "CeldaCG_RunReport",
      output_sce_prefix = "celda_cg",
      output_dir = ".",
      pdf = FALSE,
      showSession = TRUE
    )
    
    reportCeldaCGPlotResults(
      sce,
      reducedDimName,
      features = NULL,
      displayName = NULL,
      altExpName = "featureSubset",
      useAssay = "counts",
      cellAnnot = NULL,
      cellAnnotLabel = NULL,
      exactMatch = TRUE,
      moduleFilePrefix = "module_features",
      output_file = "CeldaCG_ResultReport",
      output_dir = ".",
      pdf = FALSE,
      showSetup = TRUE,
      showSession = TRUE
    )

    Arguments

    sce

    A SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    L

    Integer. Final number of feature modules. See celda_CG for more information.

    K

    Integer. Final number of cell populations. See celda_CG for more information.

    sampleLabel

    Vector or factor. Denotes the sample label for each cell (column) in the count matrix.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    useAssay

    A string specifying which assay slot to use. Default "counts".

    initialL

    Integer. Minimum number of modules to try. See recursiveSplitModule for more information. Defailt 10.

    maxL

    Integer. Maximum number of modules to try. See recursiveSplitModule for more information. Default 150.

    initialK

    Integer. Initial number of cell populations to try.

    maxK

    Integer. Maximum number of cell populations to try.

    minCell

    Integer. Minimum number of cells required for feature selection. See selectFeatures for more information. Default 3.

    minCount

    Integer. Minimum number of counts required for feature selection. See selectFeatures for more information. Default 3.

    maxFeatures

    Integer. Maximum number of features to include. If the number of features after filtering for minCell and minCount are greater than maxFeature, then Seurat's VST function is used to select the top variable features. Default 5000.

    output_file

    Character. Prefix of the html file. Default "CeldaCG_ResultReport".

    output_sce_prefix

    Character. The sce object with celda_CG results will be saved to an .rds file starting with this prefix. Default celda_cg.

    output_dir

    Character. Path to save the html file. Default ..

    pdf

    Boolean. Whether to create PDF versions of each plot in addition to PNGs. Default FALSE.

    showSession

    Boolean. Whether to show the session information at the end. Default TRUE.

    reducedDimName

    Character. Name of the reduced dimensional object to be used in 2-D scatter plots throughout the report. Default celda_UMAP.

    features

    Character vector. Expression of these features will be displayed on a reduced dimensional plot defined by reducedDimName. If NULL, then no plotting of features on a reduced dimensinoal plot will be performed. Default NULL.

    displayName

    Character. The name to use for display in scatter plots and heatmaps. If NULL, then the rownames of the sce object will be used. This can also be set to the name of a column in the row data of sce or altExp(sce, altExpName). Default NULL.

    cellAnnot

    Character vector. The cell-level annotations to display on the reduced dimensional plot. These variables should be present in the column data of the sce object. Default NULL.

    cellAnnotLabel

    Character vector. Additional cell-level annotations to display on the reduced dimensional plot. Variables will be treated as categorial and labels for each group will be placed on the plot. These variables should be present in the column data of the sce object. Default NULL.

    exactMatch

    Boolean. Whether to only identify exact matches or to identify partial matches using grep. Default FALSE.

    moduleFilePrefix

    Character. The features in each module will be written to a a csv file starting with this name. If NULL, then no file will be written. Default "module_features".

    showSetup

    Boolean. Whether to show the setup code at the beginning. Default TRUE.

    Value

    .html file

    Examples

    data(sceCeldaCG)
    if (FALSE) {
    library(SingleCellExperiment)
    sceCeldaCG$sum <- colSums(counts(sceCeldaCG))
    rowData(sceCeldaCG)$rownames <- rownames(sceCeldaCG)
    sceCeldaCG <- reportCeldaCGRun(sceCeldaCG,
        initialL = 5, maxL = 20, initialK = 5,
        maxK = 20, L = 10, K = 5)
    reportCeldaCGPlotResults(sce = sceCeldaCG,
        reducedDimName = "celda_UMAP",
        features = c("Gene_1", "Gene_100"),
        displayName = "rownames",
        cellAnnot="sum")
    }
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/resList.html ================================================ Get final celdaModels from a celda model SCE or celdaList object — resList • celda

    Returns all celda models generated during a celdaGridSearch run.

    resList(x, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    resList(x, altExpName = "featureSubset")
    
    # S4 method for celdaList
    resList(x)

    Arguments

    x

    An object of class SingleCellExperiment or celdaList.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    List. Contains one celdaModel object for each of the parameters specified in runParams(x).

    Examples

    data(sceCeldaCGGridSearch)
    celdaCGGridModels <- resList(sceCeldaCGGridSearch)
    data(celdaCGGridSearchRes)
    celdaCGGridModels <- resList(celdaCGGridSearchRes)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/resamplePerplexity.html ================================================ Calculate and visualize perplexity of all models in a celdaList — resamplePerplexity • celda

    Calculates the perplexity of each model's cluster assignments given the provided countMatrix, as well as resamplings of that count matrix, providing a distribution of perplexities and a better sense of the quality of a given K/L choice.

    resamplePerplexity(
      x,
      celdaList,
      useAssay = "counts",
      altExpName = "featureSubset",
      doResampling = FALSE,
      numResample = 5,
      seed = 12345
    )
    
    # S4 method for SingleCellExperiment
    resamplePerplexity(
      x,
      useAssay = "counts",
      altExpName = "featureSubset",
      doResampling = FALSE,
      numResample = 5,
      seed = 12345
    )
    
    # S4 method for ANY
    resamplePerplexity(
      x,
      celdaList,
      doResampling = FALSE,
      numResample = 5,
      seed = 12345
    )

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment returned from celdaGridSearch with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells. Must contain "celda_grid_search" slot in metadata(x) if x is a SingleCellExperiment object.

    celdaList

    Object of class 'celdaList'. Used only if x is a matrix object.

    useAssay

    A string specifying which assay slot to use if x is a SingleCellExperiment object. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    doResampling

    Boolean. If TRUE, then each cell in the counts matrix will be resampled according to a multinomial distribution to introduce noise before calculating perplexity. Default FALSE.

    numResample

    Integer. The number of times to resample the counts matrix for evaluating perplexity if doResampling is set to TRUE. Default 5.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    Value

    A SingleCellExperiment object or

    celdaList object with a perplexity

    property, detailing the perplexity of all K/L combinations that appeared in the celdaList's models.

    Examples

    data(sceCeldaCGGridSearch)
    sce <- resamplePerplexity(sceCeldaCGGridSearch)
    plotGridSearchPerplexity(sce)
    
    data(celdaCGSim, celdaCGGridSearchRes)
    celdaCGGridSearchRes <- resamplePerplexity(
      celdaCGSim$counts,
      celdaCGGridSearchRes
    )
    plotGridSearchPerplexity(celdaCGGridSearchRes)
    
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/retrieveFeatureIndex.html ================================================ Retrieve row index for a set of features — retrieveFeatureIndex • celda

    This will return indices of features among the rownames or rowData of a data.frame, matrix, or a SummarizedExperiment object including a SingleCellExperiment. Partial matching (i.e. grepping) can be used by setting exactMatch = FALSE.

    retrieveFeatureIndex(
      features,
      x,
      by = "rownames",
      exactMatch = TRUE,
      removeNA = FALSE
    )

    Arguments

    features

    Character vector of feature names to find in the rows of x.

    x

    A data.frame, matrix, or SingleCellExperiment object to search.

    by

    Character. Where to search for features in x. If set to "rownames" then the features will be searched for among rownames(x). If x inherits from class SummarizedExperiment, then by can be one of the fields in the row annotation data.frame (i.e. one of colnames(rowData(x))).

    exactMatch

    Boolean. Whether to only identify exact matches or to identify partial matches using grep.

    removeNA

    Boolean. If set to FALSE, features not found in x will be given NA and the returned vector will be the same length as features. If set to TRUE, then the NA values will be removed from the returned vector. Default FALSE.

    Value

    A vector of row indices for the matching features in x.

    See also

    'retrieveFeatureInfo' from package 'scater' and link{regex} for how to use regular expressions when exactMatch = FALSE.

    Author

    Yusuke Koga, Joshua Campbell

    Examples

    data(celdaCGSim)
    retrieveFeatureIndex(c("Gene_1", "Gene_5"), celdaCGSim$counts)
    #> [1] 1 5
    retrieveFeatureIndex(c("Gene_1", "Gene_5"), celdaCGSim$counts,
                                                exactMatch = FALSE)
    #> Warning: Feature 'Gene_1' matched multiple items in 'rownames': Gene_1,Gene_10,Gene_11,Gene_12,Gene_13,Gene_14,Gene_15,Gene_16,Gene_17,Gene_18,Gene_19,Gene_100. Only the first match will be selected.
    #> Warning: Feature 'Gene_5' matched multiple items in 'rownames': Gene_5,Gene_50,Gene_51,Gene_52,Gene_53,Gene_54,Gene_55,Gene_56,Gene_57,Gene_58,Gene_59. Only the first match will be selected.
    #> [1] 1 5
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/runParams.html ================================================ Get run parameters from a celda model SingleCellExperiment or celdaList object — runParams • celda

    Returns details on the clustering parameters and model priors from the celdaList object when it was created.

    runParams(x, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    runParams(x, altExpName = "featureSubset")
    
    # S4 method for celdaList
    runParams(x)

    Arguments

    x

    An object of class SingleCellExperiment or class celdaList.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    Data Frame. Contains details on the various K/L parameters, chain parameters, seed, and final log-likelihoods derived for each model in the provided celdaList.

    Examples

    data(sceCeldaCGGridSearch)
    runParams(sceCeldaCGGridSearch)
    #>   index chain K  L  seed logLikelihood mean_perplexity
    #> 1     1     1 4  9 12345      -1226407        46.95862
    #> 2     2     1 5  9 12345      -1214714        45.25311
    #> 3     3     1 6  9 12345      -1214754        45.25124
    #> 4     4     1 4 10 12345      -1225233        46.77100
    #> 5     5     1 5 10 12345      -1212891        44.98559
    #> 6     6     1 6 10 12345      -1212928        44.98491
    #> 7     7     1 4 11 12345      -1225255        46.77053
    #> 8     8     1 5 11 12345      -1212917        44.98514
    #> 9     9     1 6 11 12345      -1212953        44.98495
    data(celdaCGGridSearchRes)
    runParams(celdaCGGridSearchRes)
    #>   index chain K  L logLikelihood mean_perplexity
    #> 1     1     1 4  9      -1228381        47.12902
    #> 2     2     1 5  9      -1217364        45.51741
    #> 3     3     1 6  9      -1217407        45.51451
    #> 4     4     1 4 10      -1227891        47.04897
    #> 5     5     1 5 10      -1215541        45.24781
    #> 6     6     1 6 10      -1215583        45.24439
    #> 7     7     1 4 11      -1227913        47.04849
    #> 8     8     1 5 11      -1215567        45.24733
    #> 9     9     1 6 11      -1215619        45.24380
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/sampleCells.html ================================================ sampleCells — sampleCells • celda

    A matrix of simulated gene counts.

    sampleCells

    Format

    A matrix of simulated gene counts with 10 rows (genes) and 10 columns (cells).

    Details

    A toy count matrix for use with celda.

    Generated by Josh Campbell.

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/sampleLabel.html ================================================ Get or set sample labels from a celda SingleCellExperiment object — sampleLabel • celda

    Return or set the sample labels for the cells in sce.

    sampleLabel(x, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    sampleLabel(x, altExpName = "featureSubset")
    
    sampleLabel(x, altExpName = "featureSubset") <- value
    
    # S4 method for SingleCellExperiment
    sampleLabel(x, altExpName = "featureSubset") <- value
    
    # S4 method for celdaModel
    sampleLabel(x)

    Arguments

    x

    Can be one of

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    value

    Character vector of sample labels for replacements. Works only is x is a SingleCellExperiment object.

    Value

    Character vector. Contains the sample labels provided at model creation, or those automatically generated by celda.

    Examples

    data(sceCeldaCG)
    sampleLabel(sceCeldaCG)
    #>   [1] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>   [9] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [17] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [25] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [33] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [41] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [49] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [57] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [65] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [73] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [81] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [89] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_2 Sample_2 Sample_2
    #>  [97] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [105] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [113] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [121] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [129] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [137] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [145] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [153] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [161] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [169] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [177] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [185] Sample_2 Sample_2 Sample_2 Sample_2 Sample_3 Sample_3 Sample_3 Sample_3
    #> [193] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [201] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [209] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [217] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [225] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [233] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [241] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [249] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [257] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [265] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [273] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [281] Sample_3 Sample_3 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [289] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [297] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [305] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [313] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [321] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [329] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [337] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [345] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_5 Sample_5 Sample_5
    #> [353] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [361] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [369] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [377] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [385] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [393] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [401] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [409] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [417] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [425] Sample_5
    #> Levels: Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
    data(celdaCGMod)
    sampleLabel(celdaCGMod)
    #>   [1] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>   [9] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [17] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [25] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [33] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [41] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [49] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [57] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [65] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [73] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [81] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_1
    #>  [89] Sample_1 Sample_1 Sample_1 Sample_1 Sample_1 Sample_2 Sample_2 Sample_2
    #>  [97] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [105] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [113] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [121] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [129] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [137] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [145] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [153] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [161] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [169] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [177] Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2 Sample_2
    #> [185] Sample_2 Sample_2 Sample_2 Sample_2 Sample_3 Sample_3 Sample_3 Sample_3
    #> [193] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [201] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [209] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [217] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [225] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [233] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [241] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [249] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [257] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [265] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [273] Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3 Sample_3
    #> [281] Sample_3 Sample_3 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [289] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [297] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [305] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [313] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [321] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [329] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [337] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_4
    #> [345] Sample_4 Sample_4 Sample_4 Sample_4 Sample_4 Sample_5 Sample_5 Sample_5
    #> [353] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [361] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [369] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [377] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [385] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [393] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [401] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [409] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [417] Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5 Sample_5
    #> [425] Sample_5
    #> Levels: Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/sceCeldaC.html ================================================ sceCeldaC — sceCeldaC • celda

    A SingleCellExperiment object containing the results of running selectFeatures and celda_C on celdaCSim.

    sceCeldaC

    Format

    A SingleCellExperiment object

    Examples

    data(celdaCSim)
    sceCeldaC <- selectFeatures(celdaCSim$counts)
    sceCeldaC <- celda_C(sceCeldaC,
        K = celdaCSim$K,
        sampleLabel = celdaCSim$sampleLabel,
        nchains = 1)
    #> --------------------------------------------------
    #> Starting Celda_C: Clustering cells.
    #> --------------------------------------------------
    #> Tue Apr  2 18:57:00 2024 .. Initializing 'z' in chain 1 with 'split' 
    #> Tue Apr  2 18:57:00 2024 .... Completed iteration: 1 | logLik: -1282027.27277705
    #> Tue Apr  2 18:57:00 2024 .... Completed iteration: 2 | logLik: -1282027.27277705
    #> Tue Apr  2 18:57:00 2024 .. Finished chain 1
    #> --------------------------------------------------
    #> Completed Celda_C. Total time: 0.08842993 secs
    #> --------------------------------------------------
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/sceCeldaCG.html ================================================ sceCeldaCG — sceCeldaCG • celda

    A SingleCellExperiment object containing the results of running selectFeatures and celda_CG on celdaCGSim.

    sceCeldaCG

    Format

    A SingleCellExperiment object

    Examples

    data(celdaCGSim)
    sceCeldaCG <- selectFeatures(celdaCGSim$counts)
    sceCeldaCG <- celda_CG(sceCeldaCG,
        K = celdaCGSim$K,
        L = celdaCGSim$L,
        sampleLabel = celdaCGSim$sampleLabel,
        nchains = 1)
    #> --------------------------------------------------
    #> Starting Celda_CG: Clustering cells and genes.
    #> --------------------------------------------------
    #> Tue Apr  2 18:57:01 2024 .. Initializing 'z' in chain 1 with 'split' 
    #> Tue Apr  2 18:57:01 2024 .. Initializing 'y' in chain 1 with 'split' 
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 1 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 2 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 3 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 4 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 5 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 6 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 7 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 8 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 9 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Determining if any gene clusters should be split.
    #> Tue Apr  2 18:57:04 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:57:04 2024 .... Determining if any cell clusters should be split.
    #> Tue Apr  2 18:57:04 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 10 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .... Determining if any cell clusters should be split.
    #> Tue Apr  2 18:57:04 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:57:04 2024 .... Completed iteration: 11 | logLik: -1212891.16546068
    #> Tue Apr  2 18:57:04 2024 .. Finished chain 1
    #> --------------------------------------------------
    #> Completed Celda_CG. Total time: 3.333782 secs
    #> --------------------------------------------------
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/sceCeldaCGGridSearch.html ================================================ sceCeldaCGGridSearch — sceCeldaCGGridSearch • celda

    A SingleCellExperiment object containing the results of running selectFeatures and celdaGridSearch on celdaCGSim.

    sceCeldaCGGridSearch

    Format

    A SingleCellExperiment object

    Examples

    data(celdaCGSim)
    sce <- selectFeatures(celdaCGSim$counts)
    sceCeldaCGGridSearch <- celdaGridSearch(sce,
        model = "celda_CG",
        paramsTest = list(K = seq(4, 6), L = seq(9, 11)),
        paramsFixed = list(sampleLabel = celdaCGSim$sampleLabel),
        bestOnly = TRUE,
        nchains = 1,
        cores = 1,
        verbose = FALSE)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/sceCeldaG.html ================================================ sceCeldaG — sceCeldaG • celda

    A SingleCellExperiment object containing the results of running selectFeatures and celda_G on celdaGSim.

    sceCeldaG

    Format

    A SingleCellExperiment object

    Examples

    data(celdaGSim)
    sceCeldaG <- selectFeatures(celdaGSim$counts)
    sceCeldaG <- celda_G(sceCeldaG, L = celdaGSim$L, nchains = 1)
    #> --------------------------------------------------
    #> Starting Celda_G: Clustering genes.
    #> --------------------------------------------------
    #> Tue Apr  2 18:57:47 2024 .. Initializing 'y' in chain 1 with 'split' 
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 1 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 2 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 3 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 4 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 5 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 6 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 7 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 8 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 9 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Determining if any gene clusters should be split.
    #> Tue Apr  2 18:57:47 2024 .... No additional splitting was performed.
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 10 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .... Completed iteration: 11 | logLik: -289209.476865707
    #> Tue Apr  2 18:57:47 2024 .. Finished chain 1
    #> --------------------------------------------------
    #> Completed Celda_G. Total time: 0.4345939 secs
    #> --------------------------------------------------
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/selectBestModel.html ================================================ Select best chain within each combination of parameters — selectBestModel • celda

    Select the chain with the best log likelihood for each combination of tested parameters from a SCE object gererated by celdaGridSearch or from a celdaList object.

    selectBestModel(x, asList = FALSE, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    selectBestModel(x, asList = FALSE, altExpName = "featureSubset")
    
    # S4 method for celdaList
    selectBestModel(x, asList = FALSE)

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned from celdaGridSearch, recursiveSplitModule, or recursiveSplitCell. Must contain a list named "celda_grid_search" in metadata(x).

    • celdaList object.

    asList

    TRUE or FALSE. Whether to return the best model as a celdaList object or not. If FALSE, return the best model as a corresponding celda model object.

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    One of

    • A new SingleCellExperiment object containing one model with the best log-likelihood for each set of parameters in metadata(x). If there is only one set of parameters, a new SingleCellExperiment object with the matching model stored in the metadata "celda_parameters" slot will be returned. Otherwise, a new SingleCellExperiment object with the subset models stored in the metadata "celda_grid_search" slot will be returned.

    • A new celdaList object containing one model with the best log-likelihood for each set of parameters. If only one set of parameters is in the celdaList, the best model will be returned directly instead of a celdaList object.

    Examples

    data(sceCeldaCGGridSearch)
    ## Returns same result as running celdaGridSearch with "bestOnly = TRUE"
    sce <- selectBestModel(sceCeldaCGGridSearch)
    data(celdaCGGridSearchRes)
    ## Returns same result as running celdaGridSearch with "bestOnly = TRUE"
    cgsBest <- selectBestModel(celdaCGGridSearchRes)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/selectFeatures.html ================================================ Simple feature selection by feature counts — selectFeatures • celda

    A simple heuristic feature selection procedure. Select features with at least minCount counts in at least minCell cells. A SingleCellExperiment object with subset features will be stored in the altExp slot with name altExpName. The name of the assay slot in altExp will be the same as useAssay.

    selectFeatures(
      x,
      minCount = 3,
      minCell = 3,
      useAssay = "counts",
      altExpName = "featureSubset"
    )
    
    # S4 method for SingleCellExperiment
    selectFeatures(
      x,
      minCount = 3,
      minCell = 3,
      useAssay = "counts",
      altExpName = "featureSubset"
    )
    
    # S4 method for matrix
    selectFeatures(
      x,
      minCount = 3,
      minCell = 3,
      useAssay = "counts",
      altExpName = "featureSubset"
    )

    Arguments

    x

    A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    minCount

    Minimum number of counts required for feature selection.

    minCell

    Minimum number of cells required for feature selection.

    useAssay

    A string specifying the name of the assay slot to use. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    A SingleCellExperiment object with a

    altExpName

    altExp slot. Function parameter settings are stored in the metadata

    "select_features" slot.

    Examples

    data(sceCeldaCG)
    sce <- selectFeatures(sceCeldaCG)
    data(celdaCGSim)
    sce <- selectFeatures(celdaCGSim$counts)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/semiPheatmap.html ================================================ A function to draw clustered heatmaps. — semiPheatmap • celda

    A function to draw clustered heatmaps where one has better control over some graphical parameters such as cell size, etc.

    The function also allows to aggregate the rows using kmeans clustering. This is advisable if number of rows is so big that R cannot handle their hierarchical clustering anymore, roughly more than 1000. Instead of showing all the rows separately one can cluster the rows in advance and show only the cluster centers. The number of clusters can be tuned with parameter kmeansK.

    semiPheatmap(
      mat,
      color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(100),
      kmeansK = NA,
      breaks = NA,
      borderColor = "grey60",
      cellWidth = NA,
      cellHeight = NA,
      scale = "none",
      clusterRows = TRUE,
      clusterCols = TRUE,
      clusteringDistanceRows = "euclidean",
      clusteringDistanceCols = "euclidean",
      clusteringMethod = "complete",
      clusteringCallback = .identity2,
      cutreeRows = NA,
      cutreeCols = NA,
      treeHeightRow = ifelse(clusterRows, 50, 0),
      treeHeightCol = ifelse(clusterCols, 50, 0),
      legend = TRUE,
      legendBreaks = NA,
      legendLabels = NA,
      annotationRow = NA,
      annotationCol = NA,
      annotation = NA,
      annotationColors = NA,
      annotationLegend = TRUE,
      annotationNamesRow = TRUE,
      annotationNamesCol = TRUE,
      dropLevels = TRUE,
      showRownames = TRUE,
      showColnames = TRUE,
      main = NA,
      fontSize = 10,
      fontSizeRow = fontSize,
      fontSizeCol = fontSize,
      displayNumbers = FALSE,
      numberFormat = "%.2f",
      numberColor = "grey30",
      fontSizeNumber = 0.8 * fontSize,
      gapsRow = NULL,
      gapsCol = NULL,
      labelsRow = NULL,
      labelsCol = NULL,
      fileName = NA,
      width = NA,
      height = NA,
      silent = FALSE,
      rowLabel,
      colLabel,
      rowGroupOrder = NULL,
      colGroupOrder = NULL,
      ...
    )

    Arguments

    mat

    numeric matrix of the values to be plotted.

    color

    vector of colors used in heatmap.

    kmeansK

    the number of kmeans clusters to make, if we want to agggregate the rows before drawing heatmap. If NA then the rows are not aggregated.

    breaks

    Numeric vector. A sequence of numbers that covers the range of values in the normalized `counts`. Values in the normalized `matrix` are assigned to each bin in `breaks`. Each break is assigned to a unique color from `col`. If NULL, then breaks are calculated automatically. Default NULL.

    borderColor

    color of cell borders on heatmap, use NA if no border should be drawn.

    cellWidth

    individual cell width in points. If left as NA, then the values depend on the size of plotting window.

    cellHeight

    individual cell height in points. If left as NA, then the values depend on the size of plotting window.

    scale

    character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. Corresponding values are "row", "column" and "none".

    clusterRows

    boolean values determining if rows should be clustered or hclust object,

    clusterCols

    boolean values determining if columns should be clustered or hclust object.

    clusteringDistanceRows

    distance measure used in clustering rows. Possible values are "correlation" for Pearson correlation and all the distances supported by dist, such as "euclidean", etc. If the value is none of the above it is assumed that a distance matrix is provided.

    clusteringDistanceCols

    distance measure used in clustering columns. Possible values the same as for clusteringDistanceRows.

    clusteringMethod

    clustering method used. Accepts the same values as hclust.

    clusteringCallback

    callback function to modify the clustering. Is called with two parameters: original hclust object and the matrix used for clustering. Must return a hclust object.

    cutreeRows

    number of clusters the rows are divided into, based on the hierarchical clustering (using cutree), if rows are not clustered, the argument is ignored

    cutreeCols

    similar to cutreeRows, but for columns

    treeHeightRow

    the height of a tree for rows, if these are clustered. Default value 50 points.

    treeHeightCol

    the height of a tree for columns, if these are clustered. Default value 50 points.

    legend

    logical to determine if legend should be drawn or not.

    legendBreaks

    vector of breakpoints for the legend.

    legendLabels

    vector of labels for the legendBreaks.

    annotationRow

    data frame that specifies the annotations shown on left side of the heatmap. Each row defines the features for a specific row. The rows in the data and in the annotation are matched using corresponding row names. Note that color schemes takes into account if variable is continuous or discrete.

    annotationCol

    similar to annotationRow, but for columns.

    annotation

    deprecated parameter that currently sets the annotationCol if it is missing.

    annotationColors

    list for specifying annotationRow and annotationCol track colors manually. It is possible to define the colors for only some of the features. Check examples for details.

    annotationLegend

    boolean value showing if the legend for annotation tracks should be drawn.

    annotationNamesRow

    boolean value showing if the names for row annotation tracks should be drawn.

    annotationNamesCol

    boolean value showing if the names for column annotation tracks should be drawn.

    dropLevels

    logical to determine if unused levels are also shown in the legend.

    showRownames

    boolean specifying if column names are be shown.

    showColnames

    boolean specifying if column names are be shown.

    main

    the title of the plot

    fontSize

    base fontsize for the plot

    fontSizeRow

    fontsize for rownames (Default: fontsize)

    fontSizeCol

    fontsize for colnames (Default: fontsize)

    displayNumbers

    logical determining if the numeric values are also printed to the cells. If this is a matrix (with same dimensions as original matrix), the contents of the matrix are shown instead of original values.

    numberFormat

    format strings (C printf style) of the numbers shown in cells. For example "%.2f" shows 2 decimal places and "%.1e" shows exponential notation (see more in sprintf).

    numberColor

    color of the text

    fontSizeNumber

    fontsize of the numbers displayed in cells

    gapsRow

    vector of row indices that show shere to put gaps into heatmap. Used only if the rows are not clustered. See cutreeRow to see how to introduce gaps to clustered rows.

    gapsCol

    similar to gapsRow, but for columns.

    labelsRow

    custom labels for rows that are used instead of rownames.

    labelsCol

    similar to labelsRow, but for columns.

    fileName

    file path where to save the picture. Filetype is decided by the extension in the path. Currently following formats are supported: png, pdf, tiff, bmp, jpeg. Even if the plot does not fit into the plotting window, the file size is calculated so that the plot would fit there, unless specified otherwise.

    width

    manual option for determining the output file width in inches.

    height

    manual option for determining the output file height in inches.

    silent

    do not draw the plot (useful when using the gtable output)

    rowLabel

    row cluster labels for semi-clustering

    colLabel

    column cluster labels for semi-clustering

    rowGroupOrder

    Vector. Specifies the order of feature clusters when semisupervised clustering is performed on the y labels.

    colGroupOrder

    Vector. Specifies the order of cell clusters when semisupervised clustering is performed on the z labels.

    ...

    graphical parameters for the text used in plot. Parameters passed to grid.text, see gpar.

    Value

    Invisibly a list of components

    • treeRow the clustering of rows as hclust object

    • treeCol the clustering of columns as hclust object

    • kmeans the kmeans clustering of rows if parameter kmeansK was specified

    Author

    Raivo Kolde <rkolde@gmail.com> #@examples # Create test matrix test = matrix(rnorm(200), 20, 10) test[seq(10), seq(1, 10, 2)] = test[seq(10), seq(1, 10, 2)] + 3 test[seq(11, 20), seq(2, 10, 2)] = test[seq(11, 20), seq(2, 10, 2)] + 2 test[seq(15, 20), seq(2, 10, 2)] = test[seq(15, 20), seq(2, 10, 2)] + 4 colnames(test) = paste("Test", seq(10), sep = "") rownames(test) = paste("Gene", seq(20), sep = "")

    # Draw heatmaps pheatmap(test) pheatmap(test, kmeansK = 2) pheatmap(test, scale = "row", clusteringDistanceRows = "correlation") pheatmap(test, color = colorRampPalette(c("navy", "white", "firebrick3"))(50)) pheatmap(test, cluster_row = FALSE) pheatmap(test, legend = FALSE)

    # Show text within cells pheatmap(test, displayNumbers = TRUE) pheatmap(test, displayNumbers = TRUE, numberFormat = "%.1e") pheatmap(test, displayNumbers = matrix(ifelse(test > 5, "*", ""), nrow(test))) pheatmap(test, cluster_row = FALSE, legendBreaks = seq(-1, 4), legendLabels = c("0", "1e-4", "1e-3", "1e-2", "1e-1", "1"))

    # Fix cell sizes and save to file with correct size pheatmap(test, cellWidth = 15, cellHeight = 12, main = "Example heatmap") pheatmap(test, cellWidth = 15, cellHeight = 12, fontSize = 8, fileName = "test.pdf")

    # Generate annotations for rows and columns annotationCol = data.frame(CellType = factor(rep(c("CT1", "CT2"), 5)), Time = seq(5)) rownames(annotationCol) = paste("Test", seq(10), sep = "")

    annotationRow = data.frame(GeneClass = factor(rep(c("Path1", "Path2", "Path3"), c(10, 4, 6)))) rownames(annotationRow) = paste("Gene", seq(20), sep = "")

    # Display row and color annotations pheatmap(test, annotationCol = annotationCol) pheatmap(test, annotationCol = annotationCol, annotationLegend = FALSE) pheatmap(test, annotationCol = annotationCol, annotationRow = annotationRow)

    # Specify colors ann_colors = list(Time = c("white", "firebrick"), CellType = c(CT1 = "#1B9E77", CT2 = "#D95F02"), GeneClass = c(Path1 = "#7570B3", Path2 = "#E7298A", Path3 = "#66A61E"))

    pheatmap(test, annotationCol = annotationCol, annotationColors = ann_colors, main = "Title") pheatmap(test, annotationCol = annotationCol, annotationRow = annotationRow, annotationColors = ann_colors) pheatmap(test, annotationCol = annotationCol, annotationColors = ann_colors[2])

    # Gaps in heatmaps pheatmap(test, annotationCol = annotationCol, clusterRows = FALSE, gapsRow = c(10, 14)) pheatmap(test, annotationCol = annotationCol, clusterRows = FALSE, gapsRow = c(10, 14), cutreeCol = 2)

    # Show custom strings as row/col names labelsRow = c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "Il10", "Il15", "Il1b")

    pheatmap(test, annotationCol = annotationCol, labelsRow = labelsRow)

    # Specifying clustering from distance matrix drows = stats::dist(test, method = "minkowski") dcols = stats::dist(t(test), method = "minkowski") pheatmap(test, clusteringDistanceRows = drows, clusteringDistanceCols = dcols)

    # Modify ordering of the clusters using clustering callback option callback = function(hc, mat) sv = svd(t(mat))$v[, 1] dend = reorder(as.dendrogram(hc), wts = sv) as.hclust(dend)

    pheatmap(test, clusteringCallback = callback)

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/simulateCells.html ================================================ Simulate count data from the celda generative models. — simulateCells • celda

    This function generates a SingleCellExperiment containing a simulated counts matrix in the "counts" assay slot, as well as various parameters used in the simulation which can be useful for running celda and are stored in metadata slot. The user must provide the desired model (one of celda_C, celda_G, celda_CG) as well as any desired tuning parameters for those model's simulation functions as detailed below.

    simulateCells(
      model = c("celda_CG", "celda_C", "celda_G"),
      S = 5,
      CRange = c(50, 100),
      NRange = c(500, 1000),
      C = 100,
      G = 100,
      K = 5,
      L = 10,
      alpha = 1,
      beta = 1,
      gamma = 5,
      delta = 1,
      seed = 12345
    )

    Arguments

    model

    Character. Options available in celda::availableModels. Can be one of "celda_CG", "celda_C", or "celda_G". Default "celda_CG".

    S

    Integer. Number of samples to simulate. Default 5. Only used if model is one of "celda_CG" or "celda_C".

    CRange

    Integer vector. A vector of length 2 that specifies the lower and upper bounds of the number of cells to be generated in each sample. Default c(50, 100). Only used if model is one of "celda_CG" or "celda_C".

    NRange

    Integer vector. A vector of length 2 that specifies the lower and upper bounds of the number of counts generated for each cell. Default c(500, 1000).

    C

    Integer. Number of cells to simulate. Default 100. Only used if model is "celda_G".

    G

    Integer. The total number of features to be simulated. Default 100.

    K

    Integer. Number of cell populations. Default 5. Only used if model is one of "celda_CG" or "celda_C".

    L

    Integer. Number of feature modules. Default 10. Only used if model is one of "celda_CG" or "celda_G".

    alpha

    Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default 1. Only used if model is one of "celda_CG" or "celda_C".

    beta

    Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell population. Default 1.

    gamma

    Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 5. Only used if model is one of "celda_CG" or "celda_G".

    delta

    Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1. Only used if model is one of "celda_CG" or "celda_G".

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    Value

    A SingleCellExperiment object with simulated count matrix stored in the "counts" assay slot. Function parameter settings are stored in the metadata slot. For

    "celda_CG" and "celda_C" models, columns celda_sample_label and celda_cell_cluster in

    colData contain simulated sample labels and cell population clusters. For "celda_CG" and "celda_G"

    models, column celda_feature_module in

    rowData contains simulated gene modules.

    Examples

    sce <- simulateCells()
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/simulateContamination.html ================================================ Simulate contaminated count matrix — simulateContamination • celda

    This function generates a list containing two count matrices -- one for real expression, the other one for contamination, as well as other parameters used in the simulation which can be useful for running decontamination.

    simulateContamination(
      C = 300,
      G = 100,
      K = 3,
      NRange = c(500, 1000),
      beta = 0.1,
      delta = c(1, 10),
      numMarkers = 3,
      seed = 12345
    )

    Arguments

    C

    Integer. Number of cells to be simulated. Default 300.

    G

    Integer. Number of genes to be simulated. Default 100.

    K

    Integer. Number of cell populations to be simulated. Default 3.

    NRange

    Integer vector. A vector of length 2 that specifies the lower and upper bounds of the number of counts generated for each cell. Default c(500, 1000).

    beta

    Numeric. Concentration parameter for Phi. Default 0.1.

    delta

    Numeric or Numeric vector. Concentration parameter for Theta. If input as a single numeric value, symmetric values for beta distribution are specified; if input as a vector of lenght 2, the two values will be the shape1 and shape2 paramters of the beta distribution respectively. Default c(1, 5).

    numMarkers

    Integer. Number of markers for each cell population. Default 3.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    Value

    A list containing the nativeMatirx (real expression), observedMatrix (real expression + contamination), as well as other parameters used in the simulation.

    Author

    Shiyi Yang, Yuan Yin, Joshua Campbell

    Examples

    contaminationSim <- simulateContamination(K = 3, delta = c(1, 10))
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/splitModule.html ================================================ Split celda feature module — splitModule • celda

    Manually select a celda feature module to split into 2 or more modules. Useful for splitting up modules that show divergent expression of features in multiple cell clusters.

    splitModule(
      x,
      module,
      useAssay = "counts",
      altExpName = "featureSubset",
      n = 2,
      seed = 12345
    )
    
    # S4 method for SingleCellExperiment
    splitModule(
      x,
      module,
      useAssay = "counts",
      altExpName = "featureSubset",
      n = 2,
      seed = 12345
    )

    Arguments

    x

    A SingleCellExperiment object with the matrix located in the assay slot under useAssay. Rows represent features and columns represent cells.

    module

    Integer. The module to be split.

    useAssay

    A string specifying which assay slot to use for x. Default "counts".

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    n

    Integer. How many modules should module be split into. Default 2.

    seed

    Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

    Value

    A updated SingleCellExperiment object with new feature modules stored in column celda_feature_module in

    rowData(x).

    Examples

    data(sceCeldaCG)
    # Split module 5 into 2 new modules.
    sce <- splitModule(sceCeldaCG, module = 5)
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/subsetCeldaList.html ================================================ Subset celda model from SCE object returned from celdaGridSearch — subsetCeldaList • celda

    Select a subset of models from a SingleCellExperiment object generated by celdaGridSearch that match the criteria in the argument params.

    subsetCeldaList(x, params, altExpName = "featureSubset")
    
    # S4 method for SingleCellExperiment
    subsetCeldaList(x, params, altExpName = "featureSubset")
    
    # S4 method for celdaList
    subsetCeldaList(x, params)

    Arguments

    x

    Can be one of

    • A SingleCellExperiment object returned from celdaGridSearch, recursiveSplitModule, or recursiveSplitCell. Must contain a list named "celda_grid_search" in metadata(x).

    • celdaList object.

    params

    List. List of parameters used to subset the matching celda models in list "celda_grid_search" in metadata(x).

    altExpName

    The name for the altExp slot to use. Default "featureSubset".

    Value

    One of

    • A new SingleCellExperiment object containing all models matching the provided criteria in params. If only one celda model result in the "celda_grid_search" slot in metadata(x) matches the given criteria, a new SingleCellExperiment object with the matching model stored in the metadata "celda_parameters" slot will be returned. Otherwise, a new SingleCellExperiment object with the subset models stored in the metadata "celda_grid_search" slot will be returned.

    • A new celdaList object containing all models matching the provided criteria in params. If only one item in the celdaList matches the given criteria, the matching model will be returned directly instead of a celdaList object.

    See also

    celdaGridSearch can run Celda with multiple parameters and chains in parallel. selectBestModel can get the best model for each combination of parameters.

    Examples

    data(sceCeldaCGGridSearch)
    sceK5L10 <- subsetCeldaList(sceCeldaCGGridSearch,
        params = list(K = 5, L = 10))
    data(celdaCGGridSearchRes)
    resK5L10 <- subsetCeldaList(celdaCGGridSearchRes,
        params = list(K = 5, L = 10))
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/reference/topRank.html ================================================ Identify features with the highest influence on clustering. — topRank • celda

    topRank() can quickly identify the top `n` rows for each column of a matrix. For example, this can be useful for identifying the top `n` features per cell.

    topRank(matrix, n = 25, margin = 2, threshold = 0, decreasing = TRUE)

    Arguments

    matrix

    Numeric matrix.

    n

    Integer. Maximum number of items above `threshold` returned for each ranked row or column.

    margin

    Integer. Dimension of `matrix` to rank, with 1 for rows, 2 for columns. Default 2.

    threshold

    Numeric. Only return ranked rows or columns in the matrix that are above this threshold. If NULL, then no threshold will be applied. Default 0.

    decreasing

    Logical. Specifies if the rank should be decreasing. Default TRUE.

    Value

    List. The `index` variable provides the top `n` row (feature) indices contributing the most to each column (cell). The `names` variable provides the rownames corresponding to these indexes.

    Examples

    data(sampleCells)
    topRanksPerCell <- topRank(sampleCells, n = 5)
    topFeatureNamesForCell <- topRanksPerCell$names[1]
    

    Site built with pkgdown 2.0.7.

    ================================================ FILE: docs/sitemap.xml ================================================ /404.html /CONDUCT.html /LICENSE-text.html /articles/articles/celda_pbmc3k.html /articles/articles/decontX_pbmc4k.html /articles/articles/installation.html /articles/celda.html /articles/celda_pbmc3k.html /articles/decontX.html /articles/decontX_pbmc4k.html /articles/index.html /articles/installation.html /authors.html /index.html /news/index.html /reference/appendCeldaList.html /reference/availableModels.html /reference/bestLogLikelihood.html /reference/celda.html /reference/celdaCGGridSearchRes.html /reference/celdaCGMod.html /reference/celdaCGSim.html /reference/celdaCMod.html /reference/celdaCSim.html /reference/celdaClusters.html /reference/celdaGMod.html /reference/celdaGSim.html /reference/celdaGridSearch.html /reference/celdaHeatmap.html /reference/celdaModel.html /reference/celdaModules.html /reference/celdaPerplexity-celdaList-method.html /reference/celdaPerplexity.html /reference/celdaProbabilityMap.html /reference/celdaTsne.html /reference/celdaUmap.html /reference/celda_C.html /reference/celda_CG.html /reference/celda_G.html /reference/celdatosce.html /reference/clusterProbability.html /reference/compareCountMatrix.html /reference/contaminationSim.html /reference/countChecksum-celdaList-method.html /reference/countChecksum.html /reference/decontX.html /reference/decontXcounts.html /reference/distinctColors.html /reference/eigenMatMultInt.html /reference/eigenMatMultNumeric.html /reference/factorizeMatrix.html /reference/fastNormProp.html /reference/fastNormPropLog.html /reference/fastNormPropSqrt.html /reference/featureModuleLookup.html /reference/featureModuleTable.html /reference/geneSetEnrich.html /reference/index.html /reference/logLikelihood.html /reference/logLikelihoodHistory.html /reference/matrixNames.html /reference/moduleHeatmap.html /reference/nonzero.html /reference/normalizeCounts.html /reference/params.html /reference/perplexity.html /reference/plotCeldaViolin.html /reference/plotDecontXContamination.html /reference/plotDecontXMarkerExpression.html /reference/plotDecontXMarkerPercentage.html /reference/plotDimReduceCluster.html /reference/plotDimReduceFeature.html /reference/plotDimReduceGrid.html /reference/plotDimReduceModule.html /reference/plotGridSearchPerplexity.html /reference/plotHeatmap.html /reference/plotRPC.html /reference/recodeClusterY.html /reference/recodeClusterZ.html /reference/recursiveSplitCell.html /reference/recursiveSplitModule.html /reference/reorderCelda.html /reference/reportceldaCG.html /reference/resList.html /reference/resamplePerplexity.html /reference/retrieveFeatureIndex.html /reference/runParams.html /reference/sampleCells.html /reference/sampleLabel.html /reference/sceCeldaC.html /reference/sceCeldaCG.html /reference/sceCeldaCGGridSearch.html /reference/sceCeldaG.html /reference/selectBestModel.html /reference/selectFeatures.html /reference/semiPheatmap.html /reference/simulateCells.html /reference/simulateContamination.html /reference/splitModule.html /reference/subsetCeldaList.html /reference/topRank.html ================================================ FILE: inst/rmarkdown/CeldaCG_PlotResults.Rmd ================================================ --- title: "Celda_CG Results" date: "`r format(Sys.time(), '%B %d, %Y')`" params: sce: sce altExpName: altExpName useAssay: useAssay reducedDimName: reducedDimName features: features displayName: displayName cellAnnot: cellAnnot cellAnnotLabel: cellAnnotLabel exactMatch: exactMatch moduleFileName: moduleFileName pdf: pdf showSetup: showSetup showSession: showSession output: html_document: toc: true toc_float: true number_sections: true theme: cosmo code_folding: hide --- # Celda_CG Results ## Setup The following code loads required libraries, reads in parameters, checks input variables, and sets up display options for the **celda_CG Results** report. ```{r setup, echo = showSetup} require(singleCellTK) require(celda) require(kableExtra) require(grid) require(knitr) require(ggplot2) require(gridExtra) require(SingleCellExperiment) sce <- params$sce features <- params$features altExpName <- params$altExpName useAssay <- params$useAssay reducedDimName <- params$reducedDimName displayName <- params$displayName cellAnnot <- params$cellAnnot cellAnnotLabel <- params$cellAnnotLabel exactMatch <- params$exactMatch moduleFileName <- params$moduleFileName pdf <- params$pdf showSession <- params$showSession cellAnnotFinal <- NULL plotLabels <- c() if (length(cellAnnot) > 0) { cellAnnotFinal <- c(cellAnnotFinal, cellAnnot) plotLabels <- c(plotLabels, rep(FALSE, length(cellAnnot))) } if (length(cellAnnotLabel) > 0) { cellAnnotFinal <- c(cellAnnotFinal, cellAnnotLabel) plotLabels <- c(plotLabels, rep(TRUE, length(cellAnnotLabel))) } if (!is.null(cellAnnotFinal)) { if (!all(cellAnnotFinal %in% colnames(colData(altExp(sce, e = altExpName))))) { s <- setdiff(cellAnnotFinal, colnames(colData(altExp(sce, e = altExpName)))) stop( "The following items in 'cellAnnot' or 'cellAnnotLabel' were not found ", "in the colData of the object: ", paste0(s, ",") ) } } L <- max(as.integer(celdaModules(sce))) K <- max(as.integer(celdaClusters(sce))) # Define tab levels tab3 <- "### %s {-} " tab4 <- "#### %s {-} " space <- " " markerHeight <- 9 if (!is.null(features)) { markerHeight <- max(9, 3 * (length(features) / 3)) } dev <- ifelse(isTRUE(pdf), c("png", "pdf"), c("png")) opts_chunk$set( echo = TRUE, cache = FALSE, cache.lazy = FALSE, cache.comments = FALSE, fig.align = "center", fig.keep = "all", dev = dev ) ``` ## Visualization of cells in 2-D embeddings {.tabset .tabset-fade} Reduced dimensional 2-D plots created by algorithms such as tSNE and UMAP are useful for visualizing the relationship between cells. Each point on the plot represents a single cell. Cells closer together on the plot have more similar expression profiles across all genes. The tabs below show the `r reducedDimName` dimensions colored by different variables. The *Cluster* tab colors cells by the `r K` subpopulation labels identified by celda_CG, The *Sample Labels* tab colors cells by the sample label supplied to celda_CG. If no sample label was supplied to celda_CG, then all cells will be the same color. The *Cell Annotations* tab contains colors points by other pre-specified cell-level annotations. ### Clusters ```{r celda_clusters} plotDimReduceCluster(sce, reducedDimName = reducedDimName, labelClusters = TRUE) ``` ### Sample Labels ```{r celda_samples} plotSCEDimReduceColData( altExp(sce, altExpName), reducedDimName = reducedDimName, colorBy = "celda_sample_label", labelClusters = FALSE, dotSize = 0.5 ) ``` ### Cell Annotations {.tabset .tabset-fade} ```{r celda_cellAnnot, results = "asis"} if (!is.null(cellAnnotFinal)) { for (i in seq_along(cellAnnotFinal)) { cat(sprintf(tab4, cellAnnotFinal[i])) if(isTRUE(plotLabels[i])) { conditionClass <- "factor" } else { conditionClass <- NULL } print( plotSCEDimReduceColData( altExp(sce, altExpName), colorBy = cellAnnotFinal[i], conditionClass = conditionClass, reducedDim = reducedDimName, labelClusters = plotLabels[i], dotSize = 0.5 ) ) cat(space) } } else { message( "No cell annotations are displayed. To plot cell annotations in this ", "section, you can pass the desired variables from the `colData` in the ", "SingleCellExperiment object via the 'cellAnnot' or 'cellAnnotLabel' ", "parameters." ) } ```

    ## Modules {.tabset .tabset-fade} Celda performs bi-clustering of features into modules and cells into subpopulations. Modules are groups of genes that are co-expressed across cells. The *Module Overview* tab contains the probabilities and heatmaps for each module. Module probability plots will color cells by the probability of each module on a 2-D embedding plot. Module heatmaps show the relative expression for the features in a module across the cells. Cells within the heatmap will be ordered from the lowest to highest probability of the module. The *Module Table* tab contains a complete table of all features in each module. ### Module Overview {.tabset .tabset-fade} Use the tabs to select modules ranging from 1 to `r L`. Module probabilities on the 2-D embedding are scaled to range between 0 and 1. Each column on the heatmap represents a cell and each row represents a feature. Expression values for each feature are z-scored normalized across all cells after normalization. Red represents higher relative expression and blue represents lower relative expression. All cells are shown in the heatmap on the left. Only the top 100 cells with the lowest module probability and the 100 cells with the highest module probability are shown in the heatmap on the right. The column color bar displays the population assignment for each cell. ```{r celda_module_heatmaps, results = "asis", fig.height = 15, fig.width = 10} p2 <- moduleHeatmap( sce, topCells = NULL, displayName = displayName, moduleLabel = "All cells", useRaster = TRUE, returnAsList = TRUE ) p3 <- moduleHeatmap( sce, topCells = 100, displayName = displayName, moduleLabel = "Top 100 cells", useRaster = TRUE, returnAsList = TRUE ) fig.list <- list() for (i in seq_len(L)) { p1 <- plotDimReduceModule( sce, reducedDimName = reducedDimName, useAssay = useAssay, altExpName = altExpName, modules = i ) # p2 <- moduleHeatmap( # sce, # featureModule = i, # topCells = NULL, # displayName = displayName, # moduleLabel = "All cells", # useRaster = TRUE # ) # p3 <- moduleHeatmap( # sce, # featureModule = i, # topCells = 100, # displayName = displayName, # moduleLabel = "Top 100 cells", # useRaster = TRUE # ) fig <- multi_panel_figure(rows = 2, columns = 2, figure_name = paste0("fig", i)) fig <- fill_panel(fig, p1, row = 1, column = 1:2, label = "") fig <- fill_panel(fig, p2[[i]], row = 2, column = 1, label = "") fig <- fill_panel(fig, p3[[i]], row = 2, column = 2, label = "") fig.list[[i]] <- fig } for (i in seq_len(L)) { cat(sprintf(tab4, paste0("L", i))) print(fig.list[[i]]) cat(space) } ``` ### Module Probability Summary {.tabset .tabset-fade} Celda orders modules by hierarchical clustering and so modules with similar numbers will have more somewhat more similar expression patterns across cells. Module probability plots are shown for groups of modules in each tab to allow for a quick exploration of these patterns across cells. ```{r celda_module_tabs, results = "asis", fig.height = 9, fig.width = 10} grids <- seq(0, L, by = 16) if (tail(grids, 1) != L) { grids <- c(grids, L) } for (i in seq.int(1, length(grids) - 1)) { modules <- seq(grids[i] + 1, grids[i + 1]) if (length(modules) > 1) { label <- paste0("L", modules[1], "-", "L", modules[length(modules)]) } else { label <- paste0("L", modules[1]) } cat(sprintf(tab4, label)) print( plotDimReduceModule( sce, reducedDimName = reducedDimName, useAssay = useAssay, altExpName = altExpName, ncol = 4, modules = modules ) ) cat(space) } ``` ### Module Table This section displays a table of features in each module and can be used as a quick way to lookup features of interest. The features within each module are ordered from those with highest expression at the top to those with lower expression at the bottom (same as in the heatmaps in the previous tab). ```{r celda_modules_table} table <- featureModuleTable( sce, useAssay = useAssay, altExpName = altExpName, displayName = displayName ) kable(table, style = "html", row.names = FALSE) %>% kable_styling(bootstrap_options = "striped") %>% scroll_box(width = "100%", height = "800px") if (!is.null(moduleFileName)) { featureModuleTable( sce, useAssay = useAssay, altExpName = altExpName, displayName = displayName, outputFile = moduleFileName ) } ```

    ## Expression of pre-selected markers {.tabset .tabset-fade} Relative expression profiles are shown for features that match: **`r features`**. The parameter `exactMatch` is set to ```r exactMatch```. If this parameter was set to ```FALSE```, then additional features that contain the name of any marker may also be shown. ```{r plot_markers, fig.height = markerHeight, fig.width = 12} if (!is.null(features)) { ix <- retrieveSCEIndex( inSCE = sce, IDs = features, axis = "row", by = displayName, exactMatch = exactMatch ) if (length(ix) > 0) { print( plotDimReduceFeature( sce, reducedDimName = "celda_UMAP", features = features, displayName = displayName, colorHigh = "red", colorMid = "grey", colorLow = "blue", ncol = 3 ) ) } else { message("No matching markers were found.") } } else { message("No features were supplied.") } ``` ## Overview Heatmaps {.tabset} Overview heatmaps can be used to visualize the relationship between modules and cell populations at high level.

    ### Probability Map The probability matrix on the left contains the probability of each module within each cell subpopulation. This matrix can be used to gain insights into the absolute abundance of each module within a cell subpopulation. Modules with higher probability have a higher overall expression level compared to other modules within the same cell population. The relative probability heatmap on the right is produced by taking the z-score of the module probabilities across cell subpopulations. Examining the relative abundance can be useful for finding modules that exhibit specific patterns across cell populations even if they have an overall lower absolute probability compared to other modules. ```{r celda_probability_map, echo = TRUE, eval = TRUE, fig.height = 20, fig.width = 10} print(celdaProbabilityMap(sce)) ``` ## Session Information {.unnumbered} ```{r session, eval = showSession, echo = showSession} sessionInfo() ``` ================================================ FILE: inst/rmarkdown/CeldaCG_Run.Rmd ================================================ --- title: "Celda_CG Run" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: toc: true toc_float: true number_sections: true theme: cosmo code_folding: hide params: sce: sce L: L K: K sampleLabel: sampleLabel altExpName: altExpName useAssay: useAssay initialL: initialL maxL: maxL initialK: initialK maxK: maxK minCell: minCell minCount: minCount maxFeatures: maxFeatures sceFile: sceFile showSession: showSession pdf: pdf --- # Introduction [Celda](https://github.com/campbio/celda) package (Cellular Latent Dirichlet Allocation) performs co-clustering of features into modules and cells into subpopulations using count data generated by single-cell genomic platforms. The results can then used to explore the different combination of modules that define each cell population. # Run celda_CG ## Setup The following code loads required libraries, reads in parameters, checks input variables, and sets up display options for the **Celda_CG Run** report. ```{r run_setup} require(singleCellTK) require(celda) require(knitr) require(gridExtra) require(ggplot2) sce <- params$sce L <- params$L K <- params$K sampleName <- params$sampleName altExpName <- params$altExpName useAssay <- params$useAssay initialL <- params$initialL maxL <- params$maxL initialK <- params$initialK maxK <- params$maxK minCell <- params$minCell minCount <- params$minCount sceFile <- params$sceFile maxFeatures <- params$maxFeatures pdf <- params$pdf showSession <- params$showSession # Set up Rmarkdown variables tab3 <- "### %s {-} " tab4 <- "#### %s {-} " space <- " " dev <- ifelse(isTRUE(pdf), c("png"), c("png", "pdf")) knitr::opts_chunk$set( echo = TRUE, cache = FALSE, cache.lazy = FALSE, # don't do lazy-loading for big objects cache.comments = FALSE, fig.align = "center", fig.keep = "all", dev = dev ) ``` ## Selection of features Before clustering, features that do not have at least `r minCount` counts in at least `r minCell` cells are excluded. If the number of features is more than `r maxFeatures`, then Seurat's VST function is used to select the `r maxFeatures` most variable features. While Celda can handle features with many zero counts, lowering the number of features can reduce computational time. ```{r feature_select, fig.height = 9, fig.width = 18} # Select features with minimum counts across a minimum number of cells sce <- selectFeatures( sce, minCount = minCount, minCell = minCell, altExpName = altExpName, useAssay = useAssay ) # Use Seurat to find highly variable features if the number of rows is still # more than maxFeatures varFilter <- ifelse(nrow(altExp(sce, altExpName)) > maxFeatures, TRUE, FALSE) if (varFilter) { temp.sce <- sce temp.sce <- seuratFindHVG(inSCE = temp.sce, useAssay = useAssay) o <- head( order( rowData(temp.sce)$seurat_variableFeatures_vst_varianceStandardized, decreasing = TRUE ), n = maxFeatures ) altExp(sce, altExpName) <- subsetSCERows(temp.sce, index = o, returnAsAltExp = FALSE) } ``` ## Determining the number of modules (L) Two stepwise splitting procedures are implemented in celda to help determine the number of modules (L) and the number of cell populations (K), respectively. The ```recursiveSplitModule``` function fits different celda models for a range of L values between `r initialL` to `r maxL`. The first model is fit with ```r paste0("L = ", initialL)```. Then the `celda_G` model is used to split each module into two new modules and the likelihood is re-calculated. The split that produced the best overall likelihood out of all splits is used for the next model with `L+1` modules. Perplexity can be used as a measure of "goodness of fit" for discrete Bayesian models. The perplexity is calculated for each choice of L. The Rate of Perplexity Change (RPC) can be used to determine a better "elbow" in the perplexity plot. The elbow represents a dramatic shift in the amount that the perplexity decreases with each new module and is a good place to start looking at possible choices of ```L```. Note that sometimes going beyond the elbow can be helpful if certain modules should be further split up after visual inspection. ```{r module_split, fig.height = 9, fig.width = 18} # Run recursiveModuleSplit to identify modules for different L moduleSplit <- recursiveSplitModule( sce, initialL = initialL, maxL = maxL, sampleLabel = sampleLabel, altExpName = altExpName, useAssay = useAssay ) # Plot perplexity to help choose L sep <- ifelse(maxL > 100, 10, 5) p1 <- plotGridSearchPerplexity(moduleSplit, altExpName = altExpName, sep = sep) p2 <- plotRPC(moduleSplit, altExpName = altExpName, sep = sep) grid.arrange(p1, p2, ncol = 2) ``` ## Determining the number of cell populations (K) The ```recursiveSplitCell``` function fits different celda models for a range of ```K``` values from `r initialK` to `r maxK`. The number of modules is set to ```L``` and the module labels from the ```recursiveSplitModule``` output are used for initialization. Similarly, the first model is fit with ```r paste0("K = ", initialK)```. Then the `celda_C` model is used to split each cell population into two new cell populations and the likelihood is re-calculated. The split that produced the best overall likelihood out of all splits is used for the next model with `K+1` cell populations. Perplexity and RPC are calculated as described in the previous section. The elbow can be used as a good starting point for possible choices of ```K```. Lastly, the final model is selected and the modules and cells are reordered using hierarchical clustering so that more similar modules and cell populations will have more similar values of L and K, respectively. Different choices for ```K``` are also visualized in reduced dimensional plots in the next section. ```{r cell_split} # Select number of gene modules moduleSplitSelect <- subsetCeldaList(moduleSplit, params = list(L = L)) initial.modules <- celdaModules(moduleSplitSelect) # Split cell modules cellSplit <- recursiveSplitCell( sce, initialK = initialK, maxK = maxK, yInit = initial.modules, sampleLabel = sampleLabel, altExpName = altExpName, useAssay = useAssay, reorder = FALSE) # Show results of clustering genes (perplexity plot) p3 <- plotGridSearchPerplexity(cellSplit, altExpName = altExpName, sep = 5) + theme(legend.position = "bottom") p4 <- plotRPC(cellSplit, altExpName = altExpName, sep = 5) + theme(legend.position = "bottom") grid.arrange(p3, p4, ncol = 2) # Select the final model and reorder L/K values sce <- subsetCeldaList(cellSplit, params = list(K = K)) sce <- reorderCelda(sce) ``` ## Dimensionality Reduction {.tabset .tabset-fade} Reduced dimensional 2-D plots created by algorithms such as tSNE and UMAP are useful for visualizing the relationship between cells. Each point on the plot represents a single cell. Cells closer together on the plot have more similar expression profiles across all genes. The following tSNEs and UMAPs are colored by different celda models with different choices of ```K```. This plots can be used to help select a final solution. ### tSNE {.tabset .tabset-fade} ```{r dimreduce_tsne, results = "asis"} sce <- celdaTsne(sce, useAssay = useAssay, altExpName = altExpName) tsne <- reducedDim(altExp(sce, altExpName), "celda_tSNE") for (i in seq.int(initialK, maxK)) { cat(sprintf(tab4, paste0("K = ", i))) sce.temp <- subsetCeldaList(cellSplit, params = list(K = i)) print( plotDimReduceCluster( celdaClusters(sce.temp), dim1 = tsne[, 1], dim2 = tsne[, 2], labelClusters = TRUE ) ) cat(space) } ``` ### UMAP {.tabset .tabset-fade} ```{r dimreduce_umap, results = "asis"} sce <- celdaUmap(sce, useAssay = useAssay, altExpName = altExpName) umap <- reducedDim(altExp(sce, altExpName), "celda_UMAP") for (i in seq.int(initialK, maxK)) { cat(sprintf(tab4, paste0("K = ", i))) sce.temp <- subsetCeldaList(cellSplit, params = list(K = i)) print( plotDimReduceCluster( celdaClusters(sce.temp), dim1 = umap[, 1], dim2 = umap[, 2], labelClusters = TRUE ) ) cat(space) } ``` ## Save the final model The SCE object with the final celda model will be saved to the file: ```r basename(sceFile)```. ```{r saving} if (!is.null(sceFile)) { saveRDS(sce, sceFile) } ``` ## Methods Summary Co-clustering of features into modules and cells into subpopulations was performed with the [celda](https://www.biorxiv.org/content/10.1101/2020.11.16.373274v2) package using the ```celda_CG``` model. Features that did not have at least `r minCount` counts in at least `r minCell` cells were excluded in the analysis. `r if(isTRUE(varFilter)) paste0("The top ", maxFeatures, " variable features were selected with the Seurat VST method using the singleCellTK package.")` The number of modules was identified using the ```recursiveSplitModule``` function with ```r paste0("initialL = ", initialL)``` and ```r paste0("maxL = ", maxL)```. The number of cell populations was identified using the ```recursiveSplitCell``` function with ```r paste0("initialK = ", initialK)``` and ```r paste0("maxK = ", maxK)```. Final values of ```r paste0("L = ", L)``` and ```r paste0("K = ", K)``` were chosen for final analysis based on the curve for the [Rate of Perplexity Change](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-16-S13-S8). [UMAPs](https://arxiv.org/abs/1602.00370) were generated using the ```celdaUmap``` function. [tSNEs](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf) were generated with the ```celdaTsne``` function. ## Session Information ```{r session_info, eval = showSession} sessionInfo() ``` ================================================ FILE: man/appendCeldaList.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/misc.R \name{appendCeldaList} \alias{appendCeldaList} \title{Append two celdaList objects} \usage{ appendCeldaList(list1, list2) } \arguments{ \item{list1}{A celda_list object} \item{list2}{A celda_list object to be joined with list_1} } \value{ A celdaList object. This object contains all resList entries and runParam records from both lists. } \description{ Returns a single celdaList representing the combination of two provided celdaList objects. } \examples{ data(celdaCGGridSearchRes) appendedList <- appendCeldaList( celdaCGGridSearchRes, celdaCGGridSearchRes ) } ================================================ FILE: man/availableModels.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{availableModels} \alias{availableModels} \title{available models} \format{ An object of class \code{character} of length 3. } \usage{ availableModels } \description{ available models } \keyword{datasets} ================================================ FILE: man/bestLogLikelihood.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/loglikelihood.R \name{bestLogLikelihood} \alias{bestLogLikelihood} \alias{bestLogLikelihood,SingleCellExperiment-method} \alias{bestLogLikelihood,celdaModel-method} \title{Get the log-likelihood} \usage{ bestLogLikelihood(x, altExpName = "featureSubset") \S4method{bestLogLikelihood}{SingleCellExperiment}(x, altExpName = "featureSubset") \S4method{bestLogLikelihood}{celdaModel}(x) } \arguments{ \item{x}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}, or a celda model object.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ Numeric. The log-likelihood at the final step of Gibbs sampling used to generate the model. } \description{ Retrieves the final log-likelihood from all iterations of Gibbs sampling used to generate a celdaModel. } \examples{ data(sceCeldaCG) bestLogLikelihood(sceCeldaCG) data(celdaCGMod) bestLogLikelihood(celdaCGMod) } ================================================ FILE: man/celda.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/misc.R \name{celda} \alias{celda} \title{Celda models} \usage{ celda() } \value{ None } \description{ List of available Celda models with correpsonding descriptions. } \examples{ celda() } ================================================ FILE: man/celdaCGGridSearchRes.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{celdaCGGridSearchRes} \alias{celdaCGGridSearchRes} \title{celdaCGGridSearchRes} \format{ An object as returned from old celdaGridSearch() } \usage{ celdaCGGridSearchRes } \description{ Example results of old celdaGridSearch on celdaCGSim } \keyword{datasets} ================================================ FILE: man/celdaCGMod.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{celdaCGMod} \alias{celdaCGMod} \title{celdaCGmod} \format{ A celda_CG object } \usage{ celdaCGMod } \description{ celda_CG model object generated from \code{celdaCGSim} using old \code{celda_CG} function. } \keyword{datasets} ================================================ FILE: man/celdaCGSim.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{celdaCGSim} \alias{celdaCGSim} \title{celdaCGSim} \format{ A list of counts and properties as returned from old simulateCells(). } \usage{ celdaCGSim } \description{ An deprecated example of simulated count matrix from the celda_CG model. } \keyword{datasets} ================================================ FILE: man/celdaCMod.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{celdaCMod} \alias{celdaCMod} \title{celdaCMod} \format{ A celda_C object } \usage{ celdaCMod } \description{ Old celda_C results generated from celdaCSim } \keyword{datasets} ================================================ FILE: man/celdaCSim.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{celdaCSim} \alias{celdaCSim} \title{celdaCSim} \format{ A list of counts and properties as returned from old simulateCells(). } \usage{ celdaCSim } \description{ An old example simulated count matrix from the celda_C model. } \keyword{datasets} ================================================ FILE: man/celdaClusters.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{celdaClusters} \alias{celdaClusters} \alias{celdaClusters,SingleCellExperiment-method} \alias{celdaClusters,celdaModel-method} \alias{celdaClusters<-} \alias{celdaClusters<-,SingleCellExperiment-method} \title{Get or set the cell cluster labels from a celda \linkS4class{SingleCellExperiment} object or celda model object.} \usage{ celdaClusters(x, altExpName = "featureSubset") \S4method{celdaClusters}{SingleCellExperiment}(x, altExpName = "featureSubset") \S4method{celdaClusters}{celdaModel}(x) celdaClusters(x, altExpName = "featureSubset") <- value \S4method{celdaClusters}{SingleCellExperiment}(x, altExpName = "featureSubset") <- value } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. The a \link{altExp} slot with name \code{altExpName} will be used. Rows represent features and columns represent cells. \item Celda model object.}} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{value}{Character vector of cell cluster labels for replacements. Works only if \code{x} is a \linkS4class{SingleCellExperiment} object.} } \value{ One of \itemize{ \item Character vector if \code{x} is a \linkS4class{SingleCellExperiment} object. Contains cell cluster labels for each cell in x. \item List if \code{x} is a celda model object. Contains cell cluster labels (for celda_C and celdaCG Models) and/or feature module labels (for celda_G and celdaCG Models).} } \description{ Return or set the cell cluster labels determined by \link{celda_C} or \link{celda_CG} models. } \examples{ data(sceCeldaCG) celdaClusters(sceCeldaCG) data(celdaCGMod) celdaClusters(celdaCGMod) } ================================================ FILE: man/celdaGMod.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{celdaGMod} \alias{celdaGMod} \title{celdaGMod} \format{ A celda_G object } \usage{ celdaGMod } \description{ Old celda_G results generated from celdaGsim } \keyword{datasets} ================================================ FILE: man/celdaGSim.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{celdaGSim} \alias{celdaGSim} \title{celdaGSim} \format{ A list of counts and properties as returned from old simulateCells() } \usage{ celdaGSim } \description{ An old example simulated count matrix from the celda_G model. } \keyword{datasets} ================================================ FILE: man/celdaGridSearch.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celdaGridSearch.R \name{celdaGridSearch} \alias{celdaGridSearch} \alias{celdaGridSearch,SingleCellExperiment-method} \alias{celdaGridSearch,matrix-method} \title{Run Celda in parallel with multiple parameters} \usage{ celdaGridSearch( x, useAssay = "counts", altExpName = "featureSubset", model, paramsTest, paramsFixed = NULL, maxIter = 200, nchains = 3, cores = 1, bestOnly = TRUE, seed = 12345, perplexity = TRUE, verbose = TRUE, logfilePrefix = "Celda" ) \S4method{celdaGridSearch}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", model, paramsTest, paramsFixed = NULL, maxIter = 200, nchains = 3, cores = 1, bestOnly = TRUE, seed = 12345, perplexity = TRUE, verbose = TRUE, logfilePrefix = "Celda" ) \S4method{celdaGridSearch}{matrix}( x, useAssay = "counts", altExpName = "featureSubset", model, paramsTest, paramsFixed = NULL, maxIter = 200, nchains = 3, cores = 1, bestOnly = TRUE, seed = 12345, perplexity = TRUE, verbose = TRUE, logfilePrefix = "Celda" ) } \arguments{ \item{x}{A numeric \link{matrix} of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{useAssay}{A string specifying the name of the \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{model}{Celda model. Options available in \link{availableModels}.} \item{paramsTest}{List. A list denoting the combinations of parameters to run in a celda model. For example, \code{list(K = seq(5, 10), L = seq(15, 20))} will run all combinations of K from 5 to 10 and L from 15 to 20 in model \link{celda_CG}.} \item{paramsFixed}{List. A list denoting additional parameters to use in each celda model. Default NULL.} \item{maxIter}{Integer. Maximum number of iterations of sampling to perform. Default 200.} \item{nchains}{Integer. Number of random cluster initializations. Default 3.} \item{cores}{Integer. The number of cores to use for parallel estimation of chains. Default 1.} \item{bestOnly}{Logical. Whether to return only the chain with the highest log likelihood per combination of parameters or return all chains. Default TRUE.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. Seed values \code{seq(seed, (seed + nchains - 1))} will be supplied to each chain in \code{nchains}. If NULL, no calls to \link[withr]{with_seed} are made.} \item{perplexity}{Logical. Whether to calculate perplexity for each model. If FALSE, then perplexity can be calculated later with \link{resamplePerplexity}. Default TRUE.} \item{verbose}{Logical. Whether to print log messages during celda chain execution. Default TRUE.} \item{logfilePrefix}{Character. Prefix for log files from worker threads and main process. Default "Celda".} } \value{ A \linkS4class{SingleCellExperiment} object. Function parameter settings and celda model results are stored in the \link{metadata} \code{"celda_grid_search"} slot. } \description{ Run Celda with different combinations of parameters and multiple chains in parallel. The variable \link{availableModels} contains the potential models that can be utilized. Different parameters to be tested should be stored in a list and passed to the argument \code{paramsTest}. Fixed parameters to be used in all models, such as \code{sampleLabel}, can be passed as a list to the argument \code{paramsFixed}. When \code{verbose = TRUE}, output from each chain will be sent to a log file but not be displayed in \code{stdout}. } \examples{ \dontrun{ data(celdaCGSim) ## Run various combinations of parameters with 'celdaGridSearch' celdaCGGridSearchRes <- celdaGridSearch(celdaCGSim$counts, model = "celda_CG", paramsTest = list(K = seq(4, 6), L = seq(9, 11)), paramsFixed = list(sampleLabel = celdaCGSim$sampleLabel), bestOnly = TRUE, nchains = 1, cores = 1) } } \seealso{ \link{celda_G} for feature clustering, \link{celda_C} for clustering of cells, and \link{celda_CG} for simultaneous clustering of features and cells. \link{subsetCeldaList} can subset the \code{celdaList} object. \link{selectBestModel} can get the best model for each combination of parameters. } ================================================ FILE: man/celdaHeatmap.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_heatmap.R \name{celdaHeatmap} \alias{celdaHeatmap} \alias{celdaHeatmap,SingleCellExperiment-method} \title{Plot celda Heatmap} \usage{ celdaHeatmap( sce, useAssay = "counts", altExpName = "featureSubset", featureIx = NULL, nfeatures = 25, ... ) \S4method{celdaHeatmap}{SingleCellExperiment}( sce, useAssay = "counts", altExpName = "featureSubset", featureIx = NULL, nfeatures = 25, ... ) } \arguments{ \item{sce}{A \link[SingleCellExperiment]{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{featureIx}{Integer vector. Select features for display in heatmap. If NULL, no subsetting will be performed. Default NULL. \strong{Only used for \code{sce} containing celda_C model result returned by \link{celda_C}.}} \item{nfeatures}{Integer. Maximum number of features to select for each gene module. Default 25. \strong{Only used for \code{sce} containing celda_CG or celda_G model results returned by \link{celda_CG} or \link{celda_G}.}} \item{...}{Additional parameters passed to \link{plotHeatmap}.} } \value{ list A list containing dendrogram information and the heatmap grob } \description{ Render a stylable heatmap of count data based on celda clustering results. } \examples{ data(sceCeldaCG) celdaHeatmap(sceCeldaCG) } \seealso{ `celdaTsne()` for generating 2-dimensional tSNE coordinates } ================================================ FILE: man/celdaModel.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{celdaModel} \alias{celdaModel} \alias{celdaModel,SingleCellExperiment-method} \title{Get celda model from a celda \link[SingleCellExperiment]{SingleCellExperiment} object} \usage{ celdaModel(sce, altExpName = "featureSubset") \S4method{celdaModel}{SingleCellExperiment}(sce, altExpName = "featureSubset") } \arguments{ \item{sce}{A \link[SingleCellExperiment]{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ Character. The celda model. Can be one of "celda_C", "celda_G", or "celda_CG". } \description{ Return the celda model for \code{sce} returned by \link{celda_C}, \link{celda_G} or \link{celda_CG}. } \examples{ data(sceCeldaCG) celdaModel(sceCeldaCG) } ================================================ FILE: man/celdaModules.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{celdaModules} \alias{celdaModules} \alias{celdaModules,SingleCellExperiment-method} \alias{celdaModules<-} \alias{celdaModules<-,SingleCellExperiment-method} \title{Get or set the feature module labels from a celda \linkS4class{SingleCellExperiment} object.} \usage{ celdaModules(sce, altExpName = "featureSubset") \S4method{celdaModules}{SingleCellExperiment}(sce, altExpName = "featureSubset") celdaModules(sce, altExpName = "featureSubset") <- value \S4method{celdaModules}{SingleCellExperiment}(sce, altExpName = "featureSubset") <- value } \arguments{ \item{sce}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_G}, or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. Rows represent features and columns represent cells.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{value}{Character vector of feature module labels for replacements. Works only if \code{x} is a \linkS4class{SingleCellExperiment} object.} } \value{ Character vector. Contains feature module labels for each feature in x. } \description{ Return or set the feature module cluster labels determined by \link{celda_G} or \link{celda_CG} models. } \examples{ data(sceCeldaCG) celdaModules(sceCeldaCG) } ================================================ FILE: man/celdaPerplexity-celdaList-method.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{celdaPerplexity,celdaList-method} \alias{celdaPerplexity,celdaList-method} \title{Get perplexity for every model in a celdaList} \usage{ \S4method{celdaPerplexity}{celdaList}(celdaList) } \arguments{ \item{celdaList}{An object of class celdaList.} } \value{ List. Contains one celdaModel object for each of the parameters specified in the `runParams()` of the provided celda list. } \description{ Returns perplexity for each model in a celdaList as calculated by `perplexity().` } \examples{ data(celdaCGGridSearchRes) celdaCGGridModelPerplexities <- celdaPerplexity(celdaCGGridSearchRes) } ================================================ FILE: man/celdaPerplexity.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{celdaPerplexity} \alias{celdaPerplexity} \title{Get perplexity for every model in a celdaList} \usage{ celdaPerplexity(celdaList) } \arguments{ \item{celdaList}{An object of class celdaList.} } \value{ List. Contains one celdaModel object for each of the parameters specified in the `runParams()` of the provided celda list. } \description{ Returns perplexity for each model in a celdaList as calculated by `perplexity().` } \examples{ data(celdaCGGridSearchRes) celdaCGGridModelPerplexities <- celdaPerplexity(celdaCGGridSearchRes) } ================================================ FILE: man/celdaProbabilityMap.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celdaProbabilityMap.R \name{celdaProbabilityMap} \alias{celdaProbabilityMap} \alias{celdaProbabilityMap,SingleCellExperiment-method} \title{Probability map for a celda model} \usage{ celdaProbabilityMap( sce, useAssay = "counts", altExpName = "featureSubset", level = c("cellPopulation", "sample"), ncols = 100, col2 = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), title1 = "Absolute probability", title2 = "Relative expression", showColumnNames = TRUE, showRowNames = TRUE, rowNamesgp = grid::gpar(fontsize = 8), colNamesgp = grid::gpar(fontsize = 12), clusterRows = FALSE, clusterColumns = FALSE, showHeatmapLegend = TRUE, heatmapLegendParam = list(title = NULL, legend_height = grid::unit(6, "cm")), ... ) \S4method{celdaProbabilityMap}{SingleCellExperiment}( sce, useAssay = "counts", altExpName = "featureSubset", level = c("cellPopulation", "sample"), ncols = 100, col2 = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), title1 = "Absolute probability", title2 = "Relative expression", showColumnNames = TRUE, showRowNames = TRUE, rowNamesgp = grid::gpar(fontsize = 8), colNamesgp = grid::gpar(fontsize = 12), clusterRows = FALSE, clusterColumns = FALSE, showHeatmapLegend = TRUE, heatmapLegendParam = list(title = NULL, legend_height = grid::unit(6, "cm")), ... ) } \arguments{ \item{sce}{A \link[SingleCellExperiment]{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{level}{Character. One of "cellPopulation" or "Sample". "cellPopulation" will display the absolute probabilities and relative normalized expression of each module in each cell population. \strong{\code{level = "cellPopulation"} only works for celda_CG \code{sce} objects}. "sample" will display the absolute probabilities and relative normalized abundance of each cell population in each sample. Default "cellPopulation".} \item{ncols}{The number of colors (>1) to be in the color palette of the absolute probability heatmap.} \item{col2}{Passed to \code{col} argument of \link[ComplexHeatmap]{Heatmap}. Set color boundaries and colors for the relative expression heatmap.} \item{title1}{Passed to \code{column_title} argument of \link[ComplexHeatmap]{Heatmap}. Figure title for the absolute probability heatmap.} \item{title2}{Passed to \code{column_title} argument of \link[ComplexHeatmap]{Heatmap}. Figure title for the relative expression heatmap.} \item{showColumnNames}{Passed to \code{show_column_names} argument of \link[ComplexHeatmap]{Heatmap}. Show column names.} \item{showRowNames}{Passed to \code{show_row_names} argument of \link[ComplexHeatmap]{Heatmap}. Show row names.} \item{rowNamesgp}{Passed to \code{row_names_gp} argument of \link[ComplexHeatmap]{Heatmap}. Set row name font.} \item{colNamesgp}{Passed to \code{column_names_gp} argument of \link[ComplexHeatmap]{Heatmap}. Set column name font.} \item{clusterRows}{Passed to \code{cluster_rows} argument of \link[ComplexHeatmap]{Heatmap}. Cluster rows.} \item{clusterColumns}{Passed to \code{cluster_columns} argument of \link[ComplexHeatmap]{Heatmap}. Cluster columns.} \item{showHeatmapLegend}{Passed to \code{show_heatmap_legend} argument of \link[ComplexHeatmap]{Heatmap}. Show heatmap legend.} \item{heatmapLegendParam}{Passed to \code{heatmap_legend_param} argument of \link[ComplexHeatmap]{Heatmap}. Heatmap legend parameters.} \item{...}{Additional parameters passed to \link[ComplexHeatmap]{Heatmap}.} } \value{ A \link[ComplexHeatmap]{HeatmapList} object containing 2 \link[ComplexHeatmap]{Heatmap-class} objects } \description{ Renders probability and relative expression heatmaps to visualize the relationship between features and cell populations (or cell populations and samples). } \examples{ data(sceCeldaCG) celdaProbabilityMap(sceCeldaCG) } \seealso{ \link{celda_C} for clustering cells. \link{celda_CG} for clustering features and cells } ================================================ FILE: man/celdaTsne.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celdatSNE.R \name{celdaTsne} \alias{celdaTsne} \alias{celdaTsne,SingleCellExperiment-method} \title{t-Distributed Stochastic Neighbor Embedding (t-SNE) dimension reduction for celda \code{sce} object} \usage{ celdaTsne( sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, initialDims = 20, modules = NULL, perplexity = 20, maxIter = 2500, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, seed = 12345 ) \S4method{celdaTsne}{SingleCellExperiment}( sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, initialDims = 20, modules = NULL, perplexity = 20, maxIter = 2500, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, seed = 12345 ) } \arguments{ \item{sce}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{maxCells}{Integer. Maximum number of cells to plot. Cells will be randomly subsampled if \code{ncol(counts) > maxCells}. Larger numbers of cells requires more memory. If \code{NULL}, no subsampling will be performed. Default \code{NULL}.} \item{minClusterSize}{Integer. Do not subsample cell clusters below this threshold. Default 100.} \item{initialDims}{Integer. PCA will be used to reduce the dimensionality of the dataset. The top 'initialDims' principal components will be used for tSNE. Default 20.} \item{modules}{Integer vector. Determines which feature modules to use for tSNE. If \code{NULL}, all modules will be used. Default \code{NULL}.} \item{perplexity}{Numeric. Perplexity parameter for tSNE. Default 20.} \item{maxIter}{Integer. Maximum number of iterations in tSNE generation. Default 2500.} \item{normalize}{Character. Passed to \link{normalizeCounts} in normalization step. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells.} \item{scaleFactor}{Numeric. Sets the scale factor for cell-level normalization. This scale factor is multiplied to each cell after the library size of each cell had been adjusted in \code{normalize}. Default \code{NULL} which means no scale factor is applied.} \item{transformationFun}{Function. Applys a transformation such as 'sqrt', 'log', 'log2', 'log10', or 'log1p'. If \code{NULL}, no transformation will be applied. Occurs after applying normalization and scale factor. Default \code{NULL}.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} } \value{ \code{sce} with t-SNE coordinates (columns "celda_tSNE1" & "celda_tSNE2") added to \code{\link{reducedDim}(sce, "celda_tSNE")}. } \description{ Embeds cells in two dimensions using \link[Rtsne]{Rtsne} based on a celda model. For celda_C \code{sce} objects, PCA on the normalized counts is used to reduce the number of features before applying t-SNE. For celda_CG and celda_G \code{sce} objects, tSNE is run on module probabilities to reduce the number of features instead of using PCA. Module probabilities are square-root transformed before applying tSNE. } \examples{ data(sceCeldaCG) tsneRes <- celdaTsne(sceCeldaCG) } ================================================ FILE: man/celdaUmap.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celdaUMAP.R \name{celdaUmap} \alias{celdaUmap} \alias{celdaUmap,SingleCellExperiment-method} \title{Uniform Manifold Approximation and Projection (UMAP) dimension reduction for celda \code{sce} object} \usage{ celdaUmap( sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, modules = NULL, seed = 12345, nNeighbors = 30, minDist = 0.75, spread = 1, pca = TRUE, initialDims = 50, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, cores = 1, ... ) \S4method{celdaUmap}{SingleCellExperiment}( sce, useAssay = "counts", altExpName = "featureSubset", maxCells = NULL, minClusterSize = 100, modules = NULL, seed = 12345, nNeighbors = 30, minDist = 0.75, spread = 1, pca = TRUE, initialDims = 50, normalize = "proportion", scaleFactor = NULL, transformationFun = sqrt, cores = 1, ... ) } \arguments{ \item{sce}{A \link[SingleCellExperiment]{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{maxCells}{Integer. Maximum number of cells to plot. Cells will be randomly subsampled if \code{ncol(sce) > maxCells}. Larger numbers of cells requires more memory. If NULL, no subsampling will be performed. Default NULL.} \item{minClusterSize}{Integer. Do not subsample cell clusters below this threshold. Default 100.} \item{modules}{Integer vector. Determines which features modules to use for UMAP. If NULL, all modules will be used. Default NULL.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} \item{nNeighbors}{The size of local neighborhood used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. Default 30. See \link[uwot]{umap} for more information.} \item{minDist}{The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. Default 0.75. See \link[uwot]{umap} for more information.} \item{spread}{The effective scale of embedded points. In combination with \code{min_dist}, this determines how clustered/clumped the embedded points are. Default 1. See \link[uwot]{umap} for more information.} \item{pca}{Logical. Whether to perform dimensionality reduction with PCA before UMAP. Only works for celda_C \code{sce} objects.} \item{initialDims}{Integer. Number of dimensions from PCA to use as input in UMAP. Default 50. Only works for celda_C \code{sce} objects.} \item{normalize}{Character. Passed to \link{normalizeCounts} in normalization step. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells.} \item{scaleFactor}{Numeric. Sets the scale factor for cell-level normalization. This scale factor is multiplied to each cell after the library size of each cell had been adjusted in \code{normalize}. Default \code{NULL} which means no scale factor is applied.} \item{transformationFun}{Function. Applys a transformation such as 'sqrt', 'log', 'log2', 'log10', or 'log1p'. If \code{NULL}, no transformation will be applied. Occurs after applying normalization and scale factor. Default \code{NULL}.} \item{cores}{Number of threads to use. Default 1.} \item{...}{Additional parameters to pass to \link[uwot]{umap}.} } \value{ \code{sce} with UMAP coordinates (columns "celda_UMAP1" & "celda_UMAP2") added to \code{\link{reducedDim}(sce, "celda_UMAP")}. } \description{ Embeds cells in two dimensions using \link[uwot]{umap} based on a celda model. For celda_C \code{sce} objects, PCA on the normalized counts is used to reduce the number of features before applying UMAP. For celda_CG \code{sce} object, UMAP is run on module probabilities to reduce the number of features instead of using PCA. Module probabilities are square-root transformed before applying UMAP. } \examples{ data(sceCeldaCG) umapRes <- celdaUmap(sceCeldaCG) } ================================================ FILE: man/celda_C.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_C.R \name{celda_C} \alias{celda_C} \alias{celda_C,SingleCellExperiment-method} \alias{celda_C,ANY-method} \title{Cell clustering with Celda} \usage{ celda_C( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, alpha = 1, beta = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, logfile = NULL, verbose = TRUE ) \S4method{celda_C}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, alpha = 1, beta = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, logfile = NULL, verbose = TRUE ) \S4method{celda_C}{ANY}( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, alpha = 1, beta = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, logfile = NULL, verbose = TRUE ) } \arguments{ \item{x}{A \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells. Alternatively, any matrix-like object that can be coerced to a sparse matrix of class "dgCMatrix" can be directly used as input. The matrix will automatically be converted to a \linkS4class{SingleCellExperiment} object.} \item{useAssay}{A string specifying the name of the \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{sampleLabel}{Vector or factor. Denotes the sample label for each cell (column) in the count matrix.} \item{K}{Integer. Number of cell populations.} \item{alpha}{Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default 1.} \item{beta}{Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature in each cell population. Default 1.} \item{algorithm}{String. Algorithm to use for clustering cell subpopulations. One of 'EM' or 'Gibbs'. The EM algorithm is faster, especially for larger numbers of cells. However, more chains may be required to ensure a good solution is found. If 'EM' is selected, then 'stopIter' will be automatically set to 1. Default 'EM'.} \item{stopIter}{Integer. Number of iterations without improvement in the log likelihood to stop inference. Default 10.} \item{maxIter}{Integer. Maximum number of iterations of Gibbs sampling or EM to perform. Default 200.} \item{splitOnIter}{Integer. On every `splitOnIter` iteration, a heuristic will be applied to determine if a cell population should be reassigned and another cell population should be split into two clusters. To disable splitting, set to -1. Default 10.} \item{splitOnLast}{Integer. After `stopIter` iterations have been performed without improvement, a heuristic will be applied to determine if a cell population should be reassigned and another cell population should be split into two clusters. If a split occurs, then `stopIter` will be reset. Default TRUE.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} \item{nchains}{Integer. Number of random cluster initializations. Default 3.} \item{zInitialize}{Character. One of 'random', 'split', or 'predefined'. With 'random', cells are randomly assigned to a populations. With 'split', cells will be split into sqrt(K) populations and then each population will be subsequently split into another sqrt(K) populations. With 'predefined', values in `zInit` will be used to initialize `z`. Default 'split'.} \item{countChecksum}{Character. An MD5 checksum for the `counts` matrix. Default NULL.} \item{zInit}{Integer vector. Sets initial starting values of z. 'zInit' is only used when `zInitialize = 'predfined'`. Default NULL.} \item{logfile}{Character. Messages will be redirected to a file named `logfile`. If NULL, messages will be printed to stdout. Default NULL.} \item{verbose}{Logical. Whether to print log messages. Default TRUE.} } \value{ A \link[SingleCellExperiment]{SingleCellExperiment} object. Function parameter settings are stored in the \link{metadata} \code{"celda_parameters"} slot. Columns \code{celda_sample_label} and \code{celda_cell_cluster} in \link{colData} contain sample labels and celda cell population clusters. } \description{ Clusters the columns of a count matrix containing single-cell data into K subpopulations. The \code{useAssay} \link{assay} slot in \code{altExpName} \link{altExp} slot will be used if it exists. Otherwise, the \code{useAssay} \link{assay} slot in \code{x} will be used if \code{x} is a \linkS4class{SingleCellExperiment} object. } \examples{ data(celdaCSim) sce <- celda_C(celdaCSim$counts, K = celdaCSim$K, sampleLabel = celdaCSim$sampleLabel, nchains = 1) } \seealso{ \link{celda_G} for feature clustering and \link{celda_CG} for simultaneous clustering of features and cells. \link{celdaGridSearch} can be used to run multiple values of K and multiple chains in parallel. } ================================================ FILE: man/celda_CG.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_CG.R \name{celda_CG} \alias{celda_CG} \alias{celda_CG,SingleCellExperiment-method} \alias{celda_CG,ANY-method} \title{Cell and feature clustering with Celda} \usage{ celda_CG( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, L, alpha = 1, beta = 1, delta = 1, gamma = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), yInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = TRUE ) \S4method{celda_CG}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, L, alpha = 1, beta = 1, delta = 1, gamma = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), yInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = TRUE ) \S4method{celda_CG}{ANY}( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, K, L, alpha = 1, beta = 1, delta = 1, gamma = 1, algorithm = c("EM", "Gibbs"), stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, zInitialize = c("split", "random", "predefined"), yInitialize = c("split", "random", "predefined"), countChecksum = NULL, zInit = NULL, yInit = NULL, logfile = NULL, verbose = TRUE ) } \arguments{ \item{x}{A \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells. Alternatively, any matrix-like object that can be coerced to a sparse matrix of class "dgCMatrix" can be directly used as input. The matrix will automatically be converted to a \linkS4class{SingleCellExperiment} object.} \item{useAssay}{A string specifying the name of the \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{sampleLabel}{Vector or factor. Denotes the sample label for each cell (column) in the count matrix.} \item{K}{Integer. Number of cell populations.} \item{L}{Integer. Number of feature modules.} \item{alpha}{Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default 1.} \item{beta}{Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell population. Default 1.} \item{delta}{Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1.} \item{gamma}{Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 1.} \item{algorithm}{String. Algorithm to use for clustering cell subpopulations. One of 'EM' or 'Gibbs'. The EM algorithm for cell clustering is faster, especially for larger numbers of cells. However, more chains may be required to ensure a good solution is found. Default 'EM'.} \item{stopIter}{Integer. Number of iterations without improvement in the log likelihood to stop inference. Default 10.} \item{maxIter}{Integer. Maximum number of iterations of Gibbs sampling to perform. Default 200.} \item{splitOnIter}{Integer. On every \code{splitOnIter} iteration, a heuristic will be applied to determine if a cell population or feature module should be reassigned and another cell population or feature module should be split into two clusters. To disable splitting, set to -1. Default 10.} \item{splitOnLast}{Integer. After \code{stopIter} iterations have been performed without improvement, a heuristic will be applied to determine if a cell population or feature module should be reassigned and another cell population or feature module should be split into two clusters. If a split occurs, then 'stopIter' will be reset. Default TRUE.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} \item{nchains}{Integer. Number of random cluster initializations. Default 3.} \item{zInitialize}{Chararacter. One of 'random', 'split', or 'predefined'. With 'random', cells are randomly assigned to a populations. With 'split', cells will be split into sqrt(K) populations and then each population will be subsequently split into another sqrt(K) populations. With 'predefined', values in \code{zInit} will be used to initialize \code{z}. Default 'split'.} \item{yInitialize}{Character. One of 'random', 'split', or 'predefined'. With 'random', features are randomly assigned to a modules. With 'split', features will be split into sqrt(L) modules and then each module will be subsequently split into another sqrt(L) modules. With 'predefined', values in \code{yInit} will be used to initialize \code{y}. Default 'split'.} \item{countChecksum}{Character. An MD5 checksum for the counts matrix. Default NULL.} \item{zInit}{Integer vector. Sets initial starting values of z. 'zInit' is only used when `zInitialize = 'predfined'`. Default NULL.} \item{yInit}{Integer vector. Sets initial starting values of y. 'yInit' is only be used when `yInitialize = "predefined"`. Default NULL.} \item{logfile}{Character. Messages will be redirected to a file named `logfile`. If NULL, messages will be printed to stdout. Default NULL.} \item{verbose}{Logical. Whether to print log messages. Default TRUE.} } \value{ A \linkS4class{SingleCellExperiment} object. Function parameter settings are stored in \link{metadata} \code{"celda_parameters"} in \link{altExp} slot. In \link{altExp} slot, columns \code{celda_sample_label} and \code{celda_cell_cluster} in \link{colData} contain sample labels and celda cell population clusters. Column \code{celda_feature_module} in \link{rowData} contains feature modules. } \description{ Clusters the rows and columns of a count matrix containing single-cell data into L modules and K subpopulations, respectively. The \code{useAssay} \link{assay} slot in \code{altExpName} \link{altExp} slot will be used if it exists. Otherwise, the \code{useAssay} \link{assay} slot in \code{x} will be used if \code{x} is a \linkS4class{SingleCellExperiment} object. } \examples{ data(celdaCGSim) sce <- celda_CG(celdaCGSim$counts, K = celdaCGSim$K, L = celdaCGSim$L, sampleLabel = celdaCGSim$sampleLabel, nchains = 1) } \seealso{ \link{celda_G} for feature clustering and \link{celda_C} for clustering cells. \link{celdaGridSearch} can be used to run multiple values of K/L and multiple chains in parallel. } ================================================ FILE: man/celda_G.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_G.R \name{celda_G} \alias{celda_G} \alias{celda_G,SingleCellExperiment-method} \alias{celda_G,ANY-method} \title{Feature clustering with Celda} \usage{ celda_G( x, useAssay = "counts", altExpName = "featureSubset", L, beta = 1, delta = 1, gamma = 1, stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, yInitialize = c("split", "random", "predefined"), countChecksum = NULL, yInit = NULL, logfile = NULL, verbose = TRUE ) \S4method{celda_G}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", L, beta = 1, delta = 1, gamma = 1, stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, yInitialize = c("split", "random", "predefined"), countChecksum = NULL, yInit = NULL, logfile = NULL, verbose = TRUE ) \S4method{celda_G}{ANY}( x, useAssay = "counts", altExpName = "featureSubset", L, beta = 1, delta = 1, gamma = 1, stopIter = 10, maxIter = 200, splitOnIter = 10, splitOnLast = TRUE, seed = 12345, nchains = 3, yInitialize = c("split", "random", "predefined"), countChecksum = NULL, yInit = NULL, logfile = NULL, verbose = TRUE ) } \arguments{ \item{x}{A \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells. Alternatively, any matrix-like object that can be coerced to a sparse matrix of class "dgCMatrix" can be directly used as input. The matrix will automatically be converted to a \linkS4class{SingleCellExperiment} object.} \item{useAssay}{A string specifying the name of the \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{L}{Integer. Number of feature modules.} \item{beta}{Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell. Default 1.} \item{delta}{Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1.} \item{gamma}{Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 1.} \item{stopIter}{Integer. Number of iterations without improvement in the log likelihood to stop inference. Default 10.} \item{maxIter}{Integer. Maximum number of iterations of Gibbs sampling to perform. Default 200.} \item{splitOnIter}{Integer. On every `splitOnIter` iteration, a heuristic will be applied to determine if a feature module should be reassigned and another feature module should be split into two clusters. To disable splitting, set to -1. Default 10.} \item{splitOnLast}{Integer. After `stopIter` iterations have been performed without improvement, a heuristic will be applied to determine if a cell population should be reassigned and another cell population should be split into two clusters. If a split occurs, then `stopIter` will be reset. Default TRUE.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} \item{nchains}{Integer. Number of random cluster initializations. Default 3.} \item{yInitialize}{Chararacter. One of 'random', 'split', or 'predefined'. With 'random', features are randomly assigned to a modules. With 'split', features will be split into sqrt(L) modules and then each module will be subsequently split into another sqrt(L) modules. With 'predefined', values in `yInit` will be used to initialize `y`. Default 'split'.} \item{countChecksum}{Character. An MD5 checksum for the `counts` matrix. Default NULL.} \item{yInit}{Integer vector. Sets initial starting values of y. `yInit` can only be used when `yInitialize = 'predefined'`. Default NULL.} \item{logfile}{Character. Messages will be redirected to a file named \code{logfile}. If NULL, messages will be printed to stdout. Default NULL.} \item{verbose}{Logical. Whether to print log messages. Default TRUE.} } \value{ A \linkS4class{SingleCellExperiment} object. Function parameter settings are stored in the \link{metadata} \code{"celda_parameters"} slot. Column \code{celda_feature_module} in \link{rowData} contains feature modules. } \description{ Clusters the rows of a count matrix containing single-cell data into L modules. The \code{useAssay} \link{assay} slot in \code{altExpName} \link{altExp} slot will be used if it exists. Otherwise, the \code{useAssay} \link{assay} slot in \code{x} will be used if \code{x} is a \linkS4class{SingleCellExperiment} object. } \examples{ data(celdaGSim) sce <- celda_G(celdaGSim$counts, L = celdaGSim$L, nchains = 1) } \seealso{ \link{celda_C} for cell clustering and \link{celda_CG} for simultaneous clustering of features and cells. \link{celdaGridSearch} can be used to run multiple values of L and multiple chains in parallel. } ================================================ FILE: man/celdatosce.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celdatosce.R \name{celdatosce} \alias{celdatosce} \alias{celdatosce,celda_C-method} \alias{celdatosce,celda_G-method} \alias{celdatosce,celda_CG-method} \alias{celdatosce,celdaList-method} \title{Convert old celda model object to \code{SCE} object} \usage{ celdatosce( celdaModel, counts, useAssay = "counts", altExpName = "featureSubset" ) \S4method{celdatosce}{celda_C}( celdaModel, counts, useAssay = "counts", altExpName = "featureSubset" ) \S4method{celdatosce}{celda_G}( celdaModel, counts, useAssay = "counts", altExpName = "featureSubset" ) \S4method{celdatosce}{celda_CG}( celdaModel, counts, useAssay = "counts", altExpName = "featureSubset" ) \S4method{celdatosce}{celdaList}( celdaModel, counts, useAssay = "counts", altExpName = "featureSubset" ) } \arguments{ \item{celdaModel}{A \code{celdaModel} or \code{celdaList} object generated using older versions of \code{celda}.} \item{counts}{A numeric \link{matrix} of counts used to generate \code{celdaModel}. Dimensions and MD5 checksum will be checked by \link{compareCountMatrix}.} \item{useAssay}{A string specifying the name of the \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ A \linkS4class{SingleCellExperiment} object. Function parameter settings are stored in the \link{metadata} \code{"celda_parameters"} slot. Columns \code{celda_sample_label} and \code{celda_cell_cluster} in \link{colData} contain sample labels and celda cell population clusters. Column \code{celda_feature_module} in \link{rowData} contain feature modules. } \description{ Convert a old celda model object (\code{celda_C}, \code{celda_G}, or \code{celda_CG} object) to a \linkS4class{SingleCellExperiment} object containing celda model information in \code{metadata} slot. Counts matrix is stored in the \code{"counts"} assay slot in \code{assays}. } \examples{ data(celdaCMod, celdaCSim) sce <- celdatosce(celdaCMod, celdaCSim$counts) data(celdaGMod, celdaGSim) sce <- celdatosce(celdaGMod, celdaGSim$counts) data(celdaCGMod, celdaCGSim) sce <- celdatosce(celdaCGMod, celdaCGSim$counts) data(celdaCGGridSearchRes, celdaCGSim) sce <- celdatosce(celdaCGGridSearchRes, celdaCGSim$counts) } ================================================ FILE: man/clusterProbability.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/clusterProbability.R \name{clusterProbability} \alias{clusterProbability} \alias{clusterProbability,SingleCellExperiment-method} \title{Get the conditional probabilities of cell in subpopulations from celda model} \usage{ clusterProbability( sce, useAssay = "counts", altExpName = "featureSubset", log = FALSE ) \S4method{clusterProbability}{SingleCellExperiment}( sce, useAssay = "counts", altExpName = "featureSubset", log = FALSE ) } \arguments{ \item{sce}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. Rows represent features and columns represent cells.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{log}{Logical. If \code{FALSE}, then the normalized conditional probabilities will be returned. If \code{TRUE}, then the unnormalized log probabilities will be returned. Default \code{FALSE}.} } \value{ A list containging a matrix for the conditional cell subpopulation cluster and/or feature module probabilities. } \description{ Calculate the conditional probability of each cell belonging to each subpopulation given all other cell cluster assignments and/or each feature belonging to each module given all other feature cluster assignments in a celda model. } \examples{ data(sceCeldaCG) clusterProb <- clusterProbability(sceCeldaCG, log = TRUE) data(sceCeldaC) clusterProb <- clusterProbability(sceCeldaC) } \seealso{ `celda_C()` for clustering cells } ================================================ FILE: man/compareCountMatrix.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_functions.R \name{compareCountMatrix} \alias{compareCountMatrix} \alias{compareCountMatrix,ANY,celdaModel-method} \alias{compareCountMatrix,ANY,celdaList-method} \title{Check count matrix consistency} \usage{ compareCountMatrix(counts, celdaMod, errorOnMismatch = TRUE) \S4method{compareCountMatrix}{ANY,celdaModel}(counts, celdaMod, errorOnMismatch = TRUE) \S4method{compareCountMatrix}{ANY,celdaList}(counts, celdaMod, errorOnMismatch = TRUE) } \arguments{ \item{counts}{Integer , Numeric, or Sparse matrix. Rows represent features and columns represent cells.} \item{celdaMod}{A \code{celdaModel} or \code{celdaList} object.} \item{errorOnMismatch}{Logical. Whether to throw an error in the event of a mismatch. Default TRUE.} } \value{ Returns TRUE if provided count matrix matches the one used in the celda object and/or \code{errorOnMismatch = FALSE}, FALSE otherwise. } \description{ Checks if the counts matrix is the same one used to generate the celda model object by comparing dimensions and MD5 checksum. } \examples{ data(celdaCGSim, celdaCGMod) compareCountMatrix(celdaCGSim$counts, celdaCGMod, errorOnMismatch = FALSE) data(celdaCGSim, celdaCGGridSearchRes) compareCountMatrix(celdaCGSim$counts, celdaCGGridSearchRes, errorOnMismatch = FALSE) } ================================================ FILE: man/contaminationSim.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{contaminationSim} \alias{contaminationSim} \title{contaminationSim} \format{ A list } \usage{ contaminationSim } \description{ A toy contamination data generated by \link{simulateContamination} } \keyword{datasets} ================================================ FILE: man/countChecksum-celdaList-method.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{countChecksum,celdaList-method} \alias{countChecksum,celdaList-method} \title{Get the MD5 hash of the count matrix from the celdaList} \usage{ \S4method{countChecksum}{celdaList}(celdaList) } \arguments{ \item{celdaList}{An object of class celdaList.} } \value{ A character string of length 32 containing the MD5 digest of the count matrix. } \description{ Returns the MD5 hash of the count matrix used to generate the celdaList. } \examples{ data(celdaCGGridSearchRes) countChecksum <- countChecksum(celdaCGGridSearchRes) } ================================================ FILE: man/countChecksum.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{countChecksum} \alias{countChecksum} \title{Get the MD5 hash of the count matrix from the celdaList} \usage{ countChecksum(celdaList) } \arguments{ \item{celdaList}{An object of class celdaList.} } \value{ A character string of length 32 containing the MD5 digest of the count matrix. } \description{ Returns the MD5 hash of the count matrix used to generate the celdaList. } \examples{ data(celdaCGGridSearchRes) countChecksum <- countChecksum(celdaCGGridSearchRes) } ================================================ FILE: man/decontX.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/decon.R \name{decontX} \alias{decontX} \alias{decontX,SingleCellExperiment-method} \alias{decontX,ANY-method} \title{Contamination estimation with decontX} \usage{ decontX(x, ...) \S4method{decontX}{SingleCellExperiment}( x, assayName = "counts", z = NULL, batch = NULL, background = NULL, bgAssayName = NULL, bgBatch = NULL, maxIter = 500, delta = c(10, 10), estimateDelta = TRUE, convergence = 0.001, iterLogLik = 10, varGenes = 5000, dbscanEps = 1, seed = 12345, logfile = NULL, verbose = TRUE ) \S4method{decontX}{ANY}( x, z = NULL, batch = NULL, background = NULL, bgBatch = NULL, maxIter = 500, delta = c(10, 10), estimateDelta = TRUE, convergence = 0.001, iterLogLik = 10, varGenes = 5000, dbscanEps = 1, seed = 12345, logfile = NULL, verbose = TRUE ) } \arguments{ \item{x}{A numeric matrix of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{assayName}. Cells in each batch will be subsetted and converted to a sparse matrix of class \code{dgCMatrix} from package \link{Matrix} before analysis. This object should only contain filtered cells after cell calling. Empty cell barcodes (low expression droplets before cell calling) are not needed to run DecontX.} \item{...}{For the generic, further arguments to pass to each method.} \item{assayName}{Character. Name of the assay to use if \code{x} is a \linkS4class{SingleCellExperiment}.} \item{z}{Numeric or character vector. Cell cluster labels. If NULL, PCA will be used to reduce the dimensionality of the dataset initially, '\link[uwot]{umap}' from the 'uwot' package will be used to further reduce the dataset to 2 dimenions and the '\link[dbscan]{dbscan}' function from the 'dbscan' package will be used to identify clusters of broad cell types. Default NULL.} \item{batch}{Numeric or character vector. Batch labels for cells. If batch labels are supplied, DecontX is run on cells from each batch separately. Cells run in different channels or assays should be considered different batches. Default NULL.} \item{background}{A numeric matrix of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{assayName}. It should have the same data format as \code{x} except it contains the empty droplets instead of cells. When supplied, empirical distribution of transcripts from these empty droplets will be used as the contamination distribution. Default NULL.} \item{bgAssayName}{Character. Name of the assay to use if \code{background} is a \linkS4class{SingleCellExperiment}. Default to same as \code{assayName}.} \item{bgBatch}{Numeric or character vector. Batch labels for \code{background}. Its unique values should be the same as those in \code{batch}, such that each batch of cells have their corresponding batch of empty droplets as background, pointed by this parameter. Default to NULL.} \item{maxIter}{Integer. Maximum iterations of the EM algorithm. Default 500.} \item{delta}{Numeric Vector of length 2. Concentration parameters for the Dirichlet prior for the contamination in each cell. The first element is the prior for the native counts while the second element is the prior for the contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If \code{estimateDelta = TRUE}, this is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then \code{\link[MCMCprecision]{fit_dirichlet}} is used to update \code{delta} in each iteration. If \code{estimateDelta = FALSE}, then \code{delta} is fixed with these values for the entire inference procedure. Fixing \code{delta} and setting a high number in the second element will force \code{decontX} to be more aggressive and estimate higher levels of contamination at the expense of potentially removing native expression. Default \code{c(10, 10)}.} \item{estimateDelta}{Boolean. Whether to update \code{delta} at each iteration.} \item{convergence}{Numeric. The EM algorithm will be stopped if the maximum difference in the contamination estimates between the previous and current iterations is less than this. Default 0.001.} \item{iterLogLik}{Integer. Calculate log likelihood every \code{iterLogLik} iteration. Default 10.} \item{varGenes}{Integer. The number of variable genes to use in dimensionality reduction before clustering. Variability is calcualted using \code{\link[scran]{modelGeneVar}} function from the 'scran' package. Used only when z is not provided. Default 5000.} \item{dbscanEps}{Numeric. The clustering resolution parameter used in '\link[dbscan]{dbscan}' to estimate broad cell clusters. Used only when z is not provided. Default 1.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} \item{logfile}{Character. Messages will be redirected to a file named `logfile`. If NULL, messages will be printed to stdout. Default NULL.} \item{verbose}{Logical. Whether to print log messages. Default TRUE.} } \value{ If \code{x} is a matrix-like object, a list will be returned with the following items: \describe{ \item{\code{decontXcounts}:}{The decontaminated matrix. Values obtained from the variational inference procedure may be non-integer. However, integer counts can be obtained by rounding, e.g. \code{round(decontXcounts)}.} \item{\code{contamination}:}{Percentage of contamination in each cell.} \item{\code{estimates}:}{List of estimated parameters for each batch. If z was not supplied, then the UMAP coordinates used to generated cell cluster labels will also be stored here.} \item{\code{z}:}{Cell population/cluster labels used for analysis.} \item{\code{runParams}:}{List of arguments used in the function call.} } If \code{x} is a \linkS4class{SingleCellExperiment}, then the decontaminated counts will be stored as an assay and can be accessed with \code{decontXcounts(x)}. The contamination values and cluster labels will be stored in \code{colData(x)}. \code{estimates} and \code{runParams} will be stored in \code{metadata(x)$decontX}. The UMAPs used to generated cell cluster labels will be stored in \code{reducedDims} slot in \code{x}. } \description{ Identifies contamination from factors such as ambient RNA in single cell genomic datasets. } \examples{ # Generate matrix with contamination s <- simulateContamination(seed = 12345) library(SingleCellExperiment) sce <- SingleCellExperiment(list(counts = s$observedCounts)) sce <- decontX(sce) # Plot contamination on UMAP plotDecontXContamination(sce) # Plot decontX cluster labels umap <- reducedDim(sce) plotDimReduceCluster(x = sce$decontX_clusters, dim1 = umap[, 1], dim2 = umap[, 2], ) # Plot percentage of marker genes detected # in each cell cluster before decontamination s$markers plotDecontXMarkerPercentage(sce, markers = s$markers, assayName = "counts") # Plot percentage of marker genes detected # in each cell cluster after contamination plotDecontXMarkerPercentage(sce, markers = s$markers, assayName = "decontXcounts") # Plot percentage of marker genes detected in each cell # comparing original and decontaminated counts side-by-side plotDecontXMarkerPercentage(sce, markers = s$markers, assayName = c("counts", "decontXcounts")) # Plot raw counts of indiviual markers genes before # and after decontamination plotDecontXMarkerExpression(sce, unlist(s$markers)) } \author{ Shiyi Yang, Yuan Yin, Joshua Campbell } ================================================ FILE: man/decontXcounts.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/decon.R \name{decontXcounts} \alias{decontXcounts} \alias{decontXcounts<-} \alias{decontXcounts,SingleCellExperiment-method} \alias{decontXcounts<-,SingleCellExperiment-method} \title{Get or set decontaminated counts matrix} \usage{ decontXcounts(object, ...) decontXcounts(object, ...) <- value \S4method{decontXcounts}{SingleCellExperiment}(object, ...) \S4method{decontXcounts}{SingleCellExperiment}(object, ...) <- value } \arguments{ \item{object}{A \linkS4class{SingleCellExperiment} object.} \item{...}{For the generic, further arguments to pass to each method.} \item{value}{A matrix to save as an assay called \code{decontXcounts}} } \value{ If getting, the assay from \code{object} with the name \code{decontXcounts} will be returned. If setting, a \linkS4class{SingleCellExperiment} object will be returned with \code{decontXcounts} listed in the \code{assay} slot. } \description{ Gets or sets the decontaminated counts matrix from a a \linkS4class{SingleCellExperiment} object. } \seealso{ \code{\link{assay}} and \code{\link{assay<-}} } ================================================ FILE: man/distinctColors.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_functions.R \name{distinctColors} \alias{distinctColors} \title{Create a color palette} \usage{ distinctColors( n, hues = c("red", "cyan", "orange", "blue", "yellow", "purple", "green", "magenta"), saturationRange = c(0.7, 1), valueRange = c(0.7, 1) ) } \arguments{ \item{n}{Integer. Number of colors to generate.} \item{hues}{Character vector. Colors available from `colors()`. These will be used as the base colors for the clustering scheme in HSV. Different saturations and values will be generated for each hue. Default c("red", "cyan", "orange", "blue", "yellow", "purple", "green", "magenta").} \item{saturationRange}{Numeric vector. A vector of length 2 denoting the saturation for HSV. Values must be in [0,1]. Default: c(0.25, 1).} \item{valueRange}{Numeric vector. A vector of length 2 denoting the range of values for HSV. Values must be in [0,1]. Default: `c(0.5, 1)`.} } \value{ A vector of distinct colors that have been converted to HEX from HSV. } \description{ Generate a palette of `n` distinct colors. } \examples{ colorPal <- distinctColors(6) # can be used in plotting functions } ================================================ FILE: man/eigenMatMultInt.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{eigenMatMultInt} \alias{eigenMatMultInt} \title{Fast matrix multiplication for double x int} \usage{ eigenMatMultInt(A, B) } \arguments{ \item{A}{a double matrix} \item{B}{an integer matrix} } \value{ An integer matrix representing the product of A and B } \description{ Fast matrix multiplication for double x int } ================================================ FILE: man/eigenMatMultNumeric.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{eigenMatMultNumeric} \alias{eigenMatMultNumeric} \title{Fast matrix multiplication for double x double} \usage{ eigenMatMultNumeric(A, B) } \arguments{ \item{A}{a double matrix} \item{B}{an integer matrix} } \value{ An integer matrix representing the product of A and B } \description{ Fast matrix multiplication for double x double } ================================================ FILE: man/factorizeMatrix.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/factorizeMatrix.R \name{factorizeMatrix} \alias{factorizeMatrix} \alias{factorizeMatrix,SingleCellExperiment,ANY-method} \alias{factorizeMatrix,ANY,celda_CG-method} \alias{factorizeMatrix,ANY,celda_C-method} \alias{factorizeMatrix,ANY,celda_G-method} \title{Generate factorized matrices showing each feature's influence on cell / gene clustering} \usage{ factorizeMatrix( x, celdaMod, useAssay = "counts", altExpName = "featureSubset", type = c("counts", "proportion", "posterior") ) \S4method{factorizeMatrix}{SingleCellExperiment,ANY}( x, useAssay = "counts", altExpName = "featureSubset", type = c("counts", "proportion", "posterior") ) \S4method{factorizeMatrix}{ANY,celda_CG}(x, celdaMod, type = c("counts", "proportion", "posterior")) \S4method{factorizeMatrix}{ANY,celda_C}(x, celdaMod, type = c("counts", "proportion", "posterior")) \S4method{factorizeMatrix}{ANY,celda_G}(x, celdaMod, type = c("counts", "proportion", "posterior")) } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G} or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot in \code{altExp(x, altExpName)}. Rows represent features and columns represent cells. \item Integer counts matrix. Rows represent features and columns represent cells. This matrix should be the same as the one used to generate \code{celdaMod}.}} \item{celdaMod}{Celda model object. Only works if \code{x} is an integer counts matrix.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{type}{Character vector. A vector containing one or more of "counts", "proportion", or "posterior". "counts" returns the raw number of counts for each factorized matrix. "proportions" returns the normalized probabilities for each factorized matrix, which are calculated by dividing the raw counts in each factorized matrix by the total counts in each column. "posterior" returns the posterior estimates which include the addition of the Dirichlet concentration parameter (essentially as a pseudocount). Default \code{"counts"}.} } \value{ For celda_CG model, A list with elements for "counts", "proportions", or "posterior" probabilities. Each element will be a list containing factorized matrices for "module", "cellPopulation", and "sample". Additionally, the contribution of each module in each individual cell will be included in the "cell" element of "counts" and "proportions" elements. For celda_C model, a list with elements for "counts", "proportions", or "posterior" probabilities. Each element will be a list containing factorized matrices for "module" and "sample". For celda_G model, a list with elements for "counts", "proportions", or "posterior" probabilities. Each element will be a list containing factorized matrices for "module" and "cell". } \description{ Generates factorized matrices showing the contribution of each feature in each cell population or each cell population in each sample. } \examples{ data(sceCeldaCG) factorizedMatrices <- factorizeMatrix(sceCeldaCG, type = "posterior") data(celdaCGSim, celdaCGMod) factorizedMatrices <- factorizeMatrix( celdaCGSim$counts, celdaCGMod, "posterior") data(celdaCSim, celdaCMod) factorizedMatrices <- factorizeMatrix( celdaCSim$counts, celdaCMod, "posterior" ) data(celdaGSim, celdaGMod) factorizedMatrices <- factorizeMatrix( celdaGSim$counts, celdaGMod, "posterior" ) } ================================================ FILE: man/fastNormProp.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{fastNormProp} \alias{fastNormProp} \title{Fast normalization for numeric matrix} \usage{ fastNormProp(R_counts, R_alpha) } \arguments{ \item{R_counts}{An integer matrix} \item{R_alpha}{A double value to be added to the matrix as a pseudocount} } \value{ A numeric matrix where the columns have been normalized to proportions } \description{ Fast normalization for numeric matrix } ================================================ FILE: man/fastNormPropLog.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{fastNormPropLog} \alias{fastNormPropLog} \title{Fast normalization for numeric matrix} \usage{ fastNormPropLog(R_counts, R_alpha) } \arguments{ \item{R_counts}{An integer matrix} \item{R_alpha}{A double value to be added to the matrix as a pseudocount} } \value{ A numeric matrix where the columns have been normalized to proportions } \description{ Fast normalization for numeric matrix } ================================================ FILE: man/fastNormPropSqrt.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{fastNormPropSqrt} \alias{fastNormPropSqrt} \title{Fast normalization for numeric matrix} \usage{ fastNormPropSqrt(R_counts, R_alpha) } \arguments{ \item{R_counts}{An integer matrix} \item{R_alpha}{A double value to be added to the matrix as a pseudocount} } \value{ A numeric matrix where the columns have been normalized to proportions } \description{ Fast normalization for numeric matrix } ================================================ FILE: man/featureModuleLookup.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/featureModuleLookup.R \name{featureModuleLookup} \alias{featureModuleLookup} \alias{featureModuleLookup,SingleCellExperiment-method} \title{Obtain the gene module of a gene of interest} \usage{ featureModuleLookup( sce, features, altExpName = "featureSubset", exactMatch = TRUE, by = "rownames" ) \S4method{featureModuleLookup}{SingleCellExperiment}( sce, features, altExpName = "featureSubset", exactMatch = TRUE, by = "rownames" ) } \arguments{ \item{sce}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_G}, or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. Rows represent features and columns represent cells.} \item{features}{Character vector. Identify feature modules for the specified feature names. \code{feature} must match the rownames of \code{sce}.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{exactMatch}{Logical. Whether to look for exactMatch of the gene name within counts matrix. Default \code{TRUE}.} \item{by}{Character. Where to search for \code{features} in the sce object. If set to \code{"rownames"} then the features will be searched for among rownames(sce). This can also be set to one of the \code{colnames} of rowData(sce). Default \code{"rownames"}.} } \value{ Numeric vector containing the module numbers for each feature. If the feature was not found, then an \code{NA} value will be returned in that position. If no features were found, then an error will be given. } \description{ This function will output the corresponding feature module for a specified vector of genes from a celda_CG or celda_G \code{celdaModel}. \code{features} must match the rownames of \code{sce}. } \examples{ data(sceCeldaCG) module <- featureModuleLookup(sce = sceCeldaCG, features = c("Gene_1", "Gene_XXX")) } ================================================ FILE: man/featureModuleTable.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_functions.R \name{featureModuleTable} \alias{featureModuleTable} \title{Output a feature module table} \usage{ featureModuleTable( sce, useAssay = "counts", altExpName = "featureSubset", displayName = NULL, outputFile = NULL ) } \arguments{ \item{sce}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_G}, or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. Rows represent features and columns represent cells.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{displayName}{Character. The column name of \code{rowData(sce)} that specifies the display names for the features. Default \code{NULL}, which displays the row names.} \item{outputFile}{File name for feature module table. If NULL, file will not be created. Default NULL.} } \value{ Matrix. Contains a list of features per each column (feature module) } \description{ Creates a table that contains the list of features in each feature module. } \examples{ data(sceCeldaCG) featureModuleTable(sceCeldaCG) } ================================================ FILE: man/geneSetEnrich.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/geneSetEnrich.R \name{geneSetEnrich} \alias{geneSetEnrich} \alias{geneSetEnrich,SingleCellExperiment-method} \alias{geneSetEnrich,matrix-method} \title{Gene set enrichment} \usage{ geneSetEnrich( x, celdaModel, useAssay = "counts", altExpName = "featureSubset", databases, fdr = 0.05 ) \S4method{geneSetEnrich}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", databases, fdr = 0.05 ) \S4method{geneSetEnrich}{matrix}(x, celdaModel, databases, fdr = 0.05) } \arguments{ \item{x}{A numeric \link{matrix} of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells. Rownames of the matrix or \linkS4class{SingleCellExperiment} object should be gene names.} \item{celdaModel}{Celda object of class \code{celda_G} or \code{celda_CG}.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{databases}{Character vector. Name of reference database. Available databases can be viewed by \link[enrichR]{listEnrichrDbs}.} \item{fdr}{False discovery rate (FDR). Numeric. Cutoff value for adjusted p-value, terms with FDR below this value are considered significantly enriched.} } \value{ List of length 'L' where each member contains the significantly enriched terms for the corresponding module. } \description{ Identify and return significantly-enriched terms for each gene module in a Celda object or a \linkS4class{SingleCellExperiment} object. Performs gene set enrichment analysis for Celda identified modules using the \link[enrichR]{enrichr}. } \examples{ library(M3DExampleData) counts <- M3DExampleData::Mmus_example_list$data # subset 500 genes for fast clustering counts <- counts[seq(1501, 2000), ] # cluster genes into 10 modules for quick demo sce <- celda_G(x = as.matrix(counts), L = 10, verbose = FALSE) gse <- geneSetEnrich(sce, databases = c("GO_Biological_Process_2018", "GO_Molecular_Function_2018")) } \author{ Ahmed Youssef, Zhe Wang } ================================================ FILE: man/logLikelihood.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/loglikelihood.R \name{logLikelihood} \alias{logLikelihood} \alias{logLikelihood,SingleCellExperiment,ANY-method} \alias{logLikelihood,matrix,celda_C-method} \alias{logLikelihood,matrix,celda_G-method} \alias{logLikelihood,matrix,celda_CG-method} \title{Calculate the Log-likelihood of a celda model} \usage{ logLikelihood(x, celdaMod, useAssay = "counts", altExpName = "featureSubset") \S4method{logLikelihood}{SingleCellExperiment,ANY}(x, useAssay = "counts", altExpName = "featureSubset") \S4method{logLikelihood}{matrix,celda_C}(x, celdaMod) \S4method{logLikelihood}{matrix,celda_G}(x, celdaMod) \S4method{logLikelihood}{matrix,celda_CG}(x, celdaMod) } \arguments{ \item{x}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. Rows represent features and columns represent cells.} \item{celdaMod}{celda model object. Ignored if \code{x} is a \linkS4class{SingleCellExperiment} object.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ The log-likelihood of the cluster assignment for the provided \linkS4class{SingleCellExperiment}. } \description{ Calculate the log-likelihood for cell population and feature module cluster assignments on the count matrix, per celda model. } \examples{ data(sceCeldaC, sceCeldaCG) loglikC <- logLikelihood(sceCeldaC) loglikCG <- logLikelihood(sceCeldaCG) } \seealso{ `celda_C()` for clustering cells } ================================================ FILE: man/logLikelihoodHistory.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/loglikelihood.R \name{logLikelihoodHistory} \alias{logLikelihoodHistory} \alias{logLikelihoodHistory,SingleCellExperiment-method} \alias{logLikelihoodHistory,celdaModel-method} \title{Get log-likelihood history} \usage{ logLikelihoodHistory(x, altExpName = "featureSubset") \S4method{logLikelihoodHistory}{SingleCellExperiment}(x, altExpName = "featureSubset") \S4method{logLikelihoodHistory}{celdaModel}(x) } \arguments{ \item{x}{A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}, or a celda model object.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ Numeric. The log-likelihood at each step of Gibbs sampling used to generate the model. } \description{ Retrieves the complete log-likelihood from all iterations of Gibbs sampling used to generate a celda model. } \examples{ data(sceCeldaCG) logLikelihoodHistory(sceCeldaCG) data(celdaCGMod) logLikelihoodHistory(celdaCGMod) } ================================================ FILE: man/matrixNames.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{matrixNames} \alias{matrixNames} \alias{matrixNames,celdaModel-method} \title{Get feature, cell and sample names from a celdaModel} \usage{ matrixNames(celdaMod) \S4method{matrixNames}{celdaModel}(celdaMod) } \arguments{ \item{celdaMod}{celdaModel. Options available in `celda::availableModels`.} } \value{ List. Contains row, column, and sample character vectors corresponding to the values provided when the celdaModel was generated. } \description{ Retrieves the row, column, and sample names used to generate a celdaModel. } \examples{ data(celdaCGMod) matrixNames(celdaCGMod) } ================================================ FILE: man/moduleHeatmap.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/moduleHeatmap.R \name{moduleHeatmap} \alias{moduleHeatmap} \alias{moduleHeatmap,SingleCellExperiment-method} \title{Heatmap for featureModules} \usage{ moduleHeatmap( x, useAssay = "counts", altExpName = "featureSubset", modules = NULL, featureModule = NULL, col = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), topCells = 100, topFeatures = NULL, normalizedCounts = NA, normalize = "proportion", transformationFun = sqrt, scaleRow = scale, showFeatureNames = TRUE, displayName = NULL, trim = c(-2, 2), rowFontSize = NULL, showHeatmapLegend = FALSE, showTopAnnotationLegend = FALSE, showTopAnnotationName = FALSE, topAnnotationHeight = 5, showModuleLabel = TRUE, moduleLabel = "auto", moduleLabelSize = NULL, byrow = TRUE, top = NA, unit = "mm", ncol = NULL, useRaster = TRUE, returnAsList = FALSE, ... ) \S4method{moduleHeatmap}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", modules = NULL, featureModule = NULL, col = circlize::colorRamp2(c(-2, 0, 2), c("#1E90FF", "#FFFFFF", "#CD2626")), topCells = 100, topFeatures = NULL, normalizedCounts = NA, normalize = "proportion", transformationFun = sqrt, scaleRow = scale, showFeatureNames = TRUE, displayName = NULL, trim = c(-2, 2), rowFontSize = NULL, showHeatmapLegend = FALSE, showTopAnnotationLegend = FALSE, showTopAnnotationName = FALSE, topAnnotationHeight = 5, showModuleLabel = TRUE, moduleLabel = "auto", moduleLabelSize = NULL, byrow = TRUE, top = NA, unit = "mm", ncol = NULL, useRaster = TRUE, returnAsList = FALSE, ... ) } \arguments{ \item{x}{A numeric \link{matrix} of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells. Celda results must be present under \code{metadata(altExp(x, altExpName))}.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{modules}{Integer Vector. The featureModule(s) to display. Multiple modules can be included in a vector. Default \code{NULL} which plots all module heatmaps.} \item{featureModule}{Same as \code{modules}. Either can be used to specify the modules to display.} \item{col}{Passed to \link[ComplexHeatmap]{Heatmap}. Set color boundaries and colors.} \item{topCells}{Integer. Number of cells with the highest and lowest probabilities for each module to include in the heatmap. For example, if \code{topCells = 50}, the 50 cells with the lowest probabilities and the 50 cells with the highest probabilities for each featureModule will be included. If NULL, all cells will be plotted. Default 100.} \item{topFeatures}{Integer. Plot `topFeatures` features with the highest probabilities in the module heatmap for each featureModule. If \code{NULL}, plot all features in the module. Default \code{NULL}.} \item{normalizedCounts}{Integer matrix. Rows represent features and columns represent cells. If you have a normalized matrix result from \link{normalizeCounts}, you can pass through the result here to skip the normalization step in this function. Make sure the colnames and rownames match the object in x. This matrix should correspond to one generated from this count matrix \code{assay(altExp(x, altExpName), i = useAssay)}. If \code{NA}, normalization will be carried out in the following form \code{normalizeCounts(assay(altExp(x, altExpName), i = useAssay), normalize = "proportion", transformationFun = sqrt)}. Use of this parameter is particularly useful for plotting many module heatmaps, where normalizing the counts matrix repeatedly would be too time consuming. Default NA.} \item{normalize}{Character. Passed to \link{normalizeCounts} if \code{normalizedCounts} is \code{NA}. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells. Default "proportion".} \item{transformationFun}{Function. Passed to \link{normalizeCounts} if \code{normalizedCounts} is \code{NA}. Applies a transformation such as \link{sqrt}, \link{log}, \link{log2}, \link{log10}, or \link{log1p}. If \code{NULL}, no transformation will be applied. Occurs after normalization. Default \link{sqrt}.} \item{scaleRow}{Function. Which function to use to scale each individual row. Set to NULL to disable. Occurs after normalization and log transformation. For example, \link{scale} will Z-score transform each row. Default \link{scale}.} \item{showFeatureNames}{Logical. Whether feature names should be displayed. Default TRUE.} \item{displayName}{Character. The column name of \code{rowData(altExp(x, altExpName))} that specifies the display names for the features. Default \code{NULL}, which displays the row names. Only works if \code{showFeaturenames} is \code{TRUE} and \code{x} is a \linkS4class{SingleCellExperiment} object.} \item{trim}{Numeric vector. Vector of length two that specifies the lower and upper bounds for plotting the data. This threshold is applied after row scaling. Set to NULL to disable. Default \code{c(-2,2)}.} \item{rowFontSize}{Numeric. Font size for feature names. If \code{NULL}, then the size will automatically be determined. Default \code{NULL}.} \item{showHeatmapLegend}{Passed to \link[ComplexHeatmap]{Heatmap}. Show legend for expression levels.} \item{showTopAnnotationLegend}{Passed to \link[ComplexHeatmap]{HeatmapAnnotation}. Show legend for cell annotation.} \item{showTopAnnotationName}{Passed to \link[ComplexHeatmap]{HeatmapAnnotation}. Show heatmap top annotation name.} \item{topAnnotationHeight}{Passed to \link[ComplexHeatmap]{HeatmapAnnotation}. Column annotation height. \link[ComplexHeatmap]{rowAnnotation}. Show legend for module annotation.} \item{showModuleLabel}{Show left side module labels.} \item{moduleLabel}{The left side row titles for module heatmap. Must be vector of the same length as \code{featureModule}. Default "auto", which automatically pulls module labels from \code{x}.} \item{moduleLabelSize}{Passed to \link{gpar}. The size of text (in points).} \item{byrow}{Passed to \link{matrix}. logical. If \code{FALSE} (the default) the figure panel is filled by columns, otherwise the figure panel is filled by rows.} \item{top}{Passed to \link[gridExtra]{marrangeGrob}. The title for each page.} \item{unit}{Passed to \link[grid]{unit}. Single character object defining the unit of all dimensions defined.} \item{ncol}{Integer. Number of columns of module heatmaps. If \code{NULL}, then this will be automatically calculated so that the number of columns and rows will be approximately the same. Default \code{NULL}.} \item{useRaster}{Boolean. Rasterizing will make the heatmap a single object and reduced the memory of the plot and the size of a file. If \code{NULL}, then rasterization will be automatically determined by the underlying \link[ComplexHeatmap]{Heatmap} function. Default \code{TRUE}.} \item{returnAsList}{Boolean. If \code{TRUE}, then a list of plots will be returned instead of a single multi-panel figure. These plots can be displayed using the \link[grid]{grid.draw} function. Default \code{FALSE}.} \item{...}{Additional parameters passed to \link[ComplexHeatmap]{Heatmap}.} } \value{ A list object if plotting more than one module heatmaps. Otherwise a \link[ComplexHeatmap]{HeatmapList} object is returned. } \description{ Renders a heatmap for selected \code{featureModule}. Cells are ordered from those with the lowest probability of the module on the left to the highest probability on the right. Features are ordered from those with the highest probability in the module on the top to the lowest probability on the bottom. } \examples{ data(sceCeldaCG) moduleHeatmap(sceCeldaCG, displayName = "rownames") } ================================================ FILE: man/nonzero.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{nonzero} \alias{nonzero} \title{get row and column indices of none zero elements in the matrix} \usage{ nonzero(R_counts) } \arguments{ \item{R_counts}{A matrix} } \value{ An integer matrix where each row is a row, column indices pair } \description{ get row and column indices of none zero elements in the matrix } ================================================ FILE: man/normalizeCounts.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_functions.R \name{normalizeCounts} \alias{normalizeCounts} \title{Normalization of count data} \usage{ normalizeCounts( counts, normalize = c("proportion", "cpm", "median", "mean"), scaleFactor = NULL, transformationFun = NULL, scaleFun = NULL, pseudocountNormalize = 0, pseudocountTransform = 0 ) } \arguments{ \item{counts}{Integer, Numeric or Sparse matrix. Rows represent features and columns represent cells.} \item{normalize}{Character. Divides counts by the library sizes for each cell. One of 'proportion', 'cpm', 'median', or 'mean'. 'proportion' uses the total counts for each cell as the library size. 'cpm' divides the library size of each cell by one million to produce counts per million. 'median' divides the library size of each cell by the median library size across all cells. 'mean' divides the library size of each cell by the mean library size across all cells.} \item{scaleFactor}{Numeric. Sets the scale factor for cell-level normalization. This scale factor is multiplied to each cell after the library size of each cell had been adjusted in \code{normalize}. Default \code{NULL} which means no scale factor is applied.} \item{transformationFun}{Function. Applys a transformation such as \link{sqrt}, \link{log}, \link{log2}, \link{log10}, or \link{log1p}. If NULL, no transformation will be applied. Occurs after normalization. Default NULL.} \item{scaleFun}{Function. Scales the rows of the normalized and transformed count matrix. For example, 'scale' can be used to z-score normalize the rows. Default NULL.} \item{pseudocountNormalize}{Numeric. Add a pseudocount to counts before normalization. Default 0.} \item{pseudocountTransform}{Numeric. Add a pseudocount to normalized counts before applying the transformation function. Adding a pseudocount can be useful before applying a log transformation. Default 0.} } \value{ Numeric Matrix. A normalized matrix. } \description{ Performs normalization, transformation, and/or scaling of a counts matrix } \examples{ data(celdaCGSim) normalizedCounts <- normalizeCounts(celdaCGSim$counts, "proportion", pseudocountNormalize = 1) } ================================================ FILE: man/params.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{params} \alias{params} \alias{params,celdaModel-method} \title{Get parameter values provided for celdaModel creation} \usage{ params(celdaMod) \S4method{params}{celdaModel}(celdaMod) } \arguments{ \item{celdaMod}{celdaModel. Options available in \code{celda::availableModels}.} } \value{ List. Contains the model-specific parameters for the provided celda model object depending on its class. } \description{ Retrieves the K/L, model priors (e.g. alpha, beta), and count matrix checksum parameters provided during the creation of the provided celdaModel. } \examples{ data(celdaCGMod) params(celdaCGMod) } ================================================ FILE: man/perplexity.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/perplexity.R \name{perplexity} \alias{perplexity} \alias{perplexity,SingleCellExperiment,ANY-method} \alias{perplexity,ANY,celda_CG-method} \alias{perplexity,ANY,celda_C-method} \alias{perplexity,ANY,celda_G-method} \title{Calculate the perplexity of a celda model} \usage{ perplexity( x, celdaMod, useAssay = "counts", altExpName = "featureSubset", newCounts = NULL ) \S4method{perplexity}{SingleCellExperiment,ANY}( x, useAssay = "counts", altExpName = "featureSubset", newCounts = NULL ) \S4method{perplexity}{ANY,celda_CG}(x, celdaMod, newCounts = NULL) \S4method{perplexity}{ANY,celda_C}(x, celdaMod, newCounts = NULL) \S4method{perplexity}{ANY,celda_G}(x, celdaMod, newCounts = NULL) } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G} or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. Rows represent features and columns represent cells. \item Integer counts matrix. Rows represent features and columns represent cells. This matrix should be the same as the one used to generate \code{celdaMod}.}} \item{celdaMod}{Celda model object. Only works if \code{x} is an integer counts matrix.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{newCounts}{A new counts matrix used to calculate perplexity. If NULL, perplexity will be calculated for the matrix in \code{useAssay} slot in \code{x}. Default NULL.} } \value{ Numeric. The perplexity for the provided \code{x} (and \code{celdaModel}). } \description{ Perplexity is a statistical measure of how well a probability model can predict new data. Lower perplexity indicates a better model. } \examples{ data(sceCeldaCG) perplexity <- perplexity(sceCeldaCG) data(celdaCGSim, celdaCGMod) perplexity <- perplexity(celdaCGSim$counts, celdaCGMod) data(celdaCSim, celdaCMod) perplexity <- perplexity(celdaCSim$counts, celdaCMod) data(celdaGSim, celdaGMod) perplexity <- perplexity(celdaGSim$counts, celdaGMod) } ================================================ FILE: man/plotCeldaViolin.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_dr.R \name{plotCeldaViolin} \alias{plotCeldaViolin} \alias{plotCeldaViolin,SingleCellExperiment-method} \alias{plotCeldaViolin,ANY-method} \title{Feature Expression Violin Plot} \usage{ plotCeldaViolin( x, celdaMod, features, displayName = NULL, useAssay = "counts", altExpName = "featureSubset", exactMatch = TRUE, plotDots = TRUE, dotSize = 0.1 ) \S4method{plotCeldaViolin}{SingleCellExperiment}( x, features, displayName = NULL, useAssay = "counts", altExpName = "featureSubset", exactMatch = TRUE, plotDots = TRUE, dotSize = 0.1 ) \S4method{plotCeldaViolin}{ANY}( x, celdaMod, features, exactMatch = TRUE, plotDots = TRUE, dotSize = 0.1 ) } \arguments{ \item{x}{Numeric matrix or a \linkS4class{SingleCellExperiment} object with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{celdaMod}{Celda object of class "celda_G" or "celda_CG". Used only if \code{x} is a matrix object.} \item{features}{Character vector. Uses these genes for plotting.} \item{displayName}{Character. The column name of \code{rowData(x)} that specifies the display names for the features. Default \code{NULL}, which displays the row names. Only works if \code{x} is a \linkS4class{SingleCellExperiment} object.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{exactMatch}{Logical. Whether an exact match or a partial match using \code{grep()} is used to look up the feature in the rownames of the counts matrix. Default \code{TRUE}.} \item{plotDots}{Boolean. If \code{TRUE}, the expression of features will be plotted as points in addition to the violin curve. Default \code{TRUE}.} \item{dotSize}{Numeric. Size of points if \code{plotDots = TRUE}. Default \code{0.1}.} } \value{ Violin plot for each feature, grouped by celda cluster } \description{ Outputs a violin plot for feature expression data. } \examples{ data(sceCeldaCG) plotCeldaViolin(x = sceCeldaCG, features = "Gene_1") data(celdaCGSim, celdaCGMod) plotCeldaViolin(x = celdaCGSim$counts, celdaMod = celdaCGMod, features = "Gene_1") } ================================================ FILE: man/plotDecontXContamination.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_decontx.R \name{plotDecontXContamination} \alias{plotDecontXContamination} \title{Plots contamination on UMAP coordinates} \usage{ plotDecontXContamination( x, batch = NULL, colorScale = c("blue", "green", "yellow", "orange", "red"), size = 1 ) } \arguments{ \item{x}{Either a \linkS4class{SingleCellExperiment} with \code{decontX} results stored in \code{metadata(x)$decontX} or the result from running decontX on a count matrix.} \item{batch}{Character. Batch of cells to plot. If \code{NULL}, then the first batch in the list will be selected. Default \code{NULL}.} \item{colorScale}{Character vector. Contains the color spectrum to be passed to \code{scale_colour_gradientn} from package 'ggplot2'. Default c("blue","green","yellow","orange","red").} \item{size}{Numeric. Size of points in the scatterplot. Default 1.} } \value{ Returns a \code{ggplot} object. } \description{ A scatter plot of the UMAP dimensions generated by DecontX with cells colored by the estimated percentation of contamation. } \seealso{ See \code{\link{decontX}} for a full example of how to estimate and plot contamination. } \author{ Shiyi Yang, Joshua Campbell } ================================================ FILE: man/plotDecontXMarkerExpression.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_decontx.R \name{plotDecontXMarkerExpression} \alias{plotDecontXMarkerExpression} \title{Plots expression of marker genes before and after decontamination} \usage{ plotDecontXMarkerExpression( x, markers, groupClusters = NULL, assayName = c("counts", "decontXcounts"), z = NULL, exactMatch = TRUE, by = "rownames", log1p = FALSE, ncol = NULL, plotDots = FALSE, dotSize = 0.1 ) } \arguments{ \item{x}{Either a \linkS4class{SingleCellExperiment} or a matrix-like object of counts.} \item{markers}{Character Vector or List. A character vector or list of character vectors with the names of the marker genes of interest.} \item{groupClusters}{List. A named list that allows cell clusters labels coded in \code{z} to be regrouped and renamed on the fly. For example, \code{list(Tcells=c(1, 2), Bcells=7)} would recode clusters 1 and 2 to "Tcells" and cluster 7 to "Bcells". Note that if this is used, clusters in \code{z} not found in \code{groupClusters} will be excluded. Default \code{NULL}.} \item{assayName}{Character vector. Name(s) of the assay(s) to plot if \code{x} is a \linkS4class{SingleCellExperiment}. If more than one assay is listed, then side-by-side violin plots will be generated. Default \code{c("counts", "decontXcounts")}.} \item{z}{Character, Integer, or Vector. Indicates the cluster labels for each cell. If \code{x} is a \linkS4class{SingleCellExperiment} and \code{z = NULL}, then the cluster labels from \code{\link{decontX}} will be retreived from the \code{colData} of \code{x} (i.e. \code{colData(x)$decontX_clusters}). If \code{z} is a single character or integer, then that column will be retrived from \code{colData} of \code{x}. (i.e. \code{colData(x)[,z]}). If \code{x} is a counts matrix, then \code{z} will need to be a vector the same length as the number of columns in \code{x} that indicate the cluster to which each cell belongs. Default \code{NULL}.} \item{exactMatch}{Boolean. Whether to only identify exact matches for the markers or to identify partial matches using \code{\link{grep}}. See \code{\link{retrieveFeatureIndex}} for more details. Default \code{TRUE}.} \item{by}{Character. Where to search for the markers if \code{x} is a \linkS4class{SingleCellExperiment}. See \code{\link{retrieveFeatureIndex}} for more details. If \code{x} is a matrix, then this must be set to \code{"rownames"}. Default \code{"rownames"}.} \item{log1p}{Boolean. Whether to apply the function \code{log1p} to the data before plotting. This function will add a pseudocount of 1 and then log transform the expression values. Default \code{FALSE}.} \item{ncol}{Integer. Number of columns to make in the plot. Default \code{NULL}.} \item{plotDots}{Boolean. If \code{TRUE}, the expression of features will be plotted as points in addition to the violin curve. Default \code{FALSE}.} \item{dotSize}{Numeric. Size of points if \code{plotDots = TRUE}. Default \code{0.1}.} } \value{ Returns a \code{ggplot} object. } \description{ Generates a violin plot that shows the counts of marker genes in cells across specific clusters or cell types. Can be used to view the expression of marker genes in different cell types before and after decontamination with \code{\link{decontX}}. } \seealso{ See \code{\link{decontX}} for a full example of how to estimate and plot contamination. } \author{ Shiyi Yang, Joshua Campbell } ================================================ FILE: man/plotDecontXMarkerPercentage.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_decontx.R \name{plotDecontXMarkerPercentage} \alias{plotDecontXMarkerPercentage} \title{Plots percentage of cells cell types expressing markers} \usage{ plotDecontXMarkerPercentage( x, markers, groupClusters = NULL, assayName = c("counts", "decontXcounts"), z = NULL, threshold = 1, exactMatch = TRUE, by = "rownames", ncol = round(sqrt(length(markers))), labelBars = TRUE, labelSize = 3 ) } \arguments{ \item{x}{Either a \linkS4class{SingleCellExperiment} or a matrix-like object of counts.} \item{markers}{List. A named list indicating the marker genes for each cell type of interest. Multiple markers can be supplied for each cell type. For example, \code{list(Tcell_Markers=c("CD3E", "CD3D"), Bcell_Markers=c("CD79A", "CD79B", "MS4A1")} would specify markers for human T-cells and B-cells. A cell will be considered "positive" for a cell type if it has a count greater than \code{threshold} for at least one of the marker genes in the list.} \item{groupClusters}{List. A named list that allows cell clusters labels coded in \code{z} to be regrouped and renamed on the fly. For example, \code{list(Tcells=c(1, 2), Bcells=7)} would recode clusters 1 and 2 to "Tcells" and cluster 7 to "Bcells". Note that if this is used, clusters in \code{z} not found in \code{groupClusters} will be excluded from the barplot. Default \code{NULL}.} \item{assayName}{Character vector. Name(s) of the assay(s) to plot if \code{x} is a \linkS4class{SingleCellExperiment}. If more than one assay is listed, then side-by-side barplots will be generated. Default \code{c("counts", "decontXcounts")}.} \item{z}{Character, Integer, or Vector. Indicates the cluster labels for each cell. If \code{x} is a \linkS4class{SingleCellExperiment} and \code{z = NULL}, then the cluster labels from \code{\link{decontX}} will be retived from the \code{colData} of \code{x} (i.e. \code{colData(x)$decontX_clusters}). If \code{z} is a single character or integer, then that column will be retrived from \code{colData} of \code{x}. (i.e. \code{colData(x)[,z]}). If \code{x} is a counts matrix, then \code{z} will need to be a vector the same length as the number of columns in \code{x} that indicate the cluster to which each cell belongs. Default \code{NULL}.} \item{threshold}{Numeric. Markers greater than or equal to this value will be considered detected in a cell. Default 1.} \item{exactMatch}{Boolean. Whether to only identify exact matches for the markers or to identify partial matches using \code{\link{grep}}. See \code{\link{retrieveFeatureIndex}} for more details. Default \code{TRUE}.} \item{by}{Character. Where to search for the markers if \code{x} is a \linkS4class{SingleCellExperiment}. See \code{\link{retrieveFeatureIndex}} for more details. If \code{x} is a matrix, then this must be set to \code{"rownames"}.Default \code{"rownames"}.} \item{ncol}{Integer. Number of columns to make in the plot. Default \code{round(sqrt(length(markers))}.} \item{labelBars}{Boolean. Whether to display percentages above each bar Default \code{TRUE}.} \item{labelSize}{Numeric. Size of the percentage labels in the barplot. Default 3.} } \value{ Returns a \code{ggplot} object. } \description{ Generates a barplot that shows the percentage of cells within clusters or cell types that have detectable levels of given marker genes. Can be used to view the expression of marker genes in different cell types before and after decontamination with \code{\link{decontX}}. } \seealso{ See \code{\link{decontX}} for a full example of how to estimate and plot contamination. } \author{ Shiyi Yang, Joshua Campbell } ================================================ FILE: man/plotDimReduceCluster.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_dr.R \name{plotDimReduceCluster} \alias{plotDimReduceCluster} \alias{plotDimReduceCluster,SingleCellExperiment-method} \alias{plotDimReduceCluster,vector-method} \title{Plotting the cell labels on a dimension reduction plot} \usage{ plotDimReduceCluster( x, reducedDimName, altExpName = "featureSubset", dim1 = NULL, dim2 = NULL, size = 0.5, xlab = NULL, ylab = NULL, specificClusters = NULL, labelClusters = FALSE, groupBy = NULL, labelSize = 3.5 ) \S4method{plotDimReduceCluster}{SingleCellExperiment}( x, reducedDimName, altExpName = "featureSubset", dim1 = 1, dim2 = 2, size = 0.5, xlab = NULL, ylab = NULL, specificClusters = NULL, labelClusters = FALSE, groupBy = NULL, labelSize = 3.5 ) \S4method{plotDimReduceCluster}{vector}( x, dim1, dim2, size = 0.5, xlab = "Dimension_1", ylab = "Dimension_2", specificClusters = NULL, labelClusters = FALSE, groupBy = NULL, labelSize = 3.5 ) } \arguments{ \item{x}{Integer vector of cell cluster labels or a \linkS4class{SingleCellExperiment} object containing cluster labels for each cell in \code{"celda_cell_cluster"} column in \code{colData(x)}.} \item{reducedDimName}{The name of the dimension reduction slot in \code{reducedDimNames(x)} if \code{x} is a \linkS4class{SingleCellExperiment} object. Ignored if both \code{dim1} and \code{dim2} are set.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{dim1}{Integer or numeric vector. If \code{reducedDimName} is supplied, then, this will be used as an index to determine which dimension will be plotted on the x-axis. If \code{reducedDimName} is not supplied, then this should be a vector which will be plotted on the x-axis. Default \code{1}.} \item{dim2}{Integer or numeric vector. If \code{reducedDimName} is supplied, then, this will be used as an index to determine which dimension will be plotted on the y-axis. If \code{reducedDimName} is not supplied, then this should be a vector which will be plotted on the y-axis. Default \code{2}.} \item{size}{Numeric. Sets size of point on plot. Default \code{0.5}.} \item{xlab}{Character vector. Label for the x-axis. Default \code{NULL}.} \item{ylab}{Character vector. Label for the y-axis. Default \code{NULL}.} \item{specificClusters}{Numeric vector. Only color cells in the specified clusters. All other cells will be grey. If NULL, all clusters will be colored. Default \code{NULL}.} \item{labelClusters}{Logical. Whether the cluster labels are plotted. Default FALSE.} \item{groupBy}{Character vector. Contains sample labels for each cell. If NULL, all samples will be plotted together. Default NULL.} \item{labelSize}{Numeric. Sets size of label if labelClusters is TRUE. Default 3.5.} } \value{ The plot as a ggplot object } \description{ Create a scatterplot for each row of a normalized gene expression matrix where x and y axis are from a data dimension reduction tool. The cells are colored by "celda_cell_cluster" column in \code{colData(altExp(x, altExpName))} if \code{x} is a \linkS4class{SingleCellExperiment} object, or \code{x} if \code{x} is a integer vector of cell cluster labels. } \examples{ data(sceCeldaCG) sce <- celdaTsne(sceCeldaCG) plotDimReduceCluster(x = sce, reducedDimName = "celda_tSNE", specificClusters = c(1, 2, 3)) library(SingleCellExperiment) data(sceCeldaCG, celdaCGMod) sce <- celdaTsne(sceCeldaCG) plotDimReduceCluster(x = celdaClusters(celdaCGMod)$z, dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], specificClusters = c(1, 2, 3)) } ================================================ FILE: man/plotDimReduceFeature.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_dr.R \name{plotDimReduceFeature} \alias{plotDimReduceFeature} \alias{plotDimReduceFeature,SingleCellExperiment-method} \alias{plotDimReduceFeature,ANY-method} \title{Plotting feature expression on a dimension reduction plot} \usage{ plotDimReduceFeature( x, features, reducedDimName = NULL, displayName = NULL, dim1 = NULL, dim2 = NULL, headers = NULL, useAssay = "counts", altExpName = "featureSubset", normalize = FALSE, zscore = TRUE, exactMatch = TRUE, trim = c(-2, 2), limits = c(-2, 2), size = 0.5, xlab = NULL, ylab = NULL, colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, ncol = NULL, decreasing = FALSE ) \S4method{plotDimReduceFeature}{SingleCellExperiment}( x, features, reducedDimName = NULL, displayName = NULL, dim1 = 1, dim2 = 2, headers = NULL, useAssay = "counts", altExpName = "featureSubset", normalize = FALSE, zscore = TRUE, exactMatch = TRUE, trim = c(-2, 2), limits = c(-2, 2), size = 0.5, xlab = NULL, ylab = NULL, colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, ncol = NULL, decreasing = FALSE ) \S4method{plotDimReduceFeature}{ANY}( x, features, dim1, dim2, headers = NULL, normalize = FALSE, zscore = TRUE, exactMatch = TRUE, trim = c(-2, 2), limits = c(-2, 2), size = 0.5, xlab = "Dimension_1", ylab = "Dimension_2", colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, ncol = NULL, decreasing = FALSE ) } \arguments{ \item{x}{Numeric matrix or a \linkS4class{SingleCellExperiment} object with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{features}{Character vector. Features in the rownames of counts to plot.} \item{reducedDimName}{The name of the dimension reduction slot in \code{reducedDimNames(x)} if \code{x} is a \linkS4class{SingleCellExperiment} object. If \code{NULL}, then both \code{dim1} and \code{dim2} need to be set. Default \code{NULL}.} \item{displayName}{Character. The column name of \code{rowData(x)} that specifies the display names for the features. Default \code{NULL}, which displays the row names. Only works if \code{x} is a \linkS4class{SingleCellExperiment} object. Overwrites \code{headers}.} \item{dim1}{Integer or numeric vector. If \code{reducedDimName} is supplied, then, this will be used as an index to determine which dimension will be plotted on the x-axis. If \code{reducedDimName} is not supplied, then this should be a vector which will be plotted on the x-axis. Default \code{1}.} \item{dim2}{Integer or numeric vector. If \code{reducedDimName} is supplied, then, this will be used as an index to determine which dimension will be plotted on the y-axis. If \code{reducedDimName} is not supplied, then this should be a vector which will be plotted on the y-axis. Default \code{2}.} \item{headers}{Character vector. If \code{NULL}, the corresponding rownames are used as labels. Otherwise, these headers are used to label the features. Only works if \code{displayName} is \code{NULL} and \code{exactMatch} is \code{FALSE}.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{normalize}{Logical. Whether to normalize the columns of `counts`. Default \code{FALSE}.} \item{zscore}{Logical. Whether to scale each feature to have a mean 0 and standard deviation of 1. Default \code{TRUE}.} \item{exactMatch}{Logical. Whether an exact match or a partial match using \code{grep()} is used to look up the feature in the rownames of the counts matrix. Default TRUE.} \item{trim}{Numeric vector. Vector of length two that specifies the lower and upper bounds for the data. This threshold is applied after row scaling. Set to NULL to disable. Default \code{c(-1,1)}.} \item{limits}{Passed to \link{scale_colour_gradient2}. The range of color scale.} \item{size}{Numeric. Sets size of point on plot. Default 1.} \item{xlab}{Character vector. Label for the x-axis. If \code{reducedDimName} is used, then this will be set to the column name of the first dimension of that object. Default "Dimension_1".} \item{ylab}{Character vector. Label for the y-axis. If \code{reducedDimName} is used, then this will be set to the column name of the second dimension of that object. Default "Dimension_2".} \item{colorLow}{Character. A color available from `colors()`. The color will be used to signify the lowest values on the scale.} \item{colorMid}{Character. A color available from `colors()`. The color will be used to signify the midpoint on the scale.} \item{colorHigh}{Character. A color available from `colors()`. The color will be used to signify the highest values on the scale.} \item{midpoint}{Numeric. The value indicating the midpoint of the diverging color scheme. If \code{NULL}, defaults to the mean with 10 percent of values trimmed. Default \code{0}.} \item{ncol}{Integer. Passed to \link[ggplot2]{facet_wrap}. Specify the number of columns for facet wrap.} \item{decreasing}{logical. Specifies the order of plotting the points. If \code{FALSE}, the points will be plotted in increasing order where the points with largest values will be on top. \code{TRUE} otherwise. If \code{NULL}, no sorting is performed. Points will be plotted in their current order in \code{x}. Default \code{FALSE}.} } \value{ The plot as a ggplot object } \description{ Create a scatterplot for each row of a normalized gene expression matrix where x and y axis are from a data dimension reduction tool. The cells are colored by expression of the specified feature. } \examples{ data(sceCeldaCG) sce <- celdaTsne(sceCeldaCG) plotDimReduceFeature(x = sce, reducedDimName = "celda_tSNE", normalize = TRUE, features = c("Gene_98", "Gene_99"), exactMatch = TRUE) library(SingleCellExperiment) data(sceCeldaCG) sce <- celdaTsne(sceCeldaCG) plotDimReduceFeature(x = counts(sce), dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], normalize = TRUE, features = c("Gene_98", "Gene_99"), exactMatch = TRUE) } ================================================ FILE: man/plotDimReduceGrid.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_dr.R \name{plotDimReduceGrid} \alias{plotDimReduceGrid} \alias{plotDimReduceGrid,SingleCellExperiment-method} \alias{plotDimReduceGrid,ANY-method} \title{Mapping the dimension reduction plot} \usage{ plotDimReduceGrid( x, reducedDimName, dim1 = NULL, dim2 = NULL, useAssay = "counts", altExpName = "featureSubset", size = 1, xlab = "Dimension_1", ylab = "Dimension_2", limits = c(-2, 2), colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, varLabel = NULL, ncol = NULL, headers = NULL, decreasing = FALSE ) \S4method{plotDimReduceGrid}{SingleCellExperiment}( x, reducedDimName, dim1 = NULL, dim2 = NULL, useAssay = "counts", altExpName = "featureSubset", size = 1, xlab = "Dimension_1", ylab = "Dimension_2", limits = c(-2, 2), colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, varLabel = NULL, ncol = NULL, headers = NULL, decreasing = FALSE ) \S4method{plotDimReduceGrid}{ANY}( x, dim1, dim2, size = 1, xlab = "Dimension_1", ylab = "Dimension_2", limits = c(-2, 2), colorLow = "blue4", colorMid = "grey90", colorHigh = "firebrick1", midpoint = 0, varLabel = NULL, ncol = NULL, headers = NULL, decreasing = FALSE ) } \arguments{ \item{x}{Numeric matrix or a \linkS4class{SingleCellExperiment} object with the matrix located in the assay slot under \code{useAssay}. Each row of the matrix will be plotted as a separate facet.} \item{reducedDimName}{The name of the dimension reduction slot in \code{reducedDimNames(x)} if \code{x} is a \linkS4class{SingleCellExperiment} object. Ignored if both \code{dim1} and \code{dim2} are set.} \item{dim1}{Numeric vector. Second dimension from data dimension reduction output.} \item{dim2}{Numeric vector. Second dimension from data dimension reduction output.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{size}{Numeric. Sets size of point on plot. Default 1.} \item{xlab}{Character vector. Label for the x-axis. Default 'Dimension_1'.} \item{ylab}{Character vector. Label for the y-axis. Default 'Dimension_2'.} \item{limits}{Passed to \link{scale_colour_gradient2}. The range of color scale.} \item{colorLow}{Character. A color available from `colors()`. The color will be used to signify the lowest values on the scale. Default "blue4".} \item{colorMid}{Character. A color available from `colors()`. The color will be used to signify the midpoint on the scale. Default "grey90".} \item{colorHigh}{Character. A color available from `colors()`. The color will be used to signify the highest values on the scale. Default "firebrick1".} \item{midpoint}{Numeric. The value indicating the midpoint of the diverging color scheme. If \code{NULL}, defaults to the mean with 10 percent of values trimmed. Default \code{0}.} \item{varLabel}{Character vector. Title for the color legend.} \item{ncol}{Integer. Passed to \link[ggplot2]{facet_wrap}. Specify the number of columns for facet wrap.} \item{headers}{Character vector. If `NULL`, the corresponding rownames are used as labels. Otherwise, these headers are used to label the genes.} \item{decreasing}{logical. Specifies the order of plotting the points. If \code{FALSE}, the points will be plotted in increasing order where the points with largest values will be on top. \code{TRUE} otherwise. If \code{NULL}, no sorting is performed. Points will be plotted in their current order in \code{x}. Default \code{FALSE}.} } \value{ The plot as a ggplot object } \description{ Creates a scatterplot given two dimensions from a data dimension reduction tool (e.g tSNE) output. } \examples{ data(sceCeldaCG) sce <- celdaTsne(sceCeldaCG) plotDimReduceGrid(x = sce, reducedDimName = "celda_tSNE", xlab = "Dimension1", ylab = "Dimension2", varLabel = "tSNE") library(SingleCellExperiment) data(sceCeldaCG) sce <- celdaTsne(sceCeldaCG) plotDimReduceGrid(x = counts(sce), dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], xlab = "Dimension1", ylab = "Dimension2", varLabel = "tSNE") } ================================================ FILE: man/plotDimReduceModule.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plot_dr.R \name{plotDimReduceModule} \alias{plotDimReduceModule} \alias{plotDimReduceModule,SingleCellExperiment-method} \alias{plotDimReduceModule,ANY-method} \title{Plotting Celda module probability on a dimension reduction plot} \usage{ plotDimReduceModule( x, reducedDimName, useAssay = "counts", altExpName = "featureSubset", celdaMod, modules = NULL, dim1 = NULL, dim2 = NULL, size = 0.5, xlab = NULL, ylab = NULL, rescale = TRUE, limits = c(0, 1), colorLow = "grey90", colorHigh = "firebrick1", ncol = NULL, decreasing = FALSE ) \S4method{plotDimReduceModule}{SingleCellExperiment}( x, reducedDimName, useAssay = "counts", altExpName = "featureSubset", modules = NULL, dim1 = 1, dim2 = 2, size = 0.5, xlab = NULL, ylab = NULL, rescale = TRUE, limits = c(0, 1), colorLow = "grey90", colorHigh = "firebrick1", ncol = NULL, decreasing = FALSE ) \S4method{plotDimReduceModule}{ANY}( x, celdaMod, modules = NULL, dim1, dim2, size = 0.5, xlab = "Dimension_1", ylab = "Dimension_2", rescale = TRUE, limits = c(0, 1), colorLow = "grey90", colorHigh = "firebrick1", ncol = NULL, decreasing = FALSE ) } \arguments{ \item{x}{Numeric matrix or a \linkS4class{SingleCellExperiment} object with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{reducedDimName}{The name of the dimension reduction slot in \code{reducedDimNames(x)} if \code{x} is a \linkS4class{SingleCellExperiment} object. Ignored if both \code{dim1} and \code{dim2} are set.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{celdaMod}{Celda object of class "celda_G" or "celda_CG". Used only if \code{x} is a matrix object.} \item{modules}{Character vector. Module(s) from celda model to be plotted. e.g. c("1", "2").} \item{dim1}{Integer or numeric vector. If \code{reducedDimName} is supplied, then, this will be used as an index to determine which dimension will be plotted on the x-axis. If \code{reducedDimName} is not supplied, then this should be a vector which will be plotted on the x-axis. Default \code{1}.} \item{dim2}{Integer or numeric vector. If \code{reducedDimName} is supplied, then, this will be used as an index to determine which dimension will be plotted on the y-axis. If \code{reducedDimName} is not supplied, then this should be a vector which will be plotted on the y-axis. Default \code{2}.} \item{size}{Numeric. Sets size of point on plot. Default 0.5.} \item{xlab}{Character vector. Label for the x-axis. Default "Dimension_1".} \item{ylab}{Character vector. Label for the y-axis. Default "Dimension_2".} \item{rescale}{Logical. Whether rows of the matrix should be rescaled to [0, 1]. Default TRUE.} \item{limits}{Passed to \link{scale_colour_gradient}. The range of color scale.} \item{colorLow}{Character. A color available from `colors()`. The color will be used to signify the lowest values on the scale.} \item{colorHigh}{Character. A color available from `colors()`. The color will be used to signify the highest values on the scale.} \item{ncol}{Integer. Passed to \link[ggplot2]{facet_wrap}. Specify the number of columns for facet wrap.} \item{decreasing}{logical. Specifies the order of plotting the points. If \code{FALSE}, the points will be plotted in increasing order where the points with largest values will be on top. \code{TRUE} otherwise. If \code{NULL}, no sorting is performed. Points will be plotted in their current order in \code{x}. Default \code{FALSE}.} } \value{ The plot as a ggplot object } \description{ Create a scatterplot for each row of a normalized gene expression matrix where x and y axis are from a data dimension reduction tool. The cells are colored by the module probability. } \examples{ data(sceCeldaCG) sce <- celdaTsne(sceCeldaCG) plotDimReduceModule(x = sce, reducedDimName = "celda_tSNE", modules = c("1", "2")) library(SingleCellExperiment) data(sceCeldaCG, celdaCGMod) sce <- celdaTsne(sceCeldaCG) plotDimReduceModule(x = counts(sce), dim1 = reducedDim(altExp(sce), "celda_tSNE")[, 1], dim2 = reducedDim(altExp(sce), "celda_tSNE")[, 2], celdaMod = celdaCGMod, modules = c("1", "2")) } ================================================ FILE: man/plotGridSearchPerplexity.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/perplexity.R \name{plotGridSearchPerplexity} \alias{plotGridSearchPerplexity} \alias{plotGridSearchPerplexity,SingleCellExperiment-method} \alias{plotGridSearchPerplexity,celdaList-method} \title{Visualize perplexity of a list of celda models} \usage{ plotGridSearchPerplexity(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) \S4method{plotGridSearchPerplexity}{SingleCellExperiment}(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) \S4method{plotGridSearchPerplexity}{celdaList}(x, sep = 5, alpha = 0.5) } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned from \code{celdaGridSearch}, \code{recursiveSplitModule}, or \code{recursiveSplitCell}. Must contain a list named \code{"celda_grid_search"} in \code{metadata(x)}. \item celdaList object.}} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset". Only works if \code{x} is a \linkS4class{SingleCellExperiment} object.} \item{sep}{Numeric. Breaks in the x axis of the resulting plot.} \item{alpha}{Numeric. Passed to \link{geom_jitter}. Opacity of the points. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.} } \value{ A ggplot plot object showing perplexity as a function of clustering parameters. } \description{ Visualize perplexity of every model in a celdaList, by unique K/L combinations } \examples{ data(sceCeldaCGGridSearch) sce <- resamplePerplexity(sceCeldaCGGridSearch) plotGridSearchPerplexity(sce) data(celdaCGSim, celdaCGGridSearchRes) ## Run various combinations of parameters with 'celdaGridSearch' celdaCGGridSearchRes <- resamplePerplexity( celdaCGSim$counts, celdaCGGridSearchRes) plotGridSearchPerplexity(celdaCGGridSearchRes) } ================================================ FILE: man/plotHeatmap.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/plotHeatmap.R \name{plotHeatmap} \alias{plotHeatmap} \title{Plots heatmap based on Celda model} \usage{ plotHeatmap( counts, z = NULL, y = NULL, scaleRow = scale, trim = c(-2, 2), featureIx = NULL, cellIx = NULL, clusterFeature = TRUE, clusterCell = TRUE, colorScheme = c("divergent", "sequential"), colorSchemeSymmetric = TRUE, colorSchemeCenter = 0, col = NULL, annotationCell = NULL, annotationFeature = NULL, annotationColor = NULL, breaks = NULL, legend = TRUE, annotationLegend = TRUE, annotationNamesFeature = TRUE, annotationNamesCell = TRUE, showNamesFeature = FALSE, showNamesCell = FALSE, rowGroupOrder = NULL, colGroupOrder = NULL, hclustMethod = "ward.D2", treeheightFeature = ifelse(clusterFeature, 50, 0), treeheightCell = ifelse(clusterCell, 50, 0), silent = FALSE, ... ) } \arguments{ \item{counts}{Numeric or sparse matrix. Normalized counts matrix where rows represent features and columns represent cells. .} \item{z}{Numeric vector. Denotes cell population labels.} \item{y}{Numeric vector. Denotes feature module labels.} \item{scaleRow}{Function. A function to scale each individual row. Set to NULL to disable. Occurs after normalization and log transformation. Defualt is 'scale' and thus will Z-score transform each row.} \item{trim}{Numeric vector. Vector of length two that specifies the lower and upper bounds for the data. This threshold is applied after row scaling. Set to NULL to disable. Default c(-2,2).} \item{featureIx}{Integer vector. Select features for display in heatmap. If NULL, no subsetting will be performed. Default NULL.} \item{cellIx}{Integer vector. Select cells for display in heatmap. If NULL, no subsetting will be performed. Default NULL.} \item{clusterFeature}{Logical. Determines whether rows should be clustered. Default TRUE.} \item{clusterCell}{Logical. Determines whether columns should be clustered. Default TRUE.} \item{colorScheme}{Character. One of "divergent" or "sequential". A "divergent" scheme is best for highlighting relative data (denoted by 'colorSchemeCenter') such as gene expression data that has been normalized and centered. A "sequential" scheme is best for highlighting data that are ordered low to high such as raw counts or probabilities. Default "divergent".} \item{colorSchemeSymmetric}{Logical. When the colorScheme is "divergent" and the data contains both positive and negative numbers, TRUE indicates that the color scheme should be symmetric from \code{[-max(abs(data)), max(abs(data))]}. For example, if the data ranges goes from -1.5 to 2, then setting this to TRUE will force the color scheme to range from -2 to 2. Default TRUE.} \item{colorSchemeCenter}{Numeric. Indicates the center of a "divergent" colorScheme. Default 0.} \item{col}{Color for the heatmap.} \item{annotationCell}{Data frame. Additional annotations for each cell will be shown in the column color bars. The format of the data frame should be one row for each cell and one column for each annotation. Numeric variables will be displayed as continuous color bars and factors will be displayed as discrete color bars. Default NULL.} \item{annotationFeature}{A data frame for the feature annotations (rows).} \item{annotationColor}{List. Contains color scheme for all annotations. See `?pheatmap` for more details.} \item{breaks}{Numeric vector. A sequence of numbers that covers the range of values in the normalized `counts`. Values in the normalized `matrix` are assigned to each bin in `breaks`. Each break is assigned to a unique color from `col`. If NULL, then breaks are calculated automatically. Default NULL.} \item{legend}{Logical. Determines whether legend should be drawn. Default TRUE.} \item{annotationLegend}{Logical. Whether legend for all annotations should be drawn. Default TRUE.} \item{annotationNamesFeature}{Logical. Whether the names for features should be shown. Default TRUE.} \item{annotationNamesCell}{Logical. Whether the names for cells should be shown. Default TRUE.} \item{showNamesFeature}{Logical. Specifies if feature names should be shown. Default TRUE.} \item{showNamesCell}{Logical. Specifies if cell names should be shown. Default FALSE.} \item{rowGroupOrder}{Vector. Specifies the order of feature clusters when semisupervised clustering is performed on the \code{y} labels.} \item{colGroupOrder}{Vector. Specifies the order of cell clusters when semisupervised clustering is performed on the \code{z} labels.} \item{hclustMethod}{Character. Specifies the method to use for the 'hclust' function. See `?hclust` for possible values. Default "ward.D2".} \item{treeheightFeature}{Numeric. Width of the feature dendrogram. Set to 0 to disable plotting of this dendrogram. Default: if clusterFeature == TRUE, then treeheightFeature = 50, else treeheightFeature = 0.} \item{treeheightCell}{Numeric. Height of the cell dendrogram. Set to 0 to disable plotting of this dendrogram. Default: if clusterCell == TRUE, then treeheightCell = 50, else treeheightCell = 0.} \item{silent}{Logical. Whether to plot the heatmap.} \item{...}{Other arguments to be passed to underlying pheatmap function.} } \value{ list A list containing dendrogram information and the heatmap grob } \description{ Renders a heatmap based on a matrix of counts where rows are features and columns are cells. } \examples{ data(celdaCGSim, celdaCGMod) plotHeatmap(celdaCGSim$counts, z = celdaClusters(celdaCGMod)$z, y = celdaClusters(celdaCGMod)$y ) } ================================================ FILE: man/plotRPC.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/perplexity.R \name{plotRPC} \alias{plotRPC} \alias{plotRPC,SingleCellExperiment-method} \alias{plotRPC,celdaList-method} \title{Visualize perplexity differences of a list of celda models} \usage{ plotRPC(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) \S4method{plotRPC}{SingleCellExperiment}(x, altExpName = "featureSubset", sep = 5, alpha = 0.5) \S4method{plotRPC}{celdaList}(x, sep = 5, alpha = 0.5) } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned from \code{celdaGridSearch}, \code{recursiveSplitModule}, or \code{recursiveSplitCell}. Must contain a list named \code{"celda_grid_search"} in \code{metadata(x)}. \item celdaList object.}} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{sep}{Numeric. Breaks in the x axis of the resulting plot.} \item{alpha}{Numeric. Passed to \link{geom_jitter}. Opacity of the points. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.} } \value{ A ggplot plot object showing perplexity differences as a function of clustering parameters. } \description{ Visualize perplexity differences of every model in a celdaList, by unique K/L combinations. } \examples{ data(sceCeldaCGGridSearch) sce <- resamplePerplexity(sceCeldaCGGridSearch) plotRPC(sce) data(celdaCGSim, celdaCGGridSearchRes) ## Run various combinations of parameters with 'celdaGridSearch' celdaCGGridSearchRes <- resamplePerplexity( celdaCGSim$counts, celdaCGGridSearchRes) plotRPC(celdaCGGridSearchRes) } ================================================ FILE: man/recodeClusterY.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_functions.R \name{recodeClusterY} \alias{recodeClusterY} \title{Recode feature module labels} \usage{ recodeClusterY(sce, from, to, altExpName = "featureSubset") } \arguments{ \item{sce}{\linkS4class{SingleCellExperiment} object returned from \link{celda_G} or \link{celda_CG}. Must contain column \code{celda_feature_module} in \code{\link{rowData}(altExp(sce, altExpName))}.} \item{from}{Numeric vector. Unique values in the range of \code{seq(celdaModules(sce))} that correspond to the original module labels in \code{sce}.} \item{to}{Numeric vector. Unique values in the range of \code{seq(celdaModules(sce))} that correspond to the new module labels.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ @return \linkS4class{SingleCellExperiment} object with recoded feature module labels. } \description{ Recode feature module clusters using a mapping in the \code{from} and \code{to} arguments. } \examples{ data(sceCeldaCG) sceReorderedY <- recodeClusterY(sceCeldaCG, c(1, 3), c(3, 1)) } ================================================ FILE: man/recodeClusterZ.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_functions.R \name{recodeClusterZ} \alias{recodeClusterZ} \title{Recode cell cluster labels} \usage{ recodeClusterZ(sce, from, to, altExpName = "featureSubset") } \arguments{ \item{sce}{\linkS4class{SingleCellExperiment} object returned from \link{celda_C} or \link{celda_CG}. Must contain column \code{celda_cell_cluster} in \code{\link{colData}(altExp(sce, altExpName))}.} \item{from}{Numeric vector. Unique values in the range of \code{seq(max(as.integer(celdaClusters(sce, altExpName = altExpName))))} that correspond to the original cluster labels in \code{sce}.} \item{to}{Numeric vector. Unique values in the range of \code{seq(max(as.integer(celdaClusters(sce, altExpName = altExpName))))} that correspond to the new cluster labels.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ \linkS4class{SingleCellExperiment} object with recoded cell cluster labels. } \description{ Recode cell subpopulaton clusters using a mapping in the \code{from} and \code{to} arguments. } \examples{ data(sceCeldaCG) sceReorderedZ <- recodeClusterZ(sceCeldaCG, c(1, 3), c(3, 1)) } ================================================ FILE: man/recursiveSplitCell.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/recursiveSplit.R \name{recursiveSplitCell} \alias{recursiveSplitCell} \alias{recursiveSplitCell,SingleCellExperiment-method} \alias{recursiveSplitCell,matrix-method} \title{Recursive cell splitting} \usage{ recursiveSplitCell( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, initialK = 5, maxK = 25, tempL = NULL, yInit = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minCell = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, logfile = NULL, verbose = TRUE ) \S4method{recursiveSplitCell}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, initialK = 5, maxK = 25, tempL = NULL, yInit = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minCell = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, logfile = NULL, verbose = TRUE ) \S4method{recursiveSplitCell}{matrix}( x, useAssay = "counts", altExpName = "featureSubset", sampleLabel = NULL, initialK = 5, maxK = 25, tempL = NULL, yInit = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minCell = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, logfile = NULL, verbose = TRUE ) } \arguments{ \item{x}{A numeric \link{matrix} of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{useAssay}{A string specifying the name of the \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{sampleLabel}{Vector or factor. Denotes the sample label for each cell (column) in the count matrix.} \item{initialK}{Integer. Initial number of cell populations to try. Default \code{5}.} \item{maxK}{Integer. Maximum number of cell populations to try. Default \code{25}.} \item{tempL}{Integer. Number of temporary modules to identify and use in cell splitting. Only used if \code{yInit = NULL}. Collapsing features to a relatively smaller number of modules will increase the speed of clustering and tend to produce better cell populations. This number should be larger than the number of true modules expected in the dataset. Default \code{NULL.}} \item{yInit}{Integer vector. Module labels for features. Cells will be clustered using the \link{celda_CG} model based on the modules specified in \code{yInit} rather than the counts of individual features. While the features will be initialized to the module labels in \code{yInit}, the labels will be allowed to move within each new model with a different K.} \item{alpha}{Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default \code{1}.} \item{beta}{Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature in each cell (if \code{yInit} is NULL) or to each module in each cell population (if \code{yInit} is set). Default \code{1}.} \item{delta}{Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Only used if \code{yInit} is set. Default 1.} \item{gamma}{Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Only used if \code{yInit} is set. Default 1.} \item{minCell}{Integer. Only attempt to split cell populations with at least this many cells.} \item{reorder}{Logical. Whether to reorder cell populations using hierarchical clustering after each model has been created. If FALSE, cell populations numbers will correspond to the split which created the cell populations (i.e. 'K15' was created at split 15, 'K16' was created at split 16, etc.). Default TRUE.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} \item{perplexity}{Logical. Whether to calculate perplexity for each model. If FALSE, then perplexity can be calculated later with \link{resamplePerplexity}. Default TRUE.} \item{doResampling}{Boolean. If \code{TRUE}, then each cell in the counts matrix will be resampled according to a multinomial distribution to introduce noise before calculating perplexity. Default \code{FALSE}.} \item{numResample}{Integer. The number of times to resample the counts matrix for evaluating perplexity if \code{doResampling} is set to \code{TRUE}. Default \code{5}.} \item{logfile}{Character. Messages will be redirected to a file named "logfile". If NULL, messages will be printed to stdout. Default NULL.} \item{verbose}{Logical. Whether to print log messages. Default TRUE.} } \value{ A \linkS4class{SingleCellExperiment} object. Function parameter settings and celda model results are stored in the \link{metadata} \code{"celda_grid_search"} slot. The models in the list will be of class \code{celda_C} if \code{yInit = NULL} or \code{celda_CG} if \code{zInit} is set. } \description{ Uses the \link{celda_C} model to cluster cells into population for range of possible K's. The cell population labels of the previous "K-1" model are used as the initial values in the current model with K cell populations. The best split of an existing cell population is found to create the K-th cluster. This procedure is much faster than randomly initializing each model with a different K. If module labels for each feature are given in 'yInit', the \link{celda_CG} model will be used to split cell populations based on those modules instead of individual features. Module labels will also be updated during sampling and thus may end up slightly different than \code{yInit}. } \examples{ data(sceCeldaCG) ## Create models that range from K = 3 to K = 7 by recursively splitting ## cell populations into two to produce \link{celda_C} cell clustering models sce <- recursiveSplitCell(sceCeldaCG, initialK = 3, maxK = 7) ## Alternatively, first identify features modules using ## \link{recursiveSplitModule} moduleSplit <- recursiveSplitModule(sceCeldaCG, initialL = 3, maxL = 15) plotGridSearchPerplexity(moduleSplit) moduleSplitSelect <- subsetCeldaList(moduleSplit, list(L = 10)) ## Then use module labels for initialization in \link{recursiveSplitCell} to ## produce \link{celda_CG} bi-clustering models cellSplit <- recursiveSplitCell(sceCeldaCG, initialK = 3, maxK = 7, yInit = celdaModules(moduleSplitSelect)) plotGridSearchPerplexity(cellSplit) sce <- subsetCeldaList(cellSplit, list(K = 5, L = 10)) data(celdaCGSim, celdaCSim) ## Create models that range from K = 3 to K = 7 by recursively splitting ## cell populations into two to produce \link{celda_C} cell clustering models sce <- recursiveSplitCell(celdaCSim$counts, initialK = 3, maxK = 7) ## Alternatively, first identify features modules using ## \link{recursiveSplitModule} moduleSplit <- recursiveSplitModule(celdaCGSim$counts, initialL = 3, maxL = 15) plotGridSearchPerplexity(moduleSplit) moduleSplitSelect <- subsetCeldaList(moduleSplit, list(L = 10)) ## Then use module labels for initialization in \link{recursiveSplitCell} to ## produce \link{celda_CG} bi-clustering models cellSplit <- recursiveSplitCell(celdaCGSim$counts, initialK = 3, maxK = 7, yInit = celdaModules(moduleSplitSelect)) plotGridSearchPerplexity(cellSplit) sce <- subsetCeldaList(cellSplit, list(K = 5, L = 10)) } \seealso{ \link{recursiveSplitModule} for recursive splitting of feature modules. } ================================================ FILE: man/recursiveSplitModule.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/recursiveSplit.R \name{recursiveSplitModule} \alias{recursiveSplitModule} \alias{recursiveSplitModule,SingleCellExperiment-method} \alias{recursiveSplitModule,matrix-method} \title{Recursive module splitting} \usage{ recursiveSplitModule( x, useAssay = "counts", altExpName = "featureSubset", initialL = 10, maxL = 100, tempK = 100, zInit = NULL, sampleLabel = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minFeature = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, verbose = TRUE, logfile = NULL ) \S4method{recursiveSplitModule}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", initialL = 10, maxL = 100, tempK = 100, zInit = NULL, sampleLabel = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minFeature = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, verbose = TRUE, logfile = NULL ) \S4method{recursiveSplitModule}{matrix}( x, useAssay = "counts", altExpName = "featureSubset", initialL = 10, maxL = 100, tempK = 100, zInit = NULL, sampleLabel = NULL, alpha = 1, beta = 1, delta = 1, gamma = 1, minFeature = 3, reorder = TRUE, seed = 12345, perplexity = TRUE, doResampling = FALSE, numResample = 5, verbose = TRUE, logfile = NULL ) } \arguments{ \item{x}{A numeric \link{matrix} of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \link[SingleCellExperiment]{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{initialL}{Integer. Initial number of modules.} \item{maxL}{Integer. Maximum number of modules.} \item{tempK}{Integer. Number of temporary cell populations to identify and use in module splitting. Only used if \code{zInit = NULL} Collapsing cells to a relatively smaller number of cell popluations will increase the speed of module clustering and tend to produce better modules. This number should be larger than the number of true cell populations expected in the dataset. Default \code{100}.} \item{zInit}{Integer vector. Collapse cells to cell populations based on labels in \code{zInit} and then perform module splitting. If NULL, no collapsing will be performed unless \code{tempK} is specified. Default \code{NULL}.} \item{sampleLabel}{Vector or factor. Denotes the sample label for each cell (column) in the count matrix. Default \code{NULL}.} \item{alpha}{Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Only used if \code{zInit} is set. Default \code{1}.} \item{beta}{Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell. Default 1.} \item{delta}{Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1.} \item{gamma}{Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 1.} \item{minFeature}{Integer. Only attempt to split modules with at least this many features.} \item{reorder}{Logical. Whether to reorder modules using hierarchical clustering after each model has been created. If FALSE, module numbers will correspond to the split which created the module (i.e. 'L15' was created at split 15, 'L16' was created at split 16, etc.). Default TRUE.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} \item{perplexity}{Logical. Whether to calculate perplexity for each model. If FALSE, then perplexity can be calculated later with \link{resamplePerplexity}. Default \code{TRUE}.} \item{doResampling}{Boolean. If \code{TRUE}, then each cell in the counts matrix will be resampled according to a multinomial distribution to introduce noise before calculating perplexity. Default \code{FALSE}.} \item{numResample}{Integer. The number of times to resample the counts matrix for evaluating perplexity if \code{doResampling} is set to \code{TRUE}. Default \code{5}.} \item{verbose}{Logical. Whether to print log messages. Default TRUE.} \item{logfile}{Character. Messages will be redirected to a file named "logfile". If NULL, messages will be printed to stdout. Default NULL.} } \value{ A \linkS4class{SingleCellExperiment} object. Function parameter settings and celda model results are stored in the \link{metadata} \code{"celda_grid_search"} slot. The models in the list will be of class \link{celda_G} if \code{zInit = NULL} or \link{celda_CG} if \code{zInit} is set. } \description{ Uses the \link{celda_G} model to cluster features into modules for a range of possible L's. The module labels of the previous "L-1" model are used as the initial values in the current model with L modules. The best split of an existing module is found to create the L-th module. This procedure is much faster than randomly initializing each model with a different L. } \examples{ data(sceCeldaCG) ## Create models that range from L=3 to L=20 by recursively splitting modules ## into two moduleSplit <- recursiveSplitModule(sceCeldaCG, initialL = 3, maxL = 20) ## Example results with perplexity plotGridSearchPerplexity(moduleSplit) ## Select model for downstream analysis celdaMod <- subsetCeldaList(moduleSplit, list(L = 10)) data(celdaCGSim) ## Create models that range from L=3 to L=20 by recursively splitting modules ## into two moduleSplit <- recursiveSplitModule(celdaCGSim$counts, initialL = 3, maxL = 20) ## Example results with perplexity plotGridSearchPerplexity(moduleSplit) ## Select model for downstream analysis celdaMod <- subsetCeldaList(moduleSplit, list(L = 10)) } \seealso{ \code{recursiveSplitCell} for recursive splitting of cell populations. } ================================================ FILE: man/reorderCelda.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/reorderCelda.R \name{reorderCelda} \alias{reorderCelda} \alias{reorderCelda,SingleCellExperiment,ANY-method} \alias{reorderCelda,matrix,celda_CG-method} \alias{reorderCelda,matrix,celda_C-method} \alias{reorderCelda,matrix,celda_G-method} \title{Reorder cells populations and/or features modules using hierarchical clustering} \usage{ reorderCelda( x, celdaMod, useAssay = "counts", altExpName = "featureSubset", method = "complete" ) \S4method{reorderCelda}{SingleCellExperiment,ANY}( x, useAssay = "counts", altExpName = "featureSubset", method = "complete" ) \S4method{reorderCelda}{matrix,celda_CG}(x, celdaMod, method = "complete") \S4method{reorderCelda}{matrix,celda_C}(x, celdaMod, method = "complete") \S4method{reorderCelda}{matrix,celda_G}(x, celdaMod, method = "complete") } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G} or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot in \code{altExp(x, altExpName)}. Rows represent features and columns represent cells. \item Integer count matrix. Rows represent features and columns represent cells. This matrix should be the same as the one used to generate \code{celdaMod}.}} \item{celdaMod}{Celda model object. Only works if \code{x} is an integer counts matrix. Ignored if \code{x} is a \linkS4class{SingleCellExperiment} object.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot. Default "featureSubset".} \item{method}{Passed to \link{hclust}. The agglomeration method to be used to be used. Default "complete".} } \value{ A \linkS4class{SingleCellExperiment} object (or Celda model object) with updated cell cluster and/or feature module labels. } \description{ Apply hierarchical clustering to reorder the cell populations and/or feature modules and group similar ones together based on the cosine distance of the factorized matrix from \link{factorizeMatrix}. } \examples{ data(sceCeldaCG) reordersce <- reorderCelda(sceCeldaCG) data(celdaCGSim, celdaCGMod) reorderCeldaCG <- reorderCelda(celdaCGSim$counts, celdaCGMod) data(celdaCSim, celdaCMod) reorderCeldaC <- reorderCelda(celdaCSim$counts, celdaCMod) data(celdaGSim, celdaGMod) reorderCeldaG <- reorderCelda(celdaGSim$counts, celdaGMod) } ================================================ FILE: man/reportceldaCG.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/reports.R \name{reportceldaCG} \alias{reportceldaCG} \alias{reportCeldaCGRun} \alias{reportCeldaCGPlotResults} \title{Generate an HTML report for celda_CG} \usage{ reportCeldaCGRun( sce, L, K, sampleLabel = NULL, altExpName = "featureSubset", useAssay = "counts", initialL = 10, maxL = 150, initialK = 5, maxK = 50, minCell = 3, minCount = 3, maxFeatures = 5000, output_file = "CeldaCG_RunReport", output_sce_prefix = "celda_cg", output_dir = ".", pdf = FALSE, showSession = TRUE ) reportCeldaCGPlotResults( sce, reducedDimName, features = NULL, displayName = NULL, altExpName = "featureSubset", useAssay = "counts", cellAnnot = NULL, cellAnnotLabel = NULL, exactMatch = TRUE, moduleFilePrefix = "module_features", output_file = "CeldaCG_ResultReport", output_dir = ".", pdf = FALSE, showSetup = TRUE, showSession = TRUE ) } \arguments{ \item{sce}{A \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{L}{Integer. Final number of feature modules. See \code{celda_CG} for more information.} \item{K}{Integer. Final number of cell populations. See \code{celda_CG} for more information.} \item{sampleLabel}{Vector or factor. Denotes the sample label for each cell (column) in the count matrix.} \item{altExpName}{The name for the \link{altExp} slot to use. Default \code{"featureSubset"}.} \item{useAssay}{A string specifying which \link{assay} slot to use. Default \code{"counts"}.} \item{initialL}{Integer. Minimum number of modules to try. See \link{recursiveSplitModule} for more information. Defailt \code{10}.} \item{maxL}{Integer. Maximum number of modules to try. See \link{recursiveSplitModule} for more information. Default \code{150}.} \item{initialK}{Integer. Initial number of cell populations to try.} \item{maxK}{Integer. Maximum number of cell populations to try.} \item{minCell}{Integer. Minimum number of cells required for feature selection. See \link{selectFeatures} for more information. Default \code{3}.} \item{minCount}{Integer. Minimum number of counts required for feature selection. See \link{selectFeatures} for more information. Default \code{3}.} \item{maxFeatures}{Integer. Maximum number of features to include. If the number of features after filtering for \code{minCell} and \code{minCount} are greater than \code{maxFeature}, then Seurat's VST function is used to select the top variable features. Default \code{5000}.} \item{output_file}{Character. Prefix of the html file. Default \code{"CeldaCG_ResultReport"}.} \item{output_sce_prefix}{Character. The \code{sce} object with \code{celda_CG} results will be saved to an \code{.rds} file starting with this prefix. Default \code{celda_cg}.} \item{output_dir}{Character. Path to save the html file. Default \code{.}.} \item{pdf}{Boolean. Whether to create PDF versions of each plot in addition to PNGs. Default \code{FALSE}.} \item{showSession}{Boolean. Whether to show the session information at the end. Default \code{TRUE}.} \item{reducedDimName}{Character. Name of the reduced dimensional object to be used in 2-D scatter plots throughout the report. Default \code{celda_UMAP}.} \item{features}{Character vector. Expression of these features will be displayed on a reduced dimensional plot defined by \code{reducedDimName}. If \code{NULL}, then no plotting of features on a reduced dimensinoal plot will be performed. Default \code{NULL}.} \item{displayName}{Character. The name to use for display in scatter plots and heatmaps. If \code{NULL}, then the rownames of the \code{sce} object will be used. This can also be set to the name of a column in the row data of \code{sce} or \code{altExp(sce, altExpName)}. Default \code{NULL}.} \item{cellAnnot}{Character vector. The cell-level annotations to display on the reduced dimensional plot. These variables should be present in the column data of the \code{sce} object. Default \code{NULL}.} \item{cellAnnotLabel}{Character vector. Additional cell-level annotations to display on the reduced dimensional plot. Variables will be treated as categorial and labels for each group will be placed on the plot. These variables should be present in the column data of the \code{sce} object. Default \code{NULL}.} \item{exactMatch}{Boolean. Whether to only identify exact matches or to identify partial matches using \code{\link{grep}}. Default \code{FALSE}.} \item{moduleFilePrefix}{Character. The features in each module will be written to a a csv file starting with this name. If \code{NULL}, then no file will be written. Default \code{"module_features"}.} \item{showSetup}{Boolean. Whether to show the setup code at the beginning. Default \code{TRUE}.} } \value{ .html file } \description{ \code{reportCeldaCGRun} will run \link{recursiveSplitModule} and \link{recursiveSplitCell} to find the number of modules (\code{L}) and the number of cell populations (\code{K}). A final \link{celda_CG} model will be selected from \link{recursiveSplitCell}. After a \link{celda_CG} model has been fit, \code{reportCeldaCGPlotResults} can be used to create an HTML report for visualization and exploration of the \link{celda_CG} model results. Some of the plotting and feature selection functions require the installation of the Bioconductor package \code{singleCellTK}. } \examples{ data(sceCeldaCG) \dontrun{ library(SingleCellExperiment) sceCeldaCG$sum <- colSums(counts(sceCeldaCG)) rowData(sceCeldaCG)$rownames <- rownames(sceCeldaCG) sceCeldaCG <- reportCeldaCGRun(sceCeldaCG, initialL = 5, maxL = 20, initialK = 5, maxK = 20, L = 10, K = 5) reportCeldaCGPlotResults(sce = sceCeldaCG, reducedDimName = "celda_UMAP", features = c("Gene_1", "Gene_100"), displayName = "rownames", cellAnnot="sum") } } ================================================ FILE: man/resList.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{resList} \alias{resList} \alias{resList,SingleCellExperiment-method} \alias{resList,celdaList-method} \title{Get final celdaModels from a celda model \code{SCE} or celdaList object} \usage{ resList(x, altExpName = "featureSubset") \S4method{resList}{SingleCellExperiment}(x, altExpName = "featureSubset") \S4method{resList}{celdaList}(x) } \arguments{ \item{x}{An object of class \linkS4class{SingleCellExperiment} or \code{celdaList}.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ List. Contains one celdaModel object for each of the parameters specified in \code{runParams(x)}. } \description{ Returns all celda models generated during a \link{celdaGridSearch} run. } \examples{ data(sceCeldaCGGridSearch) celdaCGGridModels <- resList(sceCeldaCGGridSearch) data(celdaCGGridSearchRes) celdaCGGridModels <- resList(celdaCGGridSearchRes) } ================================================ FILE: man/resamplePerplexity.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/perplexity.R \name{resamplePerplexity} \alias{resamplePerplexity} \alias{resamplePerplexity,SingleCellExperiment-method} \alias{resamplePerplexity,ANY-method} \title{Calculate and visualize perplexity of all models in a celdaList} \usage{ resamplePerplexity( x, celdaList, useAssay = "counts", altExpName = "featureSubset", doResampling = FALSE, numResample = 5, seed = 12345 ) \S4method{resamplePerplexity}{SingleCellExperiment}( x, useAssay = "counts", altExpName = "featureSubset", doResampling = FALSE, numResample = 5, seed = 12345 ) \S4method{resamplePerplexity}{ANY}( x, celdaList, doResampling = FALSE, numResample = 5, seed = 12345 ) } \arguments{ \item{x}{A numeric \link{matrix} of counts or a \linkS4class{SingleCellExperiment} returned from \link{celdaGridSearch} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells. Must contain "celda_grid_search" slot in \code{metadata(x)} if \code{x} is a \linkS4class{SingleCellExperiment} object.} \item{celdaList}{Object of class 'celdaList'. Used only if \code{x} is a matrix object.} \item{useAssay}{A string specifying which \link{assay} slot to use if \code{x} is a \linkS4class{SingleCellExperiment} object. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{doResampling}{Boolean. If \code{TRUE}, then each cell in the counts matrix will be resampled according to a multinomial distribution to introduce noise before calculating perplexity. Default \code{FALSE}.} \item{numResample}{Integer. The number of times to resample the counts matrix for evaluating perplexity if \code{doResampling} is set to \code{TRUE}. Default \code{5}.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of \code{12345} is used. If \code{NULL}, no calls to \link[withr]{with_seed} are made.} } \value{ A \linkS4class{SingleCellExperiment} object or \code{celdaList} object with a \code{perplexity} property, detailing the perplexity of all K/L combinations that appeared in the celdaList's models. } \description{ Calculates the perplexity of each model's cluster assignments given the provided countMatrix, as well as resamplings of that count matrix, providing a distribution of perplexities and a better sense of the quality of a given K/L choice. } \examples{ data(sceCeldaCGGridSearch) sce <- resamplePerplexity(sceCeldaCGGridSearch) plotGridSearchPerplexity(sce) data(celdaCGSim, celdaCGGridSearchRes) celdaCGGridSearchRes <- resamplePerplexity( celdaCGSim$counts, celdaCGGridSearchRes ) plotGridSearchPerplexity(celdaCGGridSearchRes) } ================================================ FILE: man/retrieveFeatureIndex.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celda_functions.R \name{retrieveFeatureIndex} \alias{retrieveFeatureIndex} \title{Retrieve row index for a set of features} \usage{ retrieveFeatureIndex( features, x, by = "rownames", exactMatch = TRUE, removeNA = FALSE ) } \arguments{ \item{features}{Character vector of feature names to find in the rows of \code{x}.} \item{x}{A data.frame, matrix, or \linkS4class{SingleCellExperiment} object to search.} \item{by}{Character. Where to search for features in \code{x}. If set to \code{"rownames"} then the features will be searched for among \code{rownames(x)}. If \code{x} inherits from class \linkS4class{SummarizedExperiment}, then \code{by} can be one of the fields in the row annotation data.frame (i.e. one of \code{colnames(rowData(x))}).} \item{exactMatch}{Boolean. Whether to only identify exact matches or to identify partial matches using \code{\link{grep}}.} \item{removeNA}{Boolean. If set to \code{FALSE}, features not found in \code{x} will be given \code{NA} and the returned vector will be the same length as \code{features}. If set to \code{TRUE}, then the \code{NA} values will be removed from the returned vector. Default \code{FALSE}.} } \value{ A vector of row indices for the matching features in \code{x}. } \description{ This will return indices of features among the rownames or rowData of a data.frame, matrix, or a \linkS4class{SummarizedExperiment} object including a \linkS4class{SingleCellExperiment}. Partial matching (i.e. grepping) can be used by setting \code{exactMatch = FALSE}. } \examples{ data(celdaCGSim) retrieveFeatureIndex(c("Gene_1", "Gene_5"), celdaCGSim$counts) retrieveFeatureIndex(c("Gene_1", "Gene_5"), celdaCGSim$counts, exactMatch = FALSE) } \seealso{ '\link[scater]{retrieveFeatureInfo}' from package \code{'scater'} and \code{link{regex}} for how to use regular expressions when \code{exactMatch = FALSE}. } \author{ Yusuke Koga, Joshua Campbell } ================================================ FILE: man/runParams.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{runParams} \alias{runParams} \alias{runParams,SingleCellExperiment-method} \alias{runParams,celdaList-method} \title{Get run parameters from a celda model \code{SingleCellExperiment} or \code{celdaList} object} \usage{ runParams(x, altExpName = "featureSubset") \S4method{runParams}{SingleCellExperiment}(x, altExpName = "featureSubset") \S4method{runParams}{celdaList}(x) } \arguments{ \item{x}{An object of class \linkS4class{SingleCellExperiment} or class \code{celdaList}.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ Data Frame. Contains details on the various K/L parameters, chain parameters, seed, and final log-likelihoods derived for each model in the provided celdaList. } \description{ Returns details on the clustering parameters and model priors from the celdaList object when it was created. } \examples{ data(sceCeldaCGGridSearch) runParams(sceCeldaCGGridSearch) data(celdaCGGridSearchRes) runParams(celdaCGGridSearchRes) } ================================================ FILE: man/sampleCells.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{sampleCells} \alias{sampleCells} \title{sampleCells} \format{ A matrix of simulated gene counts with 10 rows (genes) and 10 columns (cells). } \source{ \url{http://github.com/campbio/celda} } \usage{ sampleCells } \description{ A matrix of simulated gene counts. } \details{ A toy count matrix for use with celda. Generated by Josh Campbell. } \keyword{datasets} ================================================ FILE: man/sampleLabel.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/accessors.R \name{sampleLabel} \alias{sampleLabel} \alias{sampleLabel,SingleCellExperiment-method} \alias{sampleLabel<-} \alias{sampleLabel<-,SingleCellExperiment-method} \alias{sampleLabel,celdaModel-method} \title{Get or set sample labels from a celda \linkS4class{SingleCellExperiment} object} \usage{ sampleLabel(x, altExpName = "featureSubset") \S4method{sampleLabel}{SingleCellExperiment}(x, altExpName = "featureSubset") sampleLabel(x, altExpName = "featureSubset") <- value \S4method{sampleLabel}{SingleCellExperiment}(x, altExpName = "featureSubset") <- value \S4method{sampleLabel}{celdaModel}(x) } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned by \link{celda_C}, \link{celda_G}, or \link{celda_CG}, with the matrix located in the \code{useAssay} assay slot. Rows represent features and columns represent cells. \item A celda model object.}} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} \item{value}{Character vector of sample labels for replacements. Works only is \code{x} is a \linkS4class{SingleCellExperiment} object.} } \value{ Character vector. Contains the sample labels provided at model creation, or those automatically generated by celda. } \description{ Return or set the sample labels for the cells in \code{sce}. } \examples{ data(sceCeldaCG) sampleLabel(sceCeldaCG) data(celdaCGMod) sampleLabel(celdaCGMod) } ================================================ FILE: man/sceCeldaC.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{sceCeldaC} \alias{sceCeldaC} \title{sceCeldaC} \format{ A \linkS4class{SingleCellExperiment} object } \usage{ sceCeldaC } \description{ A \linkS4class{SingleCellExperiment} object containing the results of running \link{selectFeatures} and \link{celda_C} on \link{celdaCSim}. } \examples{ data(celdaCSim) sceCeldaC <- selectFeatures(celdaCSim$counts) sceCeldaC <- celda_C(sceCeldaC, K = celdaCSim$K, sampleLabel = celdaCSim$sampleLabel, nchains = 1) } \keyword{datasets} ================================================ FILE: man/sceCeldaCG.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{sceCeldaCG} \alias{sceCeldaCG} \title{sceCeldaCG} \format{ A \linkS4class{SingleCellExperiment} object } \usage{ sceCeldaCG } \description{ A \linkS4class{SingleCellExperiment} object containing the results of running \link{selectFeatures} and \link{celda_CG} on \link{celdaCGSim}. } \examples{ data(celdaCGSim) sceCeldaCG <- selectFeatures(celdaCGSim$counts) sceCeldaCG <- celda_CG(sceCeldaCG, K = celdaCGSim$K, L = celdaCGSim$L, sampleLabel = celdaCGSim$sampleLabel, nchains = 1) } \keyword{datasets} ================================================ FILE: man/sceCeldaCGGridSearch.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{sceCeldaCGGridSearch} \alias{sceCeldaCGGridSearch} \title{sceCeldaCGGridSearch} \format{ A \linkS4class{SingleCellExperiment} object } \usage{ sceCeldaCGGridSearch } \description{ A \linkS4class{SingleCellExperiment} object containing the results of running \link{selectFeatures} and \link{celdaGridSearch} on \link{celdaCGSim}. } \examples{ data(celdaCGSim) sce <- selectFeatures(celdaCGSim$counts) sceCeldaCGGridSearch <- celdaGridSearch(sce, model = "celda_CG", paramsTest = list(K = seq(4, 6), L = seq(9, 11)), paramsFixed = list(sampleLabel = celdaCGSim$sampleLabel), bestOnly = TRUE, nchains = 1, cores = 1, verbose = FALSE) } \keyword{datasets} ================================================ FILE: man/sceCeldaG.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{sceCeldaG} \alias{sceCeldaG} \title{sceCeldaG} \format{ A \linkS4class{SingleCellExperiment} object } \usage{ sceCeldaG } \description{ A \linkS4class{SingleCellExperiment} object containing the results of running \link{selectFeatures} and \link{celda_G} on \link{celdaGSim}. } \examples{ data(celdaGSim) sceCeldaG <- selectFeatures(celdaGSim$counts) sceCeldaG <- celda_G(sceCeldaG, L = celdaGSim$L, nchains = 1) } \keyword{datasets} ================================================ FILE: man/selectBestModel.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celdaGridSearch.R \name{selectBestModel} \alias{selectBestModel} \alias{selectBestModel,SingleCellExperiment-method} \alias{selectBestModel,celdaList-method} \title{Select best chain within each combination of parameters} \usage{ selectBestModel(x, asList = FALSE, altExpName = "featureSubset") \S4method{selectBestModel}{SingleCellExperiment}(x, asList = FALSE, altExpName = "featureSubset") \S4method{selectBestModel}{celdaList}(x, asList = FALSE) } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned from \code{celdaGridSearch}, \code{recursiveSplitModule}, or \code{recursiveSplitCell}. Must contain a list named \code{"celda_grid_search"} in \code{metadata(x)}. \item celdaList object.}} \item{asList}{\code{TRUE} or \code{FALSE}. Whether to return the best model as a \code{celdaList} object or not. If \code{FALSE}, return the best model as a corresponding celda model object.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ One of \itemize{ \item A new \linkS4class{SingleCellExperiment} object containing one model with the best log-likelihood for each set of parameters in \code{metadata(x)}. If there is only one set of parameters, a new \linkS4class{SingleCellExperiment} object with the matching model stored in the \link{metadata} \code{"celda_parameters"} slot will be returned. Otherwise, a new \linkS4class{SingleCellExperiment} object with the subset models stored in the \link{metadata} \code{"celda_grid_search"} slot will be returned. \item A new \code{celdaList} object containing one model with the best log-likelihood for each set of parameters. If only one set of parameters is in the \code{celdaList}, the best model will be returned directly instead of a \code{celdaList} object.} } \description{ Select the chain with the best log likelihood for each combination of tested parameters from a \code{SCE} object gererated by \link{celdaGridSearch} or from a \code{celdaList} object. } \examples{ data(sceCeldaCGGridSearch) ## Returns same result as running celdaGridSearch with "bestOnly = TRUE" sce <- selectBestModel(sceCeldaCGGridSearch) data(celdaCGGridSearchRes) ## Returns same result as running celdaGridSearch with "bestOnly = TRUE" cgsBest <- selectBestModel(celdaCGGridSearchRes) } \seealso{ \link{celdaGridSearch} \link{subsetCeldaList} } ================================================ FILE: man/selectFeatures.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/selectFeatures.R \name{selectFeatures} \alias{selectFeatures} \alias{selectFeatures,SingleCellExperiment-method} \alias{selectFeatures,matrix-method} \title{Simple feature selection by feature counts} \usage{ selectFeatures( x, minCount = 3, minCell = 3, useAssay = "counts", altExpName = "featureSubset" ) \S4method{selectFeatures}{SingleCellExperiment}( x, minCount = 3, minCell = 3, useAssay = "counts", altExpName = "featureSubset" ) \S4method{selectFeatures}{matrix}( x, minCount = 3, minCell = 3, useAssay = "counts", altExpName = "featureSubset" ) } \arguments{ \item{x}{A numeric \link{matrix} of counts or a \linkS4class{SingleCellExperiment} with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{minCount}{Minimum number of counts required for feature selection.} \item{minCell}{Minimum number of cells required for feature selection.} \item{useAssay}{A string specifying the name of the \link{assay} slot to use. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ A \linkS4class{SingleCellExperiment} object with a \code{altExpName} \link{altExp} slot. Function parameter settings are stored in the \link{metadata} \code{"select_features"} slot. } \description{ A simple heuristic feature selection procedure. Select features with at least \code{minCount} counts in at least \code{minCell} cells. A \linkS4class{SingleCellExperiment} object with subset features will be stored in the \link{altExp} slot with name \code{altExpName}. The name of the \code{assay} slot in \link{altExp} will be the same as \code{useAssay}. } \examples{ data(sceCeldaCG) sce <- selectFeatures(sceCeldaCG) data(celdaCGSim) sce <- selectFeatures(celdaCGSim$counts) } ================================================ FILE: man/semiPheatmap.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/semi_pheatmap.R \name{semiPheatmap} \alias{semiPheatmap} \title{A function to draw clustered heatmaps.} \usage{ semiPheatmap( mat, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(100), kmeansK = NA, breaks = NA, borderColor = "grey60", cellWidth = NA, cellHeight = NA, scale = "none", clusterRows = TRUE, clusterCols = TRUE, clusteringDistanceRows = "euclidean", clusteringDistanceCols = "euclidean", clusteringMethod = "complete", clusteringCallback = .identity2, cutreeRows = NA, cutreeCols = NA, treeHeightRow = ifelse(clusterRows, 50, 0), treeHeightCol = ifelse(clusterCols, 50, 0), legend = TRUE, legendBreaks = NA, legendLabels = NA, annotationRow = NA, annotationCol = NA, annotation = NA, annotationColors = NA, annotationLegend = TRUE, annotationNamesRow = TRUE, annotationNamesCol = TRUE, dropLevels = TRUE, showRownames = TRUE, showColnames = TRUE, main = NA, fontSize = 10, fontSizeRow = fontSize, fontSizeCol = fontSize, displayNumbers = FALSE, numberFormat = "\%.2f", numberColor = "grey30", fontSizeNumber = 0.8 * fontSize, gapsRow = NULL, gapsCol = NULL, labelsRow = NULL, labelsCol = NULL, fileName = NA, width = NA, height = NA, silent = FALSE, rowLabel, colLabel, rowGroupOrder = NULL, colGroupOrder = NULL, ... ) } \arguments{ \item{mat}{numeric matrix of the values to be plotted.} \item{color}{vector of colors used in heatmap.} \item{kmeansK}{the number of kmeans clusters to make, if we want to agggregate the rows before drawing heatmap. If NA then the rows are not aggregated.} \item{breaks}{Numeric vector. A sequence of numbers that covers the range of values in the normalized `counts`. Values in the normalized `matrix` are assigned to each bin in `breaks`. Each break is assigned to a unique color from `col`. If NULL, then breaks are calculated automatically. Default NULL.} \item{borderColor}{color of cell borders on heatmap, use NA if no border should be drawn.} \item{cellWidth}{individual cell width in points. If left as NA, then the values depend on the size of plotting window.} \item{cellHeight}{individual cell height in points. If left as NA, then the values depend on the size of plotting window.} \item{scale}{character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. Corresponding values are \code{"row"}, \code{"column"} and \code{"none"}.} \item{clusterRows}{boolean values determining if rows should be clustered or \code{hclust} object,} \item{clusterCols}{boolean values determining if columns should be clustered or \code{hclust} object.} \item{clusteringDistanceRows}{distance measure used in clustering rows. Possible values are \code{"correlation"} for Pearson correlation and all the distances supported by \code{\link{dist}}, such as \code{"euclidean"}, etc. If the value is none of the above it is assumed that a distance matrix is provided.} \item{clusteringDistanceCols}{distance measure used in clustering columns. Possible values the same as for clusteringDistanceRows.} \item{clusteringMethod}{clustering method used. Accepts the same values as \code{\link{hclust}}.} \item{clusteringCallback}{callback function to modify the clustering. Is called with two parameters: original \code{hclust} object and the matrix used for clustering. Must return a \code{hclust} object.} \item{cutreeRows}{number of clusters the rows are divided into, based on the hierarchical clustering (using cutree), if rows are not clustered, the argument is ignored} \item{cutreeCols}{similar to \code{cutreeRows}, but for columns} \item{treeHeightRow}{the height of a tree for rows, if these are clustered. Default value 50 points.} \item{treeHeightCol}{the height of a tree for columns, if these are clustered. Default value 50 points.} \item{legend}{logical to determine if legend should be drawn or not.} \item{legendBreaks}{vector of breakpoints for the legend.} \item{legendLabels}{vector of labels for the \code{legendBreaks}.} \item{annotationRow}{data frame that specifies the annotations shown on left side of the heatmap. Each row defines the features for a specific row. The rows in the data and in the annotation are matched using corresponding row names. Note that color schemes takes into account if variable is continuous or discrete.} \item{annotationCol}{similar to annotationRow, but for columns.} \item{annotation}{deprecated parameter that currently sets the annotationCol if it is missing.} \item{annotationColors}{list for specifying annotationRow and annotationCol track colors manually. It is possible to define the colors for only some of the features. Check examples for details.} \item{annotationLegend}{boolean value showing if the legend for annotation tracks should be drawn.} \item{annotationNamesRow}{boolean value showing if the names for row annotation tracks should be drawn.} \item{annotationNamesCol}{boolean value showing if the names for column annotation tracks should be drawn.} \item{dropLevels}{logical to determine if unused levels are also shown in the legend.} \item{showRownames}{boolean specifying if column names are be shown.} \item{showColnames}{boolean specifying if column names are be shown.} \item{main}{the title of the plot} \item{fontSize}{base fontsize for the plot} \item{fontSizeRow}{fontsize for rownames (Default: fontsize)} \item{fontSizeCol}{fontsize for colnames (Default: fontsize)} \item{displayNumbers}{logical determining if the numeric values are also printed to the cells. If this is a matrix (with same dimensions as original matrix), the contents of the matrix are shown instead of original values.} \item{numberFormat}{format strings (C printf style) of the numbers shown in cells. For example "\code{\%.2f}" shows 2 decimal places and "\code{\%.1e}" shows exponential notation (see more in \code{\link{sprintf}}).} \item{numberColor}{color of the text} \item{fontSizeNumber}{fontsize of the numbers displayed in cells} \item{gapsRow}{vector of row indices that show shere to put gaps into heatmap. Used only if the rows are not clustered. See \code{cutreeRow} to see how to introduce gaps to clustered rows.} \item{gapsCol}{similar to gapsRow, but for columns.} \item{labelsRow}{custom labels for rows that are used instead of rownames.} \item{labelsCol}{similar to labelsRow, but for columns.} \item{fileName}{file path where to save the picture. Filetype is decided by the extension in the path. Currently following formats are supported: png, pdf, tiff, bmp, jpeg. Even if the plot does not fit into the plotting window, the file size is calculated so that the plot would fit there, unless specified otherwise.} \item{width}{manual option for determining the output file width in inches.} \item{height}{manual option for determining the output file height in inches.} \item{silent}{do not draw the plot (useful when using the gtable output)} \item{rowLabel}{row cluster labels for semi-clustering} \item{colLabel}{column cluster labels for semi-clustering} \item{rowGroupOrder}{Vector. Specifies the order of feature clusters when semisupervised clustering is performed on the \code{y} labels.} \item{colGroupOrder}{Vector. Specifies the order of cell clusters when semisupervised clustering is performed on the \code{z} labels.} \item{\dots}{graphical parameters for the text used in plot. Parameters passed to \code{\link{grid.text}}, see \code{\link{gpar}}.} } \value{ Invisibly a list of components \itemize{ \item \code{treeRow} the clustering of rows as \code{\link{hclust}} object \item \code{treeCol} the clustering of columns as \code{\link{hclust}} object \item \code{kmeans} the kmeans clustering of rows if parameter \code{kmeansK} was specified } } \description{ A function to draw clustered heatmaps where one has better control over some graphical parameters such as cell size, etc. The function also allows to aggregate the rows using kmeans clustering. This is advisable if number of rows is so big that R cannot handle their hierarchical clustering anymore, roughly more than 1000. Instead of showing all the rows separately one can cluster the rows in advance and show only the cluster centers. The number of clusters can be tuned with parameter kmeansK. } \author{ Raivo Kolde #@examples # Create test matrix test = matrix(rnorm(200), 20, 10) test[seq(10), seq(1, 10, 2)] = test[seq(10), seq(1, 10, 2)] + 3 test[seq(11, 20), seq(2, 10, 2)] = test[seq(11, 20), seq(2, 10, 2)] + 2 test[seq(15, 20), seq(2, 10, 2)] = test[seq(15, 20), seq(2, 10, 2)] + 4 colnames(test) = paste("Test", seq(10), sep = "") rownames(test) = paste("Gene", seq(20), sep = "") # Draw heatmaps pheatmap(test) pheatmap(test, kmeansK = 2) pheatmap(test, scale = "row", clusteringDistanceRows = "correlation") pheatmap(test, color = colorRampPalette(c("navy", "white", "firebrick3"))(50)) pheatmap(test, cluster_row = FALSE) pheatmap(test, legend = FALSE) # Show text within cells pheatmap(test, displayNumbers = TRUE) pheatmap(test, displayNumbers = TRUE, numberFormat = "\%.1e") pheatmap(test, displayNumbers = matrix(ifelse(test > 5, "*", ""), nrow(test))) pheatmap(test, cluster_row = FALSE, legendBreaks = seq(-1, 4), legendLabels = c("0", "1e-4", "1e-3", "1e-2", "1e-1", "1")) # Fix cell sizes and save to file with correct size pheatmap(test, cellWidth = 15, cellHeight = 12, main = "Example heatmap") pheatmap(test, cellWidth = 15, cellHeight = 12, fontSize = 8, fileName = "test.pdf") # Generate annotations for rows and columns annotationCol = data.frame(CellType = factor(rep(c("CT1", "CT2"), 5)), Time = seq(5)) rownames(annotationCol) = paste("Test", seq(10), sep = "") annotationRow = data.frame(GeneClass = factor(rep(c("Path1", "Path2", "Path3"), c(10, 4, 6)))) rownames(annotationRow) = paste("Gene", seq(20), sep = "") # Display row and color annotations pheatmap(test, annotationCol = annotationCol) pheatmap(test, annotationCol = annotationCol, annotationLegend = FALSE) pheatmap(test, annotationCol = annotationCol, annotationRow = annotationRow) # Specify colors ann_colors = list(Time = c("white", "firebrick"), CellType = c(CT1 = "#1B9E77", CT2 = "#D95F02"), GeneClass = c(Path1 = "#7570B3", Path2 = "#E7298A", Path3 = "#66A61E")) pheatmap(test, annotationCol = annotationCol, annotationColors = ann_colors, main = "Title") pheatmap(test, annotationCol = annotationCol, annotationRow = annotationRow, annotationColors = ann_colors) pheatmap(test, annotationCol = annotationCol, annotationColors = ann_colors[2]) # Gaps in heatmaps pheatmap(test, annotationCol = annotationCol, clusterRows = FALSE, gapsRow = c(10, 14)) pheatmap(test, annotationCol = annotationCol, clusterRows = FALSE, gapsRow = c(10, 14), cutreeCol = 2) # Show custom strings as row/col names labelsRow = c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "Il10", "Il15", "Il1b") pheatmap(test, annotationCol = annotationCol, labelsRow = labelsRow) # Specifying clustering from distance matrix drows = stats::dist(test, method = "minkowski") dcols = stats::dist(t(test), method = "minkowski") pheatmap(test, clusteringDistanceRows = drows, clusteringDistanceCols = dcols) # Modify ordering of the clusters using clustering callback option callback = function(hc, mat) { sv = svd(t(mat))$v[, 1] dend = reorder(as.dendrogram(hc), wts = sv) as.hclust(dend) } pheatmap(test, clusteringCallback = callback) } ================================================ FILE: man/simulateCells.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/simulateCells.R \name{simulateCells} \alias{simulateCells} \title{Simulate count data from the celda generative models.} \usage{ simulateCells( model = c("celda_CG", "celda_C", "celda_G"), S = 5, CRange = c(50, 100), NRange = c(500, 1000), C = 100, G = 100, K = 5, L = 10, alpha = 1, beta = 1, gamma = 5, delta = 1, seed = 12345 ) } \arguments{ \item{model}{Character. Options available in \code{celda::availableModels}. Can be one of \code{"celda_CG"}, \code{"celda_C"}, or \code{"celda_G"}. Default \code{"celda_CG"}.} \item{S}{Integer. Number of samples to simulate. Default 5. Only used if \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}.} \item{CRange}{Integer vector. A vector of length 2 that specifies the lower and upper bounds of the number of cells to be generated in each sample. Default c(50, 100). Only used if \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}.} \item{NRange}{Integer vector. A vector of length 2 that specifies the lower and upper bounds of the number of counts generated for each cell. Default c(500, 1000).} \item{C}{Integer. Number of cells to simulate. Default 100. Only used if \code{model} is \code{"celda_G"}.} \item{G}{Integer. The total number of features to be simulated. Default 100.} \item{K}{Integer. Number of cell populations. Default 5. Only used if \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}.} \item{L}{Integer. Number of feature modules. Default 10. Only used if \code{model} is one of \code{"celda_CG"} or \code{"celda_G"}.} \item{alpha}{Numeric. Concentration parameter for Theta. Adds a pseudocount to each cell population in each sample. Default 1. Only used if \code{model} is one of \code{"celda_CG"} or \code{"celda_C"}.} \item{beta}{Numeric. Concentration parameter for Phi. Adds a pseudocount to each feature module in each cell population. Default 1.} \item{gamma}{Numeric. Concentration parameter for Eta. Adds a pseudocount to the number of features in each module. Default 5. Only used if \code{model} is one of \code{"celda_CG"} or \code{"celda_G"}.} \item{delta}{Numeric. Concentration parameter for Psi. Adds a pseudocount to each feature in each module. Default 1. Only used if \code{model} is one of \code{"celda_CG"} or \code{"celda_G"}.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} } \value{ A \link[SingleCellExperiment]{SingleCellExperiment} object with simulated count matrix stored in the "counts" assay slot. Function parameter settings are stored in the \link{metadata} slot. For \code{"celda_CG"} and \code{"celda_C"} models, columns \code{celda_sample_label} and \code{celda_cell_cluster} in \link{colData} contain simulated sample labels and cell population clusters. For \code{"celda_CG"} and \code{"celda_G"} models, column \code{celda_feature_module} in \link{rowData} contains simulated gene modules. } \description{ This function generates a \linkS4class{SingleCellExperiment} containing a simulated counts matrix in the \code{"counts"} assay slot, as well as various parameters used in the simulation which can be useful for running celda and are stored in \code{metadata} slot. The user must provide the desired model (one of celda_C, celda_G, celda_CG) as well as any desired tuning parameters for those model's simulation functions as detailed below. } \examples{ sce <- simulateCells() } ================================================ FILE: man/simulateContamination.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/decon.R \name{simulateContamination} \alias{simulateContamination} \title{Simulate contaminated count matrix} \usage{ simulateContamination( C = 300, G = 100, K = 3, NRange = c(500, 1000), beta = 0.1, delta = c(1, 10), numMarkers = 3, seed = 12345 ) } \arguments{ \item{C}{Integer. Number of cells to be simulated. Default \code{300}.} \item{G}{Integer. Number of genes to be simulated. Default \code{100}.} \item{K}{Integer. Number of cell populations to be simulated. Default \code{3}.} \item{NRange}{Integer vector. A vector of length 2 that specifies the lower and upper bounds of the number of counts generated for each cell. Default \code{c(500, 1000)}.} \item{beta}{Numeric. Concentration parameter for Phi. Default \code{0.1}.} \item{delta}{Numeric or Numeric vector. Concentration parameter for Theta. If input as a single numeric value, symmetric values for beta distribution are specified; if input as a vector of lenght 2, the two values will be the shape1 and shape2 paramters of the beta distribution respectively. Default \code{c(1, 5)}.} \item{numMarkers}{Integer. Number of markers for each cell population. Default \code{3}.} \item{seed}{Integer. Passed to \code{\link[withr]{with_seed}}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \code{\link[withr]{with_seed}} are made.} } \value{ A list containing the \code{nativeMatirx} (real expression), \code{observedMatrix} (real expression + contamination), as well as other parameters used in the simulation. } \description{ This function generates a list containing two count matrices -- one for real expression, the other one for contamination, as well as other parameters used in the simulation which can be useful for running decontamination. } \examples{ contaminationSim <- simulateContamination(K = 3, delta = c(1, 10)) } \author{ Shiyi Yang, Yuan Yin, Joshua Campbell } ================================================ FILE: man/splitModule.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/splitModule.R \name{splitModule} \alias{splitModule} \alias{splitModule,SingleCellExperiment-method} \title{Split celda feature module} \usage{ splitModule( x, module, useAssay = "counts", altExpName = "featureSubset", n = 2, seed = 12345 ) \S4method{splitModule}{SingleCellExperiment}( x, module, useAssay = "counts", altExpName = "featureSubset", n = 2, seed = 12345 ) } \arguments{ \item{x}{A \linkS4class{SingleCellExperiment} object with the matrix located in the assay slot under \code{useAssay}. Rows represent features and columns represent cells.} \item{module}{Integer. The module to be split.} \item{useAssay}{A string specifying which \link{assay} slot to use for \code{x}. Default "counts".} \item{altExpName}{The name for the \link{altExp} slot to use. Default \code{"featureSubset"}.} \item{n}{Integer. How many modules should \code{module} be split into. Default \code{2}.} \item{seed}{Integer. Passed to \link[withr]{with_seed}. For reproducibility, a default value of 12345 is used. If NULL, no calls to \link[withr]{with_seed} are made.} } \value{ A updated \linkS4class{SingleCellExperiment} object with new feature modules stored in column \code{celda_feature_module} in \code{\link{rowData}(x)}. } \description{ Manually select a celda feature module to split into 2 or more modules. Useful for splitting up modules that show divergent expression of features in multiple cell clusters. } \examples{ data(sceCeldaCG) # Split module 5 into 2 new modules. sce <- splitModule(sceCeldaCG, module = 5) } ================================================ FILE: man/subsetCeldaList.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/celdaGridSearch.R \name{subsetCeldaList} \alias{subsetCeldaList} \alias{subsetCeldaList,SingleCellExperiment-method} \alias{subsetCeldaList,celdaList-method} \title{Subset celda model from SCE object returned from \code{celdaGridSearch}} \usage{ subsetCeldaList(x, params, altExpName = "featureSubset") \S4method{subsetCeldaList}{SingleCellExperiment}(x, params, altExpName = "featureSubset") \S4method{subsetCeldaList}{celdaList}(x, params) } \arguments{ \item{x}{Can be one of \itemize{ \item A \linkS4class{SingleCellExperiment} object returned from \code{celdaGridSearch}, \code{recursiveSplitModule}, or \code{recursiveSplitCell}. Must contain a list named \code{"celda_grid_search"} in \code{metadata(x)}. \item celdaList object.}} \item{params}{List. List of parameters used to subset the matching celda models in list \code{"celda_grid_search"} in \code{metadata(x)}.} \item{altExpName}{The name for the \link{altExp} slot to use. Default "featureSubset".} } \value{ One of \itemize{ \item A new \linkS4class{SingleCellExperiment} object containing all models matching the provided criteria in \code{params}. If only one celda model result in the \code{"celda_grid_search"} slot in \code{metadata(x)} matches the given criteria, a new \linkS4class{SingleCellExperiment} object with the matching model stored in the \link{metadata} \code{"celda_parameters"} slot will be returned. Otherwise, a new \linkS4class{SingleCellExperiment} object with the subset models stored in the \link{metadata} \code{"celda_grid_search"} slot will be returned. \item A new \code{celdaList} object containing all models matching the provided criteria in \code{params}. If only one item in the \code{celdaList} matches the given criteria, the matching model will be returned directly instead of a \code{celdaList} object.} } \description{ Select a subset of models from a \linkS4class{SingleCellExperiment} object generated by \link{celdaGridSearch} that match the criteria in the argument \code{params}. } \examples{ data(sceCeldaCGGridSearch) sceK5L10 <- subsetCeldaList(sceCeldaCGGridSearch, params = list(K = 5, L = 10)) data(celdaCGGridSearchRes) resK5L10 <- subsetCeldaList(celdaCGGridSearchRes, params = list(K = 5, L = 10)) } \seealso{ \link{celdaGridSearch} can run Celda with multiple parameters and chains in parallel. \link{selectBestModel} can get the best model for each combination of parameters. } ================================================ FILE: man/topRank.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/topRank.R \name{topRank} \alias{topRank} \title{Identify features with the highest influence on clustering.} \usage{ topRank(matrix, n = 25, margin = 2, threshold = 0, decreasing = TRUE) } \arguments{ \item{matrix}{Numeric matrix.} \item{n}{Integer. Maximum number of items above `threshold` returned for each ranked row or column.} \item{margin}{Integer. Dimension of `matrix` to rank, with 1 for rows, 2 for columns. Default 2.} \item{threshold}{Numeric. Only return ranked rows or columns in the matrix that are above this threshold. If NULL, then no threshold will be applied. Default 0.} \item{decreasing}{Logical. Specifies if the rank should be decreasing. Default TRUE.} } \value{ List. The `index` variable provides the top `n` row (feature) indices contributing the most to each column (cell). The `names` variable provides the rownames corresponding to these indexes. } \description{ topRank() can quickly identify the top `n` rows for each column of a matrix. For example, this can be useful for identifying the top `n` features per cell. } \examples{ data(sampleCells) topRanksPerCell <- topRank(sampleCells, n = 5) topFeatureNamesForCell <- topRanksPerCell$names[1] } ================================================ FILE: src/DecontX.cpp ================================================ #include #include // [[Rcpp::depends(RcppEigen)]] using namespace Rcpp; // [[Rcpp::export]] Rcpp::List decontXEM(const Eigen::MappedSparseMatrix &counts, const NumericVector &counts_colsums, const NumericVector &theta, const bool &estimate_eta, const NumericMatrix &eta, const NumericMatrix &phi, const IntegerVector &z, const bool &estimate_delta, const NumericVector &delta, const double &pseudocount) { // Perform error checking if (counts.cols() != theta.size()) { stop("Length of 'theta' must be equal to the number of columns in 'counts'."); } if (counts.cols() != z.size()) { stop("Length of 'z' must be equal to the number of columns in 'counts'."); } if (counts.cols() != counts_colsums.size()) { stop("Length of 'counts_colsums' must be equal to the number of columns in 'counts'."); } if (counts.rows() != phi.nrow()) { stop("The number of rows in 'phi' must be equal to the number of rows in 'counts'."); } if (counts.rows() != eta.nrow()) { stop("The number of rows in 'eta' must be equal to the number of rows in 'counts'."); } if (phi.ncol() != eta.ncol()) { stop("The number of columns in 'eta' must be equal to the number of columns in 'phi'."); } if (min(z) < 1 || max(z) > eta.ncol()) { stop("The entries in 'z' need to be between 1 and the number of columns in eta and phi."); } if (delta.size() != 2 || sum(delta < 0) > 0) { stop("'delta' must be a numeric vector of length 2 with positive integers."); } // Declare variables and functions NumericVector new_theta(theta.size()); NumericVector native_total(theta.size()); NumericMatrix new_phi(phi.nrow(), phi.ncol()); // Instantiate new_eta but only update it when estimate_eta is TRUE NumericMatrix new_eta(eta.nrow(), eta.ncol()); // Obtaining 'fit_dirichlet' function from MCMCprecision package Environment pkg = Environment::namespace_env("MCMCprecision"); Function f = pkg["fit_dirichlet"]; int i; int j; int k; int nr = phi.nrow(); double x; double pcontamin; double pnative; double normp; double px; for (int j = 0; j < counts.cols(); ++j) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { i = i_.index(); x = i_.value(); k = z[j] - 1; // Calculate variational probabilities // Removing the log/exp speeds it up and produces the same result since // there are only 2 probabilities being multiplied //pnative = log(phi(i,k) + pseudocount) + log(theta(j) + pseudocount); //pcontamin = log(eta(i,k) + pseudocount) + log(1 - theta(j) + pseudocount); pnative = (phi[nr * k + i] + pseudocount) * (theta[j] + pseudocount); pcontamin = (eta[nr * k + i] + pseudocount) * (1 - theta[j] + pseudocount); // Normalize probabilities and add to proper components //normp = exp(pnative) / (exp(pcontamin) + exp(pnative)); normp = pnative / (pcontamin + pnative); px = normp * x; new_phi(i, k) += px; native_total(j) += px; } } // Calculate Eta using Weights from Phi if (estimate_eta == TRUE) { NumericVector phi_rowsum = rowSums(new_phi); for (i = 0; i < new_eta.ncol(); i++) { for (j = 0; j < new_eta.nrow(); j++) { new_eta(j, i) = phi_rowsum[j] - new_phi(j, i); } } } // Normalize Phi NumericVector phi_colsum = colSums(new_phi); for (i = 0; i < new_phi.ncol(); i++) { new_phi(_, i) = new_phi(_, i) / phi_colsum[i]; } // Normalize Eta if (estimate_eta == TRUE) { NumericVector eta_colsum = colSums(new_eta); for (i = 0; i < new_phi.ncol(); i++) { new_eta(_, i) = new_eta(_, i) / eta_colsum[i]; } } // Update Theta NumericVector contamination_prop = (counts_colsums - native_total) / counts_colsums; NumericVector native_prop = 1 - contamination_prop; NumericMatrix theta_raw = cbind(native_prop, contamination_prop); NumericVector new_delta = delta; if (estimate_delta == TRUE) { Rcpp::List result = f(Named("x", theta_raw)); new_delta = result["alpha"]; } // Estimate new theta new_theta = (native_total + new_delta[0]) / (counts_colsums + sum(new_delta)); // If not to estimate new_eta, just return input eta if (estimate_eta == FALSE) { new_eta = eta; } return Rcpp::List::create(Rcpp::Named("phi") = new_phi, Rcpp::Named("eta") = new_eta, Rcpp::Named("theta") = new_theta, Rcpp::Named("delta") = new_delta, Rcpp::Named("contamination") = contamination_prop); } // [[Rcpp::export]] double decontXLogLik(const Eigen::MappedSparseMatrix &counts, const NumericVector &theta, const NumericMatrix &eta, const NumericMatrix &phi, const IntegerVector &z, const double &pseudocount) { // Perform error checking if (counts.cols() != theta.size()) { stop("Length of 'theta' must be equal to the number of columns in 'counts'."); } if (counts.cols() != z.size()) { stop("Length of 'z' must be equal to the number of columns in 'counts'."); } if (counts.rows() != phi.nrow()) { stop("The number of rows in 'phi' must be equal to the number of rows in 'counts'."); } if (counts.rows() != eta.nrow()) { stop("The number of rows in 'eta' must be equal to the number of rows in 'counts'."); } if (phi.ncol() != eta.ncol()) { stop("The number of columns in 'eta' must be equal to the number of columns in 'phi'."); } if (min(z) < 1 || max(z) > eta.ncol()) { stop("The entries in 'z' need to be between 1 and the number of columns in eta and phi."); } // Declare variables and functions double loglik = 0; int i; int k; double x; int nr = phi.nrow(); // Original R code: // ll <- sum(Matrix::t(counts) * log(theta * t(phi)[z, ] + // (1 - theta) * t(eta)[z, ] + 1e-20)) for (int j = 0; j < counts.cols(); ++j) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { i = i_.index(); x = i_.value(); k = z[j] - 1; loglik += x * log((phi[nr * k + i] * theta[j]) + (eta[nr * k + i] * (1 - theta[j])) + pseudocount); } } return loglik; } // [[Rcpp::export]] Rcpp::List decontXInitialize(const Eigen::MappedSparseMatrix &counts, const NumericVector &theta, const IntegerVector &z, const double &pseudocount) { // Perform error checking if (counts.cols() != theta.size()) { stop("Length of 'theta' must be equal to the number of columns in 'counts'."); } if (counts.cols() != z.size()) { stop("Length of 'z' must be equal to the number of columns in 'counts'."); } // Declare variables and functions NumericMatrix new_phi(counts.rows(), max(z)); NumericMatrix new_eta(counts.rows(), max(z)); std::fill(new_phi.begin(), new_phi.end(), pseudocount); std::fill(new_eta.begin(), new_eta.end(), pseudocount); int k; int i; double x; for (int j = 0; j < counts.cols(); ++j) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { i = i_.index(); x = i_.value(); k = z[j] - 1; new_phi(i, k) += x * theta(j); } } // Calculate Eta using Weights from Phi NumericVector phi_rowsum = rowSums(new_phi); int j; for (i = 0; i < new_eta.ncol(); i++) { for (j = 0; j < new_eta.nrow(); j++) { new_eta(j, i) = phi_rowsum[j] - new_phi(j, i); } } // Normalize Phi and Eta NumericVector phi_colsum = colSums(new_phi); NumericVector eta_colsum = colSums(new_eta); for (i = 0; i < new_phi.ncol(); i++) { new_phi(_, i) = new_phi(_, i) / phi_colsum[i]; new_eta(_, i) = new_eta(_, i) / eta_colsum[i]; } return Rcpp::List::create(Rcpp::Named("phi") = new_phi, Rcpp::Named("eta") = new_eta); } // [[Rcpp::export]] Eigen::SparseMatrix calculateNativeMatrix(const Eigen::MappedSparseMatrix &counts, const NumericVector &theta, const NumericMatrix &eta, const NumericMatrix &phi, const IntegerVector &z, const double &pseudocount) { // Perform error checking if (counts.cols() != theta.size()) { stop("Length of 'theta' must be equal to the number of columns in 'counts'."); } if (counts.cols() != z.size()) { stop("Length of 'z' must be equal to the number of columns in 'counts'."); } if (counts.rows() != phi.nrow()) { stop("The number of rows in 'phi' must be equal to the number of rows in 'counts'."); } if (counts.rows() != eta.nrow()) { stop("The number of rows in 'eta' must be equal to the number of rows in 'counts'."); } if (phi.ncol() != eta.ncol()) { stop("The number of columns in 'eta' must be equal to the number of columns in 'phi'."); } if (min(z) < 1 || max(z) > eta.ncol()) { stop("The entries in 'z' need to be between 1 and the number of columns in eta and phi."); } Eigen::SparseMatrix native_matrix = counts; int i; int k; double x; double pcontamin; double pnative; double normp; for (int j = 0; j < counts.cols(); ++j) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { i = i_.index(); x = i_.value(); k = z[j] - 1; // Calculate variational probabilities pnative = log(phi(i, k) + pseudocount) + log(theta(j) + pseudocount); pcontamin = log(eta(i, k) + pseudocount) + log(1 - theta(j) + pseudocount); // Normalize probabilities and add to proper components normp = exp(pnative) / (exp(pcontamin) + exp(pnative)); native_matrix.coeffRef(i, j) *= normp; } } return native_matrix; } ================================================ FILE: src/Makevars ================================================ ## With R 3.1.0 or later, you can uncomment the following line to tell R to ## enable compilation with C++11 (where available) ## ## Also, OpenMP support in Armadillo prefers C++11 support. However, for wider ## availability of the package we do not yet enforce this here. It is however ## recommended for client packages to set it. ## ## And with R 3.4.0, and RcppArmadillo 0.7.960.*, we turn C++11 on as OpenMP ## support within Armadillo prefers / requires it PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) ================================================ FILE: src/Makevars.win ================================================ ## With R 3.1.0 or later, you can uncomment the following line to tell R to ## enable compilation with C++11 (where available) ## ## Also, OpenMP support in Armadillo prefers C++11 support. However, for wider ## availability of the package we do not yet enforce this here. It is however ## recommended for client packages to set it. ## ## And with R 3.4.0, and RcppArmadillo 0.7.960.*, we turn C++11 on as OpenMP ## support within Armadillo prefers / requires it PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) ================================================ FILE: src/RcppExports.cpp ================================================ // Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include #include using namespace Rcpp; #ifdef RCPP_USE_GLOBAL_ROSTREAM Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get(); Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get(); #endif // decontXEM Rcpp::List decontXEM(const Eigen::MappedSparseMatrix& counts, const NumericVector& counts_colsums, const NumericVector& theta, const bool& estimate_eta, const NumericMatrix& eta, const NumericMatrix& phi, const IntegerVector& z, const bool& estimate_delta, const NumericVector& delta, const double& pseudocount); RcppExport SEXP _celda_decontXEM(SEXP countsSEXP, SEXP counts_colsumsSEXP, SEXP thetaSEXP, SEXP estimate_etaSEXP, SEXP etaSEXP, SEXP phiSEXP, SEXP zSEXP, SEXP estimate_deltaSEXP, SEXP deltaSEXP, SEXP pseudocountSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type counts_colsums(counts_colsumsSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type theta(thetaSEXP); Rcpp::traits::input_parameter< const bool& >::type estimate_eta(estimate_etaSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type eta(etaSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type phi(phiSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type z(zSEXP); Rcpp::traits::input_parameter< const bool& >::type estimate_delta(estimate_deltaSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type delta(deltaSEXP); Rcpp::traits::input_parameter< const double& >::type pseudocount(pseudocountSEXP); rcpp_result_gen = Rcpp::wrap(decontXEM(counts, counts_colsums, theta, estimate_eta, eta, phi, z, estimate_delta, delta, pseudocount)); return rcpp_result_gen; END_RCPP } // decontXLogLik double decontXLogLik(const Eigen::MappedSparseMatrix& counts, const NumericVector& theta, const NumericMatrix& eta, const NumericMatrix& phi, const IntegerVector& z, const double& pseudocount); RcppExport SEXP _celda_decontXLogLik(SEXP countsSEXP, SEXP thetaSEXP, SEXP etaSEXP, SEXP phiSEXP, SEXP zSEXP, SEXP pseudocountSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type theta(thetaSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type eta(etaSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type phi(phiSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type z(zSEXP); Rcpp::traits::input_parameter< const double& >::type pseudocount(pseudocountSEXP); rcpp_result_gen = Rcpp::wrap(decontXLogLik(counts, theta, eta, phi, z, pseudocount)); return rcpp_result_gen; END_RCPP } // decontXInitialize Rcpp::List decontXInitialize(const Eigen::MappedSparseMatrix& counts, const NumericVector& theta, const IntegerVector& z, const double& pseudocount); RcppExport SEXP _celda_decontXInitialize(SEXP countsSEXP, SEXP thetaSEXP, SEXP zSEXP, SEXP pseudocountSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type theta(thetaSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type z(zSEXP); Rcpp::traits::input_parameter< const double& >::type pseudocount(pseudocountSEXP); rcpp_result_gen = Rcpp::wrap(decontXInitialize(counts, theta, z, pseudocount)); return rcpp_result_gen; END_RCPP } // calculateNativeMatrix Eigen::SparseMatrix calculateNativeMatrix(const Eigen::MappedSparseMatrix& counts, const NumericVector& theta, const NumericMatrix& eta, const NumericMatrix& phi, const IntegerVector& z, const double& pseudocount); RcppExport SEXP _celda_calculateNativeMatrix(SEXP countsSEXP, SEXP thetaSEXP, SEXP etaSEXP, SEXP phiSEXP, SEXP zSEXP, SEXP pseudocountSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type theta(thetaSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type eta(etaSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type phi(phiSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type z(zSEXP); Rcpp::traits::input_parameter< const double& >::type pseudocount(pseudocountSEXP); rcpp_result_gen = Rcpp::wrap(calculateNativeMatrix(counts, theta, eta, phi, z, pseudocount)); return rcpp_result_gen; END_RCPP } // cG_calcGibbsProbY_Simple NumericVector cG_calcGibbsProbY_Simple(const IntegerMatrix counts, IntegerVector nGbyTS, IntegerMatrix nTSbyC, IntegerVector nbyTS, IntegerVector nbyG, const IntegerVector y, const int L, const int index, const double gamma, const double beta, const double delta); RcppExport SEXP _celda_cG_calcGibbsProbY_Simple(SEXP countsSEXP, SEXP nGbyTSSEXP, SEXP nTSbyCSEXP, SEXP nbyTSSEXP, SEXP nbyGSEXP, SEXP ySEXP, SEXP LSEXP, SEXP indexSEXP, SEXP gammaSEXP, SEXP betaSEXP, SEXP deltaSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const IntegerMatrix >::type counts(countsSEXP); Rcpp::traits::input_parameter< IntegerVector >::type nGbyTS(nGbyTSSEXP); Rcpp::traits::input_parameter< IntegerMatrix >::type nTSbyC(nTSbyCSEXP); Rcpp::traits::input_parameter< IntegerVector >::type nbyTS(nbyTSSEXP); Rcpp::traits::input_parameter< IntegerVector >::type nbyG(nbyGSEXP); Rcpp::traits::input_parameter< const IntegerVector >::type y(ySEXP); Rcpp::traits::input_parameter< const int >::type L(LSEXP); Rcpp::traits::input_parameter< const int >::type index(indexSEXP); Rcpp::traits::input_parameter< const double >::type gamma(gammaSEXP); Rcpp::traits::input_parameter< const double >::type beta(betaSEXP); Rcpp::traits::input_parameter< const double >::type delta(deltaSEXP); rcpp_result_gen = Rcpp::wrap(cG_calcGibbsProbY_Simple(counts, nGbyTS, nTSbyC, nbyTS, nbyG, y, L, index, gamma, beta, delta)); return rcpp_result_gen; END_RCPP } // cG_CalcGibbsProbY_ori NumericVector cG_CalcGibbsProbY_ori(const int index, const IntegerMatrix& counts, const IntegerMatrix& nTSbyC, const IntegerVector& nbyTS, const IntegerVector& nGbyTS, const IntegerVector& nbyG, const IntegerVector& y, const int L, const int nG, const NumericVector& lg_beta, const NumericVector& lg_gamma, const NumericVector& lg_delta, const double delta); RcppExport SEXP _celda_cG_CalcGibbsProbY_ori(SEXP indexSEXP, SEXP countsSEXP, SEXP nTSbyCSEXP, SEXP nbyTSSEXP, SEXP nGbyTSSEXP, SEXP nbyGSEXP, SEXP ySEXP, SEXP LSEXP, SEXP nGSEXP, SEXP lg_betaSEXP, SEXP lg_gammaSEXP, SEXP lg_deltaSEXP, SEXP deltaSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const int >::type index(indexSEXP); Rcpp::traits::input_parameter< const IntegerMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const IntegerMatrix& >::type nTSbyC(nTSbyCSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type nbyTS(nbyTSSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type nGbyTS(nGbyTSSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type nbyG(nbyGSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type y(ySEXP); Rcpp::traits::input_parameter< const int >::type L(LSEXP); Rcpp::traits::input_parameter< const int >::type nG(nGSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_beta(lg_betaSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_gamma(lg_gammaSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_delta(lg_deltaSEXP); Rcpp::traits::input_parameter< const double >::type delta(deltaSEXP); rcpp_result_gen = Rcpp::wrap(cG_CalcGibbsProbY_ori(index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta)); return rcpp_result_gen; END_RCPP } // cG_CalcGibbsProbY_fastRow NumericVector cG_CalcGibbsProbY_fastRow(const int index, const IntegerMatrix& counts, const IntegerMatrix& nTSbyC, const IntegerVector& nbyTS, const IntegerVector& nGbyTS, const IntegerVector& nbyG, const IntegerVector& y, const int L, const int nG, const NumericVector& lg_beta, const NumericVector& lg_gamma, const NumericVector& lg_delta, const double delta); RcppExport SEXP _celda_cG_CalcGibbsProbY_fastRow(SEXP indexSEXP, SEXP countsSEXP, SEXP nTSbyCSEXP, SEXP nbyTSSEXP, SEXP nGbyTSSEXP, SEXP nbyGSEXP, SEXP ySEXP, SEXP LSEXP, SEXP nGSEXP, SEXP lg_betaSEXP, SEXP lg_gammaSEXP, SEXP lg_deltaSEXP, SEXP deltaSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const int >::type index(indexSEXP); Rcpp::traits::input_parameter< const IntegerMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const IntegerMatrix& >::type nTSbyC(nTSbyCSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type nbyTS(nbyTSSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type nGbyTS(nGbyTSSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type nbyG(nbyGSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type y(ySEXP); Rcpp::traits::input_parameter< const int >::type L(LSEXP); Rcpp::traits::input_parameter< const int >::type nG(nGSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_beta(lg_betaSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_gamma(lg_gammaSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_delta(lg_deltaSEXP); Rcpp::traits::input_parameter< const double >::type delta(deltaSEXP); rcpp_result_gen = Rcpp::wrap(cG_CalcGibbsProbY_fastRow(index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta)); return rcpp_result_gen; END_RCPP } // cG_CalcGibbsProbY NumericVector cG_CalcGibbsProbY(const int index, const NumericVector& counts, const NumericMatrix& nTSbyC, const NumericVector& nbyTS, const IntegerVector& nGbyTS, const NumericVector& nbyG, const IntegerVector& y, const int L, const int nG, const NumericVector& lg_beta, const NumericVector& lg_gamma, const NumericVector& lg_delta, const double delta); RcppExport SEXP _celda_cG_CalcGibbsProbY(SEXP indexSEXP, SEXP countsSEXP, SEXP nTSbyCSEXP, SEXP nbyTSSEXP, SEXP nGbyTSSEXP, SEXP nbyGSEXP, SEXP ySEXP, SEXP LSEXP, SEXP nGSEXP, SEXP lg_betaSEXP, SEXP lg_gammaSEXP, SEXP lg_deltaSEXP, SEXP deltaSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const int >::type index(indexSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type nTSbyC(nTSbyCSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type nbyTS(nbyTSSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type nGbyTS(nGbyTSSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type nbyG(nbyGSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type y(ySEXP); Rcpp::traits::input_parameter< const int >::type L(LSEXP); Rcpp::traits::input_parameter< const int >::type nG(nGSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_beta(lg_betaSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_gamma(lg_gammaSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type lg_delta(lg_deltaSEXP); Rcpp::traits::input_parameter< const double >::type delta(deltaSEXP); rcpp_result_gen = Rcpp::wrap(cG_CalcGibbsProbY(index, counts, nTSbyC, nbyTS, nGbyTS, nbyG, y, L, nG, lg_beta, lg_gamma, lg_delta, delta)); return rcpp_result_gen; END_RCPP } // eigenMatMultInt SEXP eigenMatMultInt(const Eigen::Map A, const Eigen::Map< Eigen::MatrixXi> B); RcppExport SEXP _celda_eigenMatMultInt(SEXP ASEXP, SEXP BSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::Map >::type A(ASEXP); Rcpp::traits::input_parameter< const Eigen::Map< Eigen::MatrixXi> >::type B(BSEXP); rcpp_result_gen = Rcpp::wrap(eigenMatMultInt(A, B)); return rcpp_result_gen; END_RCPP } // eigenMatMultNumeric SEXP eigenMatMultNumeric(const Eigen::Map A, const Eigen::Map< Eigen::MatrixXd> B); RcppExport SEXP _celda_eigenMatMultNumeric(SEXP ASEXP, SEXP BSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::Map >::type A(ASEXP); Rcpp::traits::input_parameter< const Eigen::Map< Eigen::MatrixXd> >::type B(BSEXP); rcpp_result_gen = Rcpp::wrap(eigenMatMultNumeric(A, B)); return rcpp_result_gen; END_RCPP } // fastNormProp SEXP fastNormProp(NumericMatrix R_counts, double R_alpha); RcppExport SEXP _celda_fastNormProp(SEXP R_countsSEXP, SEXP R_alphaSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type R_counts(R_countsSEXP); Rcpp::traits::input_parameter< double >::type R_alpha(R_alphaSEXP); rcpp_result_gen = Rcpp::wrap(fastNormProp(R_counts, R_alpha)); return rcpp_result_gen; END_RCPP } // fastNormPropLog SEXP fastNormPropLog(NumericMatrix R_counts, double R_alpha); RcppExport SEXP _celda_fastNormPropLog(SEXP R_countsSEXP, SEXP R_alphaSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type R_counts(R_countsSEXP); Rcpp::traits::input_parameter< double >::type R_alpha(R_alphaSEXP); rcpp_result_gen = Rcpp::wrap(fastNormPropLog(R_counts, R_alpha)); return rcpp_result_gen; END_RCPP } // fastNormPropSqrt SEXP fastNormPropSqrt(NumericMatrix R_counts, double R_alpha); RcppExport SEXP _celda_fastNormPropSqrt(SEXP R_countsSEXP, SEXP R_alphaSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type R_counts(R_countsSEXP); Rcpp::traits::input_parameter< double >::type R_alpha(R_alphaSEXP); rcpp_result_gen = Rcpp::wrap(fastNormPropSqrt(R_counts, R_alpha)); return rcpp_result_gen; END_RCPP } // nonzero SEXP nonzero(NumericMatrix R_counts); RcppExport SEXP _celda_nonzero(SEXP R_countsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type R_counts(R_countsSEXP); rcpp_result_gen = Rcpp::wrap(nonzero(R_counts)); return rcpp_result_gen; END_RCPP } // colSumByGroupSparse Rcpp::NumericMatrix colSumByGroupSparse(const Eigen::MappedSparseMatrix& counts, const IntegerVector& group, const int& K); RcppExport SEXP _celda_colSumByGroupSparse(SEXP countsSEXP, SEXP groupSEXP, SEXP KSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type group(groupSEXP); Rcpp::traits::input_parameter< const int& >::type K(KSEXP); rcpp_result_gen = Rcpp::wrap(colSumByGroupSparse(counts, group, K)); return rcpp_result_gen; END_RCPP } // rowSumByGroupSparse Rcpp::NumericMatrix rowSumByGroupSparse(const Eigen::MappedSparseMatrix& counts, const IntegerVector& group, const int& L); RcppExport SEXP _celda_rowSumByGroupSparse(SEXP countsSEXP, SEXP groupSEXP, SEXP LSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type group(groupSEXP); Rcpp::traits::input_parameter< const int& >::type L(LSEXP); rcpp_result_gen = Rcpp::wrap(rowSumByGroupSparse(counts, group, L)); return rcpp_result_gen; END_RCPP } // colSumByGroupChangeSparse Rcpp::NumericMatrix colSumByGroupChangeSparse(const Eigen::MappedSparseMatrix& counts, const NumericMatrix& px, const IntegerVector& group, const IntegerVector& pgroup, const int& K); RcppExport SEXP _celda_colSumByGroupChangeSparse(SEXP countsSEXP, SEXP pxSEXP, SEXP groupSEXP, SEXP pgroupSEXP, SEXP KSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type px(pxSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type group(groupSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type pgroup(pgroupSEXP); Rcpp::traits::input_parameter< const int& >::type K(KSEXP); rcpp_result_gen = Rcpp::wrap(colSumByGroupChangeSparse(counts, px, group, pgroup, K)); return rcpp_result_gen; END_RCPP } // rowSumByGroupChangeSparse Rcpp::NumericMatrix rowSumByGroupChangeSparse(const Eigen::MappedSparseMatrix& counts, const NumericMatrix& px, const IntegerVector& group, const IntegerVector& pgroup, const int& L); RcppExport SEXP _celda_rowSumByGroupChangeSparse(SEXP countsSEXP, SEXP pxSEXP, SEXP groupSEXP, SEXP pgroupSEXP, SEXP LSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Eigen::MappedSparseMatrix& >::type counts(countsSEXP); Rcpp::traits::input_parameter< const NumericMatrix& >::type px(pxSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type group(groupSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type pgroup(pgroupSEXP); Rcpp::traits::input_parameter< const int& >::type L(LSEXP); rcpp_result_gen = Rcpp::wrap(rowSumByGroupChangeSparse(counts, px, group, pgroup, L)); return rcpp_result_gen; END_RCPP } RcppExport SEXP _colSumByGroup(SEXP, SEXP); RcppExport SEXP _colSumByGroup_numeric(SEXP, SEXP); RcppExport SEXP _colSumByGroupChange(SEXP, SEXP, SEXP, SEXP); RcppExport SEXP _colSumByGroupChange_numeric(SEXP, SEXP, SEXP, SEXP); RcppExport SEXP _perplexityG(SEXP, SEXP, SEXP, SEXP); RcppExport SEXP _rowSumByGroup(SEXP, SEXP); RcppExport SEXP _rowSumByGroup_numeric(SEXP, SEXP); RcppExport SEXP _rowSumByGroupChange(SEXP, SEXP, SEXP, SEXP); RcppExport SEXP _rowSumByGroupChange_numeric(SEXP, SEXP, SEXP, SEXP); static const R_CallMethodDef CallEntries[] = { {"_celda_decontXEM", (DL_FUNC) &_celda_decontXEM, 10}, {"_celda_decontXLogLik", (DL_FUNC) &_celda_decontXLogLik, 6}, {"_celda_decontXInitialize", (DL_FUNC) &_celda_decontXInitialize, 4}, {"_celda_calculateNativeMatrix", (DL_FUNC) &_celda_calculateNativeMatrix, 6}, {"_celda_cG_calcGibbsProbY_Simple", (DL_FUNC) &_celda_cG_calcGibbsProbY_Simple, 11}, {"_celda_cG_CalcGibbsProbY_ori", (DL_FUNC) &_celda_cG_CalcGibbsProbY_ori, 13}, {"_celda_cG_CalcGibbsProbY_fastRow", (DL_FUNC) &_celda_cG_CalcGibbsProbY_fastRow, 13}, {"_celda_cG_CalcGibbsProbY", (DL_FUNC) &_celda_cG_CalcGibbsProbY, 13}, {"_celda_eigenMatMultInt", (DL_FUNC) &_celda_eigenMatMultInt, 2}, {"_celda_eigenMatMultNumeric", (DL_FUNC) &_celda_eigenMatMultNumeric, 2}, {"_celda_fastNormProp", (DL_FUNC) &_celda_fastNormProp, 2}, {"_celda_fastNormPropLog", (DL_FUNC) &_celda_fastNormPropLog, 2}, {"_celda_fastNormPropSqrt", (DL_FUNC) &_celda_fastNormPropSqrt, 2}, {"_celda_nonzero", (DL_FUNC) &_celda_nonzero, 1}, {"_celda_colSumByGroupSparse", (DL_FUNC) &_celda_colSumByGroupSparse, 3}, {"_celda_rowSumByGroupSparse", (DL_FUNC) &_celda_rowSumByGroupSparse, 3}, {"_celda_colSumByGroupChangeSparse", (DL_FUNC) &_celda_colSumByGroupChangeSparse, 5}, {"_celda_rowSumByGroupChangeSparse", (DL_FUNC) &_celda_rowSumByGroupChangeSparse, 5}, {"_colSumByGroup", (DL_FUNC) &_colSumByGroup, 2}, {"_colSumByGroup_numeric", (DL_FUNC) &_colSumByGroup_numeric, 2}, {"_colSumByGroupChange", (DL_FUNC) &_colSumByGroupChange, 4}, {"_colSumByGroupChange_numeric", (DL_FUNC) &_colSumByGroupChange_numeric, 4}, {"_perplexityG", (DL_FUNC) &_perplexityG, 4}, {"_rowSumByGroup", (DL_FUNC) &_rowSumByGroup, 2}, {"_rowSumByGroup_numeric", (DL_FUNC) &_rowSumByGroup_numeric, 2}, {"_rowSumByGroupChange", (DL_FUNC) &_rowSumByGroupChange, 4}, {"_rowSumByGroupChange_numeric", (DL_FUNC) &_rowSumByGroupChange_numeric, 4}, {NULL, NULL, 0} }; RcppExport void R_init_celda(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } ================================================ FILE: src/cG_calcGibbsProbY.cpp ================================================ #include using namespace Rcpp; // [[Rcpp::depends(Rcpp)]] //Contains a version that is more simple to implement and understand as it matches the equations more directly // However it is many times slower. It is useful to use as a sanity check when modifying the faster function. // [[Rcpp::export]] NumericVector cG_calcGibbsProbY_Simple(const IntegerMatrix counts, IntegerVector nGbyTS, IntegerMatrix nTSbyC, IntegerVector nbyTS, IntegerVector nbyG, const IntegerVector y, const int L, const int index, const double gamma, const double beta, const double delta) { int index0 = index - 1; int current_y = y[index0] - 1; NumericVector probs(L); nGbyTS[current_y] -= 1; nbyTS[current_y] -= nbyG[index0]; nTSbyC(current_y,_) = nTSbyC(current_y,_) - counts(index0,_); for (int i = 0; i < L; i++) { nGbyTS[i] += 1; nbyTS[i] += nbyG[index0]; nTSbyC(i,_) = nTSbyC(i,_) + counts(index0,_); probs[i] += sum(lgamma(nGbyTS + gamma)); probs[i] += sum(lgamma(nTSbyC + beta)); probs[i] += sum(lgamma(nGbyTS * delta)); probs[i] -= sum(lgamma(nbyTS + (nGbyTS * delta))); nGbyTS[i] -= 1; nbyTS[i] -= nbyG[index0]; nTSbyC(i,_) = nTSbyC(i,_) - counts(index0,_); } nGbyTS[current_y] += 1; nbyTS[current_y] += nbyG[index0]; nTSbyC(current_y,_) = nTSbyC(current_y,_) + counts(index0,_); return(probs); } // [[Rcpp::export]] NumericVector cG_CalcGibbsProbY_ori(const int index, const IntegerMatrix& counts, const IntegerMatrix& nTSbyC, const IntegerVector& nbyTS, const IntegerVector& nGbyTS, const IntegerVector& nbyG, const IntegerVector& y, const int L, const int nG, const NumericVector& lg_beta, const NumericVector& lg_gamma, const NumericVector& lg_delta, const double delta) { int index0 = index - 1; int current_y = y[index0] - 1; int i, j, k; NumericVector probs(L); NumericVector nTSbyC_prob1(L); NumericVector nTSbyC_prob2(L); // Calculate probabilities related to the "n.TS.by.C" part of equation one time up front // The first vector represents when the current feature is added to that module // The second vector represents when the current feature is NOT added to that module for (int col = 0; col < counts.ncol(); col++) { j = col * L + current_y; // Index for the current module in the n.TS.by.C matrix k = col * nG + index0; // Index for the current feature in counts matrix for (int row = 0; row < L; row++) { if (row == current_y) { nTSbyC_prob1[row] += lg_beta[nTSbyC[j] - counts[k]]; nTSbyC_prob2[row] += lg_beta[nTSbyC[j]]; } else { nTSbyC_prob1[row] += lg_beta[nTSbyC[col * L + row]]; nTSbyC_prob2[row] += lg_beta[nTSbyC[col * L + row] + counts[k]]; } } } // Calculate the probabilities for each module // If statements determine whether to add or subtract counts from each probability for (i = 0; i < L; i++) { for(j = 0; j < L; j++) { if((i == j) & (i != current_y)) { probs[i] += lg_gamma[nGbyTS[j] + 1]; probs[i] += nTSbyC_prob2[j]; probs[i] += lg_delta[nGbyTS[j] + 1]; probs[i] -= lgamma(nbyTS[j] + nbyG[index0] + ((nGbyTS[j] + 1) * delta)); } else if ((j == current_y) & (i != current_y)) { probs[i] += lg_gamma[nGbyTS[j] - 1];; probs[i] += nTSbyC_prob1[j]; probs[i] += lg_delta[nGbyTS[j] - 1]; probs[i] -= lgamma(nbyTS[j] - nbyG[index0] + ((nGbyTS[j] - 1) * delta)); } else { probs[i] += lg_gamma[nGbyTS[j]];; probs[i] += lg_delta[nGbyTS[j]]; probs[i] -= lgamma(nbyTS[j] + (nGbyTS[j] * delta)); if(j == current_y) { probs[i] += nTSbyC_prob2[j]; } else { probs[i] += nTSbyC_prob1[j]; } } } } return(probs); } // [[Rcpp::export]] NumericVector cG_CalcGibbsProbY_fastRow(const int index, const IntegerMatrix& counts, const IntegerMatrix& nTSbyC, const IntegerVector& nbyTS, const IntegerVector& nGbyTS, const IntegerVector& nbyG, const IntegerVector& y, const int L, const int nG, const NumericVector& lg_beta, const NumericVector& lg_gamma, const NumericVector& lg_delta, const double delta) { int index0 = index - 1; int current_y = y[index0] - 1; int i; NumericVector probs(L); // Calculate probabilities related to the "n.TS.by.C" part of equation one time up front // The first case of if statement represents when the current feature is already added to that module // The second case represents when the current feature is NOT YET added to that module for (i = 0; i < L; i++) { if (i == current_y ) { for (int col = 0; col < counts.ncol(); col++) { probs[i] += lg_beta[nTSbyC(i, col)]; probs[i] -= lg_beta[nTSbyC(i, col) - counts(index0, col)]; } } else { for (int col = 0; col < counts.ncol(); col++) { probs[i] += lg_beta[nTSbyC(i, col) + counts(index0, col)]; probs[i] -= lg_beta[nTSbyC(i, col)]; } } } // Calculate the probabilities for each module // If statements determine whether to add or subtract counts from each probability for (i = 0; i < L; i++) { if (i == current_y) { probs[i] += lg_gamma[nGbyTS[i]]; probs[i] -= lg_gamma[nGbyTS[i] - 1]; probs[i] += lg_delta[nGbyTS[i]]; probs[i] -= lg_delta[nGbyTS[i] - 1]; probs[i] += lgamma(nbyTS[i] - nbyG[index0] + (nGbyTS[i] - 1) * delta); probs[i] -= lgamma(nbyTS[i] + nGbyTS[i] * delta); } else { probs[i] += lg_gamma[nGbyTS[i] + 1]; probs[i] -= lg_gamma[nGbyTS[i]]; probs[i] += lg_delta[nGbyTS[i] + 1]; probs[i] -= lg_delta[nGbyTS[i]]; probs[i] += lgamma(nbyTS[i] + nGbyTS[i] * delta); probs[i] -= lgamma(nbyTS[i] + nbyG[index0] + (nGbyTS[i] + 1) * delta); } } return(probs); } // [[Rcpp::export]] NumericVector cG_CalcGibbsProbY(const int index, const NumericVector& counts, const NumericMatrix& nTSbyC, const NumericVector& nbyTS, const IntegerVector& nGbyTS, const NumericVector& nbyG, const IntegerVector& y, const int L, const int nG, const NumericVector& lg_beta, const NumericVector& lg_gamma, const NumericVector& lg_delta, const double delta) { int index0 = index - 1; int current_y = y[index0] - 1; //int current_y = y; int i; int j,k; NumericVector probs(L); // Calculate probabilities related to the "n.TS.by.C" part of equation one time up front // The first case of if statement represents when the current feature is already added to that module // The second case represents when the current feature is NOT YET added to that module for (int col = 0; col < counts.length(); col++) { //k = col * nG + index0; // Index for the current feature in counts matrix, used when the whole matrix was passed to this function rather than just the row k = col; // Index for the current feature in counts matrix for (i = 0; i < L; i++) { j = col * L + i; // Index for the current module in the n.TS.by.C matrix if (i == current_y) { probs[i] += lg_beta[nTSbyC[j]]; probs[i] -= lg_beta[nTSbyC[j] - counts[k]]; } else { probs[i] += lg_beta[nTSbyC[j] + counts[k]]; probs[i] -= lg_beta[nTSbyC[j]]; } } } // Calculate the probabilities for each module // If statements determine whether to add or subtract counts from each probability for (i = 0; i < L; i++) { if (i == current_y) { probs[i] += lg_gamma[nGbyTS[i]]; probs[i] -= lg_gamma[nGbyTS[i] - 1]; probs[i] += lg_delta[nGbyTS[i]]; probs[i] -= lg_delta[nGbyTS[i] - 1]; probs[i] += lgamma(nbyTS[i] - nbyG[index0] + (nGbyTS[i] - 1) * delta); probs[i] -= lgamma(nbyTS[i] + nGbyTS[i] * delta); } else { probs[i] += lg_gamma[nGbyTS[i] + 1]; probs[i] -= lg_gamma[nGbyTS[i]]; probs[i] += lg_delta[nGbyTS[i] + 1]; probs[i] -= lg_delta[nGbyTS[i]]; probs[i] += lgamma(nbyTS[i] + nGbyTS[i] * delta); probs[i] -= lgamma(nbyTS[i] + nbyG[index0] + (nGbyTS[i] + 1) * delta); } } return(probs); } ================================================ FILE: src/eigenMatMultInt.cpp ================================================ // [[Rcpp::depends(RcppEigen)]] #include //' Fast matrix multiplication for double x int //' //' @param A a double matrix //' @param B an integer matrix //' @return An integer matrix representing the product of A and B // [[Rcpp::export]] SEXP eigenMatMultInt(const Eigen::Map A, const Eigen::Map< Eigen::MatrixXi> B){ Eigen::MatrixXd C = A.transpose() * B.cast(); return Rcpp::wrap(C); } //' Fast matrix multiplication for double x double //' //' @param A a double matrix //' @param B an integer matrix //' @return An integer matrix representing the product of A and B // [[Rcpp::export]] SEXP eigenMatMultNumeric(const Eigen::Map A, const Eigen::Map< Eigen::MatrixXd> B){ Eigen::MatrixXd C = A.transpose() * B; return Rcpp::wrap(C); } ================================================ FILE: src/matrixNorm.cpp ================================================ // [[Rcpp::depends(Rcpp)]] #include using namespace Rcpp ; //' Fast normalization for numeric matrix //' //' @param R_counts An integer matrix //' @param R_alpha A double value to be added to the matrix as a pseudocount //' @return A numeric matrix where the columns have been normalized to proportions // [[Rcpp::export]] SEXP fastNormProp(NumericMatrix R_counts, double R_alpha) { // Get colSums and instantiate new matrix NumericVector cs = colSums(R_counts); NumericMatrix res = NumericMatrix(R_counts.nrow(), R_counts.ncol()); // Normalize cell counts to proportions after adding pseudocount double alpha_tot = R_counts.nrow() * R_alpha; for (int i = 0; i < R_counts.ncol(); ++i) { if (cs[i] + alpha_tot == 0) { stop("Division by 0. Make sure colSums of counts does not contain 0 after rounding counts to integers."); } res(_,i) = (R_counts(_,i) + R_alpha) / (cs[i] + alpha_tot); } return Rcpp::wrap(res); } //' Fast normalization for numeric matrix //' //' @param R_counts An integer matrix //' @param R_alpha A double value to be added to the matrix as a pseudocount //' @return A numeric matrix where the columns have been normalized to proportions // [[Rcpp::export]] SEXP fastNormPropLog(NumericMatrix R_counts, double R_alpha) { // Get colSums and instantiate new matrix NumericVector cs = colSums(R_counts); NumericMatrix res = NumericMatrix(R_counts.nrow(), R_counts.ncol()); // Normalize cell counts to proportions after adding pseudocount double alpha_tot = R_counts.nrow() * R_alpha; for (int i = 0; i < R_counts.ncol(); ++i) { if (cs[i] + alpha_tot == 0) { stop("Division by 0. Make sure colSums of counts does not contain 0 after rounding counts to integers."); } res(_,i) = log((R_counts(_,i) + R_alpha) / (cs[i] + alpha_tot)); } return Rcpp::wrap(res); } //' Fast normalization for numeric matrix //' //' @param R_counts An integer matrix //' @param R_alpha A double value to be added to the matrix as a pseudocount //' @return A numeric matrix where the columns have been normalized to proportions // [[Rcpp::export]] SEXP fastNormPropSqrt(NumericMatrix R_counts, double R_alpha) { // Get colSums and instantiate new matrix NumericVector cs = colSums(R_counts); NumericMatrix res = NumericMatrix(R_counts.nrow(), R_counts.ncol()); // Normalize cell counts to proportions after adding pseudocount double alpha_tot = R_counts.nrow() * R_alpha; for (int i = 0; i < R_counts.ncol(); ++i) { if (cs[i] + alpha_tot == 0) { stop("Division by 0. Make sure colSums of counts does not contain 0 after rounding counts to integers."); } res(_,i) = sqrt((R_counts(_,i) + R_alpha) / (cs[i] + alpha_tot)); } return Rcpp::wrap(res); } //' get row and column indices of none zero elements in the matrix //' //' @param R_counts A matrix //' @return An integer matrix where each row is a row, column indices pair // [[Rcpp::export]] SEXP nonzero(NumericMatrix R_counts) { IntegerVector row(1); IntegerVector col(1); NumericVector val(1); int nR = R_counts.nrow(); int nC = R_counts.ncol(); double x; for (int c = 0; c < nC; c++) { for (int r = 0; r < nR; r++) { x = R_counts[c * nR + r]; if (x != 0) { row.push_back(r + 1); col.push_back(c + 1); val.push_back(x); } } } row.erase(0); col.erase(0); val.erase(0); List res; res["row"] = row; res["col"] = col; res["val"] = val; return(res); } ================================================ FILE: src/matrixSums.c ================================================ #include #include #include SEXP _rowSumByGroup(SEXP R_x, SEXP R_group) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); int *x = INTEGER(R_x); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group)) { error("The grouping argument must be a factor"); } int *group = INTEGER(R_group); int nl = nlevels(R_group); // If the sizes of the grouping variable and matrix do not match, throw an error if (LENGTH(R_group) != nr) { error("The length of the grouping argument must match the number of rows in the matrix"); } for (i = 0; i < nr; i++) { if(group[i] == NA_INTEGER) { error("Labels in group and pgroup must not be NA."); } } // Allocate a variable for the return matrix SEXP R_ans; PROTECT(R_ans = allocMatrix(INTSXP, nl, nc)); // Set a pointer to the return matrix and initialize the memory int *ans = INTEGER(R_ans); Memzero(ans, nl * nc); // Sum the totals for each element of the 'group' variable // Note: columns are iterated over before rows because the compiler appears to store expressions like // 'j * nr' in a temporary variable (as they do not change within inner loop); // swapping the order of the outer and inner loops slows down the code ~10X for (j = 0; j < nc; j++) { for (i = 0; i < nr; i++) { ans[j * nl + group[i] - 1] += x[j * nr + i]; } } UNPROTECT(1); return(R_ans); } SEXP _colSumByGroup(SEXP R_x, SEXP R_group) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); int *x = INTEGER(R_x); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group)) { error("The grouping argument must be a factor"); } int *group = INTEGER(R_group); int nl = nlevels(R_group); // If the sizes of the grouping variable and matrix do not match, throw an error if (LENGTH(R_group) != nc) { error("The length of the grouping argument must match the number of columns in the matrix"); } for (i = 0; i < nc; i++) { if(group[i] == NA_INTEGER) { error("Labels in group and pgroup must not be NA."); } } // Allocate a variable for the return matrix SEXP R_ans; PROTECT(R_ans = allocMatrix(INTSXP, nr, nl)); // Set a pointer to the return matrix and initialize the memory int *ans = INTEGER(R_ans); Memzero(ans, nr * nl); // Sum the totals for each element of the 'group' variable // Note: columns are iterated over before rows because the compiler appears to store expressions like // 'j * nr' in a temporary variable (as they do not change within inner loop); // swapping the order of the outer and inner loops slows down the code ~10X for (j = 0; j < nc; j++) { for (i = 0; i < nr; i++) { ans[(group[j] - 1) * nr + i] += x[j * nr + i]; } } UNPROTECT(1); return(R_ans); } SEXP _rowSumByGroupChange(SEXP R_x, SEXP R_px, SEXP R_group, SEXP R_pgroup) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); int *x = INTEGER(R_x); int *px = INTEGER(R_px); int *group = INTEGER(R_group); int *pgroup = INTEGER(R_pgroup); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group) || !isFactor(R_pgroup)) { error("The grouping arguments must be factors"); } int nl = nlevels(R_group); int nlp = nlevels(R_pgroup); if(nl != nlp || nl != nrows(R_px)) { error("group and pgroup must have the same number of levels equal to row number of px"); } if(nc != ncols(R_px)) { error("x and the previously summed matrix, px, must have the same number of columns."); } if(length(R_group) != length(R_pgroup) || length(R_group) != nr) { error("group label and previous group label must be the same length as the number of rows in x."); } for (i = 0; i < nr; i++) { if(group[i] == NA_INTEGER || pgroup[i] == NA_INTEGER) { error("Labels in group and pgroup must not be NA."); } } int g_ix; int pg_ix; for (i = 0; i < nr; i++) { if(pgroup[i] != group[i]) { for (j = 0; j < nc; j++) { pg_ix = (j * nl) + (pgroup[i] - 1); g_ix = (j * nl) + (group[i] - 1); px[pg_ix] -= x[j * nr + i]; px[g_ix] += x[j * nr + i]; } } } return(R_px); } SEXP _colSumByGroupChange(SEXP R_x, SEXP R_px, SEXP R_group, SEXP R_pgroup) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); int *x = INTEGER(R_x); int *px = INTEGER(R_px); int *group = INTEGER(R_group); int *pgroup = INTEGER(R_pgroup); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group) || !isFactor(R_pgroup)) { error("The grouping arguments must be factors"); } int nl = nlevels(R_group); int nlp = nlevels(R_pgroup); if(nl != nlp || nl != ncols(R_px)) { error("group and pgroup must have the same number of levels equal to column number of px"); } if(nr != nrows(R_px)) { error("x and the previously summed matrix, pxc must have the same number of rows"); } if(length(R_group) != length(R_pgroup) || length(R_group) != nc) { error("group label and previous group label must be the same length as the number of columns in x."); } for (i = 0; i < nc; i++) { if(group[i] == NA_INTEGER || pgroup[i] == NA_INTEGER) { error("Labels in group and pgroup must not be NA."); } } // Sum the totals for each element of the 'group' variable, // But only where the group label is different than the previous label // Note: columns are iterated over before rows because the compiler appears to store expressions like // 'j * nr' in a temporary variable (as they do not change within inner loop); // swapping the order of the outer and inner loops slows down the code ~10X for (j = 0; j < nc; j++) { if(group[j] != pgroup[j]) { for (i = 0; i < nr; i++) { px[(group[j]-1) * nr + i] += x[j * nr + i]; px[(pgroup[j]-1) * nr + i] -= x[j * nr + i]; } } } return(R_px); } SEXP _rowSumByGroup_numeric(SEXP R_x, SEXP R_group) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); double *x = REAL(R_x); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group)) { error("The grouping argument must be a factor"); } int *group = INTEGER(R_group); int nl = nlevels(R_group); // If the sizes of the grouping variable and matrix do not match, throw an error if (LENGTH(R_group) != nr) { error("The length of the grouping argument must match the number of rows in the matrix"); } // Allocate a variable for the return matrix SEXP R_ans; PROTECT(R_ans = allocMatrix(REALSXP, nl, nc)); // Set a pointer to the return matrix and initialize the memory double *ans = REAL(R_ans); Memzero(ans, nl * nc); // Sum the totals for each element of the 'group' variable // Note: columns are iterated over before rows because the compiler appears to store expressions like // 'j * nr' in a temporary variable (as they do not change within inner loop); // swapping the order of the outer and inner loops slows down the code ~10X for (j = 0; j < nc; j++) { for (i = 0; i < nr; i++) { ans[j * nl + group[i] - 1] += x[j * nr + i]; } } UNPROTECT(1); return(R_ans); } SEXP _colSumByGroup_numeric(SEXP R_x, SEXP R_group) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); double *x = REAL(R_x); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group)) { error("The grouping argument must be a factor"); } int *group = INTEGER(R_group); int nl = nlevels(R_group); // If the sizes of the grouping variable and matrix do not match, throw an error if (LENGTH(R_group) != nc) { error("The length of the grouping argument must match the number of columns in the matrix"); } // Allocate a variable for the return matrix SEXP R_ans; PROTECT(R_ans = allocMatrix(REALSXP, nr, nl)); // Set a pointer to the return matrix and initialize the memory double *ans = REAL(R_ans); Memzero(ans, nr * nl); // Sum the totals for each element of the 'group' variable // Note: columns are iterated over before rows because the compiler appears to store expressions like // 'j * nr' in a temporary variable (as they do not change within inner loop); // swapping the order of the outer and inner loops slows down the code ~10X for (j = 0; j < nc; j++) { for (i = 0; i < nr; i++) { ans[(group[j] - 1) * nr + i] += x[j * nr + i]; } } UNPROTECT(1); return(R_ans); } SEXP _rowSumByGroupChange_numeric(SEXP R_x, SEXP R_px, SEXP R_group, SEXP R_pgroup) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); double *x = REAL(R_x); double *px = REAL(R_px); int *group = INTEGER(R_group); int *pgroup = INTEGER(R_pgroup); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group) || !isFactor(R_pgroup)) { error("The grouping arguments must be factors"); } int nl = nlevels(R_group); int nlp = nlevels(R_pgroup); if(nl != nlp || nl != nrows(R_px)) { error("group and pgroup must have the same number of levels equal to row number of px"); } if(nc != ncols(R_px)) { error("x and the previously summed matrix, px, must have the same number of columns."); } if(length(R_group) != length(R_pgroup) || length(R_group) != nr) { error("group label and previous group label must be the same length as the number of rows in x."); } for (i = 0; i < nr; i++) { if(group[i] == NA_INTEGER || pgroup[i] == NA_INTEGER) { error("Labels in group and pgroup must not be NA."); } } int g_ix; int pg_ix; for (i = 0; i < nr; i++) { if(pgroup[i] != group[i]) { for (j = 0; j < nc; j++) { pg_ix = (j * nl) + (pgroup[i] - 1); g_ix = (j * nl) + (group[i] - 1); px[pg_ix] -= x[j * nr + i]; px[g_ix] += x[j * nr + i]; } } } return(R_px); } SEXP _colSumByGroupChange_numeric(SEXP R_x, SEXP R_px, SEXP R_group, SEXP R_pgroup) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); double *x = REAL(R_x); double *px = REAL(R_px); int *group = INTEGER(R_group); int *pgroup = INTEGER(R_pgroup); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group) || !isFactor(R_pgroup)) { error("The grouping arguments must be factors"); } int nl = nlevels(R_group); int nlp = nlevels(R_pgroup); if(nl != nlp || nl != ncols(R_px)) { error("group and pgroup must have the same number of levels equal to column number of px"); } if(nr != nrows(R_px)) { error("x and the previously summed matrix, pxc must have the same number of rows"); } if(length(R_group) != length(R_pgroup) || length(R_group) != nc) { error("group label and previous group label must be the same length as the number of columns in x."); } for (i = 0; i < nc; i++) { if(group[i] == NA_INTEGER || pgroup[i] == NA_INTEGER) { error("Labels in group and pgroup must not be NA."); } } // Sum the totals for each element of the 'group' variable, // But only where the group label is different than the previous label // Note: columns are iterated over before rows because the compiler appears to store expressions like // 'j * nr' in a temporary variable (as they do not change within inner loop); // swapping the order of the outer and inner loops slows down the code ~10X for (j = 0; j < nc; j++) { if(group[j] != pgroup[j]) { for (i = 0; i < nr; i++) { px[(group[j]-1) * nr + i] += x[j * nr + i]; px[(pgroup[j]-1) * nr + i] -= x[j * nr + i]; } } } return(R_px); } ================================================ FILE: src/matrixSumsSparse.cpp ================================================ #include #include // [[Rcpp::depends(RcppEigen)]] using namespace Rcpp; // Function to sum the counts for all cells that belong to a population together //Performs the same function as .colSumByGroup but for sparse matrices // [[Rcpp::export]] Rcpp::NumericMatrix colSumByGroupSparse( const Eigen::MappedSparseMatrix &counts, const IntegerVector &group, const int &K) { // Perform error checking if (counts.cols() != group.size()) { stop("Length of 'group' must be equal to the number of columns in 'counts'."); } if (min(group) < 1 || max(group) > K) { stop("The entries in 'group' need to be between 1 and 'K'"); } if(K > counts.cols()) { stop("'K' cannot be bigger than the number of columns in 'counts'."); } NumericMatrix x(counts.rows(), K); for (int j = 0; j < counts.cols(); ++j) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { x(i_.index(), group[j] - 1) += i_.value(); } } return x; } // Function to sum the counts for all features that belong to a module together //Performs the same function as .rowSumByGroup but for sparse matrices // [[Rcpp::export]] Rcpp::NumericMatrix rowSumByGroupSparse( const Eigen::MappedSparseMatrix &counts, const IntegerVector &group, const int &L) { // Perform error checking if (counts.rows() != group.size()) { stop("Length of 'group' must be equal to the number of rows in 'counts'."); } if (min(group) < 1 || max(group) > L) { stop("The entries in 'group' need to be between 1 and 'L'."); } if(L > counts.rows()) { stop("'L' cannot be bigger than the number of rows in 'counts'."); } NumericMatrix x(L, counts.cols()); for (int j = 0; j < counts.cols(); ++j) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { x(group[i_.index()] - 1, j) += i_.value(); } } return x; } // Function to change the counts for a cells previously assigned to one group // that have now been assigned to another group. Performs the same function // as .colSumByGroupChange but for sparse matrices // [[Rcpp::export]] Rcpp::NumericMatrix colSumByGroupChangeSparse( const Eigen::MappedSparseMatrix &counts, const NumericMatrix &px, const IntegerVector &group, const IntegerVector &pgroup, const int &K) { // Perform error checking if (counts.cols() != group.size()) { stop("Length of 'group' must be equal to the number of columns in 'counts'."); } if(group.size() != pgroup.size()) { stop("Length of 'group' must equal 'pgroup'."); } if (min(group) < 1 || max(group) > K) { stop("The entries in 'group' need to be between 1 and 'K'."); } if (min(pgroup) < 1 || max(pgroup) > K) { stop("The entries in 'pgroup' need to be between 1 and 'K'."); } if(px.rows() != counts.rows()) { stop("'px' and 'counts' must have the same number of rows."); } if(K > counts.cols()) { stop("'K' cannot be bigger than the number of columns in 'counts'."); } NumericMatrix x = px; for (int j = 0; j < counts.cols(); ++j) { if(group[j] != pgroup[j]) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { x(i_.index(), group[j] - 1) += i_.value(); x(i_.index(), pgroup[j] - 1) -= i_.value(); } } } return x; } // Function to change the counts for a features previously assigned to one group // that have now been assigned to another group. Performs the same function // as .rowSumByGroupChange but for sparse matrices // [[Rcpp::export]] Rcpp::NumericMatrix rowSumByGroupChangeSparse( const Eigen::MappedSparseMatrix &counts, const NumericMatrix &px, const IntegerVector &group, const IntegerVector &pgroup, const int &L) { // Perform error checking if (counts.rows() != group.size()) { stop("Length of 'group' must be equal to the number of rows in 'counts'."); } if(group.size() != pgroup.size()) { stop("Length of 'group' must equal 'pgroup'."); } if (min(group) < 1 || max(group) > L) { stop("The entries in 'group' need to be between 1 and 'L'."); } if (min(pgroup) < 1 || max(pgroup) > L) { stop("The entries in 'pgroup' need to be between 1 and 'L'."); } if(px.cols() != counts.cols()) { stop("'px' and 'counts' must have the same number of rows."); } if(L > counts.rows()) { stop("'L' cannot be bigger than the number of rows in 'counts'."); } NumericMatrix x = px; for (int j = 0; j < counts.cols(); ++j) { for (Eigen::MappedSparseMatrix::InnerIterator i_(counts, j); i_; ++i_) { if(group[i_.index()] != pgroup[i_.index()]) { x(group[i_.index()] - 1, j) += i_.value(); x(pgroup[i_.index()] - 1, j) -= i_.value(); } } } return x; } ================================================ FILE: src/perplexity.c ================================================ #include #include #include SEXP _perplexityG(SEXP R_x, SEXP R_phi, SEXP R_psi, SEXP R_group) { int i, j; int nr = nrows(R_x); int nc = ncols(R_x); int nl = nlevels(R_group); // If the grouping variable is not a factor, throw an error if (!isFactor(R_group)) { error("The grouping argument must be a factor"); } // If the length of the grouping variable and matrix do not match, throw an error if (LENGTH(R_group) != nr) { error("The length of the grouping argument must match the number of rows in the matrix."); } if (ncols(R_phi) != nc) { error("The R_phi and R_x must have the same number of colums."); } if (nrows(R_phi) != nl) { error("R_phi must have the same number of rows as the number of levels in R_group."); } if (nrows(R_psi) != nr) { error("The R_psi and R_x must have the same number of rows."); } if (ncols(R_psi) != nl) { error("R_phi must have the same number of columns as the number of levels in R_group."); } // Create pointers int *group = INTEGER(R_group); double *phi = REAL(R_phi); double *psi = REAL(R_psi); int *x = INTEGER(R_x); // Make sure values are not NA and within the range of the number of rows for (i = 0; i < nr; i++) { if(group[i] == NA_INTEGER || group[i] < 0 || group[i] > nr) { error("Labels in group and pgroup must not be NA and must less than or equal to the number of rows in the matrix."); } } // Allocate a variable for the return matrix double ans = 0; // Multiply the probabilties, log transform, and multiply against the counts to derive log(p(x)) for (j = 0; j < nc; j++) { for (i = 0; i < nr; i++) { ans += x[j * nr + i] * log(phi[j * nl + (group[i]-1)] * psi[nr * (group[i]-1) + i]); } } SEXP R_ans = PROTECT(allocVector(REALSXP, 1)); REAL(R_ans)[0] = ans; UNPROTECT(1); return(R_ans); } ================================================ FILE: tests/testthat/test-celda-functions.R ================================================ # library(celda) # context("Testing miscellaneous celda functions") # # data(celdaCGGridSearchRes) # celdaCGSim <- simulateCells("celda_CG", K = 5, L = 10) # modelCG <- celda_CG( # counts = celdaCGSim$counts, # sampleLabel = celdaCGSim$sampleLabel, # K = celdaCGSim$K, # L = celdaCGSim$L, # algorithm = "EM", # verbose = FALSE, # nchains = 1) # # factorized <- factorizeMatrix(celdaMod = modelCG, counts = celdaCGSim$counts) # # test_that(desc = "Testing compareCountMatrix with numeric matrix input", { # # Case from GitHub issue #137 # counts <- celdaCGSim$counts # storage.mode(counts) <- "numeric" # expect_true(compareCountMatrix(counts, modelCG, errorOnMismatch = TRUE)) # }) # # test_that(desc = "Testing appendCeldaList", { # expect_error( # appendCeldaList(celdaCGGridSearchRes, matrix(0)), # "Both parameters to appendCeldaList must be of class celdaList." # ) # modifiedEgList <- celdaCGGridSearchRes # modifiedEgList@countChecksum <- "abcd12345" # expect_warning( # appendCeldaList(celdaCGGridSearchRes, modifiedEgList), # paste0("Provided lists have different countChecksums and may have", # " been generated from different count matrices. Using checksum", # " from first list...")) # expect_equal(length(celdaCGGridSearchRes@resList) * 2, # length(appendCeldaList(celdaCGGridSearchRes, # celdaCGGridSearchRes)@resList)) # }) # miscellaneous fxns # functions used internally test_that(desc = "Invoking error from distinctColors function", { expect_error(distinctColors(n = 3, hues = "xx"), paste0("Only color names listed in the 'color' function can be", " used in 'hues'")) }) test_that(desc = "Invoking error from sample labels function", { expect_error(.processSampleLabels("Sample_1", 5), paste0("'sampleLabel' must be the same length as the number", " of columns in the 'counts' matrix.")) }) test_that(desc = "Invoking error from .logMessages function", { expect_error(.logMessages(date(), logfile = 5)) }) test_that(desc = paste0("miscellaneous distance fxns that are not directly", " used within celda, but will be tested"), { x <- data.frame(x = seq(2, 4), y = seq(1, 3)) expect_equal(class(.hellingerDist(x)), "dist") expect_equal(class(.spearmanDist(x)), "dist") }) ================================================ FILE: tests/testthat/test-celda_C.R ================================================ # celda_C library(celda) library(SingleCellExperiment) library(Matrix) context("Testing celda_C") sceceldaCSim <- simulateCells("celda_C", K = 10) counts(sceceldaCSim) <- as(counts(sceceldaCSim), "dgCMatrix") scesf <- selectFeatures(sceceldaCSim) K <- S4Vectors::metadata(sceceldaCSim)$celda_simulateCellscelda_C$K counts <- SummarizedExperiment::assay(scesf, "counts") sce <- celda_C(scesf, sampleLabel = sampleLabel(scesf), K = K, algorithm = "EM", verbose = FALSE, nchains = 1) factorized <- factorizeMatrix(sce) # celda_C test_that(desc = "Testing simulation and celda_C model", { #expect_equal(typeof(counts), "integer") expect_true(all(sweep(factorized$counts$sample, 2, colSums(factorized$counts$sample), "/") == factorized$proportions$sample)) expect_true(ncol(factorized$proportions$module) == K) expect_true(all(is.numeric(logLikelihoodHistory(sce)))) expect_equal(max(logLikelihoodHistory(sce)), bestLogLikelihood(sce)) }) # clusterProbability test_that(desc = "Testing clusterProbability with celda_C", { expect_true(all(round(rowSums(clusterProbability(sce)[[1]])) == 1)) }) # celdaGridSearch and perplexity calculations test_that(desc = "Testing celdaGridSearch with celda_C", { celdaCResList <- celdaGridSearch(scesf, model = "celda_C", nchains = 2, paramsTest = list(K = c(5, 6)), paramsFixed = list(sampleLabel = sampleLabel(scesf)), maxIter = 2, verbose = FALSE, bestOnly = FALSE, perplexity = FALSE) expect_error(celdaGridSearch(scesf, model = "celda_C", paramsTest = list(K = c(4, 5), M = c(3, 4)), paramsFixed = list(sampleLabel = sampleLabel(scesf)), bestOnly = FALSE), paste0("The following elements in 'paramsTest' are not arguments of", " 'celda_C': M")) expect_error(celdaGridSearch(scesf, model = "celda_C", nchains = 1, maxIter = 1, paramsTest = list(K = c(4, 5), sampleLabel = "Sample"), paramsFixed = list(sampleLabel = sampleLabel(scesf))), paste0("Setting parameters such as 'z.init', 'y.init', and", " 'sampleLabel' in 'paramsTest' is not currently supported.")) expect_error(celdaGridSearch(scesf, model = "celda_C", nchains = 1, maxIter = 1, paramsTest = list(), paramsFixed = list(sampleLabel = sampleLabel(scesf))), paste0("The following arguments are not in 'paramsTest' or", " 'paramsFixed' but are required for 'celda_C': K")) expect_error(celdaGridSearch(scesf, model = "celda_C", nchains = 1, maxIter = 1, paramsTest = list(K = c(9, 10)), paramsFixed = list(sampleLabel = sampleLabel(scesf), xxx = "xxx")), paste0("The following elements in 'paramsFixed' are not arguments", " of 'celda_C': xxx")) expect_true(class(S4Vectors::metadata( altExp(celdaCResList))$celda_grid_search)[1] == "celdaList") expect_equal(names(runParams(celdaCResList)), c("index", "chain", "K", "seed", "logLikelihood")) expect_error(plotGridSearchPerplexity(celdaCResList)) celdaCResList <- resamplePerplexity(celdaCResList, numResample = 2) expect_equal(is.null(altExp( celdaCResList)@metadata$celda_grid_search@perplexity), FALSE) expect_true(is(celdaCResList, "SingleCellExperiment")) expect_error(resamplePerplexity(celdaCResList, numResample = "2"), "The 'numResample' parameter needs to be an integer greater than 0.") plotObj <- plotGridSearchPerplexity(celdaCResList) expect_is(plotObj, "ggplot") celdaCResIndex1 <- subsetCeldaList(celdaCResList, params = list(index = 1)) expect_true(all("celda_grid_search" %in% names(altExp(celdaCResList)@metadata) && "celda_parameters" %in% names(altExp(celdaCResIndex1)@metadata))) expect_error(subsetCeldaList(celdaCResList, params = list(K = 11))) expect_error(subsetCeldaList(celdaCResList, params = list(K = 5, M = 10))) celdaCResK5 <- subsetCeldaList(celdaCResList, params = list(K = 5)) sce2 <- selectBestModel(celdaCResK5) res <- perplexity(sce) res2 <- perplexity(sce, newCounts = as.matrix(counts + 1)) expect_error(res <- perplexity(sce, newCounts = counts[-1, ])) }) # # logLikelihood # test_that(desc = "Testing logLikelihood.celda_C", { # expect_lt(logLikelihood(sce), 0) # fakeZ <- celdaCSim$z # fakeZ[1] <- celdaCSim$K + 1 # expect_error(logLikelihood(model = "celda_C", # z = fakeZ, # counts = celdaCSim$counts, # K = celdaCSim$K, # alpha = 1, # beta = 1, # sampleLabel = sampleLabel(sceceldaCSim)), # "An entry in z contains a value greater than the provided K.") # }) # # # Gibbs sampling # test_that(desc = "Testing celda_C with Gibbs sampling", { # res <- celda_C(sceceldaCSim, # sampleLabel = sampleLabel(sceceldaCSim), # K = celdaCSim$K, # algorithm = "Gibbs", # maxIter = 5, # nchain = 1) # expect_is(res, "celda_C") # }) # # # normalizeCounts # test_that(desc = paste0("Making sure normalizeCounts doesn't change", # " dimensions of counts matrix"), { # normCounts <- normalizeCounts(celdaCSim$counts) # expect_equal(dim(normCounts), dim(celdaCSim$counts)) # expect_equal(rownames(normCounts), rownames(celdaCSim$counts)) # expect_equal(colnames(normCounts), colnames(celdaCSim$counts)) # expect_error(normalizeCounts(celdaCSim$counts, # transformationFun = "scale"), # "'transformationFun' needs to be of class 'function'") # expect_error(normalizeCounts(celdaCSim$counts, scaleFun = "scale"), # "'scaleFun' needs to be of class 'function'") # }) # # # recodeClusterZ # test_that(desc = "Testing recodeClusterZ with celda_C", { # expect_error(recodeClusterY(celdaMod = modelC, # from = c(1, 2, 3, 4, 5), # to = c(5, 4, 3, 2, 1))) # expect_error(recodeClusterZ(celdaMod = modelC, # from = NULL, # to = "")) # expect_error(recodeClusterZ(celdaMod = modelC, # from = c(1, 2, 3, 4, 5), # to = c(1, 2, 3, 4, 6))) # expect_error(recodeClusterZ(celdaMod = modelC, # from = c(1, 2, 3, 4, 6), # to = c(1, 2, 3, 4, 5))) # newRecoded <- recodeClusterZ(celdaMod = modelC, # from = c(1, 2, 3, 4, 5), # to = c(5, 4, 3, 2, 1)) # expect_equal(modelC@clusters$z == 1, newRecoded@clusters$z == 5) # }) # # # compareCountMatrix # # test_that(desc = "Testing CompareCountMatrix with celda_C", { # # expect_true(compareCountMatrix(counts = celdaCSim$counts, # # celdaMod = modelC)) # # # # lessCells <- celdaCSim$counts[, seq(100)] # # expect_error(compareCountMatrix(counts = lessCells, celdaMod = modelC), # # paste0("The provided celda object was generated from a counts", # # " matrix with a different number of cells than the one provided.")) # # # # countsMatrixError <- matrix(data = 1, # # nrow = nrow(celdaCSim$counts), # # ncol = ncol(celdaCSim$counts)) # # expect_false(compareCountMatrix(counts = countsMatrixError, # # celdaMod = modelC, # # errorOnMismatch = FALSE)) # # expect_error(compareCountMatrix(counts = countsMatrixError, # # celdaMod = modelC, # # errorOnMismatch = TRUE)) # # }) # # # topRank # test_that(desc = "Checking topRank to see if it runs without errors", { # topRank <- topRank(matrix = factorized$proportions$module, # threshold = NULL) # expect_equal(names(topRank), c("index", "names")) # topRank <- topRank(matrix = factorized$proportions$module, n = 1000) # }) # # # plotHeatmap # test_that(desc = "Testing plotHeatmap with celda_C", { # expect_error(plotHeatmap(counts = celdaCSim$counts, z = modelC@params$K), # "Length of z must match number of columns in counts matrix") # expect_error(plotHeatmap( # counts = celdaCSim$counts, # z = modelC@clusters$z, # scaleRow = modelC), # "'scaleRow' needs to be of class 'function'") # expect_error(plotHeatmap(counts = celdaCSim$counts, # z = modelC@clusters$z, # trim = 3), # paste0("'trim' should be a 2 element vector specifying the lower and", # " upper boundaries")) # }) # # # plotHeatmap with annotationCell # # test_that(desc = "Testing plotHeatmap with celda_C, including annotations", # # { # # annot <- as.data.frame(c(rep(x = 1, # # times = ncol(celdaCSim$counts) - 100), # # rep(x = 2, 100))) # # # # rownames(annot) <- colnames(celdaCSim$counts) # # expect_equal(names(plotHeatmap( # # celdaMod = modelC, # # counts = celdaCSim$counts, # # annotationCell = annot, # # z = modelC@clusters$z)), # # c("tree_row", "tree_col", "gtable")) # # # # rownames(annot) <- NULL # # expect_equal(names(plotHeatmap(celdaMod = modelC, # # counts = celdaCSim$counts, # # annotation.feature = as.matrix(annot), # # z = modelC@clusters$z)), # # c("tree_row", "tree_col", "gtable")) # # # # rownames(annot) <- rev(colnames(celdaCSim$counts)) # # expect_error(plotHeatmap(celdaMod = modelC, # # counts = celdaCSim$counts, # # annotationCell = annot, # # z = modelC@clusters$z), # # paste0("Row names of 'annotationCell' are different than the", # # " column names of 'counts'") # # ) # # }) # # # celdaHeatmap # # test_that(desc = "Testing celdaHeatmap with celda_C", { # # expect_equal(names(celdaHeatmap(celdaMod = modelC, # # counts = celdaCSim$counts)), # # c("tree_row", "tree_col", "gtable")) # # }) # # # celdaProbabilityMap # # test_that(desc = "Testing celdaProbabiltyMap.celda_C for sample level", { # # plotObj <- celdaProbabilityMap(counts = celdaCSim$counts, # # celdaMod = modelC, # # level = "sample") # # expect_true(!is.null(plotObj)) # # # # ## Without a sample label # # modelC <- celda_C(celdaCSim$counts, # # sampleLabel = NULL, # # K = celdaCSim$K, # # maxIter = 5, # # nchain = 1) # # plotObj <- celdaProbabilityMap(counts = celdaCSim$counts, # # celdaMod = modelC, # # level = "sample") # # expect_true(!is.null(plotObj)) # # }) # # test_that(desc = paste0("Testing celdaTsne with celda_C when model class is", # "changed, should error"), { # modelX <- modelC # class(modelX) <- "celda_X" # expect_error(celdaTsne(counts = celdaCSim$counts, # celdaMod = modelX, # maxCells = length(modelC@clusters$z), # minClusterSize = 10), # "unable to find") # }) # # test_that(desc = "Testing celdaTsne with celda_C including all cells", { # tsne <- celdaTsne(counts = celdaCSim$counts, # celdaMod = modelC, # maxCells = length(modelC@clusters$z), # minClusterSize = 10) # plotObj <- plotDimReduceCluster(tsne[, 1], # tsne[, 2], # modelC@clusters$z, # labelClusters = TRUE) # expect_true(ncol(tsne) == 2 & nrow(tsne) == length(modelC@clusters$z)) # expect_true(!is.null(plotObj)) # }) # # test_that(desc = paste0("Testing celdaTsne with celda_C including", # " a subset of cells"), { # expect_success(expect_error(tsne <- celdaTsne(counts = celdaCSim$counts, # celdaMod = modelC, # maxCells = 50, # minClusterSize = 50))) # tsne <- celdaTsne(counts = celdaCSim$counts, # celdaMod = modelC, # maxCells = 100, # minClusterSize = 10) # plotObj <- plotDimReduceCluster(tsne[, 1], tsne[, 2], modelC@clusters$z) # expect_true(ncol(tsne) == 2 & # nrow(tsne) == length(modelC@clusters$z) && # sum(!is.na(tsne[, 1])) == 100) # expect_true(!is.null(plotObj)) # }) # # test_that(desc = paste0("Testing celdaUmap with celda_C when model class is", # " changed, should error"), { # modelX <- modelC # class(modelX) <- "celda_X" # expect_error(celdaUmap(counts = celdaCSim$counts, # celdaMod = modelX, # maxCells = length(modelC@clusters$z), # minClusterSize = 10), # "unable to find") # }) # # # test_that(desc = "Testing celdaUmap with celda_C including all cells", { # # umap <- celdaUmap(counts = celdaCSim$counts, # # celdaMod = modelC, # # maxCells = length(modelC@clusters$z), # # minClusterSize = 10) # # plotObj <- plotDimReduceCluster(umap[, 1], umap[, 2], modelC@clusters$z) # # expect_true(ncol(umap) == 2 & # # nrow(umap) == length(modelC@clusters$z)) # # expect_true(!is.null(plotObj)) # # }) # # test_that(desc = paste0("Testing celdaUmap with celda_C including", # " a subset of cells"), { # expect_success(expect_error(umap <- celdaUmap( # counts = celdaCSim$counts, # celdaMod = modelC, # maxCells = 50, # minClusterSize = 50))) # umap <- celdaUmap(counts = celdaCSim$counts, # celdaMod = modelC, # maxCells = 100, # minClusterSize = 10) # plotObj <- plotDimReduceCluster(umap[, 1], umap[, 2], modelC@clusters$z) # expect_true(ncol(umap) == 2 & # nrow(umap) == length(modelC@clusters$z) && # sum(!is.na(umap[, 1])) == 100) # expect_true(!is.null(plotObj)) # }) # # # featureModuleLookup # test_that(desc = "Testing featureModuleLookup with celda_C", { # expect_error(featureModuleLookup(celdaCSim$counts, modelC, "test_feat")) # }) # # # .cCSplitZ # test_that(desc = "Testing error checking for .cCSplitZ", { # r <- simulateCells("celda_C", # S = 1, # CRange = c(50, 100), # K = 2) # dc <- .cCDecomposeCounts(r$counts, r$sampleLabel, r$z, r$K) # res <- .cCSplitZ(r$counts, # dc$mCPByS, # dc$nGByCP, # dc$nCP, # s = as.integer(r$sampleLabel), # z = r$z, # K = r$K, # nS = dc$nS, # nG = dc$nG, # alpha = 1, # beta = 1, # zProb = NULL, # minCell = 1000) # expect_true(grepl("Cluster sizes too small", res$message)) # }) # # test_that(desc = "Testing perplexity.celda_C", { # expect_true(is.numeric(perplexity(celdaCSim$counts, modelC))) # # class(modelC) <- c("celda_CG") # expect_error(perplexity.celda_C(celdaCSim$counts, modelC), # "could not find function \"perplexity.celda_C\"") # }) ================================================ FILE: tests/testthat/test-celda_CG.R ================================================ # celda_CG library(celda) context("Testing celda_CG") K <- 5 L <- 10 celdaCGSim <- simulateCells("celda_CG", K = K, L = L) counts(celdaCGSim) <- as(counts(celdaCGSim), "dgCMatrix") celdaCGSim <- selectFeatures(celdaCGSim, 0, 0) modelCG <- celda_CG(celdaCGSim, sampleLabel = celdaCGSim$celda_sample_label, K = K, L = L, zInitialize = "random", yInitialize = "random", nchains = 1, algorithm = "EM", verbose = FALSE) factorized <- factorizeMatrix(modelCG) # celda_CG test_that(desc = "Testing simulation and celda_CG model", { expect_true(all(sweep(factorized$counts$cell, 2, colSums(counts(altExp(celdaCGSim))), "/") == factorized$proportions$cell)) expect_equal(K, ncol(factorized$proportions$cellPopulation)) expect_equal(L, nrow(factorized$proportions$cellPopulation)) }) # Cluster probabilities test_that(desc = "Testing clusterProbability with celda_CG", { clustProb <- clusterProbability(modelCG) expect_true(all(round(rowSums(clustProb$zProbability), 10) == 1) & nrow(clustProb$zProbability) == ncol(modelCG)) expect_true(all(round(rowSums(clustProb$yProbability), 10) == 1) & nrow(clustProb$yProbability) == nrow(modelCG)) clustProb <- clusterProbability(modelCG, log = TRUE) }) test_that(desc = paste0("Testing simulateCells celda_CG error checking with", " low gamma"), { expect_warning(simulateCells(model = "celda_CG", gamma = 0.1)) }) test_that(desc = paste0("Testing simulateCells celda_CG, make sure all genes", " expressed"), { simCellsLow <- simulateCells(model = "celda_CG", G = 1000, C = 300, CRange = c(1, 100), NRange = c(1, 100)) expect_true(all(rowSums(counts(simCellsLow)) > 0)) }) # Ensure logLikelihood calculates the expected values test_that(desc = "Testing logLikelihood functions for celda_CG", { expect_lt(logLikelihood(modelCG), 0) fakeZ <- as.integer(celdaClusters(modelCG)) fakeZ[1] <- K + 1 expect_error(.logLikelihoodcelda_CG( K = K, L = L, y = as.integer(celdaModules(modelCG)), z = fakeZ, delta = 1, gamma = 1, beta = 1, alpha = 1, s = modelCG$celda_sample_label, counts = counts(altExp(modelCG))), paste0("Assigned value of cell cluster greater than the total ", "number of cell clusters!")) fakeY <- as.integer(celdaModules(modelCG)) fakeY[1] <- L + 1 expect_error(.logLikelihoodcelda_CG( y = fakeY, z = as.integer(celdaClusters(modelCG)), delta = 1, gamma = 1, beta = 1, alpha = 1, K = K, L = L, s = modelCG$celda_sample_label, counts = counts(altExp(modelCG))), paste0("Assigned value of feature module greater than the total ", "number of feature modules!")) }) # normalizeCounts test_that(desc = "Testing normalizeCounts with celda_CG", { normCounts <- normalizeCounts(counts(celdaCGSim)) expect_equal(dim(normCounts), dim(counts(celdaCGSim))) expect_equal(rownames(normCounts), rownames(counts(celdaCGSim))) expect_equal(colnames(normCounts), colnames(counts(celdaCGSim))) expect_error(normalizeCounts(counts(celdaCGSim), transformationFun = "scale"), "'transformationFun' needs to be of class 'function'") expect_error(normalizeCounts(counts(celdaCGSim), scaleFun = "scale"), "'scaleFun' needs to be of class 'function'") }) # recodeClusterY test_that(desc = "Testing recodeClusterY with celda_CG", { expect_error(recodeClusterY(modelCG, from = NULL, to = "")) expect_error(recodeClusterY(modelCG, from = c(1, 2, 3, 4, 5), to = c(1, 2, 4, 3, 6))) expect_error(recodeClusterY(modelCG, from = c(1, 2, 3, 4, 6), to = c(1, 2, 4, 3, 5))) newRecoded <- recodeClusterY(modelCG, from = c(1, 2, 3, 4, 5), to = c(3, 2, 1, 4, 5)) expect_equal(celdaModules(modelCG) == 1, celdaModules(newRecoded) == 3) }) # recodeClusterZ test_that(desc = "Testing recodeClusterZ with celda_CG", { expect_error(recodeClusterZ(modelCG, from = NULL, to = "")) expect_error(recodeClusterZ(modelCG, from = c(1, 2, 3, 4, 5), to = c(1, 2, 3, 4, 6))) expect_error(recodeClusterZ(modelCG, from = c(1, 2, 3, 4, 6), to = c(1, 2, 3, 4, 5))) newRecoded <- recodeClusterZ(modelCG, from = c(1, 2, 3, 4, 5), to = c(5, 4, 3, 2, 1)) expect_equal(celdaClusters(modelCG) == 1, celdaClusters(newRecoded) == 5) }) # topRank test_that(desc = "Testing topRank with celda_CG", { topRank <- topRank(matrix = factorized$proportions$module, n = 1000, threshold = NULL) expect_equal(names(topRank), c("index", "names")) }) # celdaHeatmap test_that(desc = "Testing celdaHeatmap with celda_CG", { plt <- celdaHeatmap(modelCG) expect_equal(class(plt), c("gtable", "gTree", "grob", "gDesc")) }) # moduleHeatmap test_that(desc = "Testing moduleHeatmap with celda_CG", { plt <- moduleHeatmap(modelCG, featureModule = c(2, 3), topCells = 10, topFeatures = 10) expect_is(plt, "list") }) # celdaProbabiltyMap test_that(desc = "Testing celdaProbabiltyMap", { plotObj <- celdaProbabilityMap(modelCG) plotObj <- celdaProbabilityMap(modelCG, level = "cellPopulation") expect_true(!is.null(plotObj)) }) test_that(desc = "Testing celdaUmap and celdaTsne with celda_CG", { modelCG <- celdaUmap(modelCG, maxCells = 100, minClusterSize = 10) modelCG <- celdaTsne(modelCG, maxCells = 100, minClusterSize = 10) plotObj <- plotDimReduceCluster(modelCG, "celda_UMAP") expect_true(!is.null(plotObj)) }) # featureModuleLookup test_that(desc = "Testing featureModuleLookup with celda_CG", { res <- featureModuleLookup(modelCG, "Gene_1") expect_true(res == celdaModules(modelCG)[1]) expect_error(featureModuleLookup(modelCG, "XXXXXXX")) }) test_that(desc = "Testing perplexity of celda_CG", { expect_true(is.numeric(perplexity(modelCG))) }) test_that(desc = "Testing featureModuleTable", { table <- featureModuleTable(modelCG, outputFile = NULL) expect_equal(ncol(table), 10) }) test_that(desc = "Testing plotCeldaViolin with celda_CG", { violin <- plotCeldaViolin(modelCG, features = "Gene_1") expect_is(violin, "ggplot") }) ================================================ FILE: tests/testthat/test-celda_G.R ================================================ # # celda_G # library(celda) # context("Testing celda_G") # # celdaGSim <- simulateCells("celda_G", L = 5, G = 100) # modelG <- celda_G(counts = celdaGSim$counts, # L = celdaGSim$L, # verbose = FALSE) # factorized <- factorizeMatrix(counts = celdaGSim$counts, celdaMod = modelG) # # test_that(desc = "Testing celda_G model with numeric input matrix", { # # Github Issue #347 # numericCounts <- celdaGSim$counts # storage.mode(numericCounts) <- "numeric" # expect_true(is(celda_G(counts = numericCounts, # L = celdaGSim$L, # maxIter = 1, # verbose = FALSE), # "celda_G")) # }) # # test_that(desc = "Testing clusterProbability with celda_G", { # expect_true(ncol(clusterProbability(celdaGSim$counts, # modelG)$yProbability) == celdaGSim$L) # }) # # test_that(desc = paste0("Testing simulateCells.celda_G error checking with", # " low gamma"), { # expect_error(simulateCells(model = "celda_G", gamma = 0.000001)) # }) # # test_that(desc = paste0("Testing simulateCells.celda_G, make sure all genes", # " expressed"), { # simCellsLow <- simulateCells(model = "celda_G", # G = 1000, # C = 300, # NRange = c(1, 10)) # expect_true(all(rowSums(simCellsLow$counts) > 0)) # }) # # # test_that(desc = "Testing LogLikelihood functions", { # # expect_true(all(is.numeric(logLikelihoodHistory(celdaMod = modelG)))) # # expect_equal(max(logLikelihoodHistory(celdaMod = modelG)), # # bestLogLikelihood(modelG)) # # }) # # # test_that(desc = "Testing celdaGridSearch with celda_G", { # # celdaGRes <- celdaGridSearch(counts = celdaGSim$counts, # # model = "celda_G", # # nchains = 2, # # paramsTest = list(L = c(5, 10)), # # maxIter = 1, # # verbose = FALSE, # # bestOnly = FALSE, # # perplexity = FALSE) # # # # expect_error(celdaGridSearch(counts = celdaGSim$counts, # # model = "celda_G", # # paramsTest = list(L = 10, M = 4), # # bestOnly = FALSE), # # paste0("The following elements in 'paramsTest' are not", # # " arguments of 'celda_G': M")) # # # # expect_error(celdaGridSearch(counts = celdaGSim$counts, # # model = "celda_G", # # nchains = 1, # # paramsTest = list(L = c(4, 5), yInit = 10)), # # paste0("Setting parameters such as 'z.init', 'yInit', and", # # " 'sample.label' in 'paramsTest' is not currently supported.")) # # # # expect_error(celdaGridSearch(counts = celdaGSim$counts, # # model = "celda_G", # # nchains = 1, # # paramsTest = list()), # # paste0("The following arguments are not in 'paramsTest' or", # # " 'paramsFixed' but are required for 'celda_G': L")) # # # # expect_error(celdaGridSearch(counts = celdaGSim$counts, # # model = "celda_G", # # nchains = 1, # # paramsTest = list(L = 10), # # paramsFixed = list(xxx = "xxx")), # # paste0("The following elements in 'paramsFixed' are not arguments", # # " of 'celda_G': xxx")) # # # # expect_warning(celdaGridSearch(counts = celdaGSim$counts, # # model = "celda_G", # # paramsTest = list(L = c(5, 6), nchains = 2)), # # paste0("Parameter 'nchains' should not be used within the", # # " paramsTest list")) # # # # expect_true(is(celdaGRes, "celdaList")) # # expect_error(plotGridSearchPerplexity(celdaGRes)) # # expect_equal(names(runParams(celdaGRes)), # # c("index", "chain", "L", "log_likelihood")) # # # # celdaGRes <- resamplePerplexity(celdaGSim$counts, celdaGRes, # # numResample = 2) # # expect_equal(is.null(celdaGRes@perplexity), FALSE) # # expect_is(celdaGRes, "celdaList") # # expect_error(resamplePerplexity(celdaGSim$counts, # # celdaGRes, numResample = "2")) # # expect_error(resamplePerplexity(celdaGSim$counts, # # "celdaGRes", numResample = 2)) # # # # plotObj <- plotGridSearchPerplexity(celdaGRes) # # expect_is(plotObj, "ggplot") # # # # celdaCRes <- celdaGridSearch(counts = celdaGSim$counts, # # model = "celda_C", # # nchains = 2, # # paramsTest = list(K = c(5, 10)), # # maxIter = 1, # # verbose = FALSE, # # bestOnly = TRUE) # # expect_error(plotGridSearchPerplexity.celda_G(celdaCRes)) # # # # celdaGResIndex1 <- subsetCeldaList(celdaGRes, params = list(index = 1)) # # expect_true(all(is(celdaGResIndex1, "celda_G") && # # !is(celdaGResIndex1, "celdaList"))) # # # # expect_error(subsetCeldaList(celdaGRes, params = list(L = 11))) # # expect_error(subsetCeldaList(celdaGRes, params = list(L = 5, M = 10))) # # # # celdaGResL5 <- subsetCeldaList(celdaGRes, params = list(L = 5)) # # modelG <- selectBestModel(celdaGResL5) # # res <- perplexity(celdaGSim$counts, modelG) # # res2 <- perplexity(celdaGSim$counts, # # modelG, newCounts = celdaGSim$counts + 1) # # # # expect_error(res <- perplexity(celdaGSim$counts, modelG, # # newCounts = celdaGSim$counts[-1, ])) # # }) # # # logLikelihood # test_that(desc = "Testing logLikelihood.celda_G", { # expect_lt(logLikelihood(model = "celda_G", # counts = celdaGSim$counts, # y = celdaGSim$y, # L = celdaGSim$L, # delta = 1, # gamma = 1, # beta = 1), # 0) # # fakeY <- celdaGSim$y # fakeY[1] <- celdaGSim$L + 1 # expect_error(logLikelihood(model = "celda_G", # counts = celdaGSim$counts, # y = fakeY, # L = celdaGSim$L, # delta = 1, # gamma = 1, # beta = 1), # "An entry in y contains a value greater than the provided L.") # }) # # # normalizeCounts # test_that(desc = paste0("Making sure normalizeCounts doesn't change", # " dimensions of counts matrix"), { # normCounts <- normalizeCounts(celdaGSim$counts) # expect_equal(dim(normCounts), dim(celdaGSim$counts)) # expect_equal(rownames(normCounts), rownames(celdaGSim$counts)) # expect_equal(colnames(normCounts), colnames(celdaGSim$counts)) # expect_error(normalizeCounts(celdaGSim$counts, # transformationFun = "scale"), # "'transformationFun' needs to be of class 'function'") # expect_error(normalizeCounts(celdaGSim$counts, scaleFun = "scale"), # "'scaleFun' needs to be of class 'function'") # }) # # # recodeClusterY # test_that(desc = "Testing recodeClusterY with celda_G", { # expect_error(recodeClusterZ(celdaMod = modelG, # from = c(1, 2, 3, 4, 5), # to = c(5, 4, 3, 2, 1))) # expect_error(recodeClusterY(celdaMod = modelG, # from = NULL, # to = "")) # expect_error(recodeClusterY(celdaMod = modelG, # from = c(1, 2, 3, 4, 5), # to = c(1, 2, 3, 4, 6))) # expect_error(recodeClusterY(celdaMod = modelG, # from = c(1, 2, 3, 4, 6), # to = c(1, 2, 3, 4, 5))) # newRecoded <- recodeClusterY(celdaMod = modelG, # from = c(1, 2, 3, 4, 5), # to = c(5, 4, 3, 2, 1)) # expect_equal(modelG@clusters$y == 1, newRecoded@clusters$y == 5) # }) # # # compareCountMatrix # test_that(desc = "Testing CompareCountMatrix with celda_G", { # expect_true(compareCountMatrix(counts = celdaGSim$counts, # celdaMod = modelG)) # lessFeatures <- celdaGSim$counts[1:50, ] # expect_error(compareCountMatrix(counts = lessFeatures, celdaMod = modelG), # paste0("The provided celda object was generated from a counts matrix", # " with a different number of features than the one provided.")) # # countsMatrixError <- matrix(data = 1, # nrow = nrow(celdaGSim$counts), # ncol = ncol(celdaGSim$counts)) # expect_false(compareCountMatrix(counts = countsMatrixError, # celdaMod = modelG, # errorOnMismatch = FALSE)) # expect_error(compareCountMatrix(counts = countsMatrixError, # celdaMod = modelG, # errorOnMismatch = TRUE)) # }) # # # topRank # test_that(desc = "Testing topRank function with celda_G", { # topRank <- topRank(matrix = factorized$proportions$module, # n = 1000, # threshold = NULL) # expect_equal(names(topRank), # c("index", "names")) # expect_equal(names(topRank(matrix = factorized$proportions$module)), # c("index", "names")) # }) # # # plotHeatmap # test_that(desc = "Testing plotHeatmap with celda_G", { # expect_error(plotHeatmap(counts = celdaGSim$counts, y = modelG@params$L), # "Length of y must match number of rows in counts matrix") # expect_error(plotHeatmap(counts = celdaGSim$counts, # y = modelG@clusters$y, # scaleRow = "scale"), # "'scaleRow' needs to be of class 'function'") # expect_error(plotHeatmap(counts = celdaGSim$counts, # y = modelG@clusters$y, # trim = 3), # paste0("'trim' should be a 2 element vector specifying the lower", # " and upper boundaries")) # }) # # # test_that(desc = "Testing plotHeatmap with celda_G, including annotations", # # { # # annot <- as.data.frame(c(rep(x = 1, # # times = nrow(celdaGSim$counts) - 100), rep(x = 2, 100))) # # rownames(annot) <- rownames(celdaGSim$counts) # # colnames(annot) <- "label" # # # # expect_equal(names(plotHeatmap(celdaMod = modelG, # # counts = celdaGSim$counts, # # annotationFeature = annot, # # y = modelG@clusters$y)), # # c("treeRow", "treeCol", "gtable")) # # # # rownames(annot) <- NULL # # expect_equal(names(plotHeatmap(celdaMod = modelG, # # counts = celdaGSim$counts, # # annotationFeature = as.matrix(annot), # # y = modelG@clusters$y)), # # c("treeRow", "treeCol", "gtable")) # # # # rownames(annot) <- rev(rownames(celdaGSim$counts)) # # expect_error(plotHeatmap(celdaMod = modelG, # # counts = celdaGSim$counts, # # annotationFeature = annot, # # y = modelG@clusters$y), # # paste0("Row names of 'annotationFeature' are different than", # # " the row names of 'counts'")) # # }) # # # celdaHeatmap # test_that(desc = "Testing celdaHeatmap with celda_G", { # expect_equal(names(celdaHeatmap(celdaMod = modelG, # counts = celdaGSim$counts)), # c("treeRow", "treeCol", "gtable")) # }) # # # moduleHeatmap # test_that(desc = "Testing moduleHeatmap with celda_G", { # expect_equal(names(moduleHeatmap(celdaGSim$counts, # celdaMod = modelG, # topCells = 300, # featureModule = c(1, 2))), # c("treeRow", "treeCol", "gtable")) # expect_equal(names(moduleHeatmap(celdaGSim$counts, # celdaMod = modelG, # topFeatures = 15, # topCells = 15, # normalizedCounts = NA)), # c("treeRow", "treeCol", "gtable")) # # expect_error(moduleHeatmap("counts", celdaMod = modelG), # "'counts' should be a numeric count matrix") # expect_error(moduleHeatmap(celdaGSim$counts, celdaMod = "modelG"), # "'celdaMod' should be an object of class celda_G or celda_CG") # }) # # # plotDimReduceModule # test_that(desc = "Testing plotDimReduceModule with celda_G", { # celdaTsne <- celdaTsne(counts = celdaGSim$counts, # maxIter = 50, # celdaMod = modelG) # expect_equal(names(plotDimReduceModule( # dim1 = celdaTsne[, 1], # dim2 = celdaTsne[, 2], # counts = celdaGSim$counts, # celdaMod = modelG)), # c("data", # "layers", # "scales", # "mapping", # "theme", # "coordinates", # "facet", # "plot_env", # "labels")) # expect_equal(names(plotDimReduceModule(dim1 = celdaTsne[, 1], # dim2 = celdaTsne[, 2], # counts = celdaGSim$counts, # celdaMod = modelG, # modules = c(1, 2), # rescale = FALSE)), # c("data", # "layers", # "scales", # "mapping", # "theme", # "coordinates", # "facet", # "plot_env", # "labels")) # expect_error(plotDimReduceModule(dim1 = celdaTsne[, 1], # dim2 = celdaTsne[, 2], # counts = celdaGSim$counts, # celdaMod = modelG, # modules = c(11, 12))) # }) # # # celdaTsne # test_that(desc = paste0("Testing celdaTsne with celda_G when model class is", # " changed, should error"), { # modelX <- modelG # class(modelX) <- "celda_X" # expect_error(celdaTsne(counts = celdaGSim$counts, celdaMod = modelX), # "unable to find") # }) # # # test_that(desc = "Testing celdaTsne with celda_C including all cells", { # # tsne <- celdaTsne(counts = celdaGSim$counts, # # celdaMod = modelG, # # maxCells = ncol(celdaGSim$counts)) # # plotObj <- plotDimReduceCluster(tsne[, 1], # # tsne[, 2], rep(1, ncol(celdaGSim$counts))) # # expect_true(ncol(tsne) == 2 & nrow(tsne) == ncol(celdaGSim$counts)) # # expect_true(!is.null(plotObj)) # # # # tsne <- celdaTsne(counts = celdaGSim$counts, # # celdaMod = modelG, # # maxCells = ncol(celdaGSim$counts), # # modules = c(1, 2)) # # expect_error(tsne <- celdaTsne(counts = celdaGSim$counts, # # celdaMod = modelG, # # maxCells = ncol(celdaGSim$counts), # # modules = seq(1000, 1005))) # # }) # # test_that(desc = paste0("Testing celdaTsne with celda_G including", # " a subset of cells"), { # tsne <- celdaTsne(counts = celdaGSim$counts, # celdaMod = modelG, # maxCells = 100) # plotObj <- plotDimReduceCluster(tsne[, 1], # tsne[, 2], rep(1, ncol(celdaGSim$counts))) # expect_true(ncol(tsne) == 2 & nrow(tsne) == ncol(celdaGSim$counts) && # sum(!is.na(tsne[, 1])) == 100) # expect_true(!is.null(plotObj)) # }) # # # celdaUmap # test_that(desc = paste0("Testing celdaUmap with celda_G when model class is", # " changed, should error"), { # modelX <- modelG # class(modelX) <- "celda_X" # expect_error(celdaUmap(counts = celdaGSim$counts, celdaMod = modelX), # "unable to find") # }) # # test_that(desc = "Testing celdaUmap with celda_C including all cells", { # umap <- celdaUmap(counts = celdaGSim$counts, # celdaMod = modelG, # maxCells = ncol(celdaGSim$counts)) # plotObj <- plotDimReduceCluster(umap[, 1], # umap[, 2], rep(1, ncol(celdaGSim$counts))) # expect_true(ncol(umap) == 2 & nrow(umap) == ncol(celdaGSim$counts)) # expect_true(!is.null(plotObj)) # # umap <- celdaUmap(counts = celdaGSim$counts, # celdaMod = modelG, # maxCells = ncol(celdaGSim$counts), # modules = c(1, 2)) # expect_error(umap <- celdaUmap(counts = celdaGSim$counts, # celdaMod = modelG, # maxCells = ncol(celdaGSim$counts), # modules = seq(1000, 1005))) # }) # # test_that(desc = paste0("Testing celdaUmap with celda_G including", # " a subset of cells"), { # umap <- celdaUmap(counts = celdaGSim$counts, # celdaMod = modelG, # maxCells = 100) # plotObj <- plotDimReduceCluster(umap[, 1], # umap[, 2], rep(1, ncol(celdaGSim$counts))) # expect_true(ncol(umap) == 2 & nrow(umap) == ncol(celdaGSim$counts) && # sum(!is.na(umap[, 1])) == 100) # expect_true(!is.null(plotObj)) # }) # # # featureModuleLookup # test_that(desc = "Testing featureModuleLookup with celda_G", { # res <- featureModuleLookup(celdaGSim$counts, modelG, "Gene_1") # expect_true(res == modelG@clusters$y[1]) # res <- featureModuleLookup(celdaGSim$counts, # modelG, "Gene_2", exactMatch = FALSE) # expect_true(length(res) == 11) # res <- featureModuleLookup(celdaGSim$counts, modelG, "XXXXXXX") # expect_true(grepl("No feature", res)) # }) # # # .cGSplitY # test_that(desc = "Testing error checking for .cGSplitY", { # r <- simulateCells("celda_G", # C = 100, # G = 100, # L = 2) # dc <- .cGDecomposeCounts(r$counts, r$y, r$L) # res <- .cGSplitY(r$counts, # r$y, # dc$nTSByC, # dc$nByTS, # dc$nByG, # dc$nGByTS, # dc$nM, # dc$nG, # r$L, # beta = 1, # delta = 1, # gamma = 1, # yProb = NULL, # minFeature = 1000) # expect_true(grepl("Cluster sizes too small", res$message)) # }) # # test_that(desc = "Testing perplexity.celda_G", { # expect_true(is.numeric(perplexity(celdaGSim$counts, modelG))) # class(modelG) <- c("celda_C") # expect_error(perplexity.celda_G(celdaGSim$counts, modelG), # "could not find function \"perplexity.celda_G\"") # }) ================================================ FILE: tests/testthat/test-decon.R ================================================ library(celda) context("Testing DecontX functions") deconSim <- simulateContamination(K = 10, delta = c(1, 5)) modelDecontXoneBatch <- decontX(deconSim$observedCounts, z = deconSim$z, maxIter = 2) deconSim2 <- simulateContamination(K = 10, delta = c(1, 5)) batchDecontX <- decontX(cbind(deconSim$observedCounts, deconSim2$observedCounts), z = c(deconSim$z, deconSim2$z), batch = rep(seq(2), each = ncol(deconSim$observedCounts)), maxIter = 2) test_that(desc = "Testing simulateContamination", { expect_equivalent(object = colSums(deconSim$observedCounts), expected = deconSim$NByC) expect_equal(object = dim(deconSim$phi), expected = dim(deconSim$eta)) expect_equal(typeof(deconSim$observedCounts), "integer") expect_warning(simulateContamination(K = 101, C = 10)) expect_error(simulateContamination(K = 3, G = 2, numMarkers = 10)) }) ## DecontX test_that(desc = "Testing DecontX on counts matrix", { s <- simulateContamination() res <- decontX(s$observedCounts) p <- plotDecontXMarkerPercentage(s$observedCounts, z = res$z, markers = s$markers) p <- plotDecontXMarkerPercentage(res$decontXcounts, z = res$z, markers = s$markers) p <- plotDecontXMarkerExpression(s$observedCounts, s$markers[[1]], z = s$z) p <- plotDecontXContamination(res) # test with background input b <- s$observedCounts[, 1:5] colnames(b) <- paste(colnames(b), "_", sep = "") res <- decontX(s$observedCounts, background = b) }) test_that(desc = "Testing DecontX on SCE", { s <- simulateContamination() sce <- SingleCellExperiment::SingleCellExperiment( list(counts = s$observedCounts)) sce <- decontX(sce) p <- plotDecontXContamination(sce) p <- plotDecontXMarkerPercentage(sce, z = s$z, markers = s$markers, assayName = "decontXcounts") p <- plotDecontXMarkerExpression(sce, s$markers[[1]]) newz <- paste0("X", s$z) sce$newz2 <- newz p <- plotDecontXMarkerPercentage(sce, z = "newz2", markers = s$markers, assayName = "decontXcounts") sce <- decontX(sce, estimateDelta = FALSE) # test with background input bg <- sce[, 1:5] colnames(bg) <- paste(colnames(bg), "_", sep = "") sce <- decontX(sce, background = bg) }) ## .decontXoneBatch test_that(desc = "Testing .decontXoneBatch", { expect_error(decontX(x = deconSim$observedCounts, z = deconSim$z, delta = c(1, -1))) expect_error(decontX(x = deconSim$observedCounts, z = deconSim$z, delta = c(1, 1, 1))) expect_error(decontX(x = deconSim$observedCounts, z = c(deconSim$z, 1)), paste0("'z' must be of the same length as the number of cells in the", " 'counts' matrix.")) expect_error(.decontXoneBatch(counts = deconSim$observedCounts, z = rep(1, ncol( deconSim$observedCounts))), "No need to decontaminate when only one cluster is in the dataset.") countsNA <- deconSim$observedCounts countsNA[1, 1] <- NA expect_error(.decontXoneBatch(counts = countsNA, z = deconSim$z), "Missing value in 'counts' matrix.") }) ================================================ FILE: tests/testthat/test-intialize_cluster.R ================================================ # test_that(desc = "Test 'random' initialization for all models", { # simRes <- simulateCells(model = "celda_CG") # modelCG <- celda_CG(simRes$counts, # simRes$sampleLabel, # K = simRes$K, # L = simRes$L, # zInitialize = "random", # yInitialize = "random", # maxIter = 5, # splitOnLast = FALSE, # splitOnIter = -1) # expect_true(is(modelCG, "celda_CG")) # modelG <- celda_G(simRes$counts, # L = simRes$L, # yInitialize = "random", # maxIter = 5, # splitOnLast = FALSE, # splitOnIter = -1) # expect_true(is(modelG, "celda_G")) # modelC <- celda_C(simRes$counts, # simRes$sampleLabel, # K = simRes$K, # zInitialize = "random", # maxIter = 5, # splitOnLast = FALSE, # splitOnIter = -1) # expect_true(is(modelC, "celda_C")) # }) # # test_that(desc = "Testing .initializeCluster for random initialization", { # ## Completely random # z <- .initializeCluster(10, 100) # expect_true(length(z) == 100 & length(unique(z) == 10)) # expect_error(z <- .initializeCluster(100, 10)) # # ## With all values initialized # initZ <- rep(seq(10), each = 10) # z <- .initializeCluster(10, 100, initial = initZ) # expect_true(all(initZ == z)) # expect_error(z <- .initializeCluster(10, 100, initial = initZ[seq(99)])) # expect_error(z <- .initializeCluster(11, 100, initial = initZ)) # expect_error(z <- .initializeCluster(10, 99, initial = initZ)) # # ## With only a few values initialized # fixedZ <- rep(NA, 100) # fixedZ[seq(10)] <- 1 # z <- .initializeCluster(10, 100, fixed = fixedZ) # expect_true(all(z[seq(10)] == 1) & # length(z) == 100 & length(unique(z)) == 10) # expect_error(z <- .initializeCluster(10, 100, fixed = fixedZ[seq(99)])) # fixedZ[seq(10)] <- 11 # expect_error(z <- .initializeCluster(10, 100, fixed = fixedZ)) # }) ================================================ FILE: tests/testthat/test-matrixSums.R ================================================ library(celda) context("Testing error checking in C-level matrix sum functions") ## Test internal error checking mat <- matrix(seq(5), ncol = 10, nrow = 10) label1 <- rep(seq(2), each = 5) label2 <- as.factor(seq(100)) label3 <- as.factor(label1) label4 <- label3 label4[seq(2)] <- 2 label5 <- as.factor(rep(seq(5), each = 2)) test_that(desc = "Testing .rowSumByGroup", { expect_error(.Call("_rowSumByGroup", mat, label1)) expect_error(.Call("_rowSumByGroup", mat, label2)) res <- .Call("_rowSumByGroup", mat, label3) expect_true(all(res == rowsum(mat, label3))) res <- .rowSumByGroup(mat, label3, 2) expect_true(all(res == rowsum(mat, label3))) }) test_that(desc = "Testing .rowSumByGroupChange", { res <- rowsum(mat, label3) expect_error(.Call("_rowSumByGroupChange", mat, res, label4, label1)) expect_error(.Call("_rowSumByGroupChange", mat, res, label4, label2)) expect_error(.Call("_rowSumByGroupChange", mat[-1, ], res, label4, label3)) expect_error(.Call("_rowSumByGroupChange", mat[, -1], res, label4, label3)) expect_error(.Call("_rowSumByGroupChange", mat, res, label4, label5)) res2 <- .Call("_rowSumByGroupChange", mat, res, label4, label3) expect_true(all(res2 == rowsum(mat, label4))) res <- rowsum(mat, label3) res2 <- .rowSumByGroupChange(mat, res, label4, label3, 2) expect_true(all(res2 == rowsum(mat, label4))) }) test_that(desc = "Testing .colSumByGroup", { expect_error(.Call("_colSumByGroup", mat, label1)) expect_error(.Call("_colSumByGroup", mat, label2)) res <- .Call("_colSumByGroup", mat, label3) expect_true(all(res == t(rowsum(t(mat), label3)))) res <- .colSumByGroup(mat, label3, 2) expect_true(all(res == t(rowsum(t(mat), label3)))) }) test_that(desc = "Testing .colSumByGroupChange", { res <- t(rowsum(t(mat), label3)) expect_error(.Call("_colSumByGroupChange", mat, res, label4, label1)) expect_error(.Call("_colSumByGroupChange", mat, res, label4, label2)) expect_error(.Call("_colSumByGroupChange", mat[, -1], res, label4, label3)) expect_error(.Call("_colSumByGroupChange", mat, res, label4, label5)) expect_error(.Call("_colSumByGroupChange", mat[-1, ], res, label4, label3)) res2 <- .Call("_colSumByGroupChange", mat, res, label4, label3) expect_true(all(res2 == t(rowsum(t(mat), label4)))) res <- t(rowsum(t(mat), label3)) res2 <- .colSumByGroupChange(mat, res, label4, label3, 2) expect_true(all(res2 == t(rowsum(t(mat), label4)))) }) storage.mode(mat) <- "numeric" test_that(desc = "Testing .rowSumByGroupNumeric", { expect_error(.Call("_rowSumByGroup_numeric", mat, label1)) expect_error(.Call("_rowSumByGroup_numeric", mat, label2)) res <- .Call("_rowSumByGroup_numeric", mat, label3) expect_true(all(res == rowsum(mat, label3))) res <- .rowSumByGroupNumeric(mat, label3, 2) expect_true(all(res == rowsum(mat, label3))) }) test_that(desc = "Testing .colSumByGroupNumeric", { expect_error(.Call("_colSumByGroup_numeric", mat, label1)) expect_error(.Call("_colSumByGroup_numeric", mat, label2)) res <- .Call("_colSumByGroup_numeric", mat, label3) expect_true(all(res == t(rowsum(t(mat), label3)))) res <- .colSumByGroupNumeric(mat, label3, 2) expect_true(all(res == t(rowsum(t(mat), label3)))) }) # test_that(desc = "Testing fastNormProp Rcpp funtion",{ # res <- fastNormProp(mat, 0) # res2 <- prop.table(mat, 2) # expect_true(all(res == res2)) # }) # # test_that(desc = "Testing fastNormPropLog Rcpp funtion",{ # res <- fastNormPropLog(mat, 0) # res2 <- log(prop.table(mat, 2)) # expect_true(all(res == res2)) # }) # # test_that(desc = "Testing fastNormPropSqrt Rcpp funtion",{ # res <- fastNormPropSqrt(mat, 0) # res2 <- sqrt(prop.table(mat, 2)) # expect_true(all(res == res2)) # }) ================================================ FILE: tests/testthat/test-with_seed.R ================================================ # with_seed library(celda) context("Testing seed setting behavior") set.seed(123) celdaCGSim1 <- simulateCells("celda_CG", K = 5, L = 10, seed = 1234) celdaCGSim2 <- simulateCells("celda_CG", K = 5, L = 10, seed = NULL) celdaCGSim3 <- simulateCells("celda_CG", K = 5, L = 10, seed = 123) celdaCGSim4 <- simulateCells("celda_CG", K = 5, L = 10, seed = 1234) celdaCGSim5 <- simulateCells("celda_CG", K = 5, L = 10, seed = 1234) celdaCGSim6 <- simulateCells("celda_CG", K = 5, L = 10, seed = 12345) celdaCGSim7 <- simulateCells("celda_CG", K = 5, L = 10, seed = NULL) set.seed(123) celdaCGSim8 <- simulateCells("celda_CG", K = 5, L = 10, seed = NULL) test_that(desc = "Testing seed setting behavior in count matrix simulation", { expect_equal(celdaCGSim1, celdaCGSim4) expect_equal(celdaCGSim1, celdaCGSim5) #expect_equal(celdaCGSim2, celdaCGSim3) expect_equal(celdaCGSim2, celdaCGSim8) # expect_false(isTRUE(all.equal(celdaCGSim1, celdaCGSim2))) # expect_false(isTRUE(all.equal(celdaCGSim1, celdaCGSim3))) # expect_false(isTRUE(all.equal(celdaCGSim1, celdaCGSim6))) # expect_false(isTRUE(all.equal(celdaCGSim2, celdaCGSim7))) }) ================================================ FILE: tests/testthat.R ================================================ Sys.unsetenv("R_TESTS") library(testthat) test_check("celda") ================================================ FILE: vignettes/articles/celda_pbmc3k.Rmd ================================================ --- title: "Celda - Analysis of PBMC3K" date: "Compiled `r format(Sys.time(), '%B %d, %Y')`" author: "Joshua Campbell, Zhe Wang" --- # Introduction Celda is a Bayesian hierarchical model that can perform bi-clustering of features into modules and observations into subpopulations. In this tutorial, we will apply Celda to a real-world single-cell RNA sequencing (scRNA-seq) dataset of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor. This dataset (PBMC3K) is available from 10X Genomics and can be found on the [10X website](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k){target="_blank"}. The `celda` package uses the [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html){target="_blank"} (SCE) object for management of expression matrices, feature/cell annotation data, and metadata. All of the functions have an SCE object as the first input parameter. The functions operate on a matrix stored in the `assay` slot of the SCE object. The parameter `useAssay` can be used to specify which matrix to use (the default is `"counts"`). Matrices can be of class `matrix` or `dgCMatrix` from the [Matrix](https://cran.r-project.org/web/packages/Matrix/index.html) package. While the primary clustering is performed with functions from the celda package, the [singleCellTK](https://bioconductor.org/packages/release/bioc/html/singleCellTK.html) package is used for some other tasks such as importing data, quality control, and marker identification with differential expression. # Importing data The PBMC3K data can be easily loaded via the Bioconductor package [TENxPBMCData](https://bioconductor.org/packages/release/data/experiment/html/TENxPBMCData.html){target="_blank"}. `TENxPBMCData` is an experiment package that provides resources for various PBMC datasets generated by 10X Genomics. When using this package, the column names of returned SCE object are `NULL` by default. For this example, we paste together the name of the sample with the cell barcode to generate column names for the SCE object. Additionally, the count matrix within `sce` object is converted from a `DelayedMatrix` object to a sparse matrix `dgCMatrix` object. ```{r get_pbmc, message = FALSE} library(TENxPBMCData) sce <- TENxPBMCData("pbmc3k") colnames(sce) <- paste0("pbmc3k_", colData(sce)$Sequence) counts(sce) <- as(counts(sce), "dgCMatrix") ``` If you have the [singleCellTK](https://bioconductor.org/packages/release/bioc/html/singleCellTK.html){target="_blank"} package installed, then this dataset can be imported and converted with a single command: ```{r get_pbmc_sctk, eval=FALSE} library(singleCellTK) sce <- importExampleData("pbmc3k") ``` To get your own data into a `SingleCellExperiment` object, the `singleCellTK` package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in multiple samples processed with CellRanger: ```{r cellranger_import, eval=FALSE} library(singleCellTK) sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/")) ``` **Note:** As a reminder, you can view the assays, column annotation, and row annotation stored in the SCE with the commands `assays(sce)`, `colData(sce)`, and `rowData(sce)`, respectively. Finally, we set the `rownames` of the SCE to the gene symbol: ```{r set_symbol} rownames(sce) <- rowData(sce)$Symbol_TENx ``` # Quality Control Quality control and filtering of cells is often needed before down-stream analyses such as dimensionality reduction and clustering. Typical filtering procedures include exclusion of poor quality cells with low numbers of counts/UMIs, estimation and removal of ambient RNA, and identification of potential doublet/multiplets. Many tools and packages are available to perform these operations and users are free to apply their tool(s) of choice as the celda clustering functions will work with any matrix stored in an SCE object. The celda package does contain a Bayesian method called [decontX](https://doi.org/10.1186/s13059-020-1950-6){target="_blank"} to estimate and remove transcript contamination in individual cells in a scRNA-seq dataset. To perform QC, we suggest using the `runCellQC` function in `singleCellTK` package. This is a wrapper for several methods for calculation of QC metrics, doublet detection, and estimation of ambient RNA (including decontX). Below is a quick example of how to perform standard QC before applying celda. If you have another preferred approach or your data has already been QC'ed, you can move to [Feature selection](#featureselection) section. For this tutorial, we will only run one doublet detection algorithm and one decontamination algorithms. For a full list of algorithms that this function runs by default, see `?runCellQC`. We will also quantify the percentage of mitochondrial genes in each cell as this is often used as a measure of cell viability. ```{r sctk_qc, message=FALSE, warning=FALSE, error=FALSE} library(singleCellTK) # Get list of mitochondrial genes mito.genes <- grep("^MT-", rownames(sce), value = TRUE) # Run QC sce <- runCellQC(sce, sample = NULL, algorithms = c("QCMetrics", "scDblFinder", "decontX"), geneSetList = list(mito=mito.genes), geneSetListLocation = "rownames") ``` **Note:** If you have cells from multiple samples stored in the SCE object, make sure to supply the `sample` parameter as the QC tools need to be applied to cells from each sample individually. Individual sets of QC metrics can be plotted with specific functions. For example to plot distributions of total numbers of UMIs derived from `runPerCellQC`, doublet scores from `runScDblFinder`, and contamination scores from `runDecontX` (all of which were run by the `runCellQC` function), the following plotting functions can be used: ```{r sctk_qc_plot, fig.height = 8} plotRunPerCellQCResults(sce) plotScDblFinderResults(sce, reducedDimName = "decontX_UMAP") plotDecontXResults(sce, reducedDimName = "decontX_UMAP") ``` An comprehensive HTML report can be generated to visualize and explore the QC metrics in greater detail: ```{r sctk_qc_report, eval = FALSE} reportCellQC(sce) ``` After examining the distributions of various QC metrics, poor quality cells will need to be removed. Typically, thresholds for QC metrics should exclude cells that are outliers of the distribution (i.e. long tails in the violin or density plots). Cells can be removed using the `subsetSCECols` function. Metrics stored in the `colData` of the SCE object can be filtered using the `colData` parameter. Here we will limit to cells with at least 600 counts and 300 genes detected: ```{r sctk_qc_subset} # Filter SCE sce <- subsetSCECols(sce, colData = c("total > 600", "detected > 300")) # See number of cells after filtering ncol(sce) ``` Other common metrics to filter on include `subsets_mito_percent` for removal of cells with high mitochondrial percentage, `decontX_contamination` for removal of cells with higher levels of contamination from ambient RNA, `scDblFinder_class` to remove doublets (or calls from any of the other doublet detection algorithms). See the `singleCellTK` [documentation](sctk.camplab.net) For more information on performing comprehensive QC and filtering. # Feature selection {#featureselection} In general, removing features with low numbers of counts across all cells is recommended to reduce computational run time. A simple selection can be performed by removing features with a minimum number of counts in a minimum number of cells using the `selectFeatures` function: ```{r select_features, message = FALSE} # Select features with at least 3 counts in at least 3 cells library(celda) useAssay <- "counts" altExpName <- "featureSubset" sce <- selectFeatures(sce, minCount = 3, minCell = 3, useAssay = useAssay, altExpName = altExpName) # See number of features after filtering nrow(altExp(sce, altExpName)) ``` The `useAssay` parameter is used to denote which assay/matrix within the SCE to use for filtering. The default raw counts matrix is traditionally stored in the `"counts"` assay. If `decontX` was previously run during QC, then the decontaminated counts can be used by setting this parameter to `"decontXcounts"`. We will save this parameter in a variable called `useAssay` which will be used as input in several downstream functions. **Note:** The subsetted matrix is stored in the "alternative experiment" slot (`altExp`) within the SCE. This allows for a matrix with a different number of rows to be stored within the same SCE object (rather than creating two SCE objects). The celda functions described in the next several sections operate on a matrix stored in the `altExp` slot. The default name given to the alternative experiment and used in all downstream celda functions is `"featureSubset"`. If the `altExpName` parameter is changed here, then it will need to be supplied to downstream plotting functions as well. The list of alternative experiments in an SCE can be view with `altExpNames(sce)`. If you have already have an SCE with selected features or do not want to perform feature selection, then you need to set the alternative experiment directly with a command like `altExp(sce, "featureSubset") <- assay(sce, "counts")`. In the future, this will be updated to be more simple by utilizing the `ExperimentSubset` package. If the number of features is still relatively large (e.g. >5000), an alternative approach is to select highly variable features that can be used in the downstream clustering. The advantage of this approach is that it can greatly speed up celda and can improve with module detection among highly variable features with overall lower expression. The disadvantage of this approach is that features that do not fall into the highly variable group will not be clustered into modules. The celda package does not include methods for selection of highly variable genes (HVGs). However, the `singleCellTK` provides wrappers for methods used in [Seurat](https://doi.org/10.1016/j.cell.2019.05.031){target="_blank"} and [Scran](https://bioconductor.org/packages/release/bioc/html/scran.html). We recommend keeping at least 2,000-5,000 HVGs for clustering. Here is some example code of how to select the top 5,000 most variable genes and store it back in the SCE as an `altExp`: ```{r feature_selection, eval = FALSE, message = FALSE} library(singleCellTK) sce <- seuratFindHVG(sce, useAssay = useAssay, hvgMethod = "vst") g <- getTopHVG(sce, method = "vst", n = 5000) altExp(sce, altExpName) <- sce[g, ] ``` For the rest of the analysis with the PBMC3K data, we will use the first approach where features with at least 3 counts in 3 cells were included. # Analysis with Celda ## Bi-clustering with known numbers of clusters As mentioned earlier, celda is discrete Bayesian model that is able to simultaneously bi-cluster features into modules and cells into cell clusters. The primary bi-clustering model can be accessed with the function `celda_CG`. This function operates on a matrix stored as an alternative experiment in the `altExp` slot. If you did not perform feature selection as recommended in the previous section and your matrix of interest is not currently located in an `altExp` slot, the following code can be used to copy a matrix in the main assay slot to the `altExp` slot: ```{r, altExp_create, eval = FALSE} useAssay <- "counts" altExpName <- "featureSubset" altExp(sce, altExpName) <- assay(sce, useAssay)`. ``` The two major adjustable parameters in this model are `L`, the number of modules, and `K`, the number of cell populations. The following code bi-clusters the PBMC3K dataset into 100 modules and 15 cell populations: ```{r celda_cg_example, eval = FALSE} sce <- celda_CG(sce, L = 100, K = 15, useAssay = useAssay, altExpName = altExpName) ``` However, in most cases, the number of feature modules (`L`) and the number of cell clusters (`K`) are not known beforehand. In the next sections, we outline procedures that can be used suggest reasonable choices for these parameters. If the data is clustered with the code above by supplying K and L directly to the `celda_CG` function, then you can skip the next section and proceed to [Creating 2-D embeddings](#embed). ## Finding the number of modules In order to help choose a reasonable solutions for L and K, celda provides step-wise splitting procedures along with measurements of perplexity to suggest reasonable choices for `L` and `K`. First, the function `recursiveSplitModule` can be used to cluster features into modules for a range of `L`. Within each step, the best split of an existing module into 2 new modules is chosen to create the L-th module. The module labels of the previous model with $L-1$ modules are used as the initial starting values in the next model with $L$ modules. Note that the initialization step may take longer with larger numbers of cells in the dataset and the splitting procedure will take longer with larger numbers features in the dataset. Celda models with a L range between `initialL = 10` and `maxL = 150` are tested in the example below. ```{r module_split, message = FALSE} moduleSplit <- recursiveSplitModule(sce, useAssay = useAssay, altExpName = altExpName, initialL = 10, maxL = 150) ``` Perplexity has been commonly used in the topic models to measure how well a probabilistic model predicts observed samples ([Blei et al., 2003](https://dl.acm.org/doi/10.5555/944919.944937){target="_blank"}). Here, we use perplexity to evaluate the performance of individual models by calculating the probability of observing expression counts given an estimated Celda model. Rather than performing cross-validation which is computationally expensive, a series of test sets are created by sampling the counts from each cell according to a multinomial distribution defined by dividing the counts for each gene in the cell by the total number of counts for that cell. Perplexity is then calculated on each test set and can be visualized using function `plotGridSearchPerplexity`. A lower perplexity indicates a better model fit. ```{r module_split_perplexity, message = FALSE, warning = FALSE} plotGridSearchPerplexity(moduleSplit, altExpName = altExpName, sep = 10) ``` The perplexity alone often does not show a clear elbow or "leveling off". However, the rate of perplexity change (RPC) can be more informative to determine when adding new modules does not add much additional information [Zhao et al., 2015](https://doi.org/10.1186/1471-2105-16-S13-S8){target="_blank"}). An RPC closer to zero indicates that the addition of new modules or cell clusters is not substantially decreasing the perplexity. The RPC of models can be visualized using function `plotRPC`: ```{r module_split_rpc, message = FALSE, warning = FALSE} plotRPC(moduleSplit, altExpName = altExpName) ``` In this case, we will choose an `L` of 80 as the RPC curve tends to level off at this point: ```{r setL} L <- 80 ``` | **Note:** Perplexity and RPC are meant to be guides to give a sense of a possible starting point for `L`. However, they may not always give a clear "leveling off" depending of the complexity and quality of the dataset. **Do not give up if the choice of L is unclear or imperfect!** If the `L` to choose is unclear from these, then you can set a somewhat high number (e.g. 75) and move to the next step of selecting `K`. Later on, manual review of modules using functions such as `moduleHeatmap` can give a sense of whether individual modules should be further split up by selecting higher `L`. For example, you can start exploring the cell populations and modules with `L = 75`. If some modules need to be further split, you can then try `L = 100`, `L = 125`, and so on. ## Finding the number of cell subpopulations Now we extract the Celda model of `L = `$L$ with function `subsetCeldaList` and run `recursiveSplitCell` to fit models with a range of `K` between 3 and 25: ```{r cell_split, message = FALSE} temp <- subsetCeldaList(moduleSplit, list(L = L)) sce <- recursiveSplitCell(sce, useAssay = useAssay, altExpName = altExpName, initialK = 3, maxK = 25, yInit = celdaModules(temp)) ``` The perplexities and RPC of models can be visualized using the same functions `plotGridSearchPerplexity` and `plotRPC`. ```{r cell_split_perplexity, warning = FALSE} plotGridSearchPerplexity(sce) plotRPC(sce, , altExpName = altExpName) ``` The perplexity continues to decrease with larger values of `K`. The RPC generally levels off between 13 and 16 and we choose the model with `K = 14` for downstream analysis. The follow code selects the final `celda_CG` model with `L = 80` and `K = 14`: ```{r setK} K <- 14 sce <- subsetCeldaList(sce, list(L = L, K = K)) ``` **Note:** Similar to choosing `L`, you can guess an initial value of `K` based off of the perplexity and RPC plots and then move to the downstream exploratory analyses described in the next several sections. After reviewing the cell clusters on 2-D embeddings and module heatmaps, you may have to come back to tweak the choice of `K` until you have something that captures the cellular heterogeneity within the data without "over-clustering" cells into too many subpopulations. This may be an iterative procedure of going back-and-forth between choices of `K` and plotting the results. So do not let imperfect perplexity/PRC plots prevent you from moving on to the rest of the analysis. Often times, using an initial guess for `K` will allow you to move on in the analysis to get a sense of the major sources of biological heterogeneity present in the data. # Exploring cell populations ## Creating 2-D embeddings {#embed} After selecting a celda model with specific values of `L` and `K`, we can then perform additional exploratory and downstream analyses to understand the biology of the transcriptional modules and cell populations. We can start by generating a dimension reduction plot with the Uniform Manifold Approximation and Projection (UMAP) method to visualize the relationships between the cells in a 2-D embedding. This can be done with function `celdaUmap`. ```{r celda_umap, message = FALSE} sce <- celdaUmap(sce, useAssay = useAssay, altExpName = altExpName) ``` Alternatively, a t-distributed stochastic neighbor embedding (t-SNE) can be generated using function `celdaTsne`. The UMAP and t-SNE plots generated by `celdaUmap` and `celdaTsne` are computed based on the module probabilities (analogous to using PCs from PCA). The calculated dimension reduction coordinates for the cells are stored under the `reducedDim` slot of the `altExp` slot in the original SCE object. The follow command lists the names of the dimensionality reductions that can be used in downstream plotting functions in the next few sections: ```{r reduced_dim_list} reducedDimNames(altExp(sce, altExpName)) ``` ## Plotting cell population cluster labels The function `plotDimReduceCluster` can be used to plot the cluster labels for cell populations identified by celda on the UMAP: ```{r cell_clusters} plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", labelClusters = TRUE) ``` ## Plotting expression of specific features Usually, biological features of some cell populations are known *a priori* and can be identified with known marker genes. The expression of selected marker genes can be plotted on the UMAP with the function `plotDimReduceFeature`. ```{r cell_marker_umap} markers <- c("CD3D", "IL7R", "CD4", "CD8B", "CD19", "FCGR3A", "CD14", "FCER1A", "PF4") plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = useAssay, altExpName = altExpName, normalize = TRUE) ``` The parameter `displayName` can be used to switch between IDs stored in the `rownames` of the SCE and columns of the `rowData` of the SCE. If the assay denoted by `useAssay` is a raw counts matrix, then setting `normalize = TRUE` is recommended (otherwise the z-score of the raw counts will be plotted). When set to `TRUE`, each count will be normalized by dividing by the total number of counts in each cell. An alternative approach is to perform normalization with another method and then point to the normalized assay with the `useAssay` parameter. For example, normalization can be performed with the scater package: ```{r scran_norm, message = FALSE, eval = FALSE} library(scater) sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts") plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = "logcounts", altExpName = altExpName, normalize = FALSE) ``` This second approach may be faster if plotting a lot of marker genes or if the dataset is relatively large. ## Plotting cell subpopulations with labels Once we identify of various cell subpopulations using the known marker genes, these custom labels can be added on the UMAP colored by cluster: ```{r cell_identities, message = FALSE, warning = FALSE} g <- plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", altExpName = altExpName, labelClusters = TRUE) labels <- c("1: Megakaryocytes", "2: CD14+ Monocytes 1", "3: CD14+ Monocytes 2", "4: FCGR3A (CD16+) Monocytes", "5: CD14+ Monocytes 3", "6: CD8+ Cytotoxic T-cells", "7: CD4+ T-cells", "8: CD8+ Cytotoxic T-cells", "9: B-cells", "10: Naive CD8+ T-cells", "11: Naive CD4+ T-cells", "12: NK-cells", "13: Unknown T-cells", "14: Dendritic cells") library(ggplot2) g <- g + scale_color_manual(labels = labels, values = distinctColors(length(labels))) print(g) ``` # Exploring relationship between modules and cell populations {#probmap} Celda has the ability to identify modules of co-expressed features and quantify the probability of these modules in each cell population. An overview of the relationships between modules and cell subpopulations can be explored with the function `celdaProbabilityMap`. The "Absolute probability" heatmap on the left shows the proportion of counts in each module for each cell population. The "Absolute probability" map gives insights into the absolute abundance of a module within a given cell subpopulation. The absolute heatmap can be used to explore which modules are higher than other modules **within a cell population**. The "Relative expression" map shows the standardized (z-scored) module probabilities across cell subpopulations. The relative heatmap can be used to explore which modules are relatively higher than other modules **across cell populations**. ```{r celdaProbabilityMap, fig.height = 16, fig.width = 10} celdaProbabilityMap(sce, useAssay = useAssay, altExpName = altExpName) ``` In this plot, we can see a variety of patterns. Modules 15 - 20 are highly expressed across most cell populations indicating that they may contain housekeeping genes (e.g. ribosomal). Other modules are specific to a cell population or groups of cell populations. For example, module 35 is only on in population 1 while module 70 is expressed across populations 2, 3, and to some degree in population 5. The unknown T-cell population 13 has highly specific levels of modules 30. In the next section, we can look at the genes in these modules to gain insights into the biological properties of each of these cell populations. # Exploring feature modules The primary advantage of celda over other tools is that it can cluster features that are co-expressed across cells into modules. These modules are often more biologically coherent than features correlated with principal components from PCA. Below are several ways in which modules can be explored and visualized. ## Table of features in each module The function `featureModuleTable` can be used to get the names of all features in each module into a `data.frame`. ```{r module_table} # Save to a data.frame ta <- featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName) dim(ta) head(ta[,"L70"]) ``` The parameter `displayName` can be used to switch between IDs stored in the `rownames` of the SCE and columns of the `rowData` of the SCE. The the `outputFile` parameter is set, the table will be saved to a tab-delimited text file instead of to a `data.frame`: ```{r module_table_file, eval = FALSE} # Save to file called "modules.txt" featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName, outputFile = "modules.txt") ``` The modules for this model are shown below: ```{r module_table_display, echo = FALSE} library(knitr) library(kableExtra) table <- featureModuleTable(sce,useAssay = "counts",altExpName = "featureSubset") kb <- kable(table, style = 'html', row.names = FALSE) kb <- kable_styling(kb, bootstrap_options = c("striped", "condensed")) kb <- scroll_box(kb, width = "100%", height = "550px") kb ``` ## Module lookup If you want to quickly find which module a particular feature was assigned to, the `featureModuleLookup` function can be used. Here will will look up a marker gene for T-cells called "CD3E": ```{r feature_lookup} mod <- featureModuleLookup(sce, feature = c("CD3E", "S100A8")) mod ``` ## Module heatmaps The function `moduleHeatmap` can be used to view the expression of features across cells for a specific module. The `featureModule` parameter denotes the module(s) to be displayed. Cells are ordered from those with the lowest probability of the module on the left to the highest probability on the right. Similarly, features are ordered from those with the highest probability within the module on the top to the lowest probability on the bottom. ```{r module_heatmap} moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName) ``` The parameter `topCells` can be used to control the number of cells included in the heatmap. By default, only the 100 cells with the lowest probabilities and the 100 cells with the highest probabilities for each selected module are included (i.e. `topCells = 100` by default). To display all cells, this parameter can be set to `NULL`: ```{r module_heatmap_topcells} moduleHeatmap(sce, featureModule = 27, topCells = NULL, useAssay = useAssay, altExpName = altExpName) ``` **Note:** Multiple modules can be displayed by giving a vector of module indices to the parameter `featureModule`. If `featureModule` is not specified, then all modules will be plotted. ## Module probabilities on 2-D embeddings The function `plotDimReduceModule` can be used visualize the probabilities of a particular module or sets of modules on a reduced dimensional plot such as a UMAP. This can be another quick method to see how modules are expressed across various cells in 2-D space. As an example, we can look at module `r as.numeric(as.character(mod["S100A8"]))` which contained S100A8: ```{r module_umaps} plotDimReduceModule(sce, modules = 70, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP") ``` Similarly, multiple modules can be plotting in a grid of UMAPs: ```{r module_umaps_grid} plotDimReduceModule(sce, modules = 70:78, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP") ``` In this grid, we can see that module 70 (which has high levels of S100A8 and S100A9) is highly expressed in cell populations 2 and 3, module 71 (which contains CD14) can be used to identify all CD14+ monocytes, module 72 (which contains CST3) is expressed across both CD14 and FCGR3A (CD16) expressing monocytes, and module 73 (which contains CD4) is expressed broadly across both monocytes and dendritic cells as well as some T-cell populations. If we were interesting in defining transcriptional programs active across all monocytes, we could examine the genes found in module 72. If we were interested in defining transcriptional programs for all CD14+ monocytes, we could examine the genes in module 71. These patterns can also be observed in the [Probability Map](#probmap) In the celda probability map, we saw that the unknown T-cell population 13 had high levels of module 30. We can examine both module heatmaps and module probability maps to further explore this: ```{r module_Tcell_prolif} moduleHeatmap(sce, featureModule = 30, useAssay = useAssay, altExpName = altExpName) plotDimReduceModule(sce, modules = 30, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP") ``` Module 30 has high levels of genes associated with proliferation including HMGA1, STMN1, PCNA, HMGB2, and TUBA1B. We can therefore re-label these cells as "Proliferating T-cells". # Identification and plotting of marker genes In addition to examining modules, differential expression can be used to identify potential marker genes up-regulated in specific cell populations. The function `findMarkerDiffExp` in the `singleCellTK` package will find markers up-regulated in each cell population compared to all the others. ## Differential expression to identify marker genes ```{r diffExp, message = FALSE} # Normalize counts (if not performed previously) library(scater) sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts") # Run differential expression analysis sce <- findMarkerDiffExp(sce, useAssay = "logcounts", method = "wilcox", cluster = celdaClusters(sce), minMeanExpr = 0, fdrThreshold = 0.05, log2fcThreshold = 0, minClustExprPerc = 0, maxCtrlExprPerc = 1) ``` The function `plotMarkerDiffExp` can be used to plot the results in a heatmap. The `topN` parameter will plot the top N ranked genes for each cluster. ```{r diffExp_plot, message = FALSE, fig.height = 10} # Plot differentially expressed genes that pass additional thresholds 'minClustExprPerc' and 'maxCtrlExprPerc' plotMarkerDiffExp(sce, topN = 5, log2fcThreshold = 0, rowLabel = TRUE, fdrThreshold = 0.05, minClustExprPerc = 0.6, maxCtrlExprPerc = 0.4, minMeanExpr = 0) ``` Other parameters such as `minClustExprPerc` (the minimum number of cells expressing the marker gene in the cluster) and `maxCtrlExprPerc` (the maximum number of cells expression the marker gene in other clusters) can be used to control how specific each marker gene is to each cell populations. Similarly, adding a log2 fold-change cutoff (e.g. 1) can select for markers that are more strongly up-regulated in a cell population. ## Violin plots for marker genes The `plotCeldaViolin` function can be used to examine the distribution of expression of various features across cell population clusters derived from celda. Here we can see that the gene CD79A has high expression in the B-cell cluster and HMGB2 has high expression in the proliferating T-cell population. ```{r violin} # Normalize counts if not performed in previous steps library(scater) sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts") # Make violin plots for marker genes plotCeldaViolin(sce, useAssay = "logcounts", features = c("CD79A", "HMGB2")) ``` # Generating HTML reports The celda package comes with two functions for generating comprehensive HTML reports that 1) capture the process of selecting K/L for a `celda_CG` model and 2) plot the results from the downstream analysis. The first report runs both `recursiveSplitModule` and `recursiveSplitCell` for selection of `L` and `K`, respectively. To recapitulate the complete analysis presented in this tutorial in the HTML report, the following command can be used: ```{r report_model, eval = FALSE} sce <- reportCeldaCGRun(sce, sampleLabel = NULL, useAssay = useAssay, altExpName = altExpName, minCell = 3, minCount = 3, initialL = 10, maxL = 150, initialK = 3, maxK = 25, L = 80, K = 14) ``` All of the parameters in this function are the same that were used throughout this tutorial in the `selectFeatures`, `recursiveSplitModule`, and `recursiveSplitCell` functions. Note that this report does **not** do cell filtering, so that must be completed before running this function. The returned SCE object will have the `celda_CG` model with selected `K` and `L` which can be used in any of the downstream plotting functions as well as input into the second plotting report described next. The second report takes in as input an SCE object with a fitted `celda_CG` model and systematically generates several plots that facilitate exploratory analysis including cell subpopulation cluster labels on 2-D embeddings, user-specified annotations on 2-D embeddings, module heatmaps, module probabilities, expression of marker genes on 2-D embeddings, and the celda probability map. The report can be generated with the following code: ```{r report_results, eval = FALSE} reportCeldaCGPlotResults(sce, reducedDimName = "celda_UMAP", features = markers, useAssay = useAssay, altExpName = altExpName, cellAnnot = c("total", "detected", "decontX_contamination", "subsets_mito_percent"), cellAnnotLabel = "scDblFinder_doublet_call") ``` User-supplied annotations to plot on the 2-D embedding can be specified through the `cellAnnot` and `cellAnnotLabel` variables. Both parameters will allow for plotting of variables stored in the colData of the SCE on the 2-D embedding plot specified by `reducedDimName` parameter. For `cellAnnot`, integer and numeric variables will be plotted as as continuous variables while factors and characters will be plotted as categorical variables. For `cellAnnotLabel`, all variables will be coerced to a factor and the labels of the categories will be plotted on the scatter plot. # Other useful functions ## Matrix factorization The celda model factorizes the original matrix into three matrices: **1) module -** The probability of each feature in each module (Psi) **2) cellPopulation -** The probability of each module in each cell population (Phi) **3) sample -** The probability of each cell population in each sample (Theta) Additionally, we can calculate the probability of each module within each cell (cell). The cell matrix can essentially be used to replace PCs from PCA and is useful for downstream visualization (e.g. generating 2-D embeddings). All of these matrices can be retrieved with the `factorizeMatrix` function. The matrices are returned in three different versions: unnormalized counts, proportions (normalized by the total), or posterior estimates (where the Dirichlet concentration parameter is added in before normalization). ```{r factorize_matrix, message = FALSE} # Factorize the original counts matrix fm <- factorizeMatrix(sce) # Three different version of each matrix: names(fm) # Get normalized proportional matrices dim(fm$proportions$cell) # Matrix of module probabilities for each cell dim(fm$proportions$module) # Matrix of feature probabilities for each module dim(fm$proportions$cellPopulation) # Matrix of module probabilities for each cell population dim(fm$proportions$sample) # Matrix of cell population probabilities in each sample ``` ## Changing the feature display name The parameter `displayName` can be used to change the labels of the rows from the `rownames` to a column in the `rowData` of the SCE object. The function is available in `plotDimReduceFeature` and `moduleHeatmap`. For example, if we did not change the `rownames` to `Symbol_TENx` in the beginning of the tutorial, the following code still could be run in `moduleHeatmap` to display the gene symbol even if the `rownames` were set to the original Ensembl IDs: ```{r module_heatmap_with_display} moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName, displayName = "Symbol_TENx") ``` # Session information
    sessionInfo() ```{r sessioninfo, echo = FALSE} sessionInfo() ```
    ================================================ FILE: vignettes/articles/decontX_pbmc4k.Rmd ================================================ --- title: "Decontamination of ambient RNA in single-cell genomic data with DecontX" author: - name: Shiyi (Iris) Yang affiliation: &id Boston University School of Medicine - name: Zhe Wang affiliation: *id - name: Yuan Yin affiliation: *id - name: Joshua Campbell affiliation: *id email: camp@bu.edu date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Estimate and remove cross-contamination from ambient RNA in single-cell data with DecontX} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, dev = "png") ``` # Introduction Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a Bayesian method to estimate and remove contamination in individual cells. DecontX assumes the observed expression of a cell is a mixture of counts from two multinomial distributions: (1) a distribution of native transcript counts from the cell’s actual population and (2) a distribution of contaminating transcript counts from all other cell populations captured in the assay. Overall, computational decontamination of single cell counts can aid in downstream clustering and visualization. The package can be loaded using the `library` command. ```{r load, eval=TRUE, message=FALSE} library(celda) ``` # Importing data DecontX can take either a [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object or a counts matrix as input. `decontX` will attempt to convert any input matrix to class `dgCMatrix` from package [Matrix](https://cran.r-project.org/web/packages/Matrix/index.html) before starting the analysis. To import datasets directly into an SCE object, the [singleCellTK](https://bioconductor.org/packages/release/bioc/html/singleCellTK.html) package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in the filtered and raw matrices for multiple samples processed with CellRanger: ```{r sce_import, eval = FALSE} library(singleCellTK) sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/")) ``` Within each sample directory, there should be subfolders called `"outs/filtered_feature_bc_matrix/"` or `"outs/raw_feature_bc_matrix/"` with files called `matrix.mtx.gz`, `features.tsv.gz` and `barcodes.tsv.gz`. If these files are in different subdirectories, the `importCellRangerV3Sample` function can be used to import data from a different directory instead. Optionally, the "raw" or "droplet" matrix can also be easily imported by setting the `dataType` argument to "raw": ```{r sce_import_raw, eval = FALSE} sce.raw <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"), dataType = "raw") ``` The raw matrix can be passed to the `background` parameter in `decontX` as described below. If using Seurat, go to the [Working with Seurat](#seurat) section for details on how to convert between SCE and Seurat objects. # Load PBMC4k data from 10X We will utilize the 10X PBMC 4K dataset as an example in this vignette. This data can be easily retrieved from the package [TENxPBMCData](http://bioconductor.org/packages/release/data/experiment/html/TENxPBMCData.html). Make sure the the column names are set before running decontX. ```{r load_10X, eval=TRUE, message=FALSE} # Load PBMC data library(TENxPBMCData) sce <- TENxPBMCData("pbmc4k") colnames(sce) <- paste(sce$Sample, sce$Barcode, sep = "_") rownames(sce) <- rowData(sce)$Symbol_TENx counts(sce) <- as(counts(sce), "dgCMatrix") ``` # Running decontX A SingleCellExperiment (SCE) object or a sparse matrix containing the counts for filtered cells can be passed to decontX via the `x` parameter. The matrix to use in an SCE object can be specified with the `assayName` parameter, which is set to `"counts"` by default. There are two major ways to run decontX: with and without the raw/droplet matrix containing empty droplets. Here is an example of running decontX without supplying the background: ```{r decontX, eval=TRUE, message=FALSE} sce <- decontX(sce) ``` In this scenario, `decontX` will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset. The estimated contamination results can be found in the `colData(sce)$decontX_contamination` and the decontaminated counts can be accessed with `decontXcounts(sce)`. `decontX` will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the `z` parameter. These results will be used throughout the rest of the vignette. The raw/droplet matrix can be used to empirically estimate the distribution of ambient RNA, which is especially useful when cells that contributed to the ambient RNA are not accurately represented in the filtered count matrix containing the cells. For example, cells that were removed via flow cytometry or that were more sensitive to lysis during dissociation may have contributed to the ambient RNA but were not measured in the filtered/cell matrix. The raw/droplet matrix can be input as an SCE object or a sparse matrix using the `background` parameter: ```{r decontX_background, eval=FALSE, message=FALSE} sce <- decontX(sce, background = sce.raw) ``` Only empty droplets in the background matrix should be used to estimate the ambient RNA. If any cell ids (i.e. `colnames`) in the raw/droplet matrix supplied to the `background` parameter are also found in the filtered counts matrix (`x`), decontX will automatically remove them from the raw matrix. However, if the cell ids are not available for the input matrices, decontX will treat the entire `background` input as empty droplets. All of the outputs are the same as when running decontX without setting the `background` parameter. > Note: If the input object is just a matrix and not an SCE object, make sure to save the output into a variable with a different name (e.g. `result <- decontX(mat)`). The result object will be a list with contamination in `result$contamination` and the decontaminated counts in `result$decontXcounts`. # Plotting DecontX results ## Cluster labels on UMAP DecontX creates a UMAP which we can use to plot the cluster labels automatically identified in the analysis. Note that the clustering approach used here is designed to find "broad" cell types rather than individual cell subpopulations within a cell type. ```{r UMAP_Clusters} umap <- reducedDim(sce, "decontX_UMAP") plotDimReduceCluster(x = sce$decontX_clusters, dim1 = umap[, 1], dim2 = umap[, 2]) ``` ## Contamination on UMAP The percentage of contamination in each cell can be plotting on the UMAP to visualize what what clusters may have higher levels of ambient RNA. ```{r plot_decon} plotDecontXContamination(sce) ``` ## Expression of markers on UMAP Known marker genes can also be plotted on the UMAP to identify the cell types for each cluster. We will use CD3D and CD3E for T-cells, LYZ, S100A8, and S100A9 for monocytes, CD79A, CD79B, and MS4A1 for B-cells, GNLY for NK-cells, and PPBP for megakaryocytes. ```{r plot_feature, message=FALSE} library(scater) sce <- logNormCounts(sce) plotDimReduceFeature(as.matrix(logcounts(sce)), dim1 = umap[, 1], dim2 = umap[, 2], features = c("CD3D", "CD3E", "GNLY", "LYZ", "S100A8", "S100A9", "CD79A", "CD79B", "MS4A1"), exactMatch = TRUE) ``` ## Barplot of markers detected in cell clusters The percetage of cells within a cluster that have detectable expression of marker genes can be displayed in a barplot. Markers for cell types need to be supplied in a named list. First, the detection of marker genes in the original `counts` assay is shown: ```{r barplotCounts} markers <- list(Tcell_Markers = c("CD3E", "CD3D"), Bcell_Markers = c("CD79A", "CD79B", "MS4A1"), Monocyte_Markers = c("S100A8", "S100A9", "LYZ"), NKcell_Markers = "GNLY") cellTypeMappings <- list(Tcells = 2, Bcells = 5, Monocytes = 1, NKcells = 6) plotDecontXMarkerPercentage(sce, markers = markers, groupClusters = cellTypeMappings, assayName = "counts") ``` We can then look to see how much decontX removed aberrant expression of marker genes in each cell type by changing the `assayName` to `decontXcounts`: ```{r barplotDecontCounts} plotDecontXMarkerPercentage(sce, markers = markers, groupClusters = cellTypeMappings, assayName = "decontXcounts") ``` Percentages of marker genes detected in other cell types were reduced or completely removed. For example, the percentage of cells that expressed Monocyte marker genes was greatly reduced in T-cells, B-cells, and NK-cells. The original counts and decontamined counts can be plotted side-by-side by listing multiple assays in the `assayName` parameter. This option is only available if the data is stored in `SingleCellExperiment` object. ```{r barplotBoth} plotDecontXMarkerPercentage(sce, markers = markers, groupClusters = cellTypeMappings, assayName = c("counts", "decontXcounts")) ``` Some helpful hints when using `plotDecontXMarkerPercentage`: 1. Cell clusters can be renamed and re-grouped using the `groupCluster` parameter, which also needs to be a named list. If `groupCluster` is used, cell clusters not included in the list will be excluded in the barplot. For example, if we wanted to group T-cells and NK-cells together, we could set `cellTypeMappings <- list(NK_Tcells = c(2,6), Bcells = 5, Monocytes = 1)` 2. The level a gene that needs to be expressed to be considered detected in a cell can be adjusted using the `threshold` parameter. 3. If you are not using a `SingleCellExperiment`, then you will need to supply the original counts matrix or the decontaminated counts matrix as the first argument to generate the barplots. ## Violin plot to compare the distributions of original and decontaminated counts Another useful way to assess the amount of decontamination is to view the expression of marker genes before and after `decontX` across cell types. Here we view the monocyte markers in each cell type. The violin plot shows that the markers have been removed from T-cells, B-cells, and NK-cells, but are largely unaffected in monocytes. ```{r plotDecontXMarkerExpression} plotDecontXMarkerExpression(sce, markers = markers[["Monocyte_Markers"]], groupClusters = cellTypeMappings, ncol = 3) ``` Some helpful hints when using `plotDecontXMarkerExpression`: 1. `groupClusters` works the same way as in `plotDecontXMarkerPercentage`. 2. This function will plot each pair of markers and clusters (or cell type specified by `groupClusters`). Therefore, you may want to keep the number of markers small in each plot and call the function multiple times for different sets of marker genes. 3. You can also plot the individual points by setting `plotDots = TRUE` and/or log transform the points on the fly by setting `log1p = TRUE`. 4. This function can plot any assay in a `SingleCellExperiment`. Therefore you could also examine normalized expression of the original and decontaminated counts. For example: ```{r plot_norm_counts, eval = TRUE} library(scater) sce <- logNormCounts(sce, exprs_values = "decontXcounts", name = "decontXlogcounts") plotDecontXMarkerExpression(sce, markers = markers[["Monocyte_Markers"]], groupClusters = cellTypeMappings, ncol = 3, assayName = c("logcounts", "decontXlogcounts")) ``` # Other important notes ## Choosing appropriate cell clusters The ability of DecontX to accurately identify contamination is dependent on the cell cluster labels. DecontX assumes that contamination for a cell cluster comes from combination of counts from all other clusters. The default clustering approach used by DecontX tends to select fewer clusters that represent broader cell types. For example, all T-cells tend to be clustered together rather than splitting naive and cytotoxic T-cells into separate clusters. Custom cell type labels can be suppled via the `z` parameter if some cells are not being clustered appropriately by the default method. ## Adjusting the priors to influence contamination estimates There are ways to force `decontX` to estimate more or less contamination across a dataset by manipulating the priors. The `delta` parameter is a numeric vector of length two. It is the concentration parameter for the Dirichlet distribution which serves as the prior for the proportions of native and contamination counts in each cell. The first element is the prior for the proportion of native counts while the second element is the prior for the proportion of contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If `estimateDelta = TRUE`, `delta` is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then `delta` is updated in each iteration. If `estimateDelta = FALSE`, then `delta` is fixed with these values for the entire inference procedure. Fixing `delta` and setting a high number in the second element will force `decontX` to be more aggressive and estimate higher levels of contamination in each cell at the expense of potentially removing native expression. For example, in the previous PBMC example, we can see what the estimated `delta` was by looking in the estimates: ```{r findDelta} metadata(sce)$decontX$estimates$all_cells$delta ``` Setting a higher value in the second element of delta and `estimateDelta = FALSE` will force `decontX` to estimate higher levels of contamination per cell: ```{r newDecontX, eval=TRUE, message=FALSE} sce.delta <- decontX(sce, delta = c(9, 20), estimateDelta = FALSE) plot(sce$decontX_contamination, sce.delta$decontX_contamination, xlab = "DecontX estimated priors", ylab = "Setting priors to estimate higher contamination") abline(0, 1, col = "red", lwd = 2) ``` ## Working with Seurat {#seurat} If you are using the [Seurat](https://cran.r-project.org/web/packages/Seurat/index.html) package for downstream analysis, the following code can be used to read in a matrix and convert between Seurat and SCE objects: ```{r seurat_create, eval = FALSE} # Read counts from CellRanger output library(Seurat) counts <- Read10X("sample/outs/filtered_feature_bc_matrix/") # Create a SingleCellExperiment object and run decontX sce <- SingleCellExperiment(list(counts = counts)) sce <- decontX(sce) # Create a Seurat object from a SCE with decontX results seuratObject <- CreateSeuratObject(round(decontXcounts(sce))) ``` Optionally, the "raw" matrix can be also be imported and used as the background: ```{r seurat_raw, eval = FALSE} counts.raw <- Read10X("sample/outs/raw_feature_bc_matrix/") sce.raw <- SingleCellExperiment(list(counts = counts.raw)) sce <- decontX(sce, background = sce.raw) ``` Note that the decontaminated matrix of decontX consists of floating point numbers and must be rounded to integers before adding it to a Seurat object. If you already have a Seurat object containing the counts matrix and would like to run decontX, you can retrieve the count matrix, create a SCE object, and run decontX, and then add it back to the Seurat object: ```{r seurat_create2, eval = FALSE} counts <- GetAssayData(object = seuratObject, slot = "counts") sce <- SingleCellExperiment(list(counts = counts)) sce <- decontX(sce) seuratObj[["decontXcounts"]] <- CreateAssayObject(counts = decontXcounts(sce)) ``` # Session Information ```{r} sessionInfo() ``` ================================================ FILE: vignettes/articles/installation.Rmd ================================================ # Introduction "celda" stands for "**CE**llular **L**atent **D**irichlet **A**llocation". It is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. This package also includes a method called **decontX** which can be used to estimate and remove contamination in single cell genomic data. # Installation To install the latest stable release of **celda** from [Bioconductor](http://bioconductor.org/packages/celda/) (requires R version >= 3.6): ```{r bioc, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("celda") ``` The latest stable version of **celda** can be installed from GitHub using `devtools`: ```{r current_version, eval = FALSE} library(devtools) install_github("campbio/celda") ``` The development version of **celda** can also be installed from GitHub using `devtools`: ```{r devel_version, eval = FALSE} library(devtools) install_github("campbio/celda@devel") ``` **NOTE** For MAC OSX users, `devtools::install_github()` requires installation of **libgit2.** This can be installed via homebrew: ``` brew install libgit2 ``` # Troubleshooting * If you receive installation errors when Rcpp is being installed and compiled, try following the steps outlined [here](https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/) to solve the issue * If you are running R 4.0.0 or later version on MacOS Catalina and you see error `'wchar.h' file not found`, you can try the method in [this](https://discourse.mc-stan.org/t/dealing-with-catalina-iii/12731/5) link: * If you are trying to install **celda** using Rstudio and get this error: `could not find tools necessary to compile a package`, you can try typing this before running the install command: ``` options(buildtools.check = function(action) TRUE) ``` Also, if you receive installation errors when Rcpp is being installed and compiled, try following the steps outlined here to solve the issue: https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/ If you are running R 4.0.0 or later version on MacOS Catalina and you see error `'wchar.h' file not found`, you can try the method in this link: https://discourse.mc-stan.org/t/dealing-with-catalina-iii/12731/5 If you are trying to install on MacOS in an Apple Silicon computater and you see the following error: ``` ld: warning: directory not found for option '-L/opt/gfortran/lib/gcc/x86_64-apple-darwin20.0/12.2.0' ld: warning: directory not found for option '-L/opt/gfortran/lib' ld: library not found for -lgfortran clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [celda.so] Error 1 ERROR: compilation failed for package ‘celda’ ``` You can solve this by downloading and installing the gfortran pkg located [here](https://mac.r-project.org/tools/gfortran-12.2-universal.pkg) and then running the following command: You can solve this by downloading and installing the gfortran pkg located [here](https://mac.r-project.org/tools/gfortran-12.2-universal.pkg) and then running the following command: ``` sudo /opt/gfortran/bin/gfortran-update-sdk ``` # Vignettes and examples To build the vignettes for Celda and DecontX during installation from GitHub, use the following command: ``` library(devtools) install_github("campbio/celda", build_vignettes = TRUE) ``` Note that installation may take an extra 5-10 minutes for building of the vignettes. The Celda and DecontX vignettes can then be accessed via the following commands: ``` vignette("celda") vignette("decontX") ``` ================================================ FILE: vignettes/celda.Rmd ================================================ --- title: "Analysis of single-cell genomic data with celda" author: - name: Joshua Campbell affiliation: Boston University School of Medicine email: camp@bu.edu - name: Zhe Wang affiliation: Boston University School of Medicine - name: Shiyi Yang affiliation: Boston University School of Medicine - name: Sean Corbett affiliation: Boston University School of Medicine - name: Yusuke Koga affiliation: Boston University School of Medicine date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Analysis of single-cell genomic data with celda} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, dev = "png") ``` # Introduction **CE**llular **L**atent **D**irichlet **A**llocation (celda) is a collection of Bayesian hierarchical models to perform feature and cell bi-clustering for count data generated by single-cell platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications and has shown good performance with sparse data. celda simultaneously clusters features (i.e. gene expression) into modules based on co-expression patterns across cells and cells into subpopulations based on the probabilities of the feature modules within each cell. Starting from Bioconductor release 3.12 (`celda` version 1.6.0), `celda` makes use of `r BiocStyle::Biocpkg("SingleCellExperiment")` (SCE) objects for storing data and results. In this vignette we will demonstrate how to use celda to perform cell and feature clustering with a simple, small simulated dataset. This vignette does not include upstream importing of data, quality control, or filtering. To see a more complete analysis of larger real-world datasets, visit [camplab.net/celda](https://www.camplab.net/celda/) for additional vignettes. # Installation celda can be installed from Bioconductor: ```{r install, eval= FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("celda") ``` To load the package, type the following: ```{r library, message = FALSE} library(celda) ``` A complete list of help files are accessible using the help command with the `package` option. ```{r help, eval = FALSE} help(package = celda) ``` To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.
    # Generation of a simulated single cell dataset celda will take a matrix of counts where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of celda, we will apply it to a simulated dataset. In the function `simulateCells`, the **K** parameter designates the number of cell clusters, the **L** parameter determines the number of feature modules, the **S** parameter determines the number of samples in the simulated dataset, the **G** parameter determines the number of features to be simulated, and **CRange** specifies the lower and upper bounds of the number of cells to be generated in each sample. To simulate a dataset of 5 samples with 5 cell populations, 10 feature modules, 200 features, and between 30 to 50 cells per sample using `celda_CG` model: ```{r simulate} simsce <- simulateCells("celda_CG", S = 5, K = 5, L = 10, G = 200, CRange = c(30, 50)) ``` The `counts` assay slot in `simsce` contains the counts matrix. The dimensions of counts matrix: ```{r assay, message = FALSE} library(SingleCellExperiment) dim(counts(simsce)) ``` Columns `celda_sample_label` and `celda_cell_cluster` in `colData(simsce)` contain sample labels and celda cell population cluster labels. Here are the numbers of cells in each subpopulation and in each sample: ```{r cell_numbers} table(colData(simsce)$celda_cell_cluster) table(colData(simsce)$celda_sample_label) ``` Column `celda_feature_module` in `rowData(simsce)` contains feature module labels. Here is the number of features in each feature module: ```{r module_numbers} table(rowData(simsce)$celda_feature_module) ``` # Feature selection A simple heuristic feature selection is performed to reduce the size of features used for clustering. To speed up the process, only features with at least 3 counts in at least 3 cells are included in downstream clustering for this data. A subset `SingleCellExperiment` object with filtered features is stored in `altExp(simsce, "featureSubset")` slot by default. ```{r, warning = FALSE, message = FALSE} simsce <- selectFeatures(simsce) ``` If the number of features is still too large, then a smaller subset of features can be obtained by selecting the top number of most variable genes. For an example code, see the PBMC3K tutorial in the online celda [documentation](https://www.camplab.net/celda). # Performing bi-clustering with celda There are currently three models within *celda* package: `celda_C` will cluster cells, `celda_G` will cluster features, and `celda_CG` will simultaneously cluster cells and features. Within the functions the `K` parameter will be the number of cell populations to be estimated, while the `L` parameter will be the number of feature modules to be estimated in the output model. ```{r celda_cg, warning = FALSE, message = FALSE} sce <- celda_CG(x = simsce, K = 5, L = 10, verbose = FALSE, nchains = 1) ``` Here is a comparison between the true cluster labels and the estimated cluster labels. ```{r accuracy} table(celdaClusters(sce), celdaClusters(simsce)) table(celdaModules(sce), celdaModules(simsce)) ``` # Visualization ## Plotting cell populations on 2D-embeddings celda contains its own wrapper function for tSNE and UMAP called `celdaTsne` and `celdaUmap`, respectively. Both of these functions can be used to embed cells into 2-dimensions. The output can be used in the downstream plotting functions `plotDimReduceCluster`, `plotDimReduceModule`, and `plotDimReduceFeature` to show cell population clusters, module probabilities, and expression of individual features, respectively. ```{r umap} sce <- celdaUmap(sce) ``` ```{r plot_umap, eval = TRUE, fig.width = 7, fig.height = 7} plotDimReduceCluster(x = sce, reducedDimName = "celda_UMAP") plotDimReduceModule(x = sce, reducedDimName = "celda_UMAP", rescale = TRUE) plotDimReduceFeature(x = sce, reducedDimName = "celda_UMAP", normalize = TRUE, features = "Gene_1") ``` ## Creating an expression heatmap The clustering results can be viewed with a heatmap of the normalized counts using the function `celdaHeatmap`. The top `nfeatures` in each module will be selected according to the factorized module probability matrix. ```{r celda_heatmap, eval = TRUE, fig.width = 7, fig.height = 7} plot(celdaHeatmap(sce = sce, nfeatures = 10)) ``` ## Displaying relationships between modules and cell populations The relationships between feature modules and cell populations can be visualized with `celdaProbabilityMap`. The absolute probabilities of each feature module in each cellular subpopulation is shown on the left. The normalized and z-scored expression of each module in each cell population is shown on the right. ```{r propmap, eval = TRUE, fig.width = 7, fig.height = 7} celdaProbabilityMap(sce) ``` ## Examining co-expression with module heatmaps `moduleHeatmap` creates a heatmap using only the features from a specific feature module. Cells are ordered from those with the lowest probability of the module to the highest. If more than one module is used, then cells will be ordered by the probabilities of the first module. ```{r module_heatmap, eval = TRUE, fig.width = 7, fig.height = 7} moduleHeatmap(sce, featureModule = c(1,2), topCells = 100) ``` # Identifying reasonable numbers of feature modules and cell subpopulations In the previous example, the best `K` (the number of cell clusters) and `L` (the number of feature modules) was already known. However, the optimal `K` and `L` for each new dataset will likely not be known beforehand and multiple choices of `K` and `L` may need to be tried and compared. celda offers two sets of functions to determine the optimum `K` and `L`, `recursiveSplitModule`/`recursiveSplitCell`, and `celdaGridSearch`. ## Using recursive splitting Functions `recursiveSplitModule` and `recursiveSplitCell` offer a fast method to generate a celda model with optimum `K` and `L`. First, `recursiveSplitModule` is used to determine the optimal `L`. `recursiveSplitModule` first splits features into however many modules are specified in `initialL`. The module labels are then recursively split in a way that would generate the highest log-likelihood, all the way up to `maxL`. ```{r, message = FALSE} moduleSplit <- recursiveSplitModule(simsce, initialL = 2, maxL = 15) ``` Perplexity is a statistical measure of how well a probability model can predict new data. Lower perplexity indicates a better model. The perplexity of each model can be visualized with `plotGridSearchPerplexity`. In general, visual inspection of the plot can be used to select the optimal number of modules (`L`) or cell populations (`K`) by identifying the "elbow" - where the rate of decrease in the perplexity starts to drop off. ```{r} plotGridSearchPerplexity(moduleSplit) ``` In this example, the perplexity for `L` stops decreasing at L = 10, thus L = 10 would be a good choice. Sometimes the perplexity alone does not show a clear elbow or "leveling off". However, the rate of perplexity change (RPC) can be more informative to determine when adding new modules does not add much additional information [Zhao et al., 2015](https://doi.org/10.1186/1471-2105-16-S13-S8){target="_blank"}). An RPC closer to zero indicates that the addition of new modules or cell clusters is not substantially decreasing the perplexity. The RPC of models can be visualized using function `plotRPC`: ```{r module_split_rpc, message = FALSE, warning = FALSE} plotRPC(moduleSplit) ``` Once you have identified the optimal `L` (in this case, L is selected to be 10), the module labels are used for initialization in `recursiveSplitCell`. Similarly to `recursiveSplitModule`, cells are initially split into a small number of subpopulations, and the subpopulations are recursively split up. ```{r module_split_select, message = FALSE} moduleSplitSelect <- subsetCeldaList(moduleSplit, params = list(L = 10)) cellSplit <- recursiveSplitCell(moduleSplitSelect, initialK = 3, maxK = 12, yInit = celdaModules(moduleSplitSelect)) ``` ```{r rpc_cell} plotGridSearchPerplexity(cellSplit) plotRPC(cellSplit) ``` In this plot, the perplexity for K stops decreasing at K = 5, with a final K/L combination of K = 5, L = 10. Generally, this method can be used to pick a reasonable `L` and a potential range of `K`. However, manual review of specific selections of `K` is often required to ensure results are biologically coherent. Once users have chosen the K/L parameters for further analysis, the `subsetCeldaList` function can be used to subset the celda list *SCE* object to a single model *SCE* object. ```{r subset_celda, eval = TRUE} sce <- subsetCeldaList(cellSplit, params = list(K = 5, L = 10)) ``` ## Using a grid search Alternativley to recursive splitting, celda is able to run multiple combinations of K and L with multiple chains in parallel via the `celdaGridSearch` function. ```{r grid_search, eval = TRUE, message = FALSE} cgs <- celdaGridSearch(simsce, paramsTest = list(K = seq(4, 6), L = seq(9, 11)), cores = 1, model = "celda_CG", nchains = 2, maxIter = 100, verbose = FALSE, bestOnly = TRUE) ``` Setting `verbose` to `TRUE` will print the output of each model to a text file. These results can be visualized with `plotGridSearchPerplexity`. The major goal is to pick the lowest `K` and `L` combination with relatively good perplexity. In general, visual inspection of the plot can be used to select the number of modules (`L`) or cell populations (`K`) where the rate of decrease in the perplexity starts to drop off. `bestOnly = TRUE` indicates that only the chain with the best log likelihood will be returned for each K/L combination. ```{r plot_grid_search, eval = TRUE, fig.width = 8, fig.height = 8, warning = FALSE, message = FALSE} plotGridSearchPerplexity(cgs) ``` In this example, the perplexity for `L` stops decreasing at L = 10 for the majority of `K` values. For the line corresponding to L = 10, the perplexity stops decreasing at K = 5. Thus L = 10 and K = 5 would be a good choice. Again, manual review of specific selections of K is often be required to ensure results are biologically coherent. Once users have chosen the K/L parameters for further analysis, the `subsetCeldaList` function can be used to subset the celda list *SCE* object to a single model *SCE* object. ```{r subset_grid_search, eval = TRUE} sce <- subsetCeldaList(cgs, params = list(K = 5, L = 10)) ``` If the "bestOnly" parameter is set to FALSE in the `celdaGridSearch`, then the `selectBestModel` function can be used to select the chains with the lowest log likelihoods within each combination of parameters. Alternatively, users can select a specific chain by specifying the index within the `subsetCeldaList` function. ```{r best_only_cgs, eval = FALSE, message=FALSE} cgs <- celdaGridSearch(simsce, paramsTest = list(K = seq(4, 6), L = seq(9, 11)), cores = 1, model = "celda_CG", nchains = 2, maxIter = 100, verbose = FALSE, bestOnly = FALSE) cgs <- resamplePerplexity(cgs, celdaList = cgs, resample = 2) cgsK5L10 <- subsetCeldaList(cgs, params = list(K = 5, L = 10)) sce <- selectBestModel(cgsK5L10) ``` # Miscellaneous utility functions celda also contains several utility functions for the users' convenience. ## Finding the modules for feature with featureModuleLookup `featureModuleLookup` can be used to look up the module a specific feature was clustered to. ```{r module_lookup} featureModuleLookup(sce, feature = c("Gene_99")) ``` ## Reordering cluster labels with recodeClusterZ, recodeClusterY `recodeClusterZ` and `recodeClusterY` allows the user to recode the cell and feature cluster labels, respectively. ```{r recode_clusters} sceZRecoded <- recodeClusterZ(sce, from = c(1, 2, 3, 4, 5), to = c(2, 1, 3, 4, 5)) ``` The model prior to reordering cell labels compared to after reordering cell labels: ```{r recode_clusters_show} table(celdaClusters(sce), celdaClusters(sceZRecoded)) ``` # Session Information ```{r session_info} sessionInfo() ``` ================================================ FILE: vignettes/decontX.Rmd ================================================ --- title: "Decontamination of ambient RNA in single-cell genomic data with DecontX" author: - name: Shiyi (Iris) Yang affiliation: &id Boston University School of Medicine - name: Zhe Wang affiliation: *id - name: Yuan Yin affiliation: *id - name: Joshua Campbell affiliation: *id email: camp@bu.edu date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Estimate and remove cross-contamination from ambient RNA in single-cell data with DecontX} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, dev = "png") ``` # Introduction Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a Bayesian method to estimate and remove contamination in individual cells. DecontX assumes the observed expression of a cell is a mixture of counts from two multinomial distributions: (1) a distribution of native transcript counts from the cell’s actual population and (2) a distribution of contaminating transcript counts from all other cell populations captured in the assay. Overall, computational decontamination of single cell counts can aid in downstream clustering and visualization. The package can be loaded using the `library` command. ```{r load, eval=TRUE, message=FALSE} library(celda) ``` # Importing data DecontX can take either a [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object or a counts matrix as input. `decontX` will attempt to convert any input matrix to class `dgCMatrix` from package [Matrix](https://cran.r-project.org/web/packages/Matrix/index.html) before starting the analysis. To import datasets directly into an SCE object, the [singleCellTK](https://bioconductor.org/packages/release/bioc/html/singleCellTK.html) package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in the filtered and raw matrices for multiple samples processed with CellRanger: ```{r sce_import, eval = FALSE} library(singleCellTK) sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/")) ``` Within each sample directory, there should be subfolders called `"outs/filtered_feature_bc_matrix/"` or `"outs/raw_feature_bc_matrix/"` with files called `matrix.mtx.gz`, `features.tsv.gz` and `barcodes.tsv.gz`. If these files are in different subdirectories, the `importCellRangerV3Sample` function can be used to import data from a different directory instead. Optionally, the "raw" or "droplet" matrix can also be easily imported by setting the `dataType` argument to "raw": ```{r sce_import_raw, eval = FALSE} sce.raw <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"), dataType = "raw") ``` The raw matrix can be passed to the `background` parameter in `decontX` as described below. If using Seurat, go to the [Working with Seurat](#seurat) section for details on how to convert between SCE and Seurat objects. # Load PBMC4k data from 10X We will utilize the 10X PBMC 4K dataset as an example in this vignette. This data can be easily retrieved from the package [TENxPBMCData](http://bioconductor.org/packages/release/data/experiment/html/TENxPBMCData.html). Make sure the the column names are set before running decontX. ```{r load_10X, eval=TRUE, message=FALSE} # Load PBMC data library(TENxPBMCData) sce <- TENxPBMCData("pbmc4k") colnames(sce) <- paste(sce$Sample, sce$Barcode, sep = "_") rownames(sce) <- rowData(sce)$Symbol_TENx counts(sce) <- as(counts(sce), "dgCMatrix") ``` # Running decontX A SingleCellExperiment (SCE) object or a sparse matrix containing the counts for filtered cells can be passed to decontX via the `x` parameter. The matrix to use in an SCE object can be specified with the `assayName` parameter, which is set to `"counts"` by default. There are two major ways to run decontX: with and without the raw/droplet matrix containing empty droplets. Here is an example of running decontX without supplying the background: ```{r decontX, eval=TRUE, message=FALSE} sce <- decontX(sce) ``` In this scenario, `decontX` will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset. The estimated contamination results can be found in the `colData(sce)$decontX_contamination` and the decontaminated counts can be accessed with `decontXcounts(sce)`. `decontX` will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the `z` parameter. These results will be used throughout the rest of the vignette. The raw/droplet matrix can be used to empirically estimate the distribution of ambient RNA, which is especially useful when cells that contributed to the ambient RNA are not accurately represented in the filtered count matrix containing the cells. For example, cells that were removed via flow cytometry or that were more sensitive to lysis during dissociation may have contributed to the ambient RNA but were not measured in the filtered/cell matrix. The raw/droplet matrix can be input as an SCE object or a sparse matrix using the `background` parameter: ```{r decontX_background, eval=FALSE, message=FALSE} sce <- decontX(sce, background = sce.raw) ``` Only empty droplets in the background matrix should be used to estimate the ambient RNA. If any cell ids (i.e. `colnames`) in the raw/droplet matrix supplied to the `background` parameter are also found in the filtered counts matrix (`x`), decontX will automatically remove them from the raw matrix. However, if the cell ids are not available for the input matrices, decontX will treat the entire `background` input as empty droplets. All of the outputs are the same as when running decontX without setting the `background` parameter. > Note: If the input object is just a matrix and not an SCE object, make sure to save the output into a variable with a different name (e.g. `result <- decontX(mat)`). The result object will be a list with contamination in `result$contamination` and the decontaminated counts in `result$decontXcounts`. # Plotting DecontX results ## Cluster labels on UMAP DecontX creates a UMAP which we can use to plot the cluster labels automatically identified in the analysis. Note that the clustering approach used here is designed to find "broad" cell types rather than individual cell subpopulations within a cell type. ```{r UMAP_Clusters} umap <- reducedDim(sce, "decontX_UMAP") plotDimReduceCluster(x = sce$decontX_clusters, dim1 = umap[, 1], dim2 = umap[, 2]) ``` ## Contamination on UMAP The percentage of contamination in each cell can be plotting on the UMAP to visualize what what clusters may have higher levels of ambient RNA. ```{r plot_decon} plotDecontXContamination(sce) ``` ## Expression of markers on UMAP Known marker genes can also be plotted on the UMAP to identify the cell types for each cluster. We will use CD3D and CD3E for T-cells, LYZ, S100A8, and S100A9 for monocytes, CD79A, CD79B, and MS4A1 for B-cells, GNLY for NK-cells, and PPBP for megakaryocytes. ```{r plot_feature, message=FALSE} library(scater) sce <- logNormCounts(sce) plotDimReduceFeature(as.matrix(logcounts(sce)), dim1 = umap[, 1], dim2 = umap[, 2], features = c("CD3D", "CD3E", "GNLY", "LYZ", "S100A8", "S100A9", "CD79A", "CD79B", "MS4A1"), exactMatch = TRUE) ``` ## Barplot of markers detected in cell clusters The percetage of cells within a cluster that have detectable expression of marker genes can be displayed in a barplot. Markers for cell types need to be supplied in a named list. First, the detection of marker genes in the original `counts` assay is shown: ```{r barplotCounts} markers <- list(Tcell_Markers = c("CD3E", "CD3D"), Bcell_Markers = c("CD79A", "CD79B", "MS4A1"), Monocyte_Markers = c("S100A8", "S100A9", "LYZ"), NKcell_Markers = "GNLY") cellTypeMappings <- list(Tcells = 2, Bcells = 5, Monocytes = 1, NKcells = 6) plotDecontXMarkerPercentage(sce, markers = markers, groupClusters = cellTypeMappings, assayName = "counts") ``` We can then look to see how much decontX removed aberrant expression of marker genes in each cell type by changing the `assayName` to `decontXcounts`: ```{r barplotDecontCounts} plotDecontXMarkerPercentage(sce, markers = markers, groupClusters = cellTypeMappings, assayName = "decontXcounts") ``` Percentages of marker genes detected in other cell types were reduced or completely removed. For example, the percentage of cells that expressed Monocyte marker genes was greatly reduced in T-cells, B-cells, and NK-cells. The original counts and decontamined counts can be plotted side-by-side by listing multiple assays in the `assayName` parameter. This option is only available if the data is stored in `SingleCellExperiment` object. ```{r barplotBoth} plotDecontXMarkerPercentage(sce, markers = markers, groupClusters = cellTypeMappings, assayName = c("counts", "decontXcounts")) ``` Some helpful hints when using `plotDecontXMarkerPercentage`: 1. Cell clusters can be renamed and re-grouped using the `groupCluster` parameter, which also needs to be a named list. If `groupCluster` is used, cell clusters not included in the list will be excluded in the barplot. For example, if we wanted to group T-cells and NK-cells together, we could set `cellTypeMappings <- list(NK_Tcells = c(2,6), Bcells = 5, Monocytes = 1)` 2. The level a gene that needs to be expressed to be considered detected in a cell can be adjusted using the `threshold` parameter. 3. If you are not using a `SingleCellExperiment`, then you will need to supply the original counts matrix or the decontaminated counts matrix as the first argument to generate the barplots. ## Violin plot to compare the distributions of original and decontaminated counts Another useful way to assess the amount of decontamination is to view the expression of marker genes before and after `decontX` across cell types. Here we view the monocyte markers in each cell type. The violin plot shows that the markers have been removed from T-cells, B-cells, and NK-cells, but are largely unaffected in monocytes. ```{r plotDecontXMarkerExpression} plotDecontXMarkerExpression(sce, markers = markers[["Monocyte_Markers"]], groupClusters = cellTypeMappings, ncol = 3) ``` Some helpful hints when using `plotDecontXMarkerExpression`: 1. `groupClusters` works the same way as in `plotDecontXMarkerPercentage`. 2. This function will plot each pair of markers and clusters (or cell type specified by `groupClusters`). Therefore, you may want to keep the number of markers small in each plot and call the function multiple times for different sets of marker genes. 3. You can also plot the individual points by setting `plotDots = TRUE` and/or log transform the points on the fly by setting `log1p = TRUE`. 4. This function can plot any assay in a `SingleCellExperiment`. Therefore you could also examine normalized expression of the original and decontaminated counts. For example: ```{r plot_norm_counts, eval = TRUE} library(scater) sce <- logNormCounts(sce, exprs_values = "decontXcounts", name = "decontXlogcounts") plotDecontXMarkerExpression(sce, markers = markers[["Monocyte_Markers"]], groupClusters = cellTypeMappings, ncol = 3, assayName = c("logcounts", "decontXlogcounts")) ``` # Other important notes ## Choosing appropriate cell clusters The ability of DecontX to accurately identify contamination is dependent on the cell cluster labels. DecontX assumes that contamination for a cell cluster comes from combination of counts from all other clusters. The default clustering approach used by DecontX tends to select fewer clusters that represent broader cell types. For example, all T-cells tend to be clustered together rather than splitting naive and cytotoxic T-cells into separate clusters. Custom cell type labels can be suppled via the `z` parameter if some cells are not being clustered appropriately by the default method. ## Adjusting the priors to influence contamination estimates There are ways to force `decontX` to estimate more or less contamination across a dataset by manipulating the priors. The `delta` parameter is a numeric vector of length two. It is the concentration parameter for the Dirichlet distribution which serves as the prior for the proportions of native and contamination counts in each cell. The first element is the prior for the proportion of native counts while the second element is the prior for the proportion of contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If `estimateDelta = TRUE`, `delta` is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then `delta` is updated in each iteration. If `estimateDelta = FALSE`, then `delta` is fixed with these values for the entire inference procedure. Fixing `delta` and setting a high number in the second element will force `decontX` to be more aggressive and estimate higher levels of contamination in each cell at the expense of potentially removing native expression. For example, in the previous PBMC example, we can see what the estimated `delta` was by looking in the estimates: ```{r findDelta} metadata(sce)$decontX$estimates$all_cells$delta ``` Setting a higher value in the second element of delta and `estimateDelta = FALSE` will force `decontX` to estimate higher levels of contamination per cell: ```{r newDecontX, eval=TRUE, message=FALSE} sce.delta <- decontX(sce, delta = c(9, 20), estimateDelta = FALSE) plot(sce$decontX_contamination, sce.delta$decontX_contamination, xlab = "DecontX estimated priors", ylab = "Setting priors to estimate higher contamination") abline(0, 1, col = "red", lwd = 2) ``` ## Working with Seurat {#seurat} If you are using the [Seurat](https://cran.r-project.org/web/packages/Seurat/index.html) package for downstream analysis, the following code can be used to read in a matrix and convert between Seurat and SCE objects: ```{r seurat_create, eval = FALSE} # Read counts from CellRanger output library(Seurat) counts <- Read10X("sample/outs/filtered_feature_bc_matrix/") # Create a SingleCellExperiment object and run decontX sce <- SingleCellExperiment(list(counts = counts)) sce <- decontX(sce) # Create a Seurat object from a SCE with decontX results seuratObject <- CreateSeuratObject(round(decontXcounts(sce))) ``` Optionally, the "raw" matrix can be also be imported and used as the background: ```{r seurat_raw, eval = FALSE} counts.raw <- Read10X("sample/outs/raw_feature_bc_matrix/") sce.raw <- SingleCellExperiment(list(counts = counts.raw)) sce <- decontX(sce, background = sce.raw) ``` Note that the decontaminated matrix of decontX consists of floating point numbers and must be rounded to integers before adding it to a Seurat object. If you already have a Seurat object containing the counts matrix and would like to run decontX, you can retrieve the count matrix, create a SCE object, and run decontX, and then add it back to the Seurat object: ```{r seurat_create2, eval = FALSE} counts <- GetAssayData(object = seuratObject, slot = "counts") sce <- SingleCellExperiment(list(counts = counts)) sce <- decontX(sce) seuratObj[["decontXcounts"]] <- CreateAssayObject(counts = decontXcounts(sce)) ``` # Session Information ```{r} sessionInfo() ```