Compiled: May 18, 2020
This vignette demonstrates the import of alevin quantified counts into Seurat. Commands and parameters are based off of the alevin tutorial. If you use alevin in your work, please cite:
Alevin efficiently estimates accurate gene abundances from dscRNA-seq data
Avi Srivastava, Laraib Malik, Tom Smith, Ian Sudbery & Rob Patro
Genome Biology, 2019.
Prerequisites to install:
This vigettte demonstrates how to run ALRA on Seurat objects, which aims to recover missing values in scRNA-seq data through imputation. If you use ALRA, please cite:
Zero-preserving imputation of scRNA-seq data using low-rank approximation
George C. Linderman, Jun Zhao, Yuval Kluger
biorxiv, 2018.
Prerequisites to install:
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
library(dplyr)
To learn more about this dataset, type ?pbmc3k
InstallData("pbmc3k")
data("pbmc3k")
# Initial processing and visualization
pbmc3k <- SCTransform(pbmc3k) %>% RunPCA() %>% RunUMAP(dims = 1:30)
# run ALRA, creates alra assay of imputed values
pbmc3k <- RunALRA(pbmc3k)
# visualize original and imputed values
pbmc3k <- NormalizeData(pbmc3k, assay = "RNA")
features.plot <- c("CD3D", "MS4A1", "CD8A", "GZMK", "NCAM1", "FCGR3A")
DefaultAssay(pbmc3k) <- "RNA"
plot1 <- FeaturePlot(pbmc3k, features.plot, ncol = 2)
DefaultAssay(pbmc3k) <- "alra"
plot2 <- FeaturePlot(pbmc3k, features.plot, ncol = 2, cols = c("lightgrey", "red"))
CombinePlots(list(plot1, plot2), ncol = 1)
Find markers based on the BANKSY clusters and visualize them. Here, we
find differentially expressed genes between the CA1 and CA3 regions.
``` r
# Find markers
DefaultAssay(ss.hippo) <- 'Spatial'
markers <- FindMarkers(ss.hippo, ident.1 = 4, ident.2 = 9, only.pos = F,
logfc.threshold = 1, min.pct = 0.5)
markers <- markers[markers$p_val_adj < 0.01,]
markers
```
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## SNAP25 1.127235e-46 -1.260312 0.658 0.823 2.622400e-42
## CHGB 9.840001e-44 -1.985343 0.439 0.697 2.289178e-39
## STMN2 1.281230e-24 -1.430138 0.335 0.574 2.980653e-20
## SYN2 3.272800e-23 -1.609355 0.332 0.564 7.613842e-19
## ATP2B1 1.545647e-22 1.251540 0.639 0.474 3.595793e-18
## CPLX2 4.619232e-21 -1.220110 0.289 0.522 1.074618e-16
## PRKCB 1.276453e-18 1.394809 0.552 0.341 2.969539e-14
## PCP4 2.006224e-18 -1.269671 0.379 0.578 4.667279e-14
## TUBB2A 1.330787e-16 -1.054176 0.450 0.629 3.095942e-12
## DDN 1.784378e-14 1.401976 0.592 0.396 4.151176e-10
## SNCA 7.596526e-12 -1.022314 0.397 0.544 1.767256e-07
``` r
genes <- c('ATP2B1', 'CHGB')
SpatialFeaturePlot(ss.hippo, features = genes, pt.size.factor = 3,
stroke = NA, alpha = 0.5, max.cutoff = 'q95')
```
## Running BANKSY with locations provided explicitly
One can also call `RunBanksy` on a Seurat object created from counts by
providing the location of cell centroids or spots explicitly. In this
case, the locations must be stored as metadata. Here, we use a mouse
hippocampus VeraFISH dataset provided with the *Banksy* package.
``` r
data(hippocampus)
head(hippocampus$expression[,1:5])
```
## cell_1276 cell_8890 cell_691 cell_396 cell_9818
## Sparcl1 45 0 11 22 0
## Slc1a2 17 0 6 5 0
## Map 10 0 12 16 0
## Sqstm1 26 0 0 2 0
## Atp1a2 0 0 4 3 0
## Tnc 0 0 0 0 0
``` r
head(hippocampus$locations)
```
## sdimx sdimy
## cell_1276 -13372.899 15776.37
## cell_8890 8941.101 15866.37
## cell_691 -14882.899 15896.37
## cell_396 -15492.899 15835.37
## cell_9818 11308.101 15846.37
## cell_11310 14894.101 15810.37
Construct the Seurat object by storing the locations of cell centroids
as metadata. We keep cells with total count between 5th and 98th
percentile:
``` r
# Create manually
vf.hippo <- CreateSeuratObject(counts = hippocampus$expression,
meta.data = hippocampus$locations)
vf.hippo <- subset(vf.hippo,
nCount_RNA > quantile(vf.hippo$nCount_RNA, 0.05) &
nCount_RNA < quantile(vf.hippo$nCount_RNA, 0.98))
```
Next, we normalize the data by library size and scale the data:
``` r
# Normalize
vf.hippo <- NormalizeData(vf.hippo, scale.factor = 100, normalization.method = 'RC')
vf.hippo <- ScaleData(vf.hippo)
```
Now, run BANKSY. Here, we provide the column names of the x and y
spatial coordinates as stored in the metadata to `dimx` and `dimy`
respectively:
``` r
# Run BANKSY
vf.hippo <- RunBanksy(vf.hippo, lambda = 0.2, dimx = 'sdimx', dimy = 'sdimy',
assay = 'RNA', slot = 'data', features = 'all', k_geom = 10)
```
Note that the `RunBanksy` function sets the default assay to `BANKSY` (
determined by the `assay_name` argument) and fills the `scale.data`
slot. Users should not call `ScaleData` on the `BANKSY` assay as this
negates the effects of `lambda`.
Run PCA on the BANKSY matrix:
``` r
# PCA
vf.hippo <- RunPCA(vf.hippo, assay = 'BANKSY', features = rownames(vf.hippo), npcs = 20)
```
Find BANKSY clusters:
``` r
# Cluster
vf.hippo <- FindNeighbors(vf.hippo, dims = 1:20)
vf.hippo <- FindClusters(vf.hippo, resolution = 0.5)
```
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10205
## Number of edges: 446178
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9099
## Number of communities: 15
## Elapsed time: 1 seconds
Visualise BANKSY clusters in spatial dimensions:
``` r
# Viz
FeatureScatter(vf.hippo, 'sdimx', 'sdimy', cols = mypal, pt.size = 0.75)
```
``` r
FeatureScatter(vf.hippo, 'sdimx', 'sdimy', cols = mypal, pt.size = 0.1) + facet_wrap(~ colors)
```
Find markers and visualise them. Here, we do so for a cluster defined by
a thin layer of cells expressing Gfap. We also write a simple function
`genePlot` that plots marker genes in spatial dimensions.
``` r
# Find markers
DefaultAssay(vf.hippo) <- 'RNA'
markers <- FindMarkers(vf.hippo, ident.1 = 6, only.pos = TRUE)
genePlot <- function(object, dimx, dimy, gene, assay = 'RNA',
slot = 'scale.data', q.low = 0.01, q.high = 0.99,
col.low='blue', col.high='red') {
val <- GetAssayData(object, assay=assay, slot=slot)[gene,]
val.low <- quantile(val, q.low)
val.high <- quantile(val, q.high)
val[val < val.low] <- val.low
val[val > val.high] <- val.high
pdf <- data.frame(x=object[[dimx]], y=object[[dimy]], gene=val)
colnames(pdf) <- c('sdimx','sdimy', 'gene')
ggplot(pdf, aes(x=sdimx,y=sdimy,color=gene)) + geom_point(size = 1) +
theme_minimal() + theme(legend.title = element_blank()) +
scale_color_gradient2(low = col.low, high = col.high) +
ggtitle(gene)
}
genePlot(vf.hippo, 'sdimx', 'sdimy', 'Gfap')
```
## Multi-sample analysis
This section demonstrate demonstrates multi-sample analysis. Such an
approach is appropriate when analysing multiple spatial omics datasets
with non-contiguous spatial coordinates, and when large batch effects
are not present.
Here, we use a mouse hippocampus VeraFISH dataset provided with the
*Banksy* package.
``` r
data(hippocampus)
head(hippocampus$expression[,1:5])
```
## cell_1276 cell_8890 cell_691 cell_396 cell_9818
## Sparcl1 45 0 11 22 0
## Slc1a2 17 0 6 5 0
## Map 10 0 12 16 0
## Sqstm1 26 0 0 2 0
## Atp1a2 0 0 4 3 0
## Tnc 0 0 0 0 0
``` r
head(hippocampus$locations)
```
## sdimx sdimy
## cell_1276 -13372.899 15776.37
## cell_8890 8941.101 15866.37
## cell_691 -14882.899 15896.37
## cell_396 -15492.899 15835.37
## cell_9818 11308.101 15846.37
## cell_11310 14894.101 15810.37
For demonstration purposes, we create three separate datasets by
splitting the data.
``` r
# Number of groups
n_groups = 3
group_names = paste0('group', seq(n_groups))
group_size = 1000
starts = seq(1, by=group_size, length.out=n_groups)
ends = starts + group_size - 1
# List of Seurat objects
seu_list = lapply(seq(n_groups), function(i) {
idx = seq(starts[i], ends[i])
seu = CreateSeuratObject(
counts = hippocampus$expression[,idx],
meta.data = data.frame(scale(hippocampus$locations[idx,], scale = FALSE))
)
# Set original identity of cell
seu$orig.ident = group_names[i]
seu
})
seu_list
```
## [[1]]
## An object of class Seurat
## 120 features across 1000 samples within 1 assay
## Active assay: RNA (120 features, 0 variable features)
## 1 layer present: counts
##
## [[2]]
## An object of class Seurat
## 120 features across 1000 samples within 1 assay
## Active assay: RNA (120 features, 0 variable features)
## 1 layer present: counts
##
## [[3]]
## An object of class Seurat
## 120 features across 1000 samples within 1 assay
## Active assay: RNA (120 features, 0 variable features)
## 1 layer present: counts
Perform normalisation for each dataset.
``` r
seu_list = lapply(seu_list, NormalizeData,
scale.factor = 100, normalization.method = 'RC')
```
Merge the datasets. Note that the spatial coordinates overlap.
``` r
# Merge
seu = Reduce(merge, seu_list)
seu = JoinLayers(seu) # run this for Seurat v5 objects
# Plot spatial coordinates colored by group
plot(FetchData(seu, c('sdimx', 'sdimy')), col = factor(seu$orig.ident))
```
Now run BANKSY. For multi-sample analysis, the argument `group` must be
provided, which specifies the name of the metadata column that gives the
assignment of each cell or spot to its original Seurat object. Here, we
use `orig.ident`. Internally, providing the `group` argument tells the
function to compute neighborhood matrices based on locations staggered
by `group`, ensuring that cells from different spatial datasets do not
overlap. The staggered locations are stored in the metadata for sanity
checking. The `split.scale` argument allows for within-group scaling,
accounting for minor differences in datasets.
``` r
# Grouping variable
head(seu@meta.data)
```
## orig.ident nCount_RNA nFeature_RNA sdimx sdimy
## cell_1276 group1 266 51 -11933.19 1366.934
## cell_8890 group1 13 3 10380.81 1456.934
## cell_691 group1 132 36 -13443.19 1486.934
## cell_396 group1 95 27 -14053.19 1425.934
## cell_9818 group1 10 5 12747.81 1436.934
## cell_11310 group1 15 5 16333.81 1400.934
``` r
table(seu$orig.ident)
```
##
## group1 group2 group3
## 1000 1000 1000
``` r
# Run BANKSY
seu = RunBanksy(seu, lambda = 0.2, assay = 'RNA', slot = 'data',
dimx = 'sdimx', dimy = 'sdimy', features = 'all',
group = 'orig.ident', split.scale = TRUE, k_geom = 15)
# Staggered locations added to metadata
head(seu@meta.data)
```
## orig.ident nCount_RNA nFeature_RNA sdimx sdimy
## cell_1276 group1 266 51 -11933.19 1366.934
## cell_8890 group1 13 3 10380.81 1456.934
## cell_691 group1 132 36 -13443.19 1486.934
## cell_396 group1 95 27 -14053.19 1425.934
## cell_9818 group1 10 5 12747.81 1436.934
## cell_11310 group1 15 5 16333.81 1400.934
## staggered_sdimx staggered_sdimy
## cell_1276 3728.686 1366.934
## cell_8890 26042.686 1456.934
## cell_691 2218.686 1486.934
## cell_396 1608.686 1425.934
## cell_9818 28409.686 1436.934
## cell_11310 31995.686 1400.934
The rest of the workflow follows as before:
``` r
seu = RunPCA(seu, assay = 'BANKSY', features = rownames(seu), npcs = 30)
seu = RunUMAP(seu, dims = 1:30)
seu = FindNeighbors(seu, dims = 1:30)
seu = FindClusters(seu, resolution = 1)
```
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3000
## Number of edges: 171757
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8094
## Number of communities: 12
## Elapsed time: 0 seconds
Visualise clusters:
``` r
mypal <- kelly()[-1]
DimPlot(seu, pt.size = 0.25, label = TRUE, label.size = 3, cols = mypal)
```
``` r
FeatureScatter(seu, 'staggered_sdimx', 'staggered_sdimy', pt.size = 0.75, cols = mypal)
```
## Spatial data integration with Harmony
BANKSY can be used with Harmony for integrating multiple spatial omics
datasets in the presence of strong batch effects.
Download the data.
``` r
library(spatialLIBD)
library(ExperimentHub)
library(harmony)
ehub <- ExperimentHub::ExperimentHub()
spe <- spatialLIBD::fetch_data(type = "spe", eh = ehub)
imgData(spe) <- NULL
assay(spe, "logcounts") <- NULL
reducedDims(spe) <- NULL
rowData(spe) <- NULL
colData(spe) <- DataFrame(
sample_id = spe$sample_id,
clust_annotation = factor(
addNA(spe$layer_guess_reordered_short),
exclude = NULL, labels = seq(8)
),
in_tissue = spe$in_tissue,
row.names = colnames(spe)
)
invisible(gc())
# Subset to first sample of each subject
sample_names <- c("151507", "151669", "151673")
spe_list <- lapply(sample_names, function(x) spe[, spe$sample_id == x])
rm(spe)
invisible(gc())
```
Normalise the data and compute highly variable features.
``` r
# Convert to Seurat and Normalize data
seu_list <- lapply(spe_list, function(x) {
x <- as.Seurat(x, data = NULL)
NormalizeData(x, scale.factor = 3000, normalization.method = 'RC')
})
# Compute HVGs for each dataset and take the union
hvgs <- lapply(seu_list, function(x) {
VariableFeatures(FindVariableFeatures(x, nfeatures = 2000))
})
hvgs <- Reduce(union, hvgs)
# Subset to HVGs
seu_list <- lapply(seu_list, function(x) x[hvgs,])
seu <- Reduce(merge, seu_list)
locs <- do.call(rbind.data.frame, lapply(spe_list, spatialCoords))
seu@meta.data <- cbind(seu@meta.data, locs)
seu
```
Run BANKSY. When analysing multiple samples, the argument `group` must
be provided, which specifies the name of the metadata column that gives
the assignment of each cell or spot to its original Seurat object. Here,
we use `sample_id`. Internally, providing the `group` argument tells the
function to compute neighborhood matrices based on locations staggered
by `group`, ensuring that cells from different spatial datasets do not
overlap. The staggered locations are stored in the metadata for sanity
checking. Within-group scaling has little effect in the presence of
strong batch effects, hence, we set `split.scale=FALSE` for efficiency.
``` r
# Grouping variable
head(seu@meta.data)
table(seu$sample_id)
sdimx <- 'pxl_col_in_fullres'
sdimy <- 'pxl_row_in_fullres'
# Run BANKSY
seu <- RunBanksy(seu, lambda = 0.2, assay = 'originalexp', slot = 'data',
dimx = sdimx, dimy = sdimy, features = 'all',
group = 'sample_id', split.scale = FALSE, k_geom = 6)
```
Compute a spatially-aware embedding with PCA on the BANKSY matrix, and
run Harmony on this embedding.
``` r
seu <- RunPCA(seu, assay = 'BANKSY', features = rownames(seu), npcs = 10)
seu <- RunHarmony(seu, group.by.vars='sample_id')
```
The rest of the workflow follows as before:
``` r
seu <- RunUMAP(seu, dims = 1:10, reduction = 'harmony')
seu <- FindNeighbors(seu, dims = 1:10, reduction = 'harmony')
seu <- FindClusters(seu, resolution = 0.4)
```
Visualise clusters:
``` r
DimPlot(seu, pt.size = 0.25, label = TRUE, label.size = 3, cols = mypal)
FeatureScatter(seu, 'staggered_sdimx', 'staggered_sdimy', cols = mypal, pt.size = 0.75)
```
## Getting help
For more information, visit This vignette demonstrates how to run launch a UCSC Cell Browser instance populated with data from a Seurat object. If you use cell browser, please cite:
UCSC Single Cell Browser
Maximilian Haeussler, Nikolay Markov, Brian Raney, and Lucas Seninge
Documentation: https://cellbrowser.readthedocs.io
Prerequisites to install:
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
InstallData("pbmc3k")
pbmc3k <- LoadData("pbmc3k", type = "pbmc3k.final")
ExportToCellbrowser(pbmc3k, dir = "out", cb.dir = "cb_out", port = 8080, reductions = "umap")
# Remember to stop your cell browser instance when done
StopCellbrowser()
Compiled: April 20, 2020
This vignette demonstrates the use of the CoGAPS package on Seurat objects.
Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species
Genevieve L. Stein-O’Brien, Brian S. Clark, Thomas Sherman, Cristina Zibetti, Qiwen Hu, Rachel Sealfon, Sheng Liu, Jiang Qian, Carlo Colantuoni, Seth Blackshaw, Loyal A.Goff, Elana J.Fertig
Cell Systems, 2019.
doi: 10.1016/j.cels.2019.04.004
Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html
Prerequisites to install:
library(Seurat)
library(SeuratWrappers)
library(SeuratData)
library(CoGAPS)We suggest using a high number of iterations to get robust results when running CoGAPS. This will allow the algorithm to converge. When the system has converged, the results are fairly robust. 50,000 iterations were used in this example and the runtime was roughly five hours for each run (three patterns and ten patterns). We used Amazon Web Services, a Cloud Computing Service, to run CoGAPS. An example to run locally is featured later on.
AWS was used to run the below section of CoGAPS to look for three patterns
To learn more about this dataset, type ?pbmc3k
InstallData("pbmc3k")
data("pbmc3k.final")
params <- CogapsParams(singleCell = TRUE, sparseOptimization = TRUE, seed = 123, nIterations = 50000,
nPatterns = 3, distributed = "genome-wide")
params <- setDistributedParams(params, nSets = 5)
pbmc3k.final <- RunCoGAPS(pbmc3k.final, temp.file = TRUE, params = params)The two major lineages of blood cells are categorized as either myeloid or lymphoid. This specialization requires transcriptional diversification during lineage commitment. There are specific genes related to each of these lineages. In our data, CoGAPS identifies distinct patterns that segregate cells by immune lineage as shown below.
VlnPlot(pbmc3k.final, features = "CoGAPS_3")VlnPlot(pbmc3k.final, features = "CoGAPS_1")DimPlot(pbmc3k.final, reduction = "CoGAPS", pt.size = 0.5, dims = c(1, 3))AWS was used to run the below section of CoGAPS to look for ten patterns
To learn more about this dataset, type ?pbmc3k
InstallData("pbmc3k")
data("pbmc3k.final")
params <- CogapsParams(singleCell = TRUE, sparseOptimization = TRUE, seed = 123, nIterations = 50000,
nPatterns = 10, distributed = "genome-wide")
params <- setDistributedParams(params, nSets = 5)
pbmc3k.final <- RunCoGAPS(object = pbmc3k.final, temp.file = TRUE, params = params)Both the myeloid or lymphoid lineages give rise to many different cell types critical to the immune system. CoGAPS is able to discern cell type specific patterns, such as those shown below for DC (CoGAPS_3) and B (CoGAPS_4) cells. Importantly, CoGAPS is also able to identify phenotypic subtypes within a population of cells, such as FCGR3A+ Monocytes (CoGAPS_6).
VlnPlot(pbmc3k.final, features = "CoGAPS_3")VlnPlot(pbmc3k.final, features = "CoGAPS_4")VlnPlot(pbmc3k.final, features = "CoGAPS_6")DimPlot(pbmc3k.final, reduction = "CoGAPS", pt.size = 0.5, dims = c(3, 4))DimPlot(pbmc3k.final, reduction = "CoGAPS", pt.size = 0.5, dims = c(3, 6))DimPlot(pbmc3k.final, reduction = "CoGAPS", pt.size = 0.5, dims = c(4, 6))For example purposes, we will run locally using 5,000 iterations. Note: Results may be different because of complier dependence. Boost random number processor was used for this example.
To learn more about this dataset, type ?pbmc3k
InstallData("pbmc3k")
data("pbmc3k.final")
pbmc3k.final <- RunCoGAPS(object = pbmc3k.final, nPatterns = 3, nIterations = 5000, outputFrequency = 1000,
sparseOptimization = TRUE, nThreads = 1, distributed = "genome-wide", singleCell = TRUE, seed = 891)DimPlot(pbmc3k.final, reduction = "CoGAPS", pt.size = 0.5, dims = c(3, 2))VlnPlot(pbmc3k.final, features = "CoGAPS_2")VlnPlot(pbmc3k.final, features = "CoGAPS_3")In addition to providing the data, the user can also specify an uncertainty measurement - the standard deviation of each entry in the data matrix. By default, CoGAPS assumes that the standard deviation matrix is 10% of the data matrix. This is a reasonable heuristic to use, but for specific types of data you may be able to provide better information. An uncertainty matrix can be specified using the uncertainty argument when running CoCAPS.
pbmc3k.final <- RunCoGAPS(pbmc3k.final, uncertainty = datMat.uncertainty, nPatterns = 10, nIterations = 100,
outputFrequency = 100, sparseOptimization = TRUE, nThreads = 1, singleCell = TRUE, distributed = "genome-wide")Non-Negative Matrix Factorization algorithms typically require long computation times and CoGAPS is no exception. The simplest way to run CoGAPS in parallel is to provide the nThreads argument when running CoGAPS. This allows the underlying algorithm to run on multiple threads and has no effect on the mathematics of the algorithm. For more information on running CoGAPS in parallel, visit CoGAPS Vignette.
pbmc3k.final <- RunCoGAPS(pbmc3k.final, nPatterns = 10, nIterations = 100, outputFrequency = 100,
sparseOptimization = TRUE, nThreads = 3, singleCell = TRUE, distributed = "genome-wide")Visit the following resources to learn more about CoGAPS and running CoGAPS outside of the Seurat environment:
================================================ FILE: docs/cogaps.md ================================================ Running CoGAPS on Seurat Objects ================ Compiled: April 20, 2020 - [Running CoGAPS with Seurat Data Using Cloud Computing](#running-cogaps-with-seurat-data-using-cloud-computing) - [Using CoGAPS to Identify Cell Lineage](#using-cogaps-to-identify-cell-lineage) - [Using CoGAPS to Identify Cell Type](#using-cogaps-to-identify-cell-type) - [Running CoGAPS with Seurat Data Locally](#running-cogaps-with-seurat-data-locally) - [Lymphoid Lineage](#lymphoid-lineage-1) - [Myeloid Lineage](#myeloid-lineage-1) - [Additional Features of CoGAPS](#additional-features-of-cogaps) - [Resoures for CoGAPS](#resoures-for-cogaps) This vignette demonstrates the use of the CoGAPS package on Seurat objects. > *Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species* > > Genevieve L. Stein-O’Brien, Brian S. Clark, Thomas Sherman, Cristina Zibetti, Qiwen Hu, Rachel Sealfon, Sheng Liu, Jiang Qian, Carlo Colantuoni, Seth Blackshaw, Loyal A.Goff, Elana J.Fertig > > Cell Systems, 2019. > > doi: [10.1016/j.cels.2019.04.004](https://doi.org/10.1016/j.cels.2019.04.004) > > Bioconductor:This vignette demonstrates the use of the Conos package in Seurat. Commands and parameters are based off of the Conos tutorial. If you use Conos in your work, please cite:
Joint analysis of heterogeneous single-cell RNA-seq dataset collections
Nikolas Barkas, Viktor Petukhov, Daria Nikolaeva, Yaroslav Lozinsky, Samuel Demharter, Konstantin Khodosevich, Peter V. Kharchenko
Nature Methods, 2019.
Prerequisites to install:
library(conos)
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
To learn more about this dataset, type ?pbmcsca
InstallData("pbmcsca")
data("pbmcsca")
pbmcsca.panel <- SplitObject(pbmcsca, split.by = "Method")
for (i in 1:length(pbmcsca.panel)) {
pbmcsca.panel[[i]] <- NormalizeData(pbmcsca.panel[[i]]) %>% FindVariableFeatures() %>% ScaleData() %>%
RunPCA(verbose = FALSE)
}
pbmcsca.con <- Conos$new(pbmcsca.panel)
pbmcsca.con$buildGraph(k = 15, k.self = 5, space = "PCA", ncomps = 30, n.odgenes = 2000, matching.method = "mNN",
metric = "angular", score.component.variance = TRUE, verbose = TRUE)
pbmcsca.con$findCommunities()
pbmcsca.con$embedGraph()
pbmcsca <- as.Seurat(pbmcsca.con)
DimPlot(pbmcsca, reduction = "largeVis", group.by = c("Method", "ident", "CellType"), ncol = 3)
To learn more about this dataset, type ?ifnb
InstallData("ifnb")
data("ifnb")
ifnb.panel <- SplitObject(ifnb, split.by = "stim")
for (i in 1:length(ifnb.panel)) {
ifnb.panel[[i]] <- NormalizeData(ifnb.panel[[i]]) %>% FindVariableFeatures() %>% ScaleData() %>%
RunPCA(verbose = FALSE)
}
ifnb.con <- Conos$new(ifnb.panel)
ifnb.con$buildGraph(k = 15, k.self = 5, space = "PCA", ncomps = 30, n.odgenes = 2000, matching.method = "mNN",
metric = "angular", score.component.variance = TRUE, verbose = TRUE)
ifnb.con$findCommunities()
ifnb.con$embedGraph()
ifnb <- as.Seurat(ifnb.con)
DimPlot(ifnb, reduction = "largeVis", group.by = c("stim", "ident", "seurat_annotations"), ncol = 3)
To learn more about this dataset, type ?panc8
InstallData("panc8")
data("panc8")
panc8.panel <- SplitObject(panc8, split.by = "replicate")
for (i in 1:length(panc8.panel)) {
panc8.panel[[i]] <- NormalizeData(panc8.panel[[i]]) %>% FindVariableFeatures() %>% ScaleData() %>%
RunPCA(verbose = FALSE)
}
panc8.con <- Conos$new(panc8.panel)
panc8.con$buildGraph(k = 15, k.self = 5, space = "PCA", ncomps = 30, n.odgenes = 2000, matching.method = "mNN",
metric = "angular", score.component.variance = TRUE, verbose = TRUE)
panc8.con$findCommunities()
panc8.con$embedGraph()
panc8 <- as.Seurat(panc8.con)
DimPlot(panc8, reduction = "largeVis", group.by = c("replicate", "ident", "celltype"), ncol = 3)
This vigettte demonstrates how to run fastMNN on Seurat objects. Parameters and commands are based off of the fastMNN help page. If you use fastMNN, please cite:
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors
Laleh Haghverdi, Aaron T L Lun, Michael D Morgan & John C Marioni
Nature Biotechnology, 2018
doi: 10.1038/nbt.4091
Bioconductor: https://bioconductor.org/packages/release/bioc/html/batchelor.html
Prerequisites to install:
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
To learn more about this dataset, type ?pbmcsca
InstallData("pbmcsca")
data("pbmcsca")
pbmcsca <- NormalizeData(pbmcsca)
pbmcsca <- FindVariableFeatures(pbmcsca)
pbmcsca <- RunFastMNN(object.list = SplitObject(pbmcsca, split.by = "Method"))
pbmcsca <- RunUMAP(pbmcsca, reduction = "mnn", dims = 1:30)
pbmcsca <- FindNeighbors(pbmcsca, reduction = "mnn", dims = 1:30)
pbmcsca <- FindClusters(pbmcsca)
DimPlot(pbmcsca, group.by = c("Method", "ident", "CellType"), ncol = 3)
To learn more about this dataset, type ?ifnb
InstallData("ifnb")
data("ifnb")
ifnb <- NormalizeData(ifnb)
ifnb <- FindVariableFeatures(ifnb)
ifnb <- RunFastMNN(object.list = SplitObject(ifnb, split.by = "stim"))
ifnb <- RunUMAP(ifnb, reduction = "mnn", dims = 1:30)
ifnb <- FindNeighbors(ifnb, reduction = "mnn", dims = 1:30)
ifnb <- FindClusters(ifnb)
DimPlot(ifnb, group.by = c("stim", "ident", "seurat_annotations"), ncol = 3)
To learn more about this dataset, type ?panc8
InstallData("panc8")
data("panc8")
panc8 <- NormalizeData(panc8)
panc8 <- FindVariableFeatures(panc8)
panc8 <- RunFastMNN(object.list = SplitObject(panc8, split.by = "replicate")[c("celseq", "celseq2",
"fluidigmc1", "smartseq2")])
panc8 <- RunUMAP(panc8, reduction = "mnn", dims = 1:30)
panc8 <- FindNeighbors(panc8, reduction = "mnn", dims = 1:30)
panc8 <- FindClusters(panc8)
DimPlot(panc8, group.by = c("replicate", "ident", "celltype"), ncol = 3)
Compiled: July 15, 2020
This vignette demonstrates how to run GLM-PCA, which implements a generalized version of PCA for non-normally distributed data, on a Seurat object. If you use this, please cite:
Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model
F. William Townes, Stephanie C. Hicks, Martin J. Aryee & Rafael A. Irizarry
Genome Biology, 2019
doi: https://doi.org/10.1186/s13059-019-1861-6
GitHub: https://github.com/willtownes/glmpca CRAN: https://cran.r-project.org/web/packages/glmpca/index.html
Prerequisites to install:
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
library(glmpca)
library(scry)
To learn more about this dataset, type ?pbmc3k
InstallData("pbmc3k")
data("pbmc3k")
# Initial processing to select variable features
m <- GetAssayData(pbmc3k, slot = "counts", assay = "RNA")
devs <- scry::devianceFeatureSelection(m)
dev_ranked_genes <- rownames(pbmc3k)[order(devs, decreasing = TRUE)]
topdev <- head(dev_ranked_genes, 2000)
# run GLM-PCA on Seurat object.
# Uses Poisson model by default
# Note that data in the counts slot is used
# We choose 10 dimensions for computational efficiency
ndims <- 10
pbmc3k <- RunGLMPCA(pbmc3k, features = topdev, L = ndims)
pbmc3k <- FindNeighbors(pbmc3k, reduction = 'glmpca', dims = 1:ndims, verbose = FALSE)
pbmc3k <- FindClusters(pbmc3k, verbose = FALSE)
pbmc3k <- RunUMAP(pbmc3k, reduction = 'glmpca', dims = 1:ndims, verbose = FALSE)
# visualize markers
features.plot <- c('CD3D', 'MS4A1', 'CD8A', 'GZMK', 'GZMB', 'FCGR3A')
DimPlot(pbmc3k)
Do the learned clusters overlap with the original annotation?
with(pbmc3k[[]], table(seurat_annotations, seurat_clusters))
## seurat_clusters
## seurat_annotations 0 1 2 3 4 5 6 7 8
## Naive CD4 T 168 484 0 3 42 0 0 0 0
## Memory CD4 T 405 45 0 0 30 0 0 0 3
## CD14+ Mono 0 0 469 0 0 8 0 3 0
## B 0 0 0 344 0 0 0 0 0
## CD8 T 7 0 0 0 254 0 9 0 1
## FCGR3A+ Mono 0 0 12 0 0 150 0 0 0
## NK 0 0 0 0 8 0 147 0 0
## DC 0 0 2 2 0 0 1 27 0
## Platelet 0 0 1 0 0 0 0 0 13
pbmc3k <- NormalizeData(pbmc3k, verbose = FALSE)
FeaturePlot(pbmc3k, features.plot, ncol = 2)
This vigettte demonstrates the use of the Harmony package in Seurat. Commands and parameters are based off of the Harmony use page. If you use Harmony in your work, please cite:
Fast, sensitive, and flexible integration of single cell data with Harmony
Ilya Korsunsky, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-Ru Loh, Soumya Raychaudhuri
bioRxiv, 2019
doi: 10.1101/461954v2
Prerequisites to install:
Note that SeuratWrappers is not necessary, as the wrapper functions were generously provided by the Harmony authors, and are included when installing Harmony.
library(harmony)
library(Seurat)
library(SeuratData)
To learn more about this dataset, type ?pbmcsca
InstallData("pbmcsca")
data("pbmcsca")
pbmcsca <- NormalizeData(pbmcsca) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose = FALSE)
pbmcsca <- RunHarmony(pbmcsca, group.by.vars = "Method")
pbmcsca <- RunUMAP(pbmcsca, reduction = "harmony", dims = 1:30)
pbmcsca <- FindNeighbors(pbmcsca, reduction = "harmony", dims = 1:30) %>% FindClusters()
DimPlot(pbmcsca, group.by = c("Method", "ident", "CellType"), ncol = 3)
To learn more about this dataset, type ?ifnb
InstallData("ifnb")
data("ifnb")
ifnb <- NormalizeData(ifnb) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose = FALSE)
ifnb <- RunHarmony(ifnb, group.by.vars = "stim")
ifnb <- RunUMAP(ifnb, reduction = "harmony", dims = 1:30)
ifnb <- FindNeighbors(ifnb, reduction = "harmony", dims = 1:30) %>% FindClusters()
DimPlot(ifnb, group.by = c("stim", "ident", "seurat_annotations"), ncol = 3)
To learn more about this dataset, type ?panc8
InstallData("panc8")
data("panc8")
panc8 <- NormalizeData(panc8) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose = FALSE)
panc8 <- RunHarmony(panc8, group.by.vars = "replicate")
panc8 <- RunUMAP(panc8, reduction = "harmony", dims = 1:30)
panc8 <- FindNeighbors(panc8, reduction = "harmony", dims = 1:30) %>% FindClusters()
DimPlot(panc8, group.by = c("replicate", "ident", "celltype"), ncol = 3)
NOTE: Please update your liger version to 0.5.0 or above before following this tutorial.
This vigettte demonstrates how to run LIGER on Seurat objects. Parameters and commands are based on the LIGER tutorial. If you use LIGER, please cite:
Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity
Joshua Welch, Velina Kozareva, Ashley Ferreira, Charles Vanderburg, Carly Martin, Evan Z.Macosko
Cell, 2019.
Prerequisites to install:
library(rliger)
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
In order to replicate LIGER’s multi-dataset functionality, we will use the split.by parameter to preprocess the Seurat object on subsets of the data belonging to each dataset separately. Also, as LIGER does not center data when scaling, we will skip that step as well.
RunQuantileNorm produces joint clusters, but users can also optionally perform Louvain community detection (FindNeighbors and FindClusters) on the integrated latent space from iNMF.
To learn more about this dataset, type ?pbmcsca
InstallData("pbmcsca")
data("pbmcsca")
# Please update your `liger` version to 0.5.0 or above before following this tutorial
pbmcsca <- NormalizeData(pbmcsca)
pbmcsca <- FindVariableFeatures(pbmcsca)
pbmcsca <- ScaleData(pbmcsca, split.by = "Method", do.center = FALSE)
pbmcsca <- RunOptimizeALS(pbmcsca, k = 20, lambda = 5, split.by = "Method")
pbmcsca <- RunQuantileNorm(pbmcsca, split.by = "Method")
# You can optionally perform Louvain clustering (`FindNeighbors` and `FindClusters`) after
# `RunQuantileNorm` according to your needs
pbmcsca <- FindNeighbors(pbmcsca, reduction = "iNMF", dims = 1:20)
pbmcsca <- FindClusters(pbmcsca, resolution = 0.3)
# Dimensional reduction and plotting
pbmcsca <- RunUMAP(pbmcsca, dims = 1:ncol(pbmcsca[["iNMF"]]), reduction = "iNMF")
DimPlot(pbmcsca, group.by = c("Method", "ident", "CellType"), ncol = 3)
To learn more about this dataset, type ?ifnb
InstallData("ifnb")
data("ifnb")
# Please update your `liger` version to 0.5.0 or above before following this tutorial.
ifnb <- NormalizeData(ifnb)
ifnb <- FindVariableFeatures(ifnb)
ifnb <- ScaleData(ifnb, split.by = "stim", do.center = FALSE)
ifnb <- RunOptimizeALS(ifnb, k = 20, lambda = 5, split.by = "stim")
ifnb <- RunQuantileNorm(ifnb, split.by = "stim")
# You can optionally perform Louvain clustering (`FindNeighbors` and `FindClusters`) after
# `RunQuantileNorm` according to your needs
ifnb <- FindNeighbors(ifnb, reduction = "iNMF", dims = 1:20)
ifnb <- FindClusters(ifnb, resolution = 0.55)
# Dimensional reduction and plotting
ifnb <- RunUMAP(ifnb, dims = 1:ncol(ifnb[["iNMF"]]), reduction = "iNMF")
DimPlot(ifnb, group.by = c("stim", "ident", "seurat_annotations"), ncol = 3)
To learn more about this dataset, type ?panc8
InstallData("panc8")
data("panc8")
# Please update your `liger` version to 0.5.0 or above before following this tutorial.
panc8 <- NormalizeData(panc8)
panc8 <- FindVariableFeatures(panc8)
panc8 <- ScaleData(panc8, split.by = "replicate", do.center = FALSE)
panc8 <- RunOptimizeALS(panc8, k = 20, lambda = 5, split.by = "replicate")
panc8 <- RunQuantileNorm(panc8, split.by = "replicate")
# You can optionally perform Louvain clustering (`FindNeighbors` and `FindClusters`) after
# `RunQuantileNorm` according to your needs
panc8 <- FindNeighbors(panc8, reduction = "iNMF", dims = 1:20)
panc8 <- FindClusters(panc8, resolution = 0.4)
# Dimensional reduction and plotting
panc8 <- RunUMAP(panc8, dims = 1:ncol(panc8[["iNMF"]]), reduction = "iNMF")
DimPlot(panc8, group.by = c("replicate", "ident", "celltype"), ncol = 3)
This vigettte demonstrates the use of the miQC package in Seurat. Vignette is based off of the miQC vignette. If you use miQC in your work, please cite:
miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data
Ariel A. Hippen, Matias M. Falco, Lukas M. Weber, Erdogan Pekcan Erkan, Kaiyang Zhang, Jennifer Anne Doherty, Anna Vähärautio, Casey S. Greene, Stephanie C. Hicks
bioRxiv, 2021
Prerequisites to install:
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
library(flexmix)
This vignette provides a basic example of how to run miQC, which allows users to perform cell-wise filtering of single-cell RNA-seq data for quality control. Single-cell RNA-seq data is very sensitive to tissue quality and choice of experimental workflow; it’s critical to ensure compromised cells and failed cell libraries are removed. A high proportion of reads mapping to mitochondrial DNA is one sign of a damaged cell, so most analyses will remove cells with mtRNA over a certain threshold, but those thresholds can be arbitrary and/or detrimentally stringent, especially for archived tumor tissues. miQC jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to identify the low-quality cells in a given dataset.
To demonstrate how to run miQC on a single-cell RNA-seq dataset, we’ll use the pbmc3kdataset from the SeuratData package.
InstallData("pbmc3k")
data("pbmc3k")
pbmc3k
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
miQC requires two QC metrics for each single cell dataset: (1) the number of unique genes detected per cell and (2) the percent mitochondrial reads. The number of unique genes detected per cell are typically calculated and stored automatically as metadata (nFeature_RNA) upon creation of a Seurat object with CreateSeuratObject.
In order to calculate the percent mitochondrial reads in a cell we can use PercentageFeatureSet. Human mitochondrial genes start with MT- (and mt- for murine genes). For other IDs, we recommend using a biomaRt query to map to chromosomal location and identify all mitochondrial genes. We add this as metadata here to the Seurat object as "percent.mt".
pbmc3k[["percent.mt"]] <- PercentageFeatureSet(object = pbmc3k, pattern = "^MT-")
We can visually inspect the "percent.mt" and "nFeature_RNA" values in the pbmc3k dataset.
FeatureScatter(pbmc3k, feature1 = "nFeature_RNA", feature2 = "percent.mt")
We can see that most cells have a fairly low proportion of mitochondrial reads, given that the graph is much denser at the bottom. We likely have many cells that are intact and biologically meaningful. There are also a few cells that have almost half of their reads mapping to mitochondrial genes, which are likely broken or otherwise compromised and we will want to exclude from our downstream analysis. However, it’s not clear what boundaries to draw to separate the two groups of cells. With that in mind, we’ll generate a linear mixture model using the RunMiQC function. The linear mixture model will be stored in the misc slot of the Seurat object as "flexmix_model".
pbmc3k <- RunMiQC(pbmc3k, percent.mt = "percent.mt", nFeature_RNA = "nFeature_RNA", posterior.cutoff = 0.75,
model.slot = "flexmix_model")
This function is a wrapper for flexmix, which fits a mixture model on our data and returns the parameters of the two lines that best fit the data, as well as the posterior probability of each cell being derived from each distribution.
We can look at the parameters and posterior values directly with the functions
flexmix::parameters(Misc(pbmc3k, "flexmix_model"))
## Comp.1 Comp.2
## coef.(Intercept) 2.004939e+00 7.141952783
## coef.nFeature_RNA 3.222184e-05 -0.004138082
## sigma 7.409008e-01 2.121678523
head(flexmix::posterior(Misc(pbmc3k, "flexmix_model")))
## [,1] [,2]
## [1,] 0.9287557 0.07124429
## [2,] 0.7600390 0.23996098
## [3,] 0.9195142 0.08048576
## [4,] 0.9710883 0.02891168
## [5,] 0.9873697 0.01263027
## [6,] 0.9782177 0.02178231
Or we can visualize the model results using the PlotMiQC function, where "miQC.probability" represents the posterior probability of the cell belonging to the compromised condition:
PlotMiQC(pbmc3k, color.by = "miQC.probability") + ggplot2::scale_color_gradient(low = "grey", high = "purple")
As expected, the cells at the very top of the graph are almost certainly compromised, most likely to have been derived from the distribution with fewer unique genes and higher baseline mitochondrial expression.
We can use these posterior probabilities to choose which cells to keep, and visualize the consequences of this filtering with the PlotMiQC function. Recall when running "RunMiQC" we set the "posterior.cutoff" to be 0.75.
PlotMiQC(pbmc3k, color.by = "miQC.keep")
To actually perform the filtering and remove the indicated cells from our Seurat object, we can subset the Seurat object parameter as such:
pbmc3k_filtered <- subset(pbmc3k, miQC.keep == "keep")
pbmc3k_filtered
## An object of class Seurat
## 13714 features across 2593 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
In most cases, a linear mixture model will be satisfactory as well as simplest, but RunMiQC also supports some non-linear mixture models: currently polynomials and b-splines. A user should only need to change the model.type parameter when making the model, and all visualization and filtering functions will work the same as with a linear model.
pbmc3k <- RunMiQC(pbmc3k, percent.mt = "percent.mt", nFeature_RNA = "nFeature_RNA", posterior.cutoff = 0.75,
model.slot = "flexmix_model", model.type = "spline")
PlotMiQC(pbmc3k, color.by = "miQC.keep")
Also, RunMiQC defaults to removing any cell with 75% or greater posterior probability of being compromised, but if we want to be more or less stringent, we can alter the posterior.cutoff parameter, like so:
pbmc3k <- RunMiQC(pbmc3k, percent.mt = "percent.mt", nFeature_RNA = "nFeature_RNA", posterior.cutoff = 0.9,
model.slot = "flexmix_model")
PlotMiQC(pbmc3k, color.by = "miQC.keep")
Note that when performing miQC multiple times on different samples for the same experiment, it’s recommended to select the same posterior_cutoff for all, to give consistency in addition to the flexibility of sample-specific models.
The miQC model is based on the assumption that there are a non-trivial number of compromised cells in the dataset, which is not true in all datasets. We recommend using FeatureScatter on a dataset before running miQC to see if the two-distribution model is appropriate. Look for the distinctive triangular shape where cells have a wide variety of mitochondrial percentages at lower gene counts and taper off to lower mitochondrial percentage at higher gene counts.
For example of a dataset where there’s not a significant number of compromised cells, so the two-distribution assumption is not met, we simulate an extreme case using the "pbmc3k" dataset here.
set.seed(2021)
pbmc3k_extreme <- pbmc3k
simulated_percent_mt <- rnorm(mean = 2.5, sd = 0.2, n = ncol(pbmc3k_extreme))
pbmc3k_extreme$percent.mt <- ifelse(pbmc3k_extreme$nFeature_RNA > 400, simulated_percent_mt, pbmc3k_extreme$percent.mt)
simulated_percent_mt_2 <- runif(min = 0, max = 60, n = ncol(pbmc3k_extreme))
pbmc3k_extreme$percent.mt <- ifelse(pbmc3k_extreme$nFeature_RNA < 400, simulated_percent_mt_2, pbmc3k_extreme$percent.mt)
FeatureScatter(pbmc3k_extreme, feature1 = "nFeature_RNA", feature2 = "percent.mt")
The RunMiQC function will throw a warning if only one distribution is found. In these cases, we recommend using other filtering methods, such as a cutoff on mitochondrial percentage or percentile using the "backup.option" parameter to one of "c("percentile", "percent", "pass", "halt").
pbmc3k_extreme <- RunMiQC(pbmc3k_extreme, percent.mt = "percent.mt", nFeature_RNA = "nFeature_RNA",
posterior.cutoff = 0.9, model.slot = "flexmix_model", backup.option = "percentile", backup.percentile = 0.95)
## Warning in RunMiQC(pbmc3k_extreme, percent.mt = "percent.mt", nFeature_RNA =
## "nFeature_RNA", : flexmix returned only 1 cluster
## defaulting to backup.percentile for filtering
## Warning: Adding a command log without an assay associated with it
FeatureScatter(pbmc3k_extreme, feature1 = "nFeature_RNA", feature2 = "percent.mt", group.by = "miQC.keep")
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pbmc3k.SeuratData_3.1.4 flexmix_2.3-17 lattice_0.20-41 SeuratWrappers_0.3.0
## [5] SeuratData_0.2.1 SeuratObject_4.0.1 Seurat_4.0.1
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 colorspace_2.0-2 deldir_0.2-10 modeltools_0.2-23 ellipsis_0.3.2
## [6] ggridges_0.5.3 rprojroot_2.0.2 spatstat.data_2.1-0 farver_2.1.0 leiden_0.3.7
## [11] listenv_0.8.0 remotes_2.3.0 ggrepel_0.9.1 fansi_0.5.0 R.methodsS3_1.8.1
## [16] codetools_0.2-18 splines_4.0.4 knitr_1.33 polyclip_1.10-0 jsonlite_1.7.2
## [21] ica_1.0-2 cluster_2.1.0 R.oo_1.24.0 png_0.1-7 uwot_0.1.10
## [26] shiny_1.6.0 sctransform_0.3.2 spatstat.sparse_2.0-0 BiocManager_1.30.15 compiler_4.0.4
## [31] httr_1.4.2 Matrix_1.3-3 fastmap_1.1.0 lazyeval_0.2.2 cli_3.0.1
## [36] later_1.2.0 formatR_1.9 htmltools_0.5.1.1 prettyunits_1.1.1 tools_4.0.4
## [41] rsvd_1.0.5 igraph_1.2.6 gtable_0.3.0 glue_1.4.2 RANN_2.6.1
## [46] reshape2_1.4.4 dplyr_1.0.6 rappdirs_0.3.3 Rcpp_1.0.6 scattermore_0.7
## [51] jquerylib_0.1.4 vctrs_0.3.8 nlme_3.1-152 lmtest_0.9-38 xfun_0.23
## [56] stringr_1.4.0 globals_0.14.0 ps_1.6.0 mime_0.10 miniUI_0.1.1.1
## [61] lifecycle_1.0.0 irlba_2.3.3 goftest_1.2-2 future_1.21.0 MASS_7.3-53
## [66] zoo_1.8-9 scales_1.1.1 spatstat.core_2.1-2 promises_1.2.0.1 spatstat.utils_2.1-0
## [71] parallel_4.0.4 RColorBrewer_1.1-2 yaml_2.2.1 curl_4.3.1 reticulate_1.20
## [76] pbapply_1.4-3 gridExtra_2.3 ggplot2_3.3.5 sass_0.4.0 rpart_4.1-15
## [81] stringi_1.6.2 highr_0.9 pkgbuild_1.2.0 rlang_0.4.11 pkgconfig_2.0.3
## [86] matrixStats_0.59.0 evaluate_0.14 tensor_1.5 ROCR_1.0-11 purrr_0.3.4
## [91] labeling_0.4.2 patchwork_1.1.1 htmlwidgets_1.5.3 cowplot_1.1.1 processx_3.5.2
## [96] tidyselect_1.1.1 parallelly_1.25.0 RcppAnnoy_0.0.18 plyr_1.8.6 magrittr_2.0.1
## [101] R6_2.5.0 generics_0.1.0 mgcv_1.8-33 pillar_1.6.1 withr_2.4.2
## [106] fitdistrplus_1.1-3 nnet_7.3-15 abind_1.4-5 survival_3.2-7 tibble_3.1.2
## [111] future.apply_1.7.0 crayon_1.4.1 KernSmooth_2.23-18 utf8_1.2.1 spatstat.geom_2.1-0
## [116] plotly_4.9.3 rmarkdown_2.8 grid_4.0.4 data.table_1.14.0 callr_3.7.0
## [121] digest_0.6.27 xtable_1.8-4 tidyr_1.1.3 httpuv_1.6.1 R.utils_2.10.1
## [126] stats4_4.0.4 munsell_0.5.0 viridisLite_0.4.0 bslib_0.2.5.1
This vigettte demonstrates how to run trajectory inference and pseudotime calculations with Monocle 3 on Seurat objects. If you use Monocle 3, please cite:
The single-cell transcriptional landscape of mammalian organogenesis
Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M. Ibrahim, Andrew J. Hill, Fan Zhang, Stefan Mundlos, Lena Christiansen, Frank J. Steemers, Cole Trapnell & Jay Shendure
Prerequisites to install:
library(monocle3)
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
library(ggplot2)
library(patchwork)
library(magrittr)
InstallData("hcabm40k")
data("hcabm40k")
hcabm40k <- SplitObject(hcabm40k, split.by = "orig.ident")
for (i in seq_along(hcabm40k)) {
hcabm40k[[i]] <- NormalizeData(hcabm40k[[i]]) %>% FindVariableFeatures()
}
features <- SelectIntegrationFeatures(hcabm40k)
for (i in seq_along(along.with = hcabm40k)) {
hcabm40k[[i]] <- ScaleData(hcabm40k[[i]], features = features) %>% RunPCA(features = features)
}
anchors <- FindIntegrationAnchors(hcabm40k, reference = c(1, 2), reduction = "rpca", dims = 1:30)
integrated <- IntegrateData(anchors, dims = 1:30)
integrated <- ScaleData(integrated)
integrated <- RunPCA(integrated)
integrated <- RunUMAP(integrated, dims = 1:30, reduction.name = "UMAP")
integrated <- FindNeighbors(integrated, dims = 1:30)
integrated <- FindClusters(integrated)
DimPlot(integrated, group.by = c("orig.ident", "ident"))
cds <- as.cell_data_set(integrated)
cds <- cluster_cells(cds)
p1 <- plot_cells(cds, show_trajectory_graph = FALSE)
p2 <- plot_cells(cds, color_cells_by = "partition", show_trajectory_graph = FALSE)
wrap_plots(p1, p2)
integrated.sub <- subset(as.Seurat(cds), monocle3_partitions == 1)
cds <- as.cell_data_set(integrated.sub)
cds <- learn_graph(cds)
plot_cells(cds, label_groups_by_cluster = FALSE, label_leaves = FALSE, label_branch_points = FALSE)
max.avp <- which.max(unlist(FetchData(integrated.sub, "AVP")))
max.avp <- colnames(integrated.sub)[max.avp]
cds <- order_cells(cds, root_cells = max.avp)
plot_cells(cds, color_cells_by = "pseudotime", label_cell_groups = FALSE, label_leaves = FALSE,
label_branch_points = FALSE)
# Set the assay back as 'integrated'
integrated.sub <- as.Seurat(cds, assay = "integrated")
FeaturePlot(integrated.sub, "monocle3_pseudotime")
This vignette demonstrates how to run Nebulosa on a Seurat object. If you use this, please cite:
Nebulosa recovers single cell gene expression signals by kernel density estimation
Jose Alquicira-Hernandez and Joseph E. Powell
(Under review), 2020.
doi: 10.18129
Due to the sparsity observed in single-cell data (e.g. RNA-seq, ATAC-seq), the visualization of cell features (e.g. gene, peak) is frequently affected and unclear, especially when it is overlaid with clustering to annotate cell types. Nebulosa is an R package to visualize data from single cells based on kernel density estimation. It aims to recover the signal from dropped-out features by incorporating the similarity between cells allowing a “convolution” of the cell features.
For this vignette, let’s use Nebulosa with the Seurat package. First, we’ll do a brief/standard data processing.
library("Nebulosa")
library("Seurat")
library("BiocFileCache")
Let’s download a dataset of 3k PBMCs (available from 10X Genomics). This same dataset is commonly used in Seurat vignettes. The code below will download, store, and uncompress the data in a temporary directory.
bfc <- BiocFileCache(ask = FALSE)
data_file <- bfcrpath(bfc, file.path("https://s3-us-west-2.amazonaws.com/10x.files/samples/cell",
"pbmc3k", "pbmc3k_filtered_gene_bc_matrices.tar.gz"))
untar(data_file, exdir = tempdir())
Then, we can read the gene expression matrix using the Read10X from Seurat
data <- Read10X(data.dir = file.path(tempdir(), "filtered_gene_bc_matrices", "hg19"))
Let’s create a Seurat object with features being expressed in at least 3 cells and cells expressing at least 200 genes.
pbmc <- CreateSeuratObject(counts = data, project = "pbmc3k", min.cells = 3, min.features = 200)
Remove outlier cells based on the number of genes being expressed in each cell (below 2500 genes) and expression of mitochondrial genes (below 5%).
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA < 2500 & percent.mt < 5)
Let’s use SCTransform to stabilize the variance of the data by regressing out the effect of the sequencing depth from each cell.
pbmc <- SCTransform(pbmc, verbose = FALSE)
Once the data is normalized and scaled, we can run a Principal Component Analysis (PCA) first to reduce the dimensions of our data from 26286 features to 50 principal components. To visualize the principal components, we can run a Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) using the first 30 principal components to obtain a two-dimentional space.
pbmc <- RunPCA(pbmc)
pbmc <- RunUMAP(pbmc, dims = 1:30)
To assess cell similarity, let’s cluster the data by constructing a Shared Nearest Neighbor (SNN) Graph using the first 30 principal components and applying the Louvain algorithm.
pbmc <- FindNeighbors(pbmc, dims = 1:30)
pbmc <- FindClusters(pbmc)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 113368
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8272
## Number of communities: 13
## Elapsed time: 0 seconds
NebulosaThe main function from Nebulosa is the plot_density. For usability, it resembles the FeaturePlot function from Seurat.
Let’s plot the kernel density estimate for CD4 as follows
plot_density(pbmc, "CD4")
For comparison, let’s also plot a standard scatterplot using Seurat
FeaturePlot(pbmc, "CD4")
FeaturePlot(pbmc, "CD4", order = TRUE)
By smoothing the data, Nebulosa allows a better visualization of the global expression of CD4 in myeloid and CD4+ T cells. Notice that the “random” expression of CD4 in other areas of the plot is removed as the expression of this gene is not supported by many cells in those areas. Furthermore, CD4+ cells appear to show considerable dropout rate.
Let’s plot the expression of CD4 with Nebulosa next to the clustering results
DimPlot(pbmc, label = TRUE, repel = TRUE)
We can now easily identify that clusters 0 and 2 correspond to CD4+ T cells if we plot CD3D too.
plot_density(pbmc, "CD3D")
Characterize cell populations usually relies in more than a single marker. Nebulosa allows the visualization of the joint density of from multiple features in a single plot.
Users familiarized with PBMC datasets may know that CD8+ CCR7+ cells usually cluster next to CD4+ CCR7+ and separate from the rest of CD8+ cells. Let’s aim to identify Naive CD8+ T cells. To do so, we can just add another gene to the vector containing the features to visualize.
p3 <- plot_density(pbmc, c("CD8A", "CCR7"))
p3 + plot_layout(ncol = 1)
Nebulosa can return a joint density plot by multiplying the densities from all query genes by using the joint = TRUE parameter:
p4 <- plot_density(pbmc, c("CD8A", "CCR7"), joint = TRUE)
p4 + plot_layout(ncol = 1)
When compared to the clustering results, we can easily identify that Naive CD8+ T cells correspond to cluster 8.
Nebulosa returns the density estimates for each gene along with the joint density across all provided genes. By setting combine = FALSE, we can obtain a list of ggplot objects where the last plot corresponds to the joint density estimate.
p_list <- plot_density(pbmc, c("CD8A", "CCR7"), joint = TRUE, combine = FALSE)
p_list[[length(p_list)]]
Likewise, the identification of Naive CD4+ T cells becomes straightforward by combining CD4 and CCR7:
p4 <- plot_density(pbmc, c("CD4", "CCR7"), joint = TRUE)
p4 + plot_layout(ncol = 1)
Notice that these cells are mainly constrained to cluster 0
p4[[3]]/DimPlot(pbmc, label = TRUE, repel = TRUE)
In summary,Nebulosacan be useful to recover the signal from dropped-out genes and improve their visualization in a two-dimensional space. We recommend using Nebulosa particularly for dropped-out genes. For fairly well-expressed genes, the direct visualization of the gene expression may be preferable. We encourage users to use Nebulosa along with the core visualization methods from the Seurat and Bioconductor environments as well as other visualization methods to draw more informed conclusions about their data.
This vignette demonstrates how to run PaCMAP, a dimensionality reduction method that can be used for providing robust and trustworthy visualization, on a Seurat object. If you use our work, please cite both papers:
Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization
Yingfan Wang, Haiyang Huang, Cynthia Rudin & Yaron Shaposhnik
Journal of Machine Learning Research, 2021
doi: https://doi.org/10.48550/arXiv.2012.04456
Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization
Haiyang Huang, Yingfan Wang, Cynthia Rudin and Edward P. Browne
Communications biology, 2022
Prerequisites to install:
In addition to R packages, PaCMAP relies on Python to deliver high performance. To streamline the installation process and make environment management easier, we strongly recommend you to use anaconda(https://www.anaconda.com/download) or miniconda(https://docs.anaconda.com/miniconda/miniconda-install/) for managing Python environments. Below, we provide step-by-step instructions on how to properly install PaCMAP after you have installed one of these tools.
Create a conda environment with PaCMAP installed:
conda create -n "pacmap" python=3.12 # Install in the environment called "pacmap"
conda activate pacmap
conda install -y conda-forge::pacmap
To run PaCMAP, you need to connect your R console to the
corresponding conda environment. If your Conda/Miniconda installation is
located in a non-default directory, you might set up the conda variable
as /path/to/your/conda. This ensures the correct
environment is used during the installation.
reticulate::use_condaenv(condaenv = "pacmap", conda = "auto")
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
To learn more about this dataset, type ?pbmc3k
InstallData("pbmc3k")
pbmc3k.final <- LoadData("pbmc3k",type="pbmc3k.final")
# Initial processing to select variable features
pbmc3k.final <- UpdateSeuratObject(pbmc3k.final)
pbmc3k.final <- FindVariableFeatures(pbmc3k.final)
# run PaCMAP on Seurat object.
pbmc3k.final <- RunPaCMAP(object = pbmc3k.final, features=VariableFeatures(pbmc3k.final))
## Applied PCA, the dimensionality becomes 100
## PaCMAP(n_neighbors=10, n_MN=5, n_FP=20, distance=euclidean, lr=1.0, n_iters=(100, 100, 450), apply_pca=True, opt_method='adam', verbose=True, intermediate=False, seed=11)
## Finding pairs
## Found nearest neighbor
## Calculated sigma
## Found scaled dist
## Pairs sampled successfully.
## ((26380, 2), (13190, 2), (52760, 2))
## Initial Loss: 32494.857421875
## Iteration: 10, Loss: 25802.580078
## Iteration: 20, Loss: 21603.363281
## Iteration: 30, Loss: 19970.650391
## Iteration: 40, Loss: 18992.988281
## Iteration: 50, Loss: 18181.544922
## Iteration: 60, Loss: 17354.800781
## Iteration: 70, Loss: 16440.773438
## Iteration: 80, Loss: 15367.431641
## Iteration: 90, Loss: 14006.279297
## Iteration: 100, Loss: 11969.539062
## Iteration: 110, Loss: 14622.074219
## Iteration: 120, Loss: 14481.925781
## Iteration: 130, Loss: 14432.553711
## Iteration: 140, Loss: 14414.109375
## Iteration: 150, Loss: 14406.267578
## Iteration: 160, Loss: 14402.332031
## Iteration: 170, Loss: 14400.175781
## Iteration: 180, Loss: 14398.969727
## Iteration: 190, Loss: 14398.415039
## Iteration: 200, Loss: 14398.177734
## Iteration: 210, Loss: 7290.769531
## Iteration: 220, Loss: 7165.277344
## Iteration: 230, Loss: 7109.806641
## Iteration: 240, Loss: 7076.959961
## Iteration: 250, Loss: 7059.577148
## Iteration: 260, Loss: 7048.008301
## Iteration: 270, Loss: 7038.852539
## Iteration: 280, Loss: 7031.291504
## Iteration: 290, Loss: 7024.563477
## Iteration: 300, Loss: 7018.940430
## Iteration: 310, Loss: 7013.954102
## Iteration: 320, Loss: 7009.539062
## Iteration: 330, Loss: 7005.522949
## Iteration: 340, Loss: 7001.719727
## Iteration: 350, Loss: 6998.311523
## Iteration: 360, Loss: 6995.219727
## Iteration: 370, Loss: 6992.364258
## Iteration: 380, Loss: 6989.705566
## Iteration: 390, Loss: 6987.210449
## Iteration: 400, Loss: 6984.850586
## Iteration: 410, Loss: 6982.660156
## Iteration: 420, Loss: 6980.610840
## Iteration: 430, Loss: 6978.653320
## Iteration: 440, Loss: 6976.790039
## Iteration: 450, Loss: 6975.051758
## Iteration: 460, Loss: 6973.373535
## Iteration: 470, Loss: 6971.781250
## Iteration: 480, Loss: 6970.295410
## Iteration: 490, Loss: 6968.885254
## Iteration: 500, Loss: 6967.526367
## Iteration: 510, Loss: 6966.229492
## Iteration: 520, Loss: 6965.000977
## Iteration: 530, Loss: 6963.808105
## Iteration: 540, Loss: 6962.668945
## Iteration: 550, Loss: 6961.575195
## Iteration: 560, Loss: 6960.505371
## Iteration: 570, Loss: 6959.466309
## Iteration: 580, Loss: 6958.451172
## Iteration: 590, Loss: 6957.499023
## Iteration: 600, Loss: 6956.579102
## Iteration: 610, Loss: 6955.684570
## Iteration: 620, Loss: 6954.833984
## Iteration: 630, Loss: 6954.013672
## Iteration: 640, Loss: 6953.199219
## Iteration: 650, Loss: 6952.405762
## Elapsed time: 1.35s
# visualize markers
features.plot <- c('CD3D', 'MS4A1', 'CD8A', 'GZMK', 'GZMB', 'FCGR3A')
DimPlot(object=pbmc3k.final,reduction="pacmap")
pbmc3k.final <- NormalizeData(pbmc3k.final, verbose = FALSE)
FeaturePlot(pbmc3k.final, features.plot, ncol = 2, reduction="pacmap")
You
can also specify dims of your original dataset for running PaCMAP
# run PaCMAP on Seurat object.
pbmc3k.final <- RunPaCMAP(object = pbmc3k.final, dims=2:5)
## X is normalized
## PaCMAP(n_neighbors=10, n_MN=5, n_FP=20, distance=euclidean, lr=1.0, n_iters=(100, 100, 450), apply_pca=True, opt_method='adam', verbose=True, intermediate=False, seed=11)
## Finding pairs
## Found nearest neighbor
## Calculated sigma
## Found scaled dist
## Pairs sampled successfully.
## ((26380, 2), (13190, 2), (52760, 2))
## Initial Loss: 32494.857421875
## Iteration: 10, Loss: 25271.613281
## Iteration: 20, Loss: 21621.359375
## Iteration: 30, Loss: 19508.974609
## Iteration: 40, Loss: 18132.957031
## Iteration: 50, Loss: 17056.433594
## Iteration: 60, Loss: 16103.467773
## Iteration: 70, Loss: 15062.871094
## Iteration: 80, Loss: 13834.863281
## Iteration: 90, Loss: 12237.625000
## Iteration: 100, Loss: 9772.316406
## Iteration: 110, Loss: 12138.644531
## Iteration: 120, Loss: 12073.764648
## Iteration: 130, Loss: 12035.579102
## Iteration: 140, Loss: 12024.872070
## Iteration: 150, Loss: 12020.273438
## Iteration: 160, Loss: 12017.477539
## Iteration: 170, Loss: 12016.250000
## Iteration: 180, Loss: 12015.735352
## Iteration: 190, Loss: 12015.256836
## Iteration: 200, Loss: 12014.985352
## Iteration: 210, Loss: 5415.976562
## Iteration: 220, Loss: 5314.317383
## Iteration: 230, Loss: 5279.950195
## Iteration: 240, Loss: 5264.435059
## Iteration: 250, Loss: 5255.370605
## Iteration: 260, Loss: 5248.233398
## Iteration: 270, Loss: 5243.097168
## Iteration: 280, Loss: 5239.238281
## Iteration: 290, Loss: 5236.099609
## Iteration: 300, Loss: 5233.452637
## Iteration: 310, Loss: 5231.186523
## Iteration: 320, Loss: 5229.175781
## Iteration: 330, Loss: 5227.339355
## Iteration: 340, Loss: 5225.470703
## Iteration: 350, Loss: 5223.921875
## Iteration: 360, Loss: 5222.616699
## Iteration: 370, Loss: 5221.441895
## Iteration: 380, Loss: 5220.406250
## Iteration: 390, Loss: 5219.469727
## Iteration: 400, Loss: 5218.597168
## Iteration: 410, Loss: 5217.724609
## Iteration: 420, Loss: 5216.890625
## Iteration: 430, Loss: 5216.057617
## Iteration: 440, Loss: 5215.260742
## Iteration: 450, Loss: 5214.538574
## Iteration: 460, Loss: 5213.900391
## Iteration: 470, Loss: 5213.275879
## Iteration: 480, Loss: 5212.714844
## Iteration: 490, Loss: 5212.177734
## Iteration: 500, Loss: 5211.688965
## Iteration: 510, Loss: 5211.221680
## Iteration: 520, Loss: 5210.794434
## Iteration: 530, Loss: 5210.372559
## Iteration: 540, Loss: 5209.990723
## Iteration: 550, Loss: 5209.612305
## Iteration: 560, Loss: 5209.245117
## Iteration: 570, Loss: 5208.909180
## Iteration: 580, Loss: 5208.585938
## Iteration: 590, Loss: 5208.267578
## Iteration: 600, Loss: 5207.946777
## Iteration: 610, Loss: 5207.659668
## Iteration: 620, Loss: 5207.363281
## Iteration: 630, Loss: 5207.092285
## Iteration: 640, Loss: 5206.832520
## Iteration: 650, Loss: 5206.579102
## Elapsed time: 1.53s
# visualize markers
features.plot <- c('CD3D', 'MS4A1', 'CD8A', 'GZMK', 'GZMB', 'FCGR3A')
DimPlot(object=pbmc3k.final,reduction="pacmap")
Compiled: October 07, 2020
This vignette demonstrates the use of the Presto package in Seurat. Commands and parameters are based off of the Presto tutorial. If you use Presto in your work, please cite:
Presto scales Wilcoxon and auROC analyses to millions of observations
Ilya Korsunsky, Aparna Nathan, Nghia Millard, Soumya Raychaudhuri
bioRxiv, 2019.
Pre-print: https://www.biorxiv.org/content/10.1101/653253v1.full.pdf
Prerequisites to install:
To learn more about this dataset, type ?pbmc3k
InstallData("pbmc3k")
data("pbmc3k")
pbmc3k <- NormalizeData(pbmc3k)
Idents(pbmc3k) <- "seurat_annotations"
diffexp.B.Mono <- RunPresto(pbmc3k, "CD14+ Mono", "B")
head(diffexp.B.Mono, 10)## p_val avg_logFC pct.1 pct.2 p_val_adj
## CD79A 1.660326e-143 -2.989854 0.042 0.936 2.276972e-139
## TYROBP 3.516407e-138 3.512505 0.994 0.102 4.822401e-134
## S100A9 7.003189e-137 4.293303 0.996 0.134 9.604174e-133
## CST3 1.498348e-135 3.344758 0.992 0.174 2.054834e-131
## S100A4 8.872946e-135 2.854897 1.000 0.360 1.216836e-130
## LYZ 2.720838e-134 3.788514 1.000 0.422 3.731357e-130
## S100A8 3.115452e-133 4.039777 0.975 0.076 4.272530e-129
## CD79B 8.317731e-133 -2.667534 0.083 0.916 1.140694e-128
## S100A6 5.156920e-132 2.541609 0.996 0.352 7.072201e-128
## LGALS1 1.427548e-131 3.002493 0.979 0.131 1.957739e-127
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## CD79A.3 0.000000e+00 2.933865 0.936 0.044 0.000000e+00 B CD79A
## MS4A1.3 0.000000e+00 2.290577 0.855 0.055 0.000000e+00 B MS4A1
## LINC00926.1 2.998236e-274 1.956493 0.564 0.010 4.111781e-270 B LINC00926
## CD79B.3 1.126919e-273 2.381160 0.916 0.144 1.545457e-269 B CD79B
## TCL1A.3 1.962618e-272 2.463556 0.622 0.023 2.691534e-268 B TCL1A
## HLA-DQA1.2 3.017803e-267 2.104207 0.890 0.119 4.138616e-263 B HLA-DQA1
## VPREB3 2.131575e-238 1.667466 0.488 0.008 2.923242e-234 B VPREB3
## HLA-DQB1.2 2.076231e-230 2.112052 0.863 0.148 2.847343e-226 B HLA-DQB1
## CD74.2 1.000691e-184 2.010688 1.000 0.819 1.372347e-180 B CD74
## HLA-DRA.3 1.813356e-184 1.914531 1.000 0.492 2.486837e-180 B HLA-DRA
================================================
FILE: docs/presto.md
================================================
Fast Differential Expression with Presto
================
Compiled: October 07, 2020
This vignette demonstrates the use of the Presto package in Seurat.
Commands and parameters are based off of the [Presto
tutorial](http://htmlpreview.github.io/?https://github.com/immunogenomics/presto/blob/master/docs/getting-started.html).
If you use Presto in your work, please cite:
> *Presto scales Wilcoxon and auROC analyses to millions of
> observations*
>
> Ilya Korsunsky, Aparna Nathan, Nghia Millard, Soumya Raychaudhuri
>
> bioRxiv, 2019.
>
> Pre-print: This vigettte demonstrates how to run schex on Seurat objects, which aims to provide better plots. If you use schex, please cite:
Single cell transcriptomics reveals spatial and temporal dynamics of gene expression in the developing mouse spinal cord
Delile, Julien, Teresa Rayon, Manuela Melchionda, Amelia Edwards, James Briscoe, and Andreas Sagner.
doi: 0.1242/dev.173807
Reduced dimension plotting is one of the essential tools for the analysis of single cell data. However, as the number of cells/nuclei in these these plots increases, the usefulness of these plots decreases. Many cells are plotted on top of each other obscuring information, even when taking advantage of transparency settings. This package provides binning strategies of cells/nuclei into hexagon cells. Plotting summarized information of all cells/nuclei in their respective hexagon cells presents information without obstructions. The package seemlessly works with the two most common object classes for the storage of single cell data; SingleCellExperiment from the SingleCellExperiment package and Seurat from the Seurat package. In this vignette I will be presenting the use of schex for Seurat objects.
Prerequisites to install that are not available via install.packages:
library(Seurat)
library(SeuratData)
library(ggplot2)
library(ggrepel)
library(dplyr)
theme_set(theme_classic())
library(schex)
In order to demonstrate the capabilities of the schex package, I will use the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10x Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. You can download the data from the Seurat website.
InstallData("pbmc3k")
pbmc <- pbmc3k
In the next section, I will perform some simple quality control steps outlined in the Seurat vignette. I will then calculate various dimension reductions and cluster the data, as also outlined in the vignette.
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- pbmc %>% subset(subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) %>%
NormalizeData() %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose = FALSE) %>% RunUMAP(dims = 1:10) %>%
FindNeighbors(dims = 1:10) %>% FindClusters(resolution = 0.5, verbose = FALSE)
At this stage in the workflow we usually would like to plot aspects of our data in one of the reduced dimension representations. Instead of plotting this in an ordinary fashion, I will demonstrate how schex can provide a better way of plotting this.
First, I will calculate the hexagon cell representation for each cell for a specified dimension reduction representation. I decide to use nbins=40 which specifies that I divide my x range into 40 bins. Note that this might be a parameter that you want to play around with depending on the number of cells/ nuclei in your dataset. Generally, for more cells/nuclei, nbins should be increased.
pbmc <- make_hexbin(pbmc, nbins = 40, dimension_reduction = "UMAP")
First I plot how many cells are in each hexagon cell. This should be relatively even, otherwise change the nbins parameter in the previous calculation.
plot_hexbin_density(pbmc)
Next I colour the hexagon cells by some meta information, such as the median total count or cluster membership in each hexagon cell.
plot_hexbin_meta(pbmc, col = "nCount_RNA", action = "median")
plot_hexbin_meta(pbmc, col = "RNA_snn_res.0.5", action = "majority")
For convenience there is also a function that allows the calculation of label positions for factor variables. These can be overlayed with the package ggrepel.
label_df <- make_hexbin_label(pbmc, col = "RNA_snn_res.0.5")
pp <- plot_hexbin_meta(pbmc, col = "RNA_snn_res.0.5", action = "majority")
pp + ggrepel::geom_label_repel(data = label_df, aes(x = x, y = y, label = label), colour = "black",
label.size = NA, fill = NA)
Finally, I will visualize the gene expression of the CD19 gene in the hexagon cell representation.
gene_id <- "CD19"
plot_hexbin_gene(pbmc, type = "logcounts", gene = gene_id, action = "mean", xlab = "UMAP1", ylab = "UMAP2",
title = paste0("Mean of ", gene_id))
This vignette demonstrates analysing RNA Velocity quantifications stored in a Seurat object using scVelo. If you use scVelo in your work, please cite:
Generalizing RNA velocity to transient cell states through dynamical modeling
Volker Bergen, Marius Lange, Stefan Peidli, F. Alexander Wolf & Fabian J. Theis
doi: 10.1101/820936
Website: https://scvelo.readthedocs.io/
Prerequisites to install:
library(Seurat)
library(SeuratDisk)
library(SeuratWrappers)
# If you don't have velocyto's example mouse bone marrow dataset, download with the CURL command
# curl::curl_download(url = 'http://pklab.med.harvard.edu/velocyto/mouseBM/SCG71.loom', destfile
# = '~/Downloads/SCG71.loom')
ldat <- ReadVelocity(file = "~/Downloads/SCG71.loom")
bm <- as.Seurat(x = ldat)
bm[["RNA"]] <- bm[["spliced"]]
bm <- SCTransform(bm)
bm <- RunPCA(bm)
bm <- RunUMAP(bm, dims = 1:20)
bm <- FindNeighbors(bm, dims = 1:20)
bm <- FindClusters(bm)
DefaultAssay(bm) <- "RNA"
SaveH5Seurat(bm, filename = "mouseBM.h5Seurat")
Convert("mouseBM.h5Seurat", dest = "h5ad")
# In Python
import scvelo as scv
adata = scv.read("mouseBM.h5ad")
adata
## AnnData object with n_obs × n_vars = 6667 × 24421
## obs: 'orig.ident', 'nCount_spliced', 'nFeature_spliced', 'nCount_unspliced', 'nFeature_unspliced', 'nCount_ambiguous', 'nFeature_ambiguous', 'nCount_RNA', 'nFeature_RNA', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.0.8', 'seurat_clusters'
## var: 'features', 'ambiguous_features', 'spliced_features', 'unspliced_features'
## obsm: 'X_umap'
## layers: 'ambiguous', 'spliced', 'unspliced'
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
scv.tl.velocity(adata)
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, basis="umap", color="seurat_clusters")
scv.pl.velocity_embedding(adata, basis="umap", color="seurat_clusters", arrow_length=3, arrow_size=2, dpi=120)
scv.tl.recover_dynamics(adata)
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color="latent_time", color_map="gnuplot")
top_genes = adata.var["fit_likelihood"].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, sortby="latent_time", col_color="seurat_clusters", n_convolve=100)
``` python
scv.pl.velocity_embedding(adata, basis="umap", color="seurat_clusters", arrow_length=3, arrow_size=2, dpi=120)
```
``` python
scv.tl.recover_dynamics(adata)
```
``` python
scv.tl.latent_time(adata)
```
``` python
scv.pl.scatter(adata, color="latent_time", color_map="gnuplot")
```
``` python
top_genes = adata.var["fit_likelihood"].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, sortby="latent_time", col_color="seurat_clusters", n_convolve=100)
```
================================================
FILE: docs/tricycle.Rmd
================================================
---
title: "Running estimate_cycle_position from tricycle on Seurat Objects"
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
output:
github_document:
html_preview: true
toc: true
toc_depth: 3
fig_width: 16
html_document:
df_print: kable
theme: united
fig_height: 5
fig_width: 16
out_height: 4
---
This vignette demonstrates the use of the estimate_cycle_position from the tricycle package on Seurat objects.
> *Universal prediction of cell cycle position using transfer learning*
>
> Shijie C. Zheng, Genevieve Stein-O’Brien, Jonathan J. Augustin, Jared Slosberg, Giovanni A. Carosso, Briana Winer, Gloria Shin, Hans T. Bjornsson, Loyal A. Goff, Kasper D. Hansen
>
> bioRxiv, 2021.
>
> doi: [10.1101/2021.04.06.438463](https://doi.org/10.1101/2021.04.06.438463)
>
> Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/tricycle.html
```{r setup, include=FALSE}
knitr::opts_chunk$set(
tidy = TRUE,
tidy.opts = list(width.cutoff = 95),
message = FALSE,
warning = FALSE
)
```
Prerequisites to install:
* [Seurat](https://satijalab.org/seurat/install)
* [SeuratWrappers](https://github.com/satijalab/seurat-wrappers)
* [tricycle](https://www.bioconductor.org/packages/release/bioc/html/tricycle.html)
```{r install.deps, include = FALSE}
# SeuratWrappers:::CheckPackage(package = 'tricycle', repository = 'bioconductor')
if (!require(tricycle)) remotes::install_github(repo = 'hansenlab/tricycle')
```
```{r packages}
library(Seurat)
library(SeuratWrappers)
library(tricycle)
```
## Introduction
The Biocondutor package [tricycle](https://www.bioconductor.org/packages/release/bioc/html/tricycle.html)
infers cell cycle position for a single-cell RNA-seq dataset. Here, we show the
implementation of **main** function of tricycle, estimate_cycle_position, on the Seurat
objects. More information can be found at [tricycle](https://www.bioconductor.org/packages/release/bioc/html/tricycle.html).
## Loading examle data and making Seurat object
```{r seuratobj, eval = TRUE, echo = TRUE}
data(neurosphere_example, package = "tricycle")
neurosphere_example <- as.Seurat(neurosphere_example)
neurosphere_example
```
Note that after converting the SingleCellExperiment object to Seurat object,
the original "logcounts" assay is saved as a slot with name "data" in Seurat default Assay.
## Inferring the cell cycle position
The `Runtricycle()` function in the SeuratWrappers package first project the data
into the cell cycle embeddings using the internal reference in tricycle package,
and then estimate the cell cycle position. The estimated cell cycle position is
bound between 0 and 2pi. Note that we strive to get high resolution of cell cycle
state, and we think the continuous position is more appropriate when describing
the cell cycle. However, to help users understand the position variable, we also
note that users can approximately relate 0.5pi to be the start of S stage, pi to
be the start of G2M stage, 1.5pi to be the middle of M stage, and 1.75pi-0.25pi
to be G1/G0 stage.
```{r run, eval = TRUE, echo = TRUE}
neurosphere_example <- Runtricycle(object = neurosphere_example, slot = "data",
reduction.name = "tricycleEmbedding",
reduction.key = "tricycleEmbedding_",
gname = NULL,
gname.type = "ENSEMBL",
species = "mouse")
```
## Visualizing the results
We could extract the cell cycle embedding and make a scatter plot of the embeddings
colored by the position inference. And we also extract the expression level of gene
Top2a for accessing the performance, described below.
```{r extract, eval = TRUE, echo = TRUE}
plot.df <- FetchData(object = neurosphere_example, vars = c("tricycleEmbedding_1", "tricycleEmbedding_2", "tricyclePosition", "ENSMUSG00000020914"))
names(plot.df)[4] <- "Top2a"
```
Let us plot out the cell cycle embedding. You could also plot other embeddings,
such as T_SNE or UMAP with points colored by the cell cycle position.
```{r plotemb, eval = TRUE, echo = TRUE, fig.width = 10, fig.height = 7}
library(ggplot2)
library(cowplot)
p <- tricycle:::.plot_emb_circle_scale(emb.m = plot.df[, 1:2],
color.value = plot.df$tricyclePosition,
color_by = "tricyclePosition",
point.size = 3.5, point.alpha = 0.9
)
legend <- circle_scale_legend(text.size = 5, alpha = 0.9)
plot_grid(p, legend, ncol = 2, rel_widths = c(1, 0.4))
```
## Assessing performance
We have two ways of (quickly) assessing whether triCycle works. They are
1. Look at the projection of the data into the cell cycle embedding.
2. Look at the expression of key genes as a function of cell cycle position.
Plotting the projection of the data into the cell cycle embedding is shown
above. Our observation is that deeper sequenced data will have a more clearly
ellipsoid pattern with an empty interior. As sequencing depth decreases, the
radius of the ellipsoid decreases until the empty interior disappears. So the
absence of an interior does not mean the method does not work.
It is more important to inspect a couple of genes as a function of cell cycle
position. We tend to use Top2a which is highly expressed and therefore
"plottable" in every dataset. Other candidates are for example Smc2. To plot
this data, we provide a convenient function `fit_periodic_loess()` to fit a
loess line between the cyclic variable $\theta$ and other response variables.
This fitting is done by making `theta.v` 3 periods
`(c(theta.v - 2 * pi, theta.v, theta.v + 2 * pi))` and repeating `y` 3 times.
Only the fitted values corresponding to original `theta.v` will be returned.
In this example, we show how well the expression of the cell cycle marker gene
*Top2a* change along $\theta$.
```{r loess, message = TRUE}
fit.l <- fit_periodic_loess(neurosphere_example$tricyclePosition,
plot.df$Top2a,
plot = TRUE,
x_lab = "Cell cycle position \u03b8", y_lab = "log2(Top2a)",
fig.title = paste0("Expression of Top2a along \u03b8 (n=",
ncol(neurosphere_example), ")"))
names(fit.l)
fit.l$fig + theme_bw(base_size = 14)
```
For Top2a we expect peak expression around $\pi$.
## Plot out the kernel density
Another useful function is *plot_ccposition_den*, which computes kernel density
of $\theta$ conditioned on a phenotype using von Mises distribution. The ouput
figures are provided in two flavors, polar coordinates and Cartesian
coordinates. This could be useful when comparing different cell types,
treatments, or just stages. (Because we use a very small dataset here as
example, we set the bandwith, i.e. the concentration parameter of the von Mises
distribution as 10 to get a smooth line.)
```{r density, message = TRUE}
plot_ccposition_den(neurosphere_example$tricyclePosition,
neurosphere_example$sample, 'sample',
bw = 10, fig.title = "Kernel density of \u03b8") +
theme_bw(base_size = 14)
plot_ccposition_den(neurosphere_example$tricyclePosition,
neurosphere_example$sample, 'sample', type = "circular",
bw = 10, fig.title = "Kernel density of \u03b8") +
theme_bw(base_size = 14)
```
## Resoures for tricycle
More information about constructing your own reference, other usages and
running tricycle outside of the Seurat environment can be found
at [tricycle](https://www.bioconductor.org/packages/release/bioc/html/tricycle.html).
================================================
FILE: docs/tricycle.html
================================================
This vignette demonstrates the use of the estimate_cycle_position from the tricycle package on Seurat objects.
Universal prediction of cell cycle position using transfer learning
Shijie C. Zheng, Genevieve Stein-O’Brien, Jonathan J. Augustin, Jared Slosberg, Giovanni A. Carosso, Briana Winer, Gloria Shin, Hans T. Bjornsson, Loyal A. Goff, Kasper D. Hansen
bioRxiv, 2021.
doi: 10.1101/2021.04.06.438463
Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/tricycle.html
Prerequisites to install:
library(Seurat)
library(SeuratWrappers)
library(tricycle)
The Biocondutor package tricycle infers cell cycle position for a single-cell RNA-seq dataset. Here, we show the implementation of main function of tricycle, estimate_cycle_position, on the Seurat objects. More information can be found at tricycle.
data(neurosphere_example, package = "tricycle")
neurosphere_example <- as.Seurat(neurosphere_example)
neurosphere_example
## An object of class Seurat
## 1500 features across 400 samples within 1 assay
## Active assay: RNA (1500 features, 0 variable features)
Note that after converting the SingleCellExperiment object to Seurat object, the original “logcounts” assay is saved as a slot with name “data” in Seurat default Assay.
The Runtricycle() function in the SeuratWrappers package first project the data into the cell cycle embeddings using the internal reference in tricycle package, and then estimate the cell cycle position. The estimated cell cycle position is bound between 0 and 2pi. Note that we strive to get high resolution of cell cycle state, and we think the continuous position is more appropriate when describing the cell cycle. However, to help users understand the position variable, we also note that users can approximately relate 0.5pi to be the start of S stage, pi to be the start of G2M stage, 1.5pi to be the middle of M stage, and 1.75pi-0.25pi to be G1/G0 stage.
neurosphere_example <- Runtricycle(object = neurosphere_example, slot = "data", reduction.name = "tricycleEmbedding",
reduction.key = "tricycleEmbedding_", gname = NULL, gname.type = "ENSEMBL", species = "mouse")
We could extract the cell cycle embedding and make a scatter plot of the embeddings colored by the position inference. And we also extract the expression level of gene Top2a for accessing the performance, described below.
plot.df <- FetchData(object = neurosphere_example, vars = c("tricycleEmbedding_1", "tricycleEmbedding_2",
"tricyclePosition", "ENSMUSG00000020914"))
names(plot.df)[4] <- "Top2a"
Let us plot out the cell cycle embedding. You could also plot other embeddings, such as T_SNE or UMAP with points colored by the cell cycle position.
library(ggplot2)
library(cowplot)
p <- tricycle:::.plot_emb_circle_scale(emb.m = plot.df[, 1:2], color.value = plot.df$tricyclePosition,
color_by = "tricyclePosition", point.size = 3.5, point.alpha = 0.9)
legend <- circle_scale_legend(text.size = 5, alpha = 0.9)
plot_grid(p, legend, ncol = 2, rel_widths = c(1, 0.4))
We have two ways of (quickly) assessing whether triCycle works. They are
Plotting the projection of the data into the cell cycle embedding is shown above. Our observation is that deeper sequenced data will have a more clearly ellipsoid pattern with an empty interior. As sequencing depth decreases, the radius of the ellipsoid decreases until the empty interior disappears. So the absence of an interior does not mean the method does not work.
It is more important to inspect a couple of genes as a function of cell cycle position. We tend to use Top2a which is highly expressed and therefore “plottable” in every dataset. Other candidates are for example Smc2. To plot this data, we provide a convenient function fit_periodic_loess() to fit a loess line between the cyclic variable \(\theta\) and other response variables. This fitting is done by making theta.v 3 periods (c(theta.v - 2 * pi, theta.v, theta.v + 2 * pi)) and repeating y 3 times. Only the fitted values corresponding to original theta.v will be returned. In this example, we show how well the expression of the cell cycle marker gene Top2a change along \(\theta\).
fit.l <- fit_periodic_loess(neurosphere_example$tricyclePosition, plot.df$Top2a, plot = TRUE, x_lab = "Cell cycle position θ",
y_lab = "log2(Top2a)", fig.title = paste0("Expression of Top2a along θ (n=", ncol(neurosphere_example),
")"))
names(fit.l)
## [1] "fitted" "residual" "pred.df" "loess.o" "rsquared" "fig"
fit.l$fig + theme_bw(base_size = 14)
For Top2a we expect peak expression around \(\pi\).
Another useful function is plot_ccposition_den, which computes kernel density of \(\theta\) conditioned on a phenotype using von Mises distribution. The ouput figures are provided in two flavors, polar coordinates and Cartesian coordinates. This could be useful when comparing different cell types, treatments, or just stages. (Because we use a very small dataset here as example, we set the bandwith, i.e. the concentration parameter of the von Mises distribution as 10 to get a smooth line.)
plot_ccposition_den(neurosphere_example$tricyclePosition, neurosphere_example$sample, "sample",
bw = 10, fig.title = "Kernel density of θ") + theme_bw(base_size = 14)
plot_ccposition_den(neurosphere_example$tricyclePosition, neurosphere_example$sample, "sample",
type = "circular", bw = 10, fig.title = "Kernel density of θ") + theme_bw(base_size = 14)
More information about constructing your own reference, other usages and running tricycle outside of the Seurat environment can be found at tricycle.
This vignette demonstrates analysing RNA Velocity quantifications stored in a Seurat object. Parameters are based off of the RNA Velocity tutorial. If you use velocyto in your work, please cite:
RNA velocity of single cells
Gioele La Manno, Ruslan Soldatov, Amit Zeisel, Emelie Braun, Hannah Hochgerner, Viktor Petukhov, Katja Lidschreiber, Maria E. Kastriti, Peter Lönnerberg, Alessandro Furlan, Jean Fan, Lars E. Borm, Zehua Liu, David van Bruggen, Jimin Guo, Xiaoling He, Roger Barker, Erik Sundström, Gonçalo Castelo-Branco, Patrick Cramer, Igor Adameyko, Sten Linnarsson & Peter V. Kharchenko
doi: 10.1038/s41586-018-0414-6
Website: https://velocyto.org
Prerequisites to install:
library(Seurat)
library(velocyto.R)
library(SeuratWrappers)
# If you don't have velocyto's example mouse bone marrow dataset, download with the CURL command
# curl::curl_download(url = 'http://pklab.med.harvard.edu/velocyto/mouseBM/SCG71.loom', destfile
# = '~/Downloads/SCG71.loom')
ldat <- ReadVelocity(file = "~/Downloads/SCG71.loom")
bm <- as.Seurat(x = ldat)
bm <- SCTransform(object = bm, assay = "spliced")
bm <- RunPCA(object = bm, verbose = FALSE)
bm <- FindNeighbors(object = bm, dims = 1:20)
bm <- FindClusters(object = bm)
bm <- RunUMAP(object = bm, dims = 1:20)
bm <- RunVelocity(object = bm, deltaT = 1, kCells = 25, fit.quantile = 0.02)
ident.colors <- (scales::hue_pal())(n = length(x = levels(x = bm)))
names(x = ident.colors) <- levels(x = bm)
cell.colors <- ident.colors[Idents(object = bm)]
names(x = cell.colors) <- colnames(x = bm)
show.velocity.on.embedding.cor(emb = Embeddings(object = bm, reduction = "umap"), vel = Tool(object = bm,
slot = "RunVelocity"), n = 200, scale = "sqrt", cell.colors = ac(x = cell.colors, alpha = 0.5),
cex = 0.8, arrow.scale = 3, show.grid.flow = TRUE, min.grid.cell.mass = 0.5, grid.n = 40, arrow.lwd = 1,
do.par = FALSE, cell.border.alpha = 0.1)