Getting started with an ScPCA dataset
This section provides information on next steps you might take after downloading a dataset from the ScPCA portal.
Quantified single-cell or single-nuclei gene expression data is provided as either SingleCellExperiment
objects (.rds
files) or AnnData
objects (.h5ad
files).
A full description of the contents of the SingleCellExperiment
and AnnData
objects can be found in the single cell gene expression file contents section.
There are three objects available for each library: an unfiltered object (unfiltered.rds
or unfiltered_rna.h5ad
), a filtered object (filtered.rds
or filtered_rna.h5ad
), and a processed object (processed.rds
or processed_rna.h5ad
).
The unfiltered object contains the gene expression data for all droplets, regardless of the presence of a cell or not.
The filtered object contains the gene expression data for only droplets that are likely to contain cells, removing any probable empty droplets.
See the section on filtering cells for more information on how we remove potential empty droplets.
The processed objects are further filtered to remove any low quality cells and contain both the raw and normalized gene expression data for the identified cells.
Additionally, these objects contain dimensionality reduction results (principal component analysis and UMAP) and clustering assignments.
See the description of the processed gene expression data for more information on the processed
objects.
We recommend starting with the processed objects.
For working with the processed
SingleCellExperiment
object found in theprocessed.rds
file see Importing ScPCA data fromSingleCellExperiment
objects into R below.For working with the processed
AnnData
objects found in theprocessed_rna.h5ad
file see Importing ScPCA data fromAnnData
objects into Python below.
Note: We also have a separate GitHub repository that contains workflows for some common analysis performed on single-cell RNA-sequencing data.
These workflows are designed to apply the same analysis (e.g., clustering) across multiple samples in parallel and is currently only for use with SingleCellExperiment
objects.
These workflows and more resources for processing single-cell and single-nuclei datasets can be found in the scpca-downstream-analyses
repository.
Importing ScPCA data from SingleCellExperiment
objects into R
The first step in analyzing the provided gene expression data stored in the SingleCellExperiment
objects will be to import the data into R.
To work with SingleCellExperiment
objects in R, we need to ensure that we have the SingleCellExperiment
package installed and loaded.
The following commands can be used to import the RDS file into R and save the SingleCellExperiment
object:
# if SingleCellExperiment is not installed, install the package
# otherwise the installation step can be skipped
if (!("SingleCellExperiment" %in% installed.packages())) {
BiocManager::install("SingleCellExperiment")
}
library(SingleCellExperiment)
# read in the RDS file, including the path to the file's location
processed_sce <- readRDS("SCPCS000000/SCPCL000000_processed.rds")
More resources for learning about SingleCellExperiment
objects:
SingleCellExperiment
objects Vignette on working withSingleCellExperiment
objectsOrchestrating Single Cell Analysis chapter on the
SingleCellExperiment
class
Importing ScPCA data from AnnData
objects into Python
The first step in analyzing the provided gene expression data stored in the AnnData
objects will be to import the data into python.
To work with AnnData
objects in python, we need to ensure that we have the AnnData
package installed and loaded.
The following commands can be used to import the H5AD file into python and save the AnnData
object:
# load anndata module
import anndata
# read in the H5AD file, including the path to the file's location
processed_adata = anndata.read_h5ad("SCPCS000000/SCPCL000000_processed_rna.h5ad")
More resources for learning about AnnData
objects:
Working with processed objects
Gene expression data
The processed objects contain both the raw and normalized gene expression data.
The following commands can be used to access the expression data found in the processed SingleCellExperiment
objects:
# the raw counts matrix stored in the processed object
counts(processed_sce)
# log normalized counts matrix stored in the processed object
logcounts(processed_sce)
The following commands can be used to access the expression data found in the processed AnnData
objects:
# the raw counts matrix stored in the processed object
processed_adata.raw.X
# log normalized counts matrix stored in the processed object
processed_adata.X
Here we provide more resources on understanding normalization in single-cell RNA-seq analysis:
Hemberg lab scRNA-seq course section on normalization methods
Stegle et al. (2015) Computational and analytical challenges in single-cell transcriptomics. Includes a discussion of normalization and technical variance in scRNA-seq.
Quality control data
The processed SingleCellExperiment
objects already have undergone additional quality control steps to remove low quality cells.
Low quality cells include those with a higher percentage of reads from mitochondrial genes (i.e., those that are damaged or dying) and those with a lower number of total reads and unique genes identified (i.e., those with inefficient reverse transcription or PCR amplification).
All processed objects include miQC
results found in the colData() of the SingleCellExperiment object or the .obs slot of the AnnData object.
miQC
jointly models the proportion of mitochondrial reads and the number of unique genes detected in each cell to calculate the probability of a cell being compromised (i.e., dead or damaged).
High-quality cells are those with a low probability of being being compromised (< 0.75) or sufficiently low mitochondrial content.
The probability compromised for each cell as calculated by miQC
can be found in the prob_compromised
column of the colData
of SingleCellExperiment
objects.
# probability compromised for each cell
processed_sce$prob_compromised
The probability compromised for each cell as calculated by miQC
can be found in the prob_compromised
column of the .obs
slot of AnnData
objects.
# probability compromised for each cell
processed_adata.obs["prob_compromised"]
Additionally we include a column, scpca_filter
, that labels cells as either Keep
or Remove
based on having both a prob_compromised
< 0.75 and number of unique genes identified > 200.
All cells included in the processed object will have Keep
in the scpca_filter
column.
If you prefer to work with the object prior to removal of any low-quality cells, please use the filtered object, which contains all cells that were not discarded as empty droplets.
Dimensionality reduction
Dimensionality reduction is commonly used as a precursor to plotting, clustering, and other downstream analysis.
The processed objects contain results from performing principal component analysis (PCA), a technique that identifies new axes that capture the largest amount of variation in the data, and uniform manifold approximation and projection(UMAP), which may be better for visualization. Use caution when interpreting UMAP results, as location, spacing, and density of clusters can be dependent on parameter choices and random effects and do not always accurately represent the relationships among cells.
Dimensionality reduction results can be accessed in the SingleCellExperiment
objects using the following commands:
# principal component analysis results
reducedDim(processed_sce, "PCA")
# UMAP results
reducedDim(processed_sce, "UMAP")
Dimensionality reduction results can be accessed in the AnnData
objects using the following command:
# principal component analysis results
processed_adata.obsm["X_pca"]
# UMAP results
processed_adata.obsm["X_umap"]
See below for more resources on dimensionality reduction:
Highly variable genes
In the processed objects, principal components were calculated from a set of highly variable genes identified for a given library. We encourage you to visit the Feature selection chapter in Orchestrating Single Cell Analysis to read more about modeling gene variance and selecting the highly variable genes.
The list of highly variable genes used for this calculation, in order from highest to lowest variation, is stored in the metadata
of the SingleCellExperiment
object and the uns
slot of the AnnData
object.
This list can be accessed using the following command in the SingleCellExperiment
objects:
# list of highly variable genes
metadata(processed_sce)$highly_variable_genes
This list can be accessed using the following command in the AnnData
objects:
# list of highly variable genes
processed_adata.uns["highly_variable_genes"]
Additionally, the AnnData
objects contain a column in the .var
slot, "highly_variable"
, indicating whether or not a gene is found in the list of highly variable genes.
Clustering
Cluster assignments obtained from Graph-based clustering is also available in the processed objects. By default, clustering is performed using the Louvain algorithm with 20 nearest neighbors and Jaccard weighting.
To access the cluster assignments in the SingleCellExperiment
object, use the following command:
# cluster assignment for each cell
processed_sce$cluster
To access the cluster assignments in the AnnData
object, use the following command:
# cluster assignment for each cell
processed_adata.obs["cluster"]
See these resources for more information on clustering:
Cell type annotation
Processed objects may contain cell type annotations and associated metadata from one or more of the following sources.
Submitter-provided annotations (note that these are only present for a subset of libraries).
Annotations determined by
SingleR
, an automated reference-based method (Looney et al. 2019).Annotations determined by
CellAssign
, an automated marker-gene based method (Zhang et al. 2019).
If at least one method was used for cell type annotation, a supplemental cell type report will be provided with the download. This report evaluates cell annotations results as follows:
It provides diagnostic plots to assess the quality of cell type annotations.
If multiple annotations are present, the report compares different annotations to one another. Strong agreement between different annotation methods is a qualitative indicator of robustness.
To determine which methods were used for cell type annotations, use the following command on the processed SingleCellExperiment
object.
If this returns NULL
, then no cell type annotation was performed for the given library.
# show vector of available celltypes
# values will be one or more of: `submitter`, `singler`, `cellassign`
metadata(processed_sce)$celltype_methods
Or, use the following command on the processed AnnData
object.
If this returns an empty list, then no cell type annotation was performed for the given library.
# show list of available celltypes
# values will be one or more of: `submitter`, `singler`, `cellassign`
processed_adata.uns["celltype_methods"]
Below we provide instructions on how to access annotations from each cell type annotation method used, if present.
Note that additional information about SingleR
and CellAssign
annotation, including their respective reference source and versions, is also available from the processed SingleCellExperiment
object’s metadata and from the processed AnnData
object’s uns
slot, as described in the experiment metadata table.
Submitter-provided annotations
To access submitter-provided annotations, if available, in the SingleCellExperiment
, use the following command:
# submitter-provided annotations for each cell
processed_sce$submitter_celltype_annotation
To access submitter-provided annotations in the AnnData
object, use the following command:
# submitter-provided annotations for each cell
processed_adata.obs["submitter_celltype_annotation"]
Cells that submitters did not annotate are labeled with Submitter-excluded
.
Note that submitter-provided annotations are also present in unfiltered and filtered objects and can be accessed using the same approach shown here for processed objects.
SingleR
annotations
SingleR
annotation uses a reference dataset from the celldex
package (Aran et al. (2019)).
To access automated SingleR
annotations as cell type names and/or ontology terms in the process SingleCellExperiment
object, use the following command(s):
# SingleR annotatins as cell type names
processed_sce$singler_celltype_annotation
# Or, SingleR annotatins as cell type ontology terms
processed_sce$singler_celltype_ontology
To access automated SingleR
annotations as cell type names and/or ontology terms in the processed AnnData
object, use the following command(s):
# SingleR annotatins as cell type names
processed_adata.obs["singler_celltype_annotation"]
# Or, SingleR annotatins as cell type ontology terms
processed_adata.obs["singler_celltype_ontology"]
Cells that SingleR
could not confidently annotate are labeled with NA
.
You can also access the full object returned by SingleR
from the SingleCellExperiment
’s metadata with the following command (note that this information is not provided in the AnnData
object):
# SingleR full result
metadata(processed_sce)$singler_results
CellAssign
annotations
CellAssign
annotation uses a reference set of marker genes from the PanglaoDB
database (Oscar Franzén et al. (2019)), as compiled by the Data Lab for a given tissue group.
To access automated CellAssign
annotations in the SingleCellExperiment
, use the following command:
# CellAssign annotations for each cell
processed_sce$cellassign_celltype_annotation
To access automated CellAssign
annotations in the AnnData
object, use the following command:
# CellAssign annotations for each cell
processed_adata.obs["cellassign_celltype_annotation"]
Cells that CellAssign
could not confidently annotate are labeled with "other"
.
You can also access the full predictions matrix returned by CellAssign
from the SingleCellExperiment
’s metadata with the following command:
# CellAssign full predictions matrix full result
metadata(processed_sce)$cellassign_predictions
Additional cell type resources
See these resources for more information on automated cell type annotation:
Working with a merged ScPCA object
Merged ScPCA objects contain all information found in processed
objects for all individual libraries that comprise a given ScPCA project.
For more information on how these objects were prepared, see the section on merged object preparation.
Please be aware that data in merged objects has not been integrated/batch-corrected.
To work with a merged object, you will first have to read it in.
You can read in a SingleCellExperiment
merged object with the following R code:
library(SingleCellExperiment)
merged_sce <- readRDS("SCPCP000000_merged.rds")
You can read in an AnnData
merged object with the following python code:
import anndata
adata_merged_object = anndata.read_h5ad("SCPCP000000_merged_rna.h5ad")
Analyses with merged objects
Merged object can facilitate joint analysis on multiple samples at the same time. Specifically, merged objects are useful for comparing gene-level metrics among different samples.
Some examples of analyses you can perform with a merged object include the following:
Differential expression analysis
Gene-set enrichment analysis
By contrast, it is recommended to control for batch effects when performing analyses that compare cell-level metrics among different samples. In order to do this, an integrated/batch-corrected object should be created. These types of analyses include, for example, differential cell abundance or differential cell composition analysis.
The merged object provided in the ScPCA Portal has not been integrated, so you will need to integrate your samples of interest to remove technical batch effects before proceeding. Note that the merged object does not contain cell clusters, which you may wish to re-compute after removing batch effects.
We recommend consulting with the following resources before performing integration:
Subsetting the merged object
You may wish to only work with a subset of libraries present in the merged object.
To subset a SingleCellExperiment
merged object to a given set of libraries, use the following R code:
# Define vector of library IDs of interest
libraries <- c("SCPCL00000X", "SCPCL00000Y", "SCPCL00000Z")
# Create a subsetted merged object
subsetted_merged_sce <- merged_sce[,merged_sce$library_id %in% libraries]
To subset an AnnData
merged object to a given set of libraries, use the following R code:
# Define list of library IDs of interest
libraries = ["SCPCL00000X", "SCPCL00000Y", "SCPCL00000Z"]
# Create a subsetted merged object
subsetted_adata_merged_object = adata_merged_object[adata_merged_object.obs["library_id"].isin(libraries)]
The merged object additionally contains metadata such as information about sample diagnosis, subdiagnosis, or tissue location that may be useful for subsetting.
A full set of merged object contents which can support subsetting is available in <TODO: the forthcoming section forthcoming in merged_objects.md
about file contents>.
As one example, to subset a SingleCellExperiment
merged object to a given diagnosis, use the following R code:
# Define diagnosis of interest
diagnosis <- "example diagnosis"
# Create a subsetted merged object
subsetted_merged_sce <- merged_sce[,merged_sce$diagnosis == diagnosis]
To subset an AnnData
merged object to a given diagnosis, use the following python code:
# Define diagnosis of interest
diagnosis = "example diagnosis"
# Create a subsetted merged object
subsetted_adata_merged_object = adata_merged_object[adata_merged_object.obs["diagnosis"] == diagnosis]
What if I want to use Seurat?
The files available for download that contain SingleCellExperiment
objects can also be converted into Seurat objects.
You can find the code needed to convert the SingleCellExperiment
objects to Seurat objects in the FAQ section.
After converting the object to a Seurat object, the same steps outlined above (quality control, filtering, normalization, dimensionality reduction) can be followed but using functions available as part of the Seurat package.
Here are some resources that can be used to get you started working with Seurat objects:
Getting started tutorial in Seurat including quality control, normalization, and dimensionality reduction.
Converting Seurat objects to and from a
SingleCellExperiment
Harvard Chan Bioinformatics Core course on single-cell RNA-seq analysis with Seurat
Special considerations for CITE-seq experiments
If the dataset you downloaded contains samples with ADT data from a CITE-seq experiment, the raw and normalized ADT expression matrices will be provided.
For SingleCellExperiment
objects with ADT data, the ADT expression matrices will be stored as an altExp
named "adt"
in the same object containing the RNA expression data.
For AnnData
objects with ADT data, the ADT expression matrices will be provided in separate files corresponding to the same three stages of data processing: an unfiltered object (_unfiltered_adt.h5ad
), a filtered object (_filtered_adt.h5ad
), and a processed object (_processed_adt.h5ad
).
These files will only contain ADT expression data and not RNA expression data.
We recommend working with the processed objects as they contain both the raw and normalized ADT expression matrices along with additional quality control metrics for the CITE-seq experiment.
To access the ADT expression matrices in SingleCellExperiment
objects, use the following commands:
# View the ADT alternative experiment
# Note that the altExp name is optional, as this is the only altExp present
altExp(processed_sce, "adt")
# the raw ADT counts matrix
counts(altExp(processed_sce))
# the log normalized ADT counts matrix
logcounts(altExp(processed_sce))
To access the ADT matrices in AnnData
objects you will first need to read in the _adt.h5ad
file, and then you can access the raw.X
and X
matrices as shown below:
import anndata
# read in the H5AD file with ADT data, including the path to the file's location
adt_adata = anndata.read_h5ad("SCPCS000000/SCPCL000000_processed_adt.h5ad")
# the raw ADT counts matrix
adt_adata.raw.X
# the log normalized ADT counts matrix
adt_adata.X
Be aware that the processed objects have been filtered to remove low-quality cells based on RNA expression but have not been filtered based on ADT counts.
Filtering cells based on ADT quality control
The adt_scpca_filter
column indicates which cells should be removed before proceeding with downstream analyses of the ADT data, as determined by DropletUtils::CleanTagCounts()
.
This process identified cells with high levels of ambient contamination and/or high levels of negative control ADTs (if available).
Cells are labeled either as "Keep"
(cells to retain) or "Remove"
(cells to filter out).
To filter cells based on this column in the SingleCellExperiment
objects, use the following command:
# Filter cells based on ADT QC statistics
processed_sce[, which(processed_sce$adt_scpca_filter == "Keep")]
To filter cells based on this column in the AnnData
objects, use the following command:
# Filter cells based on ADT QC statistics
adt_adata[adt_adata.obs["adt_scpca_filter"] == "Keep"]
Note that the normalized ADT expression matrix only contains values for cells labeled as "Keep"
in the adt_scpca_filter
column.
Any cells labeled "Remove"
have NA
values in the normalized expression matrix (see Processed ADT Data for more details).
Alternatively, you can also filter cells out based on your own criteria.
For SingleCellExperiment
objects, quality-control statistics calculated by DropletUtils::CleanTagCounts()
are provided in the colData
slot of the altExp
(colData(altExp(filtered_sce))
) as described in Additional SingleCellExperiment components for CITE-seq libraries (with ADT tags).
For AnnData
objects, these same quality-control statistics are provided in the obs
slot of the AnnData
object as described in Additional AnnData components for CITE-seq libraries (with ADT tags).
We recommend filtering out these low-quality cells before proceeding with downstream analyses.
Here are some additional resources that can be used for working with ADT counts from CITE-seq experiments:
Special considerations for multiplexed samples
If the dataset that you have downloaded contains samples that were multiplexed (i.e. cells from multiple samples have been combined together into one library), then the steps to take after downloading will be a little different than for libraries that only contain cells corresponding to a single sample. Here, multiplexed samples refer to samples that have been combined together into one library using cell hashing and then sequenced together. This means that a single library contains cells or nuclei that correspond to multiple samples. Each sample has been tagged with a hashtag oligo (HTO) prior to mixing, and that HTO can be used to identify which cells belong to which sample within a multiplexed library. The libraries available for download on the portal have not been separated by sample (i.e. demultiplexed), and therefore contain data from multiple samples.
Note that multiplexed sample libraries are only available as SingleCellExperiment
objects, and are not currently available as AnnData
objects.
If you prefer to work with AnnData
objects, we recommend using the zellkonverter
package to convert the SingleCellExperiment
object to an H5AD file containing an AnnData
object.
Libraries containing multiplexed samples can be initially processed using the same workflow described above including removal of low quality cells, normalization, and dimensionality reduction.
Demultiplexing can then be used to identify the sample that each cell is from.
Demultiplexing has already been performed using both Seurat::HTODemux
and DropletUtils::hashedDrops
.
For samples where corresponding bulk RNA-sequencing data is available, genetic demultiplexing was also conducted.
The associated demultiplexing results were summarized and are available in the colData
of the processed SingleCellExperiment
object.
The hashedDrops_sampleid
, HTODemux_sampleid
, and vireo_sampleid
columns in the colData
report the sample called for each cell by the specified demultiplexing method.
If a confident call was not made for a cell by the demultiplexing method, the column will have a value of NA
.
For more information on how to access the full demultiplexing results, see this description of demultiplexing results.
If desired, the sample calls identified from demultiplexing can be used to separate the SingleCellExperiment
object by sample for downstream analysis.
# view the summary of the sample calls for genetic demultiplexing
# genetic demultiplexing results are stored in the vireo_sampleid column found in the colData
table(multiplexed_sce$vireo_sampleid)
# identify cells that are assigned to sample A (use `which` to remove NA)
sampleA_cells <- which(multiplexed_sce$vireo_sampleid == "sampleA")
# create a new sce that only contains cells from sample A
sampleA_sce <- multiplexed_sce[, sampleA_cells]
Here are some additional resources that can be used for working with multiplexed samples (or those with cell hashing):