Getting started with an ScPCA dataset

This section provides information on next steps you might take after downloading a dataset from the ScPCA portal.

Importing ScPCA data into R

Quantified single-cell or single-nuclei gene expression data is provided as an RDS file as described in the single cell gene expression file contents section. There are two RDS files that will be available for each library, a filtered.rds and an unfiltered.rds file. The filtered.rds files will contain the gene expression data for only droplets that are likely to contain cells, removing any potential empty droplets. The unfiltered.rds file will contain the gene expression data for all droplets, regardless of the presence of a cell or not. See the section on filtering cells for more information on how we remove potential empty droplets. In most scenarios, we recommend starting with the filtered.rds file.

The first step in analyzing the provided gene expression data will be to import the data into R. As a reminder, each RDS file contains a SingleCellExperiment object. Refer to single-cell gene expression file contents for a more detailed description on the contents of the included SingleCellExperiment object.

To work with SingleCellExperiment objects in R, we need to ensure that we have the SingleCellExperiment package installed and loaded.

The following commands can be used to import the RDS file into R and save the SingleCellExperiment object:

# if SingleCellExperiment is not installed, install the package
# otherwise the installation step can be skipped
if (!("SingleCellExperiment" %in% installed.packages())) {
  BiocManager::install("SingleCellExperiment")
}

library(SingleCellExperiment)
# read in the RDS file, including the path to the file's location
sce <- readRDS("SCPCS000000/SCPCL000000_filtered.rds")

More resources for learning about SingleCellExperiment objects:

Quality control

After we have imported the RDS file into R and loaded the SingleCellExperiment object, we can begin working with the data. Before we perform any downstream steps, we recommend removing low quality cells from the dataset. Low quality cells include those with a higher percentage of reads from mitochondrial genes (i.e., those that are damaged or dying) and those with a lower number of total reads and unique genes identified (i.e., those with inefficient reverse transcription or PCR amplification).

All filtered.rds objects include miQC results found in the colData() of the SingleCellExperiment object. miQC jointly models the proportion of mitochondrial reads and the number of unique genes detected in each cell to calculate the probability of a cell being compromised (i.e., dead or damaged). High-quality cells are those with a low probability of being being compromised (< 0.75) or sufficiently low mitochondrial content. All cells that are identified as low-quality cells will have FALSE in the miQC_pass column of the colData() and can be removed prior to downstream analyses.

The following command can be used to remove the low-quality cells:

# filter the `SingleCellExperiment`
# the `$` notation denotes access to the colData slot of the `SingleCellExperiment` object
filtered_sce <- sce[, which(sce$miQC_pass)]

For more information on using miQC for filtering cells, see the following resources:

You can also directly filter cells based on the number of unique genes, total reads, and fraction of mitochondrial reads. We used the function scuttle::addPerCellQCMetrics() to calculate these metrics and have included them in the colData of the filtered SingleCellExperiment objects.

The following columns are added by scuttle::addPerCellQCMetrics() and can be found in the colData:

Column name

Contents

sum

UMI count for RNA-seq data

detected

Number of genes detected (gene count > 0 )

subsets_mito_sum

UMI count of mitochondrial genes

subsets_mito_detected

Number of mitochondrial genes detected

subsets_mito_percent

Percent of all UMI counts assigned to mitochondrial genes

total

Total UMI count for RNA-seq data and any alternative experiments (i.e., CITE-seq)

These metrics can be used to directly filter the SingleCellExperiment object based on informed thresholds. If you are planning to filter low quality cells using such thresholds, we encourage you to read more about the various metrics and plot the distribution of each metric before deciding on which cells to exclude. The Quality Control chapter in Orchestrating Single Cell Analysis provides a nice guide to checking diagnostic plots and then choosing cutoffs.

Normalization

The provided data contains unnormalized raw counts. We recommend using the scran and scater packages to add normalized counts to the SingleCellExperiment object.

Follow these steps to perform normalization using scran and scater:

# Cluster similar cells
qclust <- scran::quickCluster(filtered_sce)

# Compute sum factors for each cell cluster grouping.
filtered_sce <- scran::computeSumFactors(filtered_sce, clusters = qclust)

# Normalize and log transform.
normalized_sce <- scater::logNormCounts(filtered_sce)

Here we provide more resources on understanding normalization in single-cell RNA-seq analysis:

Dimensionality Reduction

Dimensionality reduction is commonly used as a precursor to plotting, clustering, and other downstream analysis.

It is common to start with performing principal component analysis (PCA), a technique that identifies new axes that capture the largest amount of variation in the data. The PCA results can be calculated and stored in the SingleCellExperiment object using the following command:

# calculate a PCA matrix using the top 500 most highly variable genes (default)
normalized_sce <- scater::runPCA(normalized_sce, ntop = 500)

Here we are calculating PCA by using the default of the top 500 most highly variable genes as input, however this is not always the optimal choice. We encourage you to visit the Feature selection chapter in Orchestrating Single Cell Analysis to read more about modeling gene variance and selecting the highly variable genes.

PCA is commonly used for initial dimensionality reduction. However, we can use more advanced techniques, like UMAP (Uniform Manifold Approximation and Projection), that may be better for visualization. Use caution when interpreting UMAP results, as location, spacing, and density of clusters can be dependent on parameter choices and random effects and do not always accurately represent the relationships among cells.

UMAP can also be quite slow for a large dataset, so we can use the previous PCA results as input to speed up the analysis.

# Run UMAP using already stored PCA results
normalized_sce <- scater::runUMAP(normalized_sce, dimred = "PCA")

See below for more resources on dimensionality reduction:

What if I want to use Seurat?

The files available for download contain SingleCellExperiment objects. If desired, these can be converted into Seurat objects. You can find the code needed to convert the SingleCellExperiment objects to Seurat objects in the FAQ section.

After converting the object to a Seurat object, the same steps outlined above (quality control, filtering, normalization, dimensionality reduction) can be followed but using functions available as part of the Seurat package.

Here are some resources that can be used to get you started working with Seurat objects:

What if I want to use scanpy?

There are a variety of ways to convert the counts data into a python-compatible format. We have found that one of the more efficient ways is to first convert the counts data into 10x format using DropletUtils::write10xCounts() and then read the files into Python using the scanpy package.

You can find the code needed to perform these steps outlined in the FAQ section on using Python.

After creating an AnnData object in scanpy, the same steps outlined above (quality control, filtering, normalization, dimensionality reduction) can be followed but using functions available as part of the scanpy package.

Here are some resources that can be used to get you started working with AnnData objects:

Special considerations for multiplexed samples

If the dataset that you have downloaded contains samples that were multiplexed (i.e. cells from mutliple samples have been combined together into one library), then the steps to take after downloading will be a little different than for libraries that only contain cells corresponding to a single sample. Here, multiplexed samples refer to samples that have been combined together into one library using cell hashing and then sequenced together. This means that a single library contains cells or nuclei that correspond to multiple samples. Each sample has been tagged with a hashtag oligo (HTO) prior to mixing, and that HTO can be used to identify which cells belong to which sample within a multiplexed library. The libraries available for download on the portal have not been separated by sample (i.e. demultiplexed), and therefore contain data from multiple samples.

Libraries containing multiplexed samples can be initially processed using the same workflow described above including removal of low quality cells, normalization, and dimensionality reduction. Demultiplexing can then be used to identify the sample that each cell is from. Demultiplexing has already been performed using both Seurat::HTODemux and DropletUtils::hashedDrops. For samples where corresponding bulk RNA-sequencing data is available, genetic demultiplexing was also conducted. The results from demultiplexing using these methods have been summarized and are present in the colData of the SingleCellExperiment object in the _filtered.rds file only. The hashedDrops_sampleid, HTODemux_sampleid, and vireo_sampleid columns in the colData report the sample called for each cell by the specified demultiplexing method. If a confident call was not made for a cell by the demultiplexing method, the column will have a value of NA. For more information on how to access the full demultiplexing results, see this description of demultiplexing results.

If desired, the sample calls identified from demultiplexing can be used to separate the SingleCellExperiment object by sample for downstream analysis.

# view the summary of the sample calls for genetic demultiplexing
# genetic demultiplexing results are stored in the vireo_sampleid column found in the colData
table(multiplexed_sce$vireo_sampleid)

# identify cells that are assigned to sample A (use `which` to remove NA)
sampleA_cells <- which(multiplexed_sce$vireo_sampleid == "sampleA")

# create a new sce that only contains cells from sample A
sampleA_sce <- multiplexed_sce[, sampleA_cells]

Here are some additional resources that can be used for working with multiplexed samples (or those with cell hashing):