# Getting started with an ScPCA dataset

This section provides information on next steps you might take after downloading a dataset from the ScPCA portal.

We also have a separate GitHub repository that contains workflows for some common analysis performed on single-cell RNA-sequencing data. These workflows are designed to apply the same analysis (e.g., clustering) across multiple samples in parallel. These workflows and more resources for processing single-cell and single-nuclei datasets can be found in the scpca-downstream-analyses repository.

## Importing ScPCA data into R

Quantified single-cell or single-nuclei gene expression data is provided as an RDS file as described in the single cell gene expression file contents section. There are three RDS files that are available for each library: an unfiltered.rds, a filtered.rds, and a processed.rds file. The unfiltered.rds file contains the gene expression data for all droplets, regardless of the presence of a cell or not. The filtered.rds files contains the gene expression data for only droplets that are likely to contain cells, removing any probable empty droplets. See the section on filtering cells for more information on how we remove potential empty droplets.

The processed.rds files are further filtered to remove any low quality cells and contain both the raw and normalized gene expression data for the identified cells. See the description of the processed gene expression data for more information on the processed objects.

In most scenarios, we recommend starting with the processed.rds file.

The first step in analyzing the provided gene expression data will be to import the data into R. As a reminder, each RDS file contains a SingleCellExperiment object. Refer to single-cell gene expression file contents for a more detailed description on the contents of the included SingleCellExperiment object.

To work with SingleCellExperiment objects in R, we need to ensure that we have the SingleCellExperiment package installed and loaded.

The following commands can be used to import the RDS file into R and save the SingleCellExperiment object:

# if SingleCellExperiment is not installed, install the package
# otherwise the installation step can be skipped
if (!("SingleCellExperiment" %in% installed.packages())) {
BiocManager::install("SingleCellExperiment")
}

library(SingleCellExperiment)
# read in the RDS file, including the path to the file's location


More resources for learning about SingleCellExperiment objects:

## Working with the processed SingleCellExperiment objects

The SingleCellExperiment objects stored in the processed.rds files have undergone additional quality control steps to remove low quality cells. In addition, normalized expression counts and dimensionality reduction (principal component analysis and UMAP) have been calculated.

The following commands can be used to access the raw and normalized count matrices:

# the raw counts matrix stored in the processed object
raw_counts <- counts(processed_sce)

# log normalized counts matrix stored in the processed object
normalized_counts <- logcounts(processed_sce)


Dimensionality reduction results can be accessed using the following command:

# extract principal component results
pca_results <- reducedDim(processed_sce, "PCA")

# extract UMAP results
umap_results <- reducedDim(processed_sce, "UMAP")


Principal components were calculated from a set of highly variable genes identified for a given library. The list of highly variable genes used for this calculation, in order from highest to lowest variation, is stored in the metadata of the SingleCellExperiment object.

This list can be accessed using the following command:

filtered_sce <- sce[, which(sce$miQC_pass)]  For more information on using miQC for filtering cells, see the following resources: You can also directly filter cells based on the number of unique genes, total reads, and fraction of mitochondrial reads. We used the function scuttle::addPerCellQCMetrics() to calculate these metrics and have included them in the colData of the filtered SingleCellExperiment objects. The following columns are added by scuttle::addPerCellQCMetrics() and can be found in the colData: Column name Contents sum UMI count for RNA-seq data detected Number of genes detected (gene count > 0 ) subsets_mito_sum UMI count of mitochondrial genes subsets_mito_detected Number of mitochondrial genes detected subsets_mito_percent Percent of all UMI counts assigned to mitochondrial genes total Total UMI count for RNA-seq data and any alternative experiments (i.e., CITE-seq) These metrics can be used to directly filter the SingleCellExperiment object based on informed thresholds. If you are planning to filter low quality cells using such thresholds, we encourage you to read more about the various metrics and plot the distribution of each metric before deciding on which cells to exclude. The Quality Control chapter in Orchestrating Single Cell Analysis provides a nice guide to checking diagnostic plots and then choosing cutoffs. ### Normalization The provided data contains unnormalized raw counts. We recommend using the scran and scater packages to add normalized counts to the SingleCellExperiment object. Follow these steps to perform normalization using scran and scater: # Cluster similar cells qclust <- scran::quickCluster(filtered_sce) # Compute sum factors for each cell cluster grouping. filtered_sce <- scran::computeSumFactors(filtered_sce, clusters = qclust) # Normalize and log transform. normalized_sce <- scater::logNormCounts(filtered_sce)  Here we provide more resources on understanding normalization in single-cell RNA-seq analysis: ### Dimensionality Reduction Dimensionality reduction is commonly used as a precursor to plotting, clustering, and other downstream analysis. It is common to start with performing principal component analysis (PCA), a technique that identifies new axes that capture the largest amount of variation in the data. The PCA results can be calculated and stored in the SingleCellExperiment object using the following command: # calculate a PCA matrix using the top 500 most highly variable genes (default) normalized_sce <- scater::runPCA(normalized_sce, ntop = 500)  Here we are calculating PCA by using the default of the top 500 most highly variable genes as input, however this is not always the optimal choice. We encourage you to visit the Feature selection chapter in Orchestrating Single Cell Analysis to read more about modeling gene variance and selecting the highly variable genes. PCA is commonly used for initial dimensionality reduction. However, we can use more advanced techniques, like UMAP (Uniform Manifold Approximation and Projection), that may be better for visualization. Use caution when interpreting UMAP results, as location, spacing, and density of clusters can be dependent on parameter choices and random effects and do not always accurately represent the relationships among cells. UMAP can also be quite slow for a large dataset, so we can use the previous PCA results as input to speed up the analysis. # Run UMAP using already stored PCA results normalized_sce <- scater::runUMAP(normalized_sce, dimred = "PCA")  See below for more resources on dimensionality reduction: ## What if I want to use Seurat? The files available for download contain SingleCellExperiment objects. If desired, these can be converted into Seurat objects. You can find the code needed to convert the SingleCellExperiment objects to Seurat objects in the FAQ section. After converting the object to a Seurat object, the same steps outlined above (quality control, filtering, normalization, dimensionality reduction) can be followed but using functions available as part of the Seurat package. Here are some resources that can be used to get you started working with Seurat objects: ## What if I want to use scanpy? There are a variety of ways to convert the counts data into a python-compatible format. We have found that one of the more efficient ways is to first convert the counts data into 10x format using DropletUtils::write10xCounts() and then read the files into Python using the scanpy package. You can find the code needed to perform these steps outlined in the FAQ section on using Python. After creating an AnnData object in scanpy, the same steps outlined above (quality control, filtering, normalization, dimensionality reduction) can be followed but using functions available as part of the scanpy package. Here are some resources that can be used to get you started working with AnnData objects: ## Special considerations for multiplexed samples If the dataset that you have downloaded contains samples that were multiplexed (i.e. cells from multiple samples have been combined together into one library), then the steps to take after downloading will be a little different than for libraries that only contain cells corresponding to a single sample. Here, multiplexed samples refer to samples that have been combined together into one library using cell hashing and then sequenced together. This means that a single library contains cells or nuclei that correspond to multiple samples. Each sample has been tagged with a hashtag oligo (HTO) prior to mixing, and that HTO can be used to identify which cells belong to which sample within a multiplexed library. The libraries available for download on the portal have not been separated by sample (i.e. demultiplexed), and therefore contain data from multiple samples. Libraries containing multiplexed samples can be initially processed using the same workflow described above including removal of low quality cells, normalization, and dimensionality reduction. Demultiplexing can then be used to identify the sample that each cell is from. Demultiplexing has already been performed using both Seurat::HTODemux and DropletUtils::hashedDrops. For samples where corresponding bulk RNA-sequencing data is available, genetic demultiplexing was also conducted. The results from demultiplexing using these methods have been summarized and are present in the colData of the SingleCellExperiment object in the _filtered.rds file only. The hashedDrops_sampleid, HTODemux_sampleid, and vireo_sampleid columns in the colData report the sample called for each cell by the specified demultiplexing method. If a confident call was not made for a cell by the demultiplexing method, the column will have a value of NA. For more information on how to access the full demultiplexing results, see this description of demultiplexing results. If desired, the sample calls identified from demultiplexing can be used to separate the SingleCellExperiment object by sample for downstream analysis. # view the summary of the sample calls for genetic demultiplexing # genetic demultiplexing results are stored in the vireo_sampleid column found in the colData table(multiplexed_sce$vireo_sampleid)

# identify cells that are assigned to sample A (use which to remove NA)
sampleA_cells <- which(multiplexed_sce\$vireo_sampleid == "sampleA")

# create a new sce that only contains cells from sample A
sampleA_sce <- multiplexed_sce[, sampleA_cells]


Here are some additional resources that can be used for working with multiplexed samples (or those with cell hashing):