Frequently Asked Questions

Why did we use Alevin-fry for processing?

We aimed to process all of the data in the portal such that it is comparable to widely used pipelines, namely Cell Ranger from 10x Genomics. In our own benchmarking, we found that Alevin-fry (He et al. (2022)) produces very similar results to Cell Ranger, while allowing faster, more memory efficient processing of single-cell and single-nuclei RNA-sequencing data. In the configuration that we are using (“selective alignment” mapping to a human transcriptome that includes introns), Alevin-fry uses approximately 12-16 GB of memory per sample and completes mapping and quantification in less than an hour. By contrast, Cell Ranger uses up to 25-30 GB of memory per sample and takes anywhere from 2-8 hours to align and quantify one sample. Quantification of samples processed with both Alevin-fry and Cell Ranger resulted in similar distributions of mapped UMI count per cell and genes detected per cell for both tools.

We also compared the mean gene expression reported for each gene by both methods and observed a high correlation with a Pearson R correlation coefficient of 0.98.

Recent reports from others support our findings. He et al. (2022) demonstrated that alevin-fry can process single-cell and single-nuclei data more quickly and efficiently then other available methods, while also decreasing the false positive rate of gene detection that is commonly seen in methods that utilize transcriptome alignment. You et al. (2021) and Tian et al. (2019) have also noted that results from different pre-processing workflows for single-cell RNA-sequencing analysis tend to result in compatible results downstream.

How do I use the provided RDS files in R?

If you would like to work with the gene expression data in R, you will need to choose the option for downloading the data as a SingleCellExperiment object. This download includes RDS files that can be directly read into R.

Note: You will need to install and load the SingleCellExperiment package from Bioconductor to work with the provided files.

To read in the RDS files you can use the readRDS command in base R.

library(SingleCellExperiment)
scpca_sample <- readRDS("SCPCL000000_processed.rds")

A full description of the contents of the SingleCellExperiment object can be found in the section on Components of a SingleCellExperiment object. For more information on working with the RDS files, see Getting started with an ScPCA dataset.

How do I use the provided H5AD files in Python?

If you would like to work with the gene expression data in Python, you will need to choose the option for downloading the data as an AnnData object. This download includes H5AD files that can be directly read into Python.

Note: You will need to install the AnnData package to work with the provided files.

To read in the H5AD files you can use the read_h5ad function from the AnnData package.

import anndata
scpca_sample = anndata.read_h5ad(file = "SCPCL000000_processed_rna.h5ad")

A full description of the contents of the AnnData object can be found in the section on Components of an AnnData object. For more information on working with the H5AD files, see Getting started with an ScPCA dataset.

Which samples can I download as AnnData objects?

Most samples in the ScPCA Portal are available for download as both SingleCellExperiment objects (.rds files) and AnnData objects (.h5ad files). There are two types of samples where AnnData objects are not available:

  • Spatial transcriptomics data from a given sample

    • As described in the spatial transcriptomics processing section, no post-processing is performed on these libraries after running Space Ranger. Therefore, we only provide the output from running Space Ranger and do not store data in either R or Python objects.

  • Samples that are part of multiplexed libraries

    • Although the ScPCA pipeline reports demultiplexing results, it does not definitively separate samples due to the potential for disagreement among methods.

    • Resolving such disagreements requires examination of the HTO data, which can not be stored in the same AnnData object. Therefore, we do not currently provide any multiplexed libraries as AnnData objects.

    • In addition, providing multiplexed data in this form is not compliant with the standards for CZI’s CELLxGENE, which we have tried to match as closely as possible.

What if I want to use MuData instead of AnnData objects?

MuData objects are Python objects built on top of AnnData objects that are specifically used to store multimodal data. Currently, we provide RNA counts and ADT counts, if present, as separate AnnData objects in their own H5AD files, as described in the file contents documentation. However, these objects can be combined into a MuData object if desired for a multimodal analysis.

Note: You will need to install the MuData package to generate and work with MuData objects.

import anndata
import mudata

# Read individual AnnData files
rna_object = anndata.read_h5ad(file = "SCPCL000000_processed_rna.h5ad")
adt_object = anndata.read_h5ad(file = "SCPCL000000_processed_adt.h5ad")

# Combine into a MuData object, using keys "RNA" and "ADT" to distinguish modalities
mdata_object = mudata.MuData({"RNA": rna_object, "ADT": adt_object})

For more information on working with AnnData objects, see Getting started with an ScPCA dataset.

What is the difference between samples and libraries?

A sample ID, labeled as scpca_sample_id and indicated by the prefix SCPCS, represents a unique tissue that was collected from a participant.

The library ID, labeled as scpca_library_id and indicated by the prefix SCPCL, represents a single set of cells from a tissue sample, or a particular combination of samples in the case of multiplexed libraries. For single-cell or single-nuclei experiments, this will be the result of emulsion and droplet generation using the 10x Genomics workflow, potentially including both RNA-seq, ADT (i.e., from CITE-seq), and cell hashing sequencing libraries. Multiplexed libraries will have more than one sample ID corresponding to each library ID.

In most cases, each sample will only have one corresponding single-cell or single-nuclei library, and may also have an associated bulk RNA-seq library. However, in some cases multiple libraries were created by separate droplet generation and sequencing from the same sample, resulting in more than one single-cell or single-nuclei library ID being associated with the same sample ID.

What is a participant ID?

The participant_id is a unique ID provided by the submitter to indicate the participant from which a collection of samples was obtained. For example, one participant may have a sample collected both at initial diagnosis and at relapse. This would result in two different sample ID’s, but the same participant ID. However, for most participants, only a single sample was collected and submitted for sequencing.

What is a multiplexed sample?

Multiplexed samples refer to samples that have been combined together into a single library using cell hashing (Stoeckius et al. 2018) or a related technology and then sequenced together. This means that a single library contains cells or nuclei that correspond to multiple samples. Each sample has been tagged with a hashtag oligo (HTO) prior to mixing, and that HTO can be used to identify which cells or nuclei belong to which sample within a multiplexed library. The libraries available for download on the portal have not been separated by sample (i.e. demultiplexed), and therefore contain data from multiple samples. For more information on working with multiplexed samples, see the special considerations for multiplexed samples section in getting started with an ScPCA dataset.

Why are demultiplexed samples not available?

Downloading a multiplexed sample on the portal will result in obtaining the gene expression files corresponding to the library containing the chosen multiplexed sample and any other samples that were multiplexed with that chosen sample. This means that users will receive the gene expression data for all samples that were combined into a given library and will have to separate any cells corresponding to the sample of interest before proceeding with downstream analysis.

We have applied multiple demultiplexing methods to multiplexed libraries and noticed that these demultiplexing methods can vary both in calls and confidence levels assigned. Here we have performed some exploratory analysis comparing demultiplexing methods within a single multiplexed library. Because of the inconsistency across demultiplexing methods used, the choice of demultiplexing method to use is at the discretion of the user. Rather than separating out each sample, the sample calls and any associated statistics regarding sample calls for multiple demultiplexing methods can be found in the _filtered.rds file for each multiplexed library. See the demultiplexing results section for instructions on how to access the demultiplexing results in the SingleCellExperiment objects for multiplexed libraries. We also include the Hash Tag Oligo counts matrix to allow demultiplexing using other available methods.

What are estimated demux cell counts?

Estimated demux cell counts are provided for multiplexed libraries and refer to the estimated cell counts for each sample that is present in the library. In order to provide an estimate of the number of cells or nuclei that are present in a given sample before download, we use the estimated number of cells per sample identified by one of the tested demultiplexing methods. However, these estimated demux cell counts should only be considered a guide; we encourage users to investigate the data on their own and make their own decisions on the best demultiplexing method to use for their research purposes.

Note that not all cells in a library are included in the estimated demux cell count, as some cells may not have been assigned to a sample. Estimated demux cell counts are only reported for multiplexed samples and are not reported for single-cell or single-nuclei samples that are not multiplexed. For more about demultiplexing, see the section on processing multiplexed libraries.

What genes are included in the reference transcriptome?

The reference transcriptome index that was used for alignment was constructed by extracting both spliced cDNA and intronic regions from the primary genome assembly GRCh38, Ensembl database version 104 (see the code used to generate the reference transcriptome). The resulting reference transcriptome index contains 60,319 genes. In addition to protein-coding genes, this list of genes includes pseudogenes and non-coding RNA. The gene expression data files available for download report all possible genes present in the reference transcriptome, even if not detected in a given library.

Where can I see the code for generating QC reports?

A QC report for every processed library is included with all downloads, generated from the unfiltered and filtered Single-cell gene expression files. You can find the function for generating a QC report and the QC report template documents in the package we developed for working with processed ScPCA data, scpcaTools.

Which libraries include cell type annotations?

Most single-cell and single-nuclei RNA-seq libraries available on the portal will have cell type annotations included in the processed SingleCellExperiment or AnnData object. For more information on where to find the cell type annotations, refer to section(s) describing SingleCellExperiment file contents and/or AnnData file contents. If cell type annotation was performed, a supplemental cell type report (SCPCL000000_celltype-report.html) will be included in the download.

Cell type annotation is not performed on samples derived from cell lines. This means processed objects will not include cell type annotations, and the download will not include a cell type report.

Which libraries include CNV inferences?

As with cell type annotation, most single-cell and single-nuclei RNA-seq libraries available on the portal will have CNV inferences in the processed SingleCellExperiment or AnnData object. For more information on where to find these results, refer to sections describing SingleCellExperiment cell metrics and SingleCellExperiment metadata, and/or AnnData cell metrics and AnnData metadata.

There are several circumstances when CNV results are not available:

  • CNV inference is not performed on libraries which do not have enough cells to include in a normal reference, as described in the CNV inference processing documentation

  • CNV inference is not performed on libraries derived from cell line or non-cancerous samples

  • If inferCNV experienced a failure while running, there will not be any associated results in the processed objects

Where can I find the inferCNV heatmap?

For libraries that underwent CNV inference, the inferCNV heatmap depicting expression across genomic regions is embedded in the final QC report. You can directly copy the figure from the QC report file for use in other contexts.

What if I want to use Seurat instead of Bioconductor?

The RDS files available for download contain SingleCellExperiment objects. If desired, these can be converted into Seurat objects.

You will need to install and load the Seurat package to work with Seurat objects.

For libraries that only contain RNA-seq data (i.e., do not have an ADT library found in the altExp of the SingleCellExperiment object), you can use the following commands:

library(Seurat)
library(SingleCellExperiment)

# read in RDS file
sce <- readRDS("SCPCL000000_filtered.rds")

# create seurat object from the SCE counts matrix
seurat_object <- CreateSeuratObject(counts = counts(sce),
                                    assay = "RNA",
                                    project = "SCPCL000000")

The above code will only maintain information found in the original counts matrix from the SingleCellExperiment. Optionally, if you would like to keep the included cell and gene associated metadata during conversion to the Seurat object you can perform the below additional steps:

# convert colData and rowData to data.frame for use in the Seurat object
cell_metadata <- as.data.frame(colData(sce))
row_metadata <- as.data.frame(rowData(sce))

# add cell metadata (colData) from SingleCellExperiment to Seurat
seurat_object@meta.data <- cell_metadata

# add row metadata (rowData) from SingleCellExperiment to Seurat
seurat_object[["RNA"]]@meta.features <- row_metadata

# add metadata from SingleCellExperiment to Seurat
seurat_object@misc <- metadata(sce)

For SingleCellExperiment objects from libraries with both RNA-seq and ADT data, you can use the following additional commands to add a second assay containing the ADT counts and associated feature data:

# create assay object in Seurat from ADT counts found in altExp(SingleCellExperiment)
adt_assay <- CreateAssayObject(counts = counts(altExp(sce)))

# optional: add row metadata (rowData) from altExp to assay
adt_row_metadata <- as.data.frame(rowData(altExp(sce)))
adt_assay@meta.features <- adt_row_metadata

# add altExp from SingleCellExperiment as second assay to Seurat
seurat_object[["ADT"]] <- adt_assay

When should I download a project as a merged object?

When you download all data for a ScPCA project, you will be presented with two options. You can either download the project such that the data for each sample is stored in separate files, or you can download a single file that contains a merged object with data from all samples in the project. This merged object contains combined data from all samples (and therefore all libraries), including expression count matrices and associated metadata. The samples have simply been merged into a single file - they have not been integrated/batch-corrected.

You may prefer to download this merged object instead of individual sample files to facilitate downstream analyses that consider multiple samples at once, such as differential expression analysis, integrating multiple samples, or jointly clustering multiple samples.

Please refer to the section about getting started with a merged object for more details on working with these objects.

Which projects can I download as merged objects?

Most projects in the ScPCA Portal are available for download as a merged object. There are three types of projects for which merged objects are not available:

  • Projects comprised of spatial transcriptomics

    • As described in the spatial transcriptomics processing section, no post-processing is performed on these libraries after running Space Ranger. Therefore, merging samples into a single object is beyond the scope of the ScPCA pipeline.

  • Projects containing multiplexed libraries

    • Although the ScPCA pipeline reports demultiplexing results, it does not actually perform demultiplexing. As there is no guarantee that a unique HTO was used for each sample in a given project, it would not necessarily be possible to determine which HTO corresponds to which sample in a merged object.

  • Projects containing more than 100 samples

    • The more samples that are included in a merged object, the larger the object, and the more difficult it will be to work with that object in R or Python. Because of this, we do not provide merged objects for projects with more than 100 samples as the size of the merged object is too large.

Why can’t I merge a subset of samples from a project?

Merged project downloads are not available for a subset of samples in a project. Merged objects will always contain all samples in the given project (see which projects do not have merged objects).

If you would like to work with a merged object that only contains a subset of project samples, we recommend downloading the merged object and subsetting it directly to the samples of your choosing. See Subsetting the Merged Object for instructions on how to subset SingleCellExperiment and AnnData merged objects.

Why doesn’t my existing code work on a new download from the Portal?

Although we try to maintain backward compatibility, new features added to the ScPCA Portal may result in downloads that are no longer compatible with code written with older downloads from the ScPCA Portal in mind. Please see our CHANGELOG for a summary of changes that impact downloads from the Portal.

I previously downloaded a sample that is no longer on the Portal. Why can’t I find it?

If a sample you downloaded previously is no longer available, a submitter has requested it to be removed. We process and release all data provided to us on the Portal, without filtering out samples based on quality. Instead, we provide a QC report with each download so that users may assess library quality themselves. However, we respect submitters’ requests to remove samples if they have deemed they are low quality based on their own analyses.

Can I download data from the Portal programmatically?

We provide an R package, ScPCAr, to facilitate programmatic access to the ScPCA Portal. This package allows you to search for and download data from the ScPCA Portal directly within R. Please see the package documentation for more details about installation and usage. Source code for the package can be found on GitHub.

Why can’t I change the data format in My Dataset?

When creating a custom dataset for download (My Dataset), all single-cell sample or project data included must be of the same data format, either SingleCellExperiment for use in R or AnnData for use in Python. We currently do not support including both data formats at once in My Dataset. Once a sample or project of a given data format has been added to My Dataset, all subsequent single-cell or single-nuclei data added will automatically be in that same format.

Therefore, if you wish to download single-cell or single-nuclei expression data in both SingleCellExperiment and AnnData data formats, you will need to create and download separate My Datasets, one at a time, for each format.

Why did project options change when I appended samples to My Dataset?

If you would like to include all samples from a dataset you have previously created, you can append these samples to your current My Dataset. In some cases, however, certain project-level options may change when you append additional samples from a project that is already present in My Dataset.

Specifically, we apply these rules when you append to My Dataset:

  • If you selected to include bulk RNA-seq expression in the download either in the previous dataset or the current My Dataset, bulk expression will remain included in the download.

  • If you selected the merged project option both in the previous dataset and the current My Dataset, the merge option will remain selected. Otherwise, if only one dataset had this option selected, the merge option will no longer be applied.

You are always welcome to edit these options in My Dataset to your liking after appending the additional samples.

Why are some values different after I regenerate My Dataset?

The Portal only offers data processed using a single version of the AlexsLemonade/scpca-nf workflow for each sample at any given time. If any new features or updates are made to the workflow, all data currently on the Portal will be re-processed. Because of this, when you regenerate a previously-created version of My Dataset, the values in your downloaded files may be slightly different compared to a previous download. A full description of any major changes made to data on the Portal are described in the CHANGELOG page.

You can learn more about the specific version of the data you have as follows:

  • The file name of each downloaded zip file, and the enclosed README.md, will include the date it was downloaded from the Portal

  • The metadata included in your downloaded files will contain information about the AlexsLemonade/scpca-nf workflow version that was used to process the data

    • For example, the column workflow_version in the metadata file included in your download (single-cell_metadata.tsv, bulk_metadata.tsv, and/or spatial_metadata.tsv) provides the AlexsLemonade/scpca-nf workflow version used to process the sample, and the column processed_date provides the date the sample was processed through the workflow. See the metadata documentation for additional information

  • For more information about a given AlexsLemonade/scpca-nf release, refer to the releases page on GitHub