Single-cell gene expression file contents

Single-cell or single-nuclei gene expression data (unfiltered, filtered, or processed) is provided for use with R as an RDS file containing a SingleCellExperiment object. This object contains the expression data, cell and gene metrics, associated metadata, and, in the case of multimodal data like ADTs from CITE-seq experiments, data from additional cell-based assays.

We highly encourage you to familiarize yourself with the general object structure and functions available as part of the SingleCellExperiment package from Bioconductor. Below we present some details about the specific contents of the objects we provide.

To begin, you will need to load the SingleCellExperiment package and read the RDS file:

library(SingleCellExperiment)
sce <- readRDS("SCPCL000000_processed.rds")

Components of a `SingleCellExperiment` object

Expression counts

The counts assay of the SingleCellExperiment object for single-cell and single-nuclei experiments (for all provided file types) contains the primary RNA-seq expression data as integer counts. The counts here include reads aligned to both spliced and unspliced cDNA (see the section on Post Alevin-fry processing). The data is stored as a sparse matrix, and each column represents a cell or droplet, each row a gene. Column names are cell barcode sequences and row names are Ensembl gene IDs. The counts assay can be accessed with the following R code:

count_matrix <- counts(sce)

Additionally, the spliced assay contains a counts matrix that includes reads from spliced cDNA only.

Cell metrics

Cell metrics calculated from the RNA-seq expression data are stored as a DataFrame in the colData slot, with the cell barcodes as the names of the rows.

cell_metrics <- colData(sce)

The following per-cell data columns are included for each cell, calculated using the scuttle::addPerCellQCMetrics() function.

Column name	Contents
`sum`	UMI count for RNA-seq data
`detected`	Number of genes detected (gene count > 0 )
`subsets_mito_sum`	UMI count of mitochondrial genes
`subsets_mito_detected`	Number of mitochondrial genes detected
`subsets_mito_percent`	Percent of all UMI counts assigned to mitochondrial genes
`total`	Total UMI count for RNA-seq data and any alternative experiments (i.e., ADT data from CITE-seq)

The following additional per-cell data columns are included in both the filtered and processed objects. These columns include metrics calculated by miQC, a package that jointly models proportion of reads belonging to mitochondrial genes and number of unique genes detected to predict low-quality cells. We also include the filtering results used for the creation of the processed objects. See the description of the processed gene expression data for more information on filtering performed to create the processed objects.

Column name	Contents
`prob_compromised`	Probability that a cell is compromised (i.e., dead or damaged), as calculated by `miQC`
`miQC_pass`	Indicates whether the cell passed the default miQC filtering. `TRUE` is assigned to cells with a low probability of being compromised (`prob_compromised` < 0.75) or sufficiently low mitochondrial content
`scpca_filter`	Labels cells as either `Keep` or `Remove` based on filtering criteria (`prob_compromised` < 0.75 and number of unique genes detected > 200)
`adt_scpca_filter`	If CITE-seq was performed, labels cells as either `Keep` or `Remove` based on ADT filtering criteria (`discard = TRUE` as determined by `DropletUtils::CleanTagCounts()`)

Gene information and metrics

Gene information and metrics calculated from the RNA-seq expression data are stored as a DataFrame in the rowData slot, with the Ensembl ID as the names of the rows.

gene_info <- rowData(sce)

The following columns are included for all genes. Metrics were calculated using the scuttle::addPerFeatureQCMetrics function.

Column name	Contents
`gene_symbol`	HUGO gene symbol, if defined
`mean`	Mean count across all cells/droplets
`detected`	Percent of cells in which the gene was detected (gene count > 0 )

Experiment metadata

Metadata associated with data processing is included in the metadata slot as a list.

expt_metadata <- metadata(sce)

Item name	Contents
`salmon_version`	Version of `salmon` used for initial mapping
`reference_index`	Transcriptome reference file used for mapping
`total_reads`	Total number of reads processed by `salmon`
`mapped_reads`	Number of reads successfully mapped
`mapping_tool`	Pipeline used for mapping and quantification (`alevin-fry` for all current data in ScPCA)
`alevinfry_version`	Version of `alevin-fry` used for mapping and quantification
`af_permit_type`	`alevin-fry generate-permit-list` method used for filtering cell barcodes
`af_resolution`	`alevin-fry quant` resolution mode used
`usa_mode`	Boolean indicating whether quantification was done using `alevin-fry` USA mode
`af_num_cells`	Number of cells reported by `alevin-fry`
`tech_version`	A string indicating the technology and version used for the single-cell library, such as 10Xv2, 10Xv3, or 10Xv3.1
`transcript_type`	Transcripts included in gene counts: `spliced` for single-cell samples and `unspliced` for single-nuclei
`miQC_model`	The model object that `miQC` fit to the data and was used to calculate `prob_compromised`. Only present for `filtered` objects
`filtering_method`	The method used for cell filtering. One of `emptyDrops`, `emptyDropsCellRanger`, or `UMI cutoff`. Only present for `filtered` objects
`umi_cutoff`	The minimum UMI count per cell used as a threshold for removing empty droplets. Only present for `filtered` objects where the `filtering_method` is `UMI cutoff`
`prob_compromised_cutoff`	The minimum cutoff for the probability of a cell being compromised, as calculated by `miQC`. Only present for `filtered` objects
`scpca_filter_method`	Method used by the Data Lab to filter low quality cells prior to normalization. Either `miQC` or `Minimum_gene_cutoff`
`adt_scpca_filter_method`	If CITE-seq was performed, the method used by the Data Lab to identify cells to be filtered prior to normalization, based on ADT counts. Either `cleanTagCounts with isotype controls` or `cleanTagCounts without isotype controls`. If filtering failed (i.e. `DropletUtils::cleanTagCounts()` could not reliably determine which cells to filter), the value will be `No filter`
`min_gene_cutoff`	The minimum cutoff for the number of unique genes detected per cell. Only present for `filtered` objects
`normalization`	The method used for normalization of raw RNA counts. Either `deconvolution`, described in Lun, Bach, and Marioni (2016), or `log-normalization`. Only present for `processed` objects
`adt_normalization`	If CITE-seq was performed, the method used for normalization of raw ADT counts. Either `median-based` or `log-normalization`, as explained in processed ADT data section. Only present for `processed` objects
`highly_variable_genes`	A list of highly variable genes used for dimensionality reduction, determined using `scran::modelGeneVar` and `scran::getTopHVGs`. Only present for `processed` objects

Dimensionality reduction results

In the RDS file containing the processed SingleCellExperiment object only (_processed.rds), the reducedDim slot of the object will be occupied with both principal component analysis (PCA) and UMAP results. For all other files, the reducedDim slot will be empty as no dimensionality reduction was performed.

PCA results were calculated using scater::runPCA(), using only highly variable genes. The list of highly variable genes used was selected using scran::modelGeneVar and scran::getTopHVGs, and are stored in the SingleCellExperiment object in metadata(sce)$highly_variable_genes. The following command can be used to access the PCA results:

reducedDim(sce, "PCA")

UMAP results were calculated using scater::runUMAP(), with the PCA results as input rather than the full gene expression matrix. The following command can be used to access the UMAP results:

reducedDim(sce,"UMAP")

Additional SingleCellExperiment components for CITE-seq libraries (with ADT tags)

ADT data from CITE-seq experiments, when present, is included within the SingleCellExperiment as an “Alternative Experiment” named "adt" , which can be accessed with the following command:

altExp(sce, "adt")

Within this, the main expression matrix is again found in the counts assay and the normalized expression matrix is found in the logcounts assay. For each assay, each column corresponds to a cell or droplet (in the same order as the parent SingleCellExperiment) and each row corresponds to an antibody derived tag (ADT). Column names are again cell barcode sequences and row names are the antibody targets for each ADT.

Note that only cells which are denoted as “Keep” in the colData(sce)$adt_scpca_filter column (as described above) have normalized expression values in the logcounts assay, and all other cells are assigned NA values. However, as described in the processed ADT data section, normalization may fail under certain circumstances, in which case there will be no logcounts normalized expression matrix present in the alternative experiment.

The following additional per-cell data columns for the ADT data can be found in the main colData data frame (accessed with colData(sce) as above).

Column name	Contents
`altexps_adt_sum`	UMI count for CITE-seq ADTs
`altexps_adt_detected`	Number of ADTs detected per cell (ADT count > 0 )
`altexps_adt_percent`	Percent of `total` UMI count from ADT reads

In addition, the following QC statistics from DropletUtils::cleanTagCounts() can be found in the colData of the "adt" alternative experiment, accessed with colData(altExp(sce, "adt")).

Column name	Contents
`zero.ambient`	Indicates whether the cell has zero ambient contamination
`sum.controls`	The sum of counts for all control features. Only present if negative/isotype control ADTs are present
`high.controls`	Indicates whether the cell has unusually high total control counts. Only present if negative/isotype control ADTs are present
`ambient.scale`	The relative amount of ambient contamination. Only present if negative control ADTs are not present
`high.ambient`	Indicates whether the cell has unusually high contamination. Only present if negative/isotype control ADTs are not present
`discard`	Indicates whether the cell should be discarded based on QC statistics

Metrics for each of the ADTs assayed can be found as a DataFrame stored as rowData within the alternative experiment:

adt_info <- rowData(altExp(sce, "adt"))

This data frame contains the following columns with statistics for each ADT:

Column name	Contents
`mean`	Mean ADT count across all cells/droplets
`detected`	Percent of cells in which the ADT was detected (ADT count > 0 )
`target_type`	Whether each ADT is a target (`target`), negative/isotype control (`neg_control`), or positive control (`pos_control`). If this information was not provided, all ADTs will have been considered targets and will be labeled as `target`

Finally, additional metadata for ADT processing can be found in the metadata slot of the alternative experiment. This metadata slot has the same contents as the parent experiment metadata, along with one additional field, ambient_profile, which holds a list of representing the ambient concentrations of each ADT.

adt_metadata <- metadata(altExp(sce, "adt"))

Additional SingleCellExperiment components for multiplexed libraries

Multiplexed libraries will contain a number of additional components and fields.

Hashtag oligo (HTO) quantification for each cell is included within the SingleCellExperiment as an “Alternative Experiment” named "cellhash" , which can be accessed with the following command:

altExp(sce, "cellhash")

Within this, the main data matrix is again found in the counts assay, with each column corresponding to a cell or droplet (in the same order as the parent SingleCellExperiment) and each row corresponding to a hashtag oligo (HTO). Column names are again cell barcode sequences and row names the HTO ids for all assayed HTOs.

The following additional per-cell data columns for the cellhash data can be found in the main colData data frame (accessed with colData(sce) as above).

Column name	Contents
`altexps_cellhash_sum`	UMI count for cellhash HTOs
`altexps_cellhash_detected`	Number of HTOs detected per cell (HTO count > 0 )
`altexps_cellhash_percent`	Percent of `total` UMI count from HTO reads

Metrics for each of the HTOs assayed can be found as a DataFrame stored as rowData within the alternative experiment:

hto_info <- rowData(altExp(sce, "cellhash"))

This data frame contains the following columns with statistics for each HTO:

Column name	Contents
`mean`	Mean HTO count across all cells/droplets
`detected`	Percent of cells in which the HTO was detected (HTO count > 0 )
`sample_id`	Sample ID for this library that corresponds to the HTO (only present in `_filtered.rds` files)

Note that in the unfiltered SingleCellExperiment objects, this may include hashtag oligos that do not correspond to any included sample, but were part of the reference set used for mapping.

Demultiplexing results

Demultiplexing results are included only in the _filtered.rds files. The demultiplexing methods applied for these files are as described in the multiplex data processing section.

Demultiplexing analysis adds the following additional fields to the colData(sce) data frame:

Column name	Contents
`hashedDrops_sampleid`	Most likely sample as called by `DropletUtils::hashedDrops`
`HTODemux_sampleid`	Most likely sample as called by `Seurat::HTODemux`
`vireo_sampleid`	Most likely sample as called by `vireo` (genetic demultiplexing)

Additional demultiplexing statistics

Each demultiplexing method generates additional statistics specific to the method that you may wish to access, including probabilities, alternative calls, and potential doublet information.

For methods that rely on the HTO data, these statistics are found in the colData(altExp(sce, "cellhash")) data frame; DropletUtils::hashedDrops() statistics have the prefix hashedDrops_ and Seurat::HTODemux() statistics have the prefix HTODemux.

Genetic demultiplexing statistics are found in the main colData(sce) data frame, with the prefix vireo_.