Single-cell gene expression file contents
Single-cell or single-nuclei gene expression data (unfiltered, filtered, or processed) is provided in two formats:
As an RDS file containing a
SingleCellExperimentobject for use in R.An H5AD file containing an
AnnDataobject for use in Python.
These objects contain the expression data, cell and gene metrics, associated metadata, and, in the case of multimodal data like ADTs from CITE-seq experiments, data from additional cell-based assays.
For SingleCellExperiment objects, the ADT data will be included as an alternative experiment in the same object containing the primary RNA data.
For AnnData objects, the ADT data will be available as a separate object stored in a separate file.
Note that multiplexed sample libraries are only available as SingleCellExperiment objects, and are not currently available as AnnData objects.
Below we present some details about the specific contents of the objects we provide.
Components of a SingleCellExperiment object
Before getting started, we highly encourage you to familiarize yourself with the general SingleCellExperiment object structure and functions available as part of the SingleCellExperiment package from Bioconductor.
To begin, you will need to load the SingleCellExperiment package and read the RDS file:
library(SingleCellExperiment)
sce <- readRDS("SCPCL000000_processed.rds")
SingleCellExperiment expression counts
The counts and logcounts assays of the SingleCellExperiment object for single-cell and single-nuclei experiments contain the main RNA-seq expression data.
The counts assay contains the primary raw counts represented as integers, and the logcounts assay contains normalized counts as described in the data post-processing section.
The counts assay includes reads aligned to both spliced and unspliced cDNA (see the section on Post Alevin-fry processing).
Each assay is stored as a sparse matrix, where each column represents a cell or droplet, and each row represents a gene.
The counts and logcounts assays can be accessed with the following R code:
counts(sce) # counts matrix
logcounts(sce) # logcounts matrix
Column names are cell barcode sequences, and row names are Ensembl gene IDs. These names can be accessed with the following R code:
colnames(sce) # matrix column names
rownames(sce) # matrix row names
There is also a spliced assay which contains the counts matrix with only reads from spliced cDNA.
SingleCellExperiment cell metrics
Cell metrics calculated from the RNA-seq expression data are stored as a DataFrame in the colData slot, with the cell barcodes as the names of the rows.
colData(sce) # cell metrics
The following per-cell data columns are included for each cell, calculated using the scuttle::addPerCellQCMetrics() function.
Column name |
Contents |
|---|---|
|
UMI count for RNA-seq data |
|
Number of genes detected (gene count > 0 ) |
|
UMI count of mitochondrial genes |
|
Number of mitochondrial genes detected |
|
Percent of all UMI counts assigned to mitochondrial genes |
|
Total UMI count for RNA-seq data and any alternative experiments (i.e., ADT data from CITE-seq) |
The following additional per-cell data columns are included in both the filtered and processed objects.
These columns include metrics calculated by miQC, a package that jointly models proportion of reads belonging to mitochondrial genes and number of unique genes detected to predict low-quality cells, as well as scDblFinder, a package to predict doublets.
We also include the filtering results used for the creation of the processed objects.
See the description of the processed gene expression data for more information on filtering performed to create the processed objects.
Column name |
Contents |
|---|---|
|
Probability that a cell is compromised (i.e., dead or damaged), as calculated by |
|
Indicates whether the cell passed the default miQC filtering. |
|
Labels cells as either |
|
If CITE-seq was performed, labels cells as either |
|
The |
|
The |
The processed object contains additional colData column(s):
A column with graph-based clustering assignments will be present Note that these clusters were calculated with default parameters and were not evaluated, as described in the section on processed gene expression data
If cell type annotation was performed, columns containing annotation results will be present, as described in the cell type annotation processing section
If CNV inference was performed, columns containing these results will be present, as described in the CNV inference processing section
Column name |
Contents |
|---|---|
|
Cell cluster identity identified by graph-based clustering |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
The assigned consensus cell type annotation as determined by finding the latest common ancestor among |
|
The assigned consensus cell type ontology ID as determined by finding the latest common ancestor among |
|
Whether the cell would be considered part of the reference normal cells for |
|
If CNV inference was performed, the total number of CNV events in the cell calculated by |
For some libraries, cell types were annotated either by submitters or through the OpenScPCA project as described in the cell type annotation processing section.
In this case, these additional columns will be present in all three objects (unfiltered, filtered, and processed) containing the associated cell type annotations:
Column name |
Contents |
|---|---|
|
If available, cell type annotations obtained from the group that submitted the original data. Cells that the submitter did not annotate are labeled as |
|
If available, cell type annotations obtained from the OpenScPCA project as determined by analysis performed in the |
|
If available, the Cell Ontology identifier associated with the |
SingleCellExperiment gene information and metrics
Gene information and metrics calculated from the RNA-seq expression data are stored as a DataFrame in the rowData slot, with the Ensembl ID as the names of the rows.
rowData(sce) # gene metrics
The following columns are included for all genes.
Metrics were calculated using the scuttle::addPerFeatureQCMetrics function.
Column name |
Contents |
|---|---|
|
HUGO gene symbol, if defined |
|
Ensembl gene ID |
|
Mean count across all cells/droplets |
|
Percent of cells in which the gene was detected (gene count > 0 ) |
SingleCellExperiment experiment metadata
Metadata associated with data processing is included in the metadata slot as a list.
metadata(sce) # experiment metadata
Item name |
Contents |
|---|---|
|
Version of |
|
Transcriptome reference file used for mapping |
|
Total number of reads processed by |
|
Number of reads successfully mapped |
|
Pipeline used for mapping and quantification ( |
|
Version of |
|
|
|
|
|
Boolean indicating whether quantification was done using |
|
Number of cells reported by |
|
A string indicating the technology and version used for the single-cell library, such as 10Xv2, 10Xv3, or 10Xv3.1 |
|
A string indicating the Experimental Factor Ontology term ID associated with the |
|
|
|
Types of counts matrices included in the object. |
|
Data frame containing metadata for each sample included in the library (see the |
|
A string indicating the type of sample, with one of the following values: |
|
The model object that |
|
The method used for cell filtering. One of |
|
The minimum UMI count per cell used as a threshold for removing empty droplets. Only present for |
|
The minimum cutoff for the probability of a cell being compromised, as calculated by |
|
Method used by the Data Lab to filter low quality cells prior to normalization. Either |
|
If CITE-seq was performed, the method used by the Data Lab to identify cells to be filtered prior to normalization, based on ADT counts. Either |
|
The minimum cutoff for the number of unique genes detected per cell used to filter cells. Only present for |
|
The method used for normalization of raw RNA counts. Either |
|
If CITE-seq was performed, the method used for normalization of raw ADT counts. Either |
|
A vector of highly variable genes used for dimensionality reduction, determined using |
|
The algorithm used to perform graph-based clustering of cells. Only present for |
|
The weighting approach used during graph-based clustering. Only present for |
|
The nearest neighbor parameter value used for the graph-based clustering. Only present for |
|
If cell type annotation was performed, a vector of the methods used for annotation. May include |
|
If cell type annotations from the OpenScPCA project are available, the original module name from the |
|
If cell type annotations from the OpenScPCA project are available, the version of the |
|
If cell type annotations from the OpenScPCA project are available, the release date for the input ScPCA data used when assigning annotations |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If consensus cell types are present, a vector with the names of the automated methods used to generate consensus cell type annotations |
|
Vector of consensus cell type labels which would be specified for the normal reference cells in |
|
The total number of normal reference cells included in the |
|
The broad diagnosis group used to determine which cell types should be considered as part of the normal cell reference for |
|
String indicating the status of running |
|
A list containing the contents of the |
|
The |
SingleCellExperiment sample metadata
Relevant sample metadata is available as a data frame stored in the metadata(sce)$sample_metadata slot of the SingleCellExperiment object.
Each row in the data frame will correspond to a sample present in the library.
The following columns are included in the sample metadata data frame for all libraries.
Column name |
Contents |
|---|---|
|
Sample ID in the form |
|
Library ID in the form |
|
Unique ID corresponding to the donor from which the sample was obtained |
|
Original sample identifier from submitter |
|
Submitter name/ID |
|
Age provided by submitter |
|
Whether age is the age at diagnosis ( |
|
Sex of patient that the sample was obtained from |
|
Tumor type |
|
Subcategory of diagnosis or mutation status (if applicable) |
|
Where in the body the tumor sample was located |
|
At what stage of disease the sample was obtained, either diagnosis or recurrence |
|
The organism the sample was obtained from (e.g., |
|
Whether the sample is a patient-derived xenograft |
|
Whether the sample was derived from a cell line |
|
|
|
|
|
NCBI taxonomy term for organism, e.g. |
|
For Homo sapiens, a |
|
|
|
|
For some libraries, the sample metadata may also include additional metadata specific to the disease type and experimental design of the project. Examples of this include treatment or outcome.
SingleCellExperiment dimensionality reduction results
In the RDS file containing the processed SingleCellExperiment object only (_processed.rds), the reducedDim slot of the object will be occupied with both principal component analysis (PCA) and UMAP results.
For all other objects, the reducedDim slot will be empty as no dimensionality reduction was performed.
PCA results were calculated using scater::runPCA(), using only highly variable genes.
The list of highly variable genes used was selected using scran::modelGeneVar and scran::getTopHVGs, and are stored in the SingleCellExperiment object in metadata(sce)$highly_variable_genes.
The following command can be used to access the PCA results:
reducedDim(sce, "PCA")
UMAP results were calculated using scater::runUMAP(), with the PCA results as input rather than the full gene expression matrix.
The following command can be used to access the UMAP results:
reducedDim(sce, "UMAP")
Additional SingleCellExperiment components for multiplexed libraries
Multiplexed libraries will contain a number of additional components and fields.
Hashtag oligo (HTO) quantification for each cell is included within the SingleCellExperiment as an “Alternative Experiment” named "cellhash" , which can be accessed with the following command:
altExp(sce, "cellhash") # hto experiment
Within this, the main data matrix is again found in the counts assay, with each column corresponding to a cell or droplet (in the same order as the parent SingleCellExperiment) and each row corresponding to a hashtag oligo (HTO).
Column names are again cell barcode sequences and row names the HTO IDs for all assayed HTOs.
The following additional per-cell data columns for the cellhash data can be found in the main colData data frame (accessed with colData(sce) as above).
Column name |
Contents |
|---|---|
|
UMI count for cellhash HTOs |
|
Number of HTOs detected per cell (HTO count > 0 ) |
|
Percent of |
Metrics for each of the HTOs assayed can be found as a DataFrame stored as rowData within the alternative experiment:
rowData(altExp(sce, "cellhash")) # hto metrics
This data frame contains the following columns with statistics for each HTO:
Column name |
Contents |
|---|---|
|
Mean HTO count across all cells/droplets |
|
Percent of cells in which the HTO was detected (HTO count > 0 ) |
|
Sample ID for this library that corresponds to the HTO. Only present in |
Note that in the unfiltered SingleCellExperiment objects, this may include hashtag oligos that do not correspond to any included sample, but were part of the reference set used for mapping.
Demultiplexing results
Demultiplexing results are included only in the filtered and processed objects.
A list of the demultiplexing methods applied for these objects can be found in metadata(sce)$demux_methods and are described in the multiplex data processing section.
Demultiplexing analysis adds the following additional fields to the colData(sce) data frame:
Column name |
Contents |
|---|---|
|
Most likely sample as called by |
|
Most likely sample as called by |
|
Most likely sample as called by |
Additional demultiplexing statistics
Each demultiplexing method generates additional statistics specific to the method that you may wish to access, including probabilities, alternative calls, and potential doublet information.
For methods that rely on the HTO data, these statistics are found in the colData(altExp(sce, "cellhash")) data frame;
DropletUtils::hashedDrops() statistics have the prefix hashedDrops_ and Seurat::HTODemux() statistics have the prefix HTODemux.
Genetic demultiplexing statistics are found in the main colData(sce) data frame, with the prefix vireo_.
Components of an AnnData object
Before getting started, we highly encourage you to familiarize yourself with the general AnnData object structure and functions available as part of the AnnData package.
For the most part, the AnnData objects that we provide are formatted to match the expected data format for CELLxGENE following schema version 3.0.0.
To begin, you will need to load the AnnData package and read the H5AD file:
import anndata
adata_object = anndata.read_h5ad("SCPCL000000_processed_rna.h5ad")
AnnData expression counts
The data matrix, X, of the AnnData object for single-cell and single-nuclei experiments contains the primary RNA-seq expression data as integer counts in both the unfiltered (_unfiltered_rna.h5ad) and filtered (_filtered_rna.h5ad) objects.
The data is stored as a sparse matrix, where each column represents a cell or droplet, and each row represents a gene.
The X matrix can be accessed with the following python code:
adata_object.X # raw count matrix
Column names are cell barcode sequences, and row names are Ensembl gene IDs. These names can be accessed as with the following python code:
adata_object.obs_names # matrix column names
adata_object.var_names # matrix row names
In processed objects only (_processed_rna.h5ad), the data matrix X contains the normalized data, while the primary data can be found in raw.X.
The counts in the processed object can be accessed with the following python code:
adata_object.raw.X # raw count matrix
adata_object.X # normalized count matrix
AnnData cell metrics
Cell metrics calculated from the RNA-seq expression data are stored as a pandas.DataFrame in the .obs slot, with the cell barcodes as the names of the rows.
adata_object.obs # cell metrics
All of the per-cell data columns included in the colData of the SingleCellExperiment objects are present in the .obs slot of the AnnData object.
To see a full description of the included columns, see the section on cell metrics in Components of a SingleCellExperiment object.
The AnnData object also includes the following additional cell-level metadata columns:
Column name |
Contents |
|---|---|
|
Sample ID in the form |
|
Library ID in the form |
|
Project ID in the form |
|
Unique ID corresponding to the donor from which the sample was obtained |
|
Original sample identifier from submitter |
|
Submitter name/ID |
|
Age provided by submitter |
|
Whether age is the age at diagnosis ( |
|
Sex of patient that the sample was obtained from |
|
Tumor type |
|
Subcategory of diagnosis or mutation status (if applicable) |
|
Where in the body the tumor sample was located |
|
At what stage of disease the sample was obtained, either diagnosis or recurrence |
|
The organism the sample was obtained from (e.g., |
|
Whether the sample is a patient-derived xenograft |
|
Whether the sample was derived from a cell line |
|
|
|
|
|
NCBI taxonomy term for organism, e.g. |
|
For Homo sapiens, a |
|
|
|
|
|
A string indicating the Experimental Factor Ontology term id associated with the technology and version used for the single-cell library, such as 10Xv2, 10Xv3, or 10Xv3.1 |
|
|
|
Set to |
AnnData gene information and metrics
Gene information and metrics calculated from the RNA-seq expression data are stored as a pandas.DataFrame in the .var slot, with the Ensembl ID as the names of the rows.
adata_object.var # gene metrics
All of the per-gene data columns included in the rowData of the SingleCellExperiment objects are present in the .var slot of the AnnData object.
To see a full description of the included columns, see the section on gene metrics in Components of a SingleCellExperiment object.
The AnnData object also includes the following additional gene-level metadata column:
Column name |
Contents |
|---|---|
|
Boolean indicating if the gene or feature is filtered out in the normalized matrix but is present in the raw matrix |
|
Boolean indicating if the gene or feature is found in the highly variable gene list determined using |
AnnData experiment metadata
Metadata associated with data processing is included in the .uns slot as a list.
adata_object.uns # experiment metadata
All of the object metadata included in SingleCellExperiment objects are present in the .uns slot of the AnnData object.
To see a full description of the included columns, see the section on experiment metadata in Components of a SingleCellExperiment object.
There are two exceptions to this:
The
AnnDataobject does not contain thesample_metadataitem in the.unsslot. Instead, the contents of thesample_metadatadata frame are stored in the cell-level metadata (.obs).The
AnnDataobject does not contain any metadata fields whose type could not be automatically converted to a Python object. This includes anylisttype fields present in theSingleCellExperimentmetadata.
The AnnData object also includes the following additional items in the .uns slot:
Item name |
Contents |
|---|---|
|
CZI schema version used for |
|
A dictionary object containing the parameters and variance weights associated with the PCA matrix found in |
AnnData dimensionality reduction results
The H5AD file containing the processed AnnData object (_processed_rna.h5ad) contains a slot .obsm with both principal component analysis (X_pca) and UMAP (X_umap) results stored as a numpy.ndarray.
For all other H5AD files, the .obsm slot will be empty as no dimensionality reduction was performed.
For information on how PCA and UMAP results were calculated see the section on processed gene expression data.
The following command can be used to access the PCA and UMAP results:
adata_object.obsm["X_pca"] # pca results
adata_object.obsm["X_umap"] # umap results