Merged objects
Each merged object contains combined information for all individual samples in the given ScPCA project. While each individual object, as described on the Single-cell gene expression file contents page, contains quantified gene expression results for a single library, each merged object contains all gene expression results, including gene expression counts and metadata, for all libraries and samples in the given ScPCA project. This information includes quantified gene expression data, cell and gene metrics, and associated metadata for all libraries. See the section on merged object processing for more information on how these objects were prepared.
Merged objects are provided in two formats:
As an RDS file containing a
SingleCellExperimentobject for use in R.As an H5AD file containing an
AnnDataobject for use in Python.
Below we present some details about the specific contents of the objects we provide.
Components of a SingleCellExperiment merged object
To begin, you will need to load the SingleCellExperiment package and read the RDS file:
library(SingleCellExperiment)
merged_sce <- readRDS("SCPCP000000_merged.rds")
SingleCellExperiment expression counts
Merged SingleCellExperiment objects contain two main assays, counts and logcounts, each containing RNA-seq expression data for all libraries in the given ScPCA project combined into a single matrix.
The counts assay contains the primary raw counts represented as integers, and the logcounts assay contains normalized counts as described in the data post-processing section.
The counts assay includes reads aligned to both spliced and unspliced cDNA (see the section on Post Alevin-fry processing).
Each assay is stored as a sparse matrix, where each column represents a cell or droplet, and each row represents a gene.
The counts and logcounts assays can be accessed with the following R code:
counts(merged_sce) # combined counts matrix
logcounts(merged_sce) # combined logcounts matrix
Column names are cell barcode sequences prefixed with the originating library ID, e.g. SCPCL000000-{barcode}, and row names are Ensembl gene IDs.
These names can be accessed with the following R code:
colnames(merged_sce) # matrix column names
rownames(merged_sce) # matrix row names
There is also a spliced assay which contains the counts matrix with only reads from spliced cDNA.
SingleCellExperiment cell metrics
Cell metrics calculated from the RNA-seq expression data are stored as a DataFrame in the colData slot, where row names are the cell barcode prefixed with the originating library ID, e.g. SCPCL000000-{barcode}.
This DataFrame also contains additional sample metadata information stored in the colData slot for all projects that do not contain multiplexed libraries.
Read more about the included sample metadata in the Sample metadata section,
colData(merged_sce) # cell metrics
The following per-cell data columns are included for each cell.
Columns representing quality control statistics were calculated using the scuttle::addPerCellQCMetrics() function.
Column name |
Contents |
|---|---|
|
The cell barcode |
|
Library ID in the form |
|
Unique ID for each cell in the format |
|
UMI count for RNA-seq data |
|
Number of genes detected (gene count > 0 ) |
|
UMI count of mitochondrial genes |
|
Number of mitochondrial genes detected |
|
Percent of all UMI counts assigned to mitochondrial genes |
|
Total UMI count for RNA-seq data and any alternative experiments (i.e., ADT data from CITE-seq) |
|
Probability that a cell is compromised (i.e., dead or damaged), as calculated by |
|
Indicates whether the cell passed the default miQC filtering. |
|
Labels cells as either |
|
If CITE-seq was performed, labels cells as either |
|
The |
|
The |
|
If available, cell type annotations obtained from the group that submitted the original data. Cells that the submitter did not annotate are labeled as |
|
If available, cell type annotations obtained from the OpenScPCA project as determined by analysis performed in the |
|
If available, the Cell Ontology identifier associated with the |
Unlike for individual SCE objects, cluster assignments are not included in the colData.
Further, if cell type annotation was performed on at least one library included in the merged object, there will be additional colData columns with these annotation results, as described in the cell type annotation processing section.
Column name |
Contents |
|---|---|
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
The assigned consensus cell type annotation as determined by finding the latest common ancestor among |
|
The assigned consensus cell type ontology ID as determined by finding the latest common ancestor among |
|
Whether the cell would be considered part of the reference normal cells for |
|
If CNV inference was performed, the total number of CNV events in the cell calculated by |
SingleCellExperiment gene information and metrics
Gene information and metrics calculated from the RNA-seq expression data are stored as a DataFrame in the rowData slot, with the Ensembl ID as the names of the rows.
rowData(merged_sce) # gene metrics
The following columns are included for all genes.
The columns mean and detected will appear for each library ID included in the merged object, named as shown in the table below.
However, there will only be a single gene_symbol and gene_ids column, as this information equally pertains to all libraries.
Metrics were calculated for each library using the scuttle::addPerFeatureQCMetrics function.
Column name |
Contents |
|---|---|
|
Ensembl gene ID |
|
HUGO gene symbol, if defined |
|
Mean count across all cells/droplets for library |
|
Percent of cells in which the gene was detected (gene count > 0 ) for library |
SingleCellExperiment experiment metadata
metadata(merged_sce) # experiment metadata
Item name |
Contents |
|---|---|
|
A vector of library IDs which are included in the merged object, each in the form |
|
A vector of sample IDs which are included in the merged object, each in the form |
|
A list of the library metadata for each library. Each list is named with the appropriate library ID and contains the library metadata fields for the given library as they would appear in an individual library object. See the table below for a full description of its contents |
|
A data frame of additional sample metadata, as described in the |
|
A vector of highly variable genes used for performing dimensionality reduction on the merged object, determined using |
To access the library_metadata field for a specific library, use the following code:
# Access individual library metadata for SCPCL000000
metadata(merged_sce)$library_metadata$SCPCL000000
Each such list will contain the following fields:
Item name |
Contents |
|---|---|
|
Sample ID in the form |
|
Library ID in the form |
|
Project ID in the form |
|
Version of |
|
Transcriptome reference file used for mapping |
|
Total number of reads processed by |
|
Number of reads successfully mapped |
|
Pipeline used for mapping and quantification ( |
|
Version of |
|
|
|
|
|
Boolean indicating whether quantification was done using |
|
Number of cells reported by |
|
A string indicating the technology and version used for the single-cell library, such as 10Xv2, 10Xv3, or 10Xv3.1 |
|
A string indicating the Experimental Factor Ontology term ID associated with the |
|
|
|
Transcripts included in gene counts: |
|
A string indicating the type of sample, with one of the following values: |
|
The method used for cell filtering. One of |
|
The minimum UMI count per cell used as a threshold for removing empty droplets. Only present for objects where the |
|
The minimum cutoff for the probability of a cell being compromised, as calculated by |
|
Method used by the Data Lab to filter low quality cells prior to normalization. Either |
|
If CITE-seq was performed, the method used by the Data Lab to identify cells to be filtered prior to normalization, based on ADT counts. Either |
|
The minimum cutoff for the number of unique genes detected per cell |
|
The method used for normalization of raw RNA counts. Either |
|
If CITE-seq was performed, the method used for normalization of raw ADT counts. Either |
|
A list of highly variable genes used for dimensionality reduction, determined using |
|
If cell type annotation was performed, a vector of the methods used for annotation. May include |
|
If cell type annotations from the OpenScPCA project are available, the original module name from the |
|
If cell type annotations from the OpenScPCA project are available, the version of the |
|
If cell type annotations from the OpenScPCA project are available, the release date for the input ScPCA data used when assigning annotations |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If consensus cell types are present, a vector with the names of the automated methods used to generate consensus cell type annotations |
|
Vector of consensus cell type labels which would be specified for the normal reference cells in |
|
The total number of normal reference cells included in the |
|
The broad diagnosis group used to determine which cell types should be considered as part of the normal cell reference for |
|
String indicating the status of running |
|
A list containing the contents of the |
|
The |
Unlike for individual SingleCellExperiment objects, cluster algorithm parameters are not included in these metadata lists because clusters themselves are not included in the merged object.
SingleCellExperiment sample metadata
Sample metadata describing each sample included in the merged object is stored in one of two locations, depending on the sample type:
If the project does_not_ contain multiplexed libraries, this information can be found as additional columns in the colData slot’s DataFrame, along with cell metrics.
colData(merged_sce) # sample metadata only for projects without multiplexing
If the project contains multiplexed libraries, this information is stored in the metadata slot in the sample_metadata field as a data.frame.
Similar to merged objects with multiplexed libraries, all individual library objects will contain this sample metadata in the SingleCellExperiment object’s metadata slot.
metadata(merged_sce)$sample_metadata # sample metadata only for projects with multiplexed samples
Column name |
Contents |
|---|---|
|
Sample ID in the form |
|
Project ID in the form |
|
A string indicating the technology and version used for the sample’s single-cell library, such as 10Xv2, 10Xv3, or 10Xv3.1 |
|
A string indicating the Experimental Factor Ontology term ID associated with the |
|
|
|
Any additional modalities associated with the library, represented as alternative experiment names such as |
|
Unique ID corresponding to the donor from which the sample was obtained |
|
Original sample identifier from submitter |
|
Submitter name/ID |
|
Age provided by submitter |
|
Whether age is the age at diagnosis ( |
|
Sex of patient that the sample was obtained from |
|
Tumor type |
|
Subcategory of diagnosis or mutation status (if applicable) |
|
Where in the body the tumor sample was located |
|
At what stage of disease the sample was obtained, either diagnosis or recurrence |
|
The organism the sample was obtained from (e.g., |
|
Whether the sample is a patient-derived xenograft |
|
Whether the sample was derived from a cell line |
|
|
|
|
|
NCBI taxonomy term for organism, e.g. |
|
For Homo sapiens, a |
|
|
|
|
Some merged objects may have some additional sample metadata columns specific to the given ScPCA project’s disease type and experimental design. Examples of this include treatment or outcome.
SingleCellExperiment dimensionality reduction results
The reducedDim slot of the merged object will contain both principal component analysis (PCA) and UMAP results.
PCA results were calculated using batchelor::multiBatchPCA, specifying libraries as batches to ensure that each library in the merged object was equally weighted, and specifying a list of highly variable genes.
The highly variable genes were selected in a library-aware manner with scran::modelGeneVar and scran::getTopHVGs.
The vector of highly variable genes are stored in the SingleCellExperiment object in metadata(merged_sce)$merged_highly_variable_genes.
The following command can be used to access the PCA results:
reducedDim(merged_sce, "PCA")
UMAP results were calculated using scater::runUMAP(), with the batch-aware PCA results as input rather than the full gene expression matrix.
The following command can be used to access the UMAP results:
reducedDim(merged_sce, "UMAP")
Additional SingleCellExperiment components for multiplexed libraries
Merged objects are not available for any projects that contain multiplexed libraries. This is because there is no guarantee that a unique HTO was used for each sample in a given project, so it would not necessarily be possible to determine which HTO corresponds to which sample in a merged object.
Components of an AnnData merged object
Before getting started, we highly encourage you to familiarize yourself with the general AnnData object structure and functions available as part of the AnnData package.
For the most part, the AnnData objects that we provide are formatted to match the expected data format for CELLxGENE following schema version 3.0.0.
To begin, you will need to load the AnnData package and read the H5AD file:
import anndata
merged_adata_object = anndata.read_h5ad("SCPCP000000_merged_rna.h5ad")
AnnData expression counts
Merged AnnData objects contain two data matrices, each containing RNA-seq expression data for all libraries in the given ScPCA project combined into a single matrix.
The data matrix raw.X of the merged AnnData object contains the RNA-seq expression data as primary integer counts, and the data matrix X contains the RNA-seq expression data as normalized counts.
The data is stored as a sparse matrix, where each column represents a cell or droplet, and each row represents a gene.
The raw.X and X matrices can be accessed with the following python code:
merged_adata_object.raw.X # raw count matrix
merged_adata_object.X # normalized count matrix
Column names are cell barcode sequences prefixed with the originating library ID, e.g. SCPCL000000-{barcode}, and row names are Ensembl gene IDs.
These names can be accessed as with the following python code:
merged_adata_object.obs_names # matrix column names
merged_adata_object.var_names # matrix row names
AnnData cell metrics
Cell metrics calculated from the RNA-seq expression data, which were calculated separately for each library, are stored as a pandas.DataFrame in the .obs slot.
The slot’s row names are cell barcode sequences prefixed with the originating library ID, e.g. SCPCL000000-{barcode}.
merged_adata_object.obs # cell metrics and metadata
All of the per-cell data columns included in the colData of the SingleCellExperiment merged objects are present in the .obs slot of the AnnData object.
To see a full description of the included columns, see the sections on cell metrics and sample metadata in Components of a SingleCellExperiment merged object.
AnnData gene information and metrics
Gene information and metrics from the RNA-seq expression data, which were calculated separately for each library, are stored as a pandas.DataFrame in the .var slot, with the Ensembl ID as the names of the rows.
merged_adata_object.var # gene metrics
All of the per-gene data columns included in the rowData of the SingleCellExperiment objects are present in the .var slot of the AnnData object.
Note that the SingleCellExperiment columns named SCPCL000000-mean and SCPCL000000-detected are instead named SCPCL000000.mean and SCPCL000000.detected, respectively, in the merged AnnData object.
To see a full description of the included columns, see the section on gene metrics in Components of a SingleCellExperiment merged object.
AnnData experiment metadata
A partial set of the metadata associated with data processing is included in the .uns slot of the AnnData object as a list.
merged_adata_object.uns # experiment metadata
The following items are available in the .uns slot:
Item name |
Contents |
|---|---|
|
A list of library IDs which are included in the merged object, each in the form |
|
A list of sample IDs which are included in the merged object, each in the form |
|
A list of highly variable genes used for performing dimensionality reduction on the merged object, determined using |
Additional experiment metadata is available in the metadata TSV file included in the ScPCA Portal download folder.
AnnData dimensionality reduction results
The merged AnnData object contains a slot .obsm with both principal component analysis (X_pca) and UMAP (X_umap) results.
For information on how PCA and UMAP results were calculated see the section on processed gene expression data.
The following command can be used to access the PCA and UMAP results:
merged_adata_object.obsm["X_pca"] # pca results
merged_adata_object.obsm["X_umap"] # umap results