Merged objects
Each merged object contains combined information for all individual samples in the given ScPCA project. While each individual object, as described on the Single-cell gene expression file contents page, contains quantified gene expression results for a single library, each merged object contains all gene expression results, including gene expression counts and metadata, for all libraries and samples in the given ScPCA project. This information includes quantified gene expression data, cell and gene metrics, and associated metadata for all libraries. See the section on merged object processing for more information on how these objects were prepared.
Merged objects are provided in two formats:
As an RDS file containing a
SingleCellExperiment
object for use in R.As an H5AD file containing an
AnnData
object for use in Python.
Below we present some details about the specific contents of the objects we provide.
Components of a SingleCellExperiment merged object
To begin, you will need to load the SingleCellExperiment
package and read the RDS file:
library(SingleCellExperiment)
merged_sce <- readRDS("SCPCP000000_merged.rds")
SingleCellExperiment expression counts
Merged SingleCellExperiment
objects contain two main assays, counts
and logcounts
, each containing RNA-seq expression data for all libraries in the given ScPCA project combined into a single matrix.
The counts
assay contains the primary raw counts represented as integers, and the logcounts
assay contains normalized counts as described in the data post-processing section.
The counts
assay includes reads aligned to both spliced and unspliced cDNA (see the section on Post Alevin-fry processing).
Each assay is stored as a sparse matrix, where each column represents a cell or droplet, and each row represents a gene.
The counts
and logcounts
assays can be accessed with the following R code:
counts(merged_sce) # combined counts matrix
logcounts(merged_sce) # combined logcounts matrix
Column names are cell barcode sequences prefixed with the originating library ID, e.g. SCPCL000000-{barcode}
, and row names are Ensembl gene IDs.
These names can be accessed with the following R code:
colnames(merged_sce) # matrix column names
rownames(merged_sce) # matrix row names
There is also a spliced
assay which contains the counts matrix with only reads from spliced cDNA.
SingleCellExperiment cell metrics
Cell metrics calculated from the RNA-seq expression data are stored as a DataFrame
in the colData
slot, where row names are the cell barcode prefixed with the originating library ID, e.g. SCPCL000000-{barcode}
.
This DataFrame
also contains additional sample metadata information stored in the colData
slot for all projects that do not contain multiplexed libraries.
Read more about the included sample metadata in the Sample metadata
section,
colData(merged_sce) # cell metrics
The following per-cell data columns are included for each cell.
Columns representing quality control statistics were calculated using the scuttle::addPerCellQCMetrics()
function.
Column name |
Contents |
---|---|
|
The cell barcode |
|
Library ID in the form |
|
Unique ID for each cell in the format |
|
UMI count for RNA-seq data |
|
Number of genes detected (gene count > 0 ) |
|
UMI count of mitochondrial genes |
|
Number of mitochondrial genes detected |
|
Percent of all UMI counts assigned to mitochondrial genes |
|
Total UMI count for RNA-seq data and any alternative experiments (i.e., ADT data from CITE-seq) |
|
Probability that a cell is compromised (i.e., dead or damaged), as calculated by |
|
Indicates whether the cell passed the default miQC filtering. |
|
Labels cells as either |
|
If CITE-seq was performed, labels cells as either |
Unlike for individual SCE objects, cluster assignments are not included in the colData
.
SingleCellExperiment gene information and metrics
Gene information and metrics calculated from the RNA-seq expression data are stored as a DataFrame
in the rowData
slot, with the Ensembl ID as the names of the rows.
rowData(merged_sce) # gene metrics
The following columns are included for all genes.
The columns mean
and detected
will appear for each library ID included in the merged object, named as shown in the table below.
However, there will only be a single gene_symbol
and gene_ids
column, as this information equally pertains to all libraries.
Metrics were calculated for each library using the scuttle::addPerFeatureQCMetrics
function.
Column name |
Contents |
---|---|
|
Ensembl gene ID |
|
HUGO gene symbol, if defined |
|
Mean count across all cells/droplets for library |
|
Percent of cells in which the gene was detected (gene count > 0 ) for library |
SingleCellExperiment experiment metadata
metadata(merged_sce) # experiment metadata
Item name |
Contents |
---|---|
|
A vector of library IDs which are included in the merged object, each in the form |
|
A vector of sample IDs which are included in the merged object, each in the form |
|
A list of the library metadata for each library. Each list is named with the appropriate library ID and contains the library metadata fields for the given library as they would appear in an individual library object. See the table below for a full description of its contents |
|
A data frame of additional sample metadata, as described in the |
|
A vector of highly variable genes used for performing dimensionality reduction on the merged object, determined using |
To access the library_metadata
field for a specific library, use the following code:
# Access individual library metadata for SCPCL000000
metadata(merged_sce)$library_metadata$SCPCL000000
Each such list will contain the following fields:
Item name |
Contents |
---|---|
|
Sample ID in the form |
|
Library ID in the form |
|
Project ID in the form |
|
Version of |
|
Transcriptome reference file used for mapping |
|
Total number of reads processed by |
|
Number of reads successfully mapped |
|
Pipeline used for mapping and quantification ( |
|
Version of |
|
|
|
|
|
Boolean indicating whether quantification was done using |
|
Number of cells reported by |
|
A string indicating the technology and version used for the single-cell library, such as 10Xv2, 10Xv3, or 10Xv3.1 |
|
A string indicating the Experimental Factor Ontology term ID associated with the |
|
|
|
Transcripts included in gene counts: |
|
A string indicating the type of sample, with one of the following values: |
|
The method used for cell filtering. One of |
|
The minimum UMI count per cell used as a threshold for removing empty droplets. Only present for objects where the |
|
The minimum cutoff for the probability of a cell being compromised, as calculated by |
|
Method used by the Data Lab to filter low quality cells prior to normalization. Either |
|
If CITE-seq was performed, the method used by the Data Lab to identify cells to be filtered prior to normalization, based on ADT counts. Either |
|
The minimum cutoff for the number of unique genes detected per cell |
|
The method used for normalization of raw RNA counts. Either |
|
If CITE-seq was performed, the method used for normalization of raw ADT counts. Either |
|
A list of highly variable genes used for dimensionality reduction, determined using |
|
If cell type annotation was performed, a vector of the methods used for annotation. May include |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
Unlike for individual SingleCellExperiment objects, cluster algorithm parameters are not included in these metadata lists because clusters themselves are not included in the merged object.
SingleCellExperiment sample metadata
Sample metadata describing each sample included in the merged object is stored in one of two locations, depending on the sample type:
If the project does_not_ contain multiplexed libraries, this information can be found as additional columns in the colData
slot’s DataFrame
, along with cell metrics.
colData(merged_sce) # sample metadata only for projects without multiplexing
If the project contains multiplexed libraries, this information is stored in the metadata
slot in the sample_metadata
field as a data.frame
.
Similar to merged objects with multiplexed libraries, all {ref} individual library objects<sce_file_contents:singlecellexperiment sample metadata>
will contain this sample metadata in the SingleCellExperiment
object’s metadata
slot.
metadata(merged_sce)$sample_metadata # sample metadata only for projects with multiplexed samples
Column name |
Contents |
---|---|
|
Sample ID in the form |
|
Project ID in the form |
|
A string indicating the technology and version used for the sample’s single-cell library, such as 10Xv2, 10Xv3, or 10Xv3.1 |
|
A string indicating the Experimental Factor Ontology term ID associated with the |
|
|
|
Any additional modalities associated with the library, represented as alternative experiment names such as |
|
Unique ID corresponding to the donor from which the sample was obtained |
|
Original sample identifier from submitter |
|
Submitter name/ID |
|
Age at time sample was obtained |
|
Sex of patient that the sample was obtained from |
|
Tumor type |
|
Subcategory of diagnosis or mutation status (if applicable) |
|
Where in the body the tumor sample was located |
|
At what stage of disease the sample was obtained, either diagnosis or recurrence |
|
The organism the sample was obtained from (e.g., |
|
Whether the sample is a patient-derived xenograft |
|
Whether the sample was derived from a cell line |
|
|
|
|
|
NCBI taxonomy term for organism, e.g. |
|
For Homo sapiens, a |
|
|
|
|
|
If available, cell type annotations obtained from the group that submitted the original data. Cells that the submitter did not annotate are labeled as |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
|
If cell typing with |
Some merged objects may have some additional sample metadata columns specific to the given ScPCA project’s disease type and experimental design. Examples of this include treatment or outcome.
SingleCellExperiment dimensionality reduction results
The reducedDim
slot of the merged object will contain both principal component analysis (PCA
) and UMAP
results.
PCA results were calculated using batchelor::multiBatchPCA
, specifying libraries as batches to ensure that each library in the merged object was equally weighted, and specifying a list of highly variable genes.
The highly variable genes were selected in a library-aware manner with scran::modelGeneVar
and scran::getTopHVGs
.
The vector of highly variable genes are stored in the SingleCellExperiment
object in metadata(merged_sce)$merged_highly_variable_genes
.
The following command can be used to access the PCA results:
reducedDim(merged_sce, "PCA")
UMAP results were calculated using scater::runUMAP()
, with the batch-aware PCA results as input rather than the full gene expression matrix.
The following command can be used to access the UMAP results:
reducedDim(merged_sce, "UMAP")
Additional SingleCellExperiment components for multiplexed libraries
Merged objects are not available for any projects that contain multiplexed libraries. This is because there is no guarantee that a unique HTO was used for each sample in a given project, so it would not necessarily be possible to determine which HTO corresponds to which sample in a merged object.
Components of an AnnData merged object
Before getting started, we highly encourage you to familiarize yourself with the general AnnData
object structure and functions available as part of the AnnData
package.
For the most part, the AnnData
objects that we provide are formatted to match the expected data format for CELLxGENE
following schema version 3.0.0
.
To begin, you will need to load the AnnData
package and read the H5AD file:
import anndata
merged_adata_object = anndata.read_h5ad("SCPCP000000_merged_rna.h5ad")
AnnData expression counts
Merged AnnData
objects contain two data matrices, each containing RNA-seq expression data for all libraries in the given ScPCA project combined into a single matrix.
The data matrix raw.X
of the merged AnnData
object contains the RNA-seq expression data as primary integer counts, and the data matrix X
contains the RNA-seq expression data as normalized counts.
The data is stored as a sparse matrix, where each column represents a cell or droplet, and each row represents a gene.
The raw.X
and X
matrices can be accessed with the following python code:
merged_adata_object.raw.X # raw count matrix
merged_adata_object.X # normalized count matrix
Column names are cell barcode sequences prefixed with the originating library ID, e.g. SCPCL000000-{barcode}
, and row names are Ensembl gene IDs.
These names can be accessed as with the following python code:
merged_adata_object.obs_names # matrix column names
merged_adata_object.var_names # matrix row names
AnnData cell metrics
Cell metrics calculated from the RNA-seq expression data, which were calculated separately for each library, are stored as a pandas.DataFrame
in the .obs
slot.
The slot’s row names are cell barcode sequences prefixed with the originating library ID, e.g. SCPCL000000-{barcode}
.
merged_adata_object.obs # cell metrics and metadata
All of the per-cell data columns included in the colData
of the SingleCellExperiment
merged objects are present in the .obs
slot of the AnnData
object.
To see a full description of the included columns, see the sections on cell metrics and sample metadata in Components of a SingleCellExperiment merged object
.
AnnData gene information and metrics
Gene information and metrics from the RNA-seq expression data, which were calculated separately for each library, are stored as a pandas.DataFrame
in the .var
slot, with the Ensembl ID as the names of the rows.
merged_adata_object.var # gene metrics
All of the per-gene data columns included in the rowData
of the SingleCellExperiment
objects are present in the .var
slot of the AnnData
object.
Note that the SingleCellExperiment
columns named SCPCL000000-mean
and SCPCL000000-detected
are instead named SCPCL000000.mean
and SCPCL000000.detected
, respectively, in the merged AnnData
object.
To see a full description of the included columns, see the section on gene metrics in Components of a SingleCellExperiment merged object
.
AnnData experiment metadata
A partial set of the metadata associated with data processing is included in the .uns
slot of the AnnData
object as a list.
merged_adata_object.uns # experiment metadata
The following items are available in the .uns
slot:
Item name |
Contents |
---|---|
|
A list of library IDs which are included in the merged object, each in the form |
|
A list of sample IDs which are included in the merged object, each in the form |
|
A list of highly variable genes used for performing dimensionality reduction on the merged object, determined using |
Additional experiment metadata is available in the metadata TSV file included in the ScPCA Portal download folder.
AnnData dimensionality reduction results
The merged AnnData
object contains a slot .obsm
with both principal component analysis (X_PCA
) and UMAP (X_UMAP
) results.
For information on how PCA and UMAP results were calculated see the section on processed gene expression data.
The following command can be used to access the PCA and UMAP results:
merged_adata_object.obsm["X_PCA"] # pca results
merged_adata_object.obsm["X_UMAP"] # umap results