Processing information

Single-cell and single-nuclei RNA-seq

Mapping and quantification using alevin-fry

We used salmon and alevin-fry to generate gene by cell counts matrices for all single-cell and single-nuclei samples. In brief, we utilized selective alignment to the splici index for all single-cell and single-nuclei samples.

Reference transcriptome index

For all samples, we aligned FASTQ files to a reference transcriptome index referred to as the splici index. The splici index is built using transcripts from both spliced cDNA and intronic regions. Inclusion of intronic regions in the index used for alignment allowed us to capture both reads from mature, spliced cDNA and nascent, unspliced cDNA. Alignment of RNA-seq data to an index containing intronic regions has been shown to reduce spuriously detected genes (He et al. (2022), Kaminow et al. 2021). In our hands, we have found that use of the splici index led to a more comparable distribution of unique genes found per cell to Cell Ranger than did use of an index obtained from spliced cDNA transcripts only.

Selective alignment

We mapped reads to the transcriptome index using salmon with the default “selective alignment” strategy. Briefly, selective alignment uses a mapping score validated approach to identify maximal exact matches between reads and the provided index. For all samples, we used selective alignment to the splici index.

More detailed descriptions of the mapping strategy invoked by salmon in conjunction with alevin-fry can be found in Srivastava et al. (2020) and He et al. (2022).

Alevin-fry parameters

After mapping FASTQ files using selective alignment to the splici index, we continued with the alevin-fry pipeline using the following parameters:

  1. During the generate-permit-list step of alevin-fry, we used the --unfiltered-pl option, which returns any cell with at least 1 read found in a reference barcode list. For our reference barcode list, we used a list of all possible cell barcodes from 10x Genomics.

  2. We chose to use the cr-like-em resolution strategy for feature quantification and UMI de-duplication. Similar to the way Cell Ranger performs feature quantification, the cr-like-em resolution strategy assigns all UMIs that align to a single gene to that gene. In contrast to Cell Ranger, cr-like-em keeps multi-mapped reads and invokes an extra step to assign these multi-mapped reads to a UMI.

  3. With initial mapping to the splici index, alevin-fry quantification resulted in separate counts for spliced and unspliced transcripts, and an ambiguous count for reads compatible with either spliced or unspliced transcripts.

Post alevin-fry processing

Combining counts from spliced cDNA and intronic regions

For single-cell and single-nuclei samples, the reads from spliced cDNA and intronic regions are combined by gene to produce a gene by cell counts matrix. After combining read counts, values are rounded to integer values.

Filtering cells

In addition to an unfiltered counts matrix, we provide a matrix filtered to only cell barcodes from droplets that are likely to include true cells. To do this we used DropletUtils::emptyDropsCellRanger(), a function that estimates the profile of cells containing ambient RNA and tests the likelihood of all other droplets as differing from the ambient profile (Lun et al. 2019). This function more closely mimics the filtering performed in Cell Ranger than does its predecessor DropletUtils::emptyDrops(). We consider droplets with an FDR less than or equal to 0.01 to be cell-containing droplets. Only cells that pass this FDR threshold are included in the filtered counts matrix.

For some libraries, DropletUtils::emptyDropsCellRanger() may fail due to low numbers of droplets with reads or other violations of its assumptions. For these libraries, only droplets containing at least 100 UMI are included in the filtered counts matrix.

We additionally used scDblFinder to predict whether cells present in this filtered object are singlets or doublets. We provide the results from this analysis in the filtered and processed objects, but we do not perform any filtering based on these results.

Processed gene expression data

In addition to the raw gene expression data, we also provide a processed SingleCellExperiment object with further filtering applied, a normalized counts matrix, and results from dimensionality reduction.

Prior to normalization, low-quality cells are removed from the gene by cell counts matrix. To identify low-quality cells, we use miQC, a package that jointly models proportion of reads belonging to mitochondrial genes and number of unique genes detected. Cells with a high likelihood of being compromised (greater than 0.75) and cells that do not pass a minimum number of unique genes detected threshold of 200 are removed from the counts matrix present in the processed SingleCellExperiment object. In certain circumstances, miQC modeling may fail; in these cases, only cells which do not pass the threshold of at least 200 unique genes are removed.

Log-normalized counts are calculated using the deconvolution method presented in Lun, Bach, and Marioni (2016). Specifically, scran::quickCluster() was used to derive cell clusters on which to calculate sum factors with scran::computeSumFactors(), which are in turn used during normalization with scuttle::logNormCounts(). If this deconvolution-based approach failed for any reason, only scuttle::logNormCounts() was used for normalization.

Next, scran::modelGeneVar() was used to model gene variance from the log-normalized counts and scran::getTopHVGs() was used to select the top 2000 high-variance genes. The log-normalized counts are used to model variance of each gene prior to selecting the top 2000 highly variable genes (HVGs). These HVGs are then used as input to principal component analysis, and the top 50 principal components are selected. Finally, these principal components are used to calculate the UMAP (Uniform Manifold Approximation and Projection) embeddings.

Graph-based clustering is also performed, using the Louvain algorithm with 20 nearest neighbors and Jaccard weighting.

Cell type annotation

We perform cell type annotation with three complementary methods, where possible, and assign a single consensus cell type annotation based on agreement among these methods:

For SingleR annotation, we identify an appropriate reference dataset from the celldex package and train the classification model to use ontology IDs for annotation. Cells which SingleR cannot confidently assign are labeled as NA.

For CellAssign annotation, we identify an appropriate set of marker genes for the given tissue type from the PanglaoDB database. We combine these marker genes with all “immune cell” PanglaoDB marker genes to create a full marker gene list for for cell type annotation. During annotation, we additionally include an "other" cell type that does not express any of these marker genes. As a consequence, cells which CellAssign cannot confidently annotate from the full marker gene list are labeled as "other".

Please be aware that all cell type annotation reference datasets are derived from normal (not tumor) tissue. In addition, CellAssign annotation is only performed if there are at least 30 cells present in the processed object.

For SCimilarity annotation, we use the foundation model described in Heimberg et al. 2025 that contains 7.3 million cells from various normal and diseased tissues to annotate all samples. Each cell is annotated with the cell type label of the most similar cell in the SCimilarity model.

Some cells may be labeled as “Unclassified cell” if they were not annotated with a given automated method. These cells were not present in earlier ScPCA data versions on which cell typing was originally performed and are therefore not labeled.

Additionally, annotations from SingleR, CellAssign, and SCimilarity are used to assign an ontology-aware consensus cell type label.

Consensus cell types are assigned if two out of the three cell type methods share an latest common ancestor (LCA) that meets the following criteria, otherwise no consensus cell type is assigned:

  1. The terms share at least 1 LCA that either has fewer than 170 descendants or is one of neuron, epithelial cell, columnar/cuboidal epithelial cell or endo-epithelial cell.

  2. If more than 1 LCA is shared between two terms, then the LCA with the fewest descendants is kept and all others are discarded.

  3. If the LCA has fewer than 170 descendants and is one of the following non-specific LCA terms, no consensus cell type is assigned: bone cell, lining cell, blood cell, progenitor cell, supporting cell, biogenic amine secreting cell, protein secreting cell, extracellular matrix secreting cell, serotonin secreting cell, peptide hormone secreting cell, exocrine cell, sensory receptor cell, or interstitial cell.

If more than one LCA is identified as a possible consensus cell type, meaning there is agreement among all three methods, the LCA with the fewest descendants is used as the consensus cell type. For more information about how consensus cell types are assigned, see the cell-type-consensus module in the OpenScPCA-analysis GitHub repository.

Cell type annotation is not performed for cell line samples. For information on how to determine if a given sample was derived from a cell line, refer to section(s) describing SingleCellExperiment file contents and/or AnnData file contents.

Note: For some libraries, cell type annotations were provided from the group that submitted the original data. In these cases, the cell type annotations obtained from the submitter will be present in addition to cell type annotation performed with SingleR and CellAssign.

Cell type annotations from the OpenScPCA project

As part of the ongoing OpenScPCA project, cell types are annotated and validated on a project-by-project basis using methods and references that are most appropriate for the disease types represented in that project. If cell type annotation has been completed for all samples in a project, these curated cell types will be included alongside the automated cell type annotations. For more information on where to find these annotations in the downloaded objects, refer to section(s) describing SingleCellExperiment file contents and/or AnnData file contents.

For more details on how cells from a specific project were annotated, see the appropriate module in the OpenScPCA-analysis repository. The name of the module and other versioning information can be found in the experiment metadata for SingleCellExperiment and AnnData objects.

The OpenScPCA project is an ongoing open and collaborative effort to characterize the ScPCA Portal data. For more information on the project, including contributing your own analyses, see the OpenScPCA documentation.

CNV inference

We perform CNV inference using inferCNV, specifying the i6 HMM to quantify specific CNV events.

inferCNV uses a designated set of normal reference cells to quantify CNV events based on gene expression. We use the consensus cell type labels, as described in the cell type annotation section, to establish normal references for each sample. The specific cell types to include are determined by each sample’s diagnosis. We designate cells as either reference or query, rather than using their specific cell type labels, where the label reference was used to specify normal reference cells. inferCNV is only run if there are at least 100 cells designated as reference in a given sample. As such, inferCNV is not run on cell lines because they do not undergo cell type annotation or on samples obtained from normal or non-cancerous tissue.

We additionally specify a gene ordering file with chromosome arm designations (e.g., chr1p and chr1q are used rather than chr1) for finer-grained results. We use all other inferCNV defaults, except we set denoise = TRUE and cutoff = 0.1 (for 10x data) as recommended. Note that we keep the default inferCNV setting to remove any cells with raw RNA counts less than 100; these cells will not have inferCNV estimates.

We calculate the total CNV per cell using the feature output from the i6 HMM by summing all values in the HMM metadata table columns named has_cnv_{chr}{1:22}{p,q} (e.g., has_cnv_chr1p, has_cnv_chr1q, and so on). Any cells which inferCNV removed due to low counts will not have a total CNV estimate.

ADT quantification from CITE-seq experiments

CITE-seq libraries with reads from antibody-derived tags (ADTs) were also quantified using salmon and alevin-fry, rounded to integer values.

Reference indices were constructed from the submitter-provided list of antibody barcode sequences corresponding to each library using the --features flag of salmon index. Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with selective alignment and subsequent quantification via alevin-fry.

Combining ADT counts with RNA counts

The unfiltered ADT and RNA-seq count matrices often include somewhat different sets of cell barcodes, due to stochastic variation in library construction and sequencing. When normalizing these two count matrices to the same set of cells, we chose to prioritize RNA-seq results for broad comparability among libraries with and without ADT data. Any cell barcodes that appeared only in ADT data were discarded. Cell barcodes that were present only in the RNA-seq data (i.e., did not appear in the ADT data) were assigned zero counts for all ADTs. When cells were filtered based on RNA-seq content after quantification, the ADT count matrix was filtered to match.

Processed ADT data

An ambient profile representing antibody-derived tag (ADT) proportions present in the ambient solution is calculated from the unfiltered SingleCellExperiment object using DropletUtils::ambientProfileEmpty(). Quality-control statistics were calculated with DropletUtils::cleanTagCounts() (with default parameters) using this ambient profile, along with negative/isotype control information, if present. This function flags cells as low-quality if they either have very high levels of ambient contamination and/or negative/isotype control tags (if present), or lack ambient expression altogether which may indicate failed capture. Low-quality cells identified by DropletUtils::cleanTagCounts() are flagged but not removed except during normalization, as described below. If DropletUtils::cleanTagCounts() cannot reliably determine which cells to filter, then no cells will be flagged for removal.

For all cells that would be retained if DropletUtils::cleanTagCounts() filtering were applied, log-normalized ADT counts are, by default, calculated using median-based normalization, again making use of the baseline ambient profile. In order for this normalization to succeed, all median size factor values must be positive. If any size factors are not positive or if ADT filtering failed, then only log-based normalization (with a pseudocount of one) will be performed. Normalized counts for cells that would be filtered out by DropletUtils::cleanTagCounts() are assigned as NA.

Multiplexed libraries

Multiplexed libraries, or libraries with cells or nuclei from more than one biological sample, were processed to allow demultiplexing using both hashtag oligonucleotide (HTO) results and genotype data.

Hashtag oligonucleotide (HTO) quantification

HTO reads were also quantified using salmon and alevin-fry, rounded to integer values. Reference indices were constructed from the submitter-provided list of HTO sequences corresponding to each library using the --features flag of salmon index. Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with selective alignment and subsequent quantification via alevin-fry.

As with the ADT data, we retained all cells with RNA-seq data, setting HTO counts to zero for any missing cell barcodes. When cells were filtered based on RNA-seq content after quantification, the HTO count matrix was filtered to match.

HTO demultiplexing

We performed HTO demultiplexing using both DropletUtils::hashedDrops() and Seurat::HTODemux() only on the filtered cells, using default parameters for each.

We report the demultiplexed sample calls and associated statistics for both algorithms, but do not separate the multiplexed library into individual samples.

Genetic demultiplexing

For multiplex libraries where bulk RNA-seq data is available for the individual samples, we also performed demultiplexing analysis using genotype data following the methods described in Weber et al. (2021):

The genetic demultiplexing calls are reported alongside HTO demultiplexing results for each library, but we again do not separate the individual samples. For information on where the demultiplexing calls can be found, see the section on demultiplexing results in the SingleCellExperiment file contents.

Spatial transcriptomics

Mapping and quantification using Space Ranger

Processing spatial transcriptomics libraries requires two steps - gene expression quantification and tissue detection. In the absence of independent tissue detection methods to use with Alevin-fry, we used 10x Genomics’ Space Ranger to obtain both gene expression and spatial information. spaceranger count takes FASTQ files and a microscopic slide image as input and performs alignment, quantification, and tissue detection for each spot. In contrast to alevin-fry, which maps reads to a reference transcriptome index, Space Ranger aligns transcript reads to the reference genome using STAR (Dobin et al. 2012). See the 10x documentation for more information on how Space Ranger quantifies gene expression and detects tissue.

Bulk RNA samples

Preprocessing with fastp

Prior to quantifying gene expression for bulk RNA-seq samples, FASTQ files were pre-processed using fastp to perform adapter trimming, quality filtering, and length filtering. For length filtering, trimmed reads shorter than 20 basepairs were removed by using the --length_required 20 option. All other filtering and trimming was performed using the default strategies enabled in fastp.

Mapping and quantification using salmon

To quantify gene expression for bulk RNA-seq samples, we used salmon quant. Here, we performed selective alignment of the trimmed and filtered FASTQ files to a decoy-aware reference transcriptome index (Srivastava et al. 2020). The decoy-aware reference transcriptome, was created from spliced cDNA sequences with the entire genome sequence as a decoy.

Salmon parameters

A benefit of using salmon is the ability to incorporate RNA-seq specific technical biases and correct counts accordingly. We chose to enable the --seqBias and --gcBias flags, to correct for sequence-specific biases due to random hexamer primer and fragment-level GC biases, respectively.

Merged objects

In addition to providing separate objects for each sample, we also offer an option to download all samples (and therefore libraries) for a project as a single merged object. This merged object contains the gene expression data for all samples from a single project in a single file. This section describes how these merged objects were prepared.

Following post-processing of each SingleCellExperiment (see the section above on post-processing) object, all objects belonging to a single ScPCA project were merged together. These merged objects were not batch-corrected; they do not represent integrated objects.

If at least one library in the given project contained ADT data from CITE-seq experiments, the associated ADT “alternative experiment” was also merged. Any libraries in the given project which did not contain ADT data will contain NA values in the merged gene by counts matrices.

By contrast, cell hashing alternative experiments were not merged. For any projects with cell hash data, only the associated RNA data was merged in the final object.

After merging, new principal component analysis (PCA) coordinates and UMAP embeddings were calculated so that each library in the merged object is equally weighted. For this, the top 2000 high-variance genes (HVGs) were calculated by modeling variance separately for each library in the merged object. These HVGs were used as input to the PCA, which was calculated using the batchelor::multiBatchPCA function and specifying libraries as batches, and the top 50 principal components were selected. These new principal components were used to calculate the new UMAP embeddings found in the merged object.