# Processing information

## Single-cell and single-nuclei RNA-seq

### Mapping and quantification using alevin-fry

We used salmon alevin and alevin-fry to generate gene by cell counts matrices for all single-cell and single-nuclei samples. In brief, we utilized selective alignment to the splici index for all single-cell and single-nuclei samples.

#### Reference transcriptome index

For all samples, we aligned FASTQ files to a reference transcriptome index referred to as the splici index. The splici index is built using transcripts from both spliced cDNA and intronic regions. Inclusion of intronic regions in the index used for alignment allowed us to capture both reads from mature, spliced cDNA and nascent, unspliced cDNA. Alignment of RNA-seq data to an index containing intronic regions has been shown to reduce spuriously detected genes (He et al. 2021, Kaminow et al. 2021). In our hands, we have found that use of the splici index led to a more comparable distribution of unique genes found per cell to Cell Ranger than did use of an index obtained from spliced cDNA transcripts only.

#### Selective alignment

We mapped reads to the transcriptome index using salmon with the default “selective alignment” strategy. Briefly, selective alignment uses a mapping score validated approach to identify maximal exact matches between reads and the provided index. For all samples, we used selective alignment to the splici index.

A more detailed description of the mapping strategy invoked by salmon in conjunction with alevin-fry can be found in Srivastava et al. (2020).

#### Alevin-fry parameters

After mapping FASTQ files using selective alignment to the splici index, we continued with the alevin-fry pipeline using the following parameters:

1. During the generate-permit-list step of alevin-fry, we used the --unfiltered-pl option, which returns any cell with at least 1 read found in a reference barcode list. For our reference barcode list, we used a list of all possible cell barcodes from 10X Genomics.

2. We chose to use the cr-like-em resolution strategy for feature quantification and UMI de-duplication. Similar to the way Cell Ranger performs feature quantification, the cr-like-em resolution strategy assigns all UMIs that align to a single gene to that gene. In contrast to Cell Ranger, cr-like-em keeps multi-mapped reads and invokes an extra step to assign these multi-mapped reads to a UMI.

3. With initial mapping to the splici index, alevin-fry quantification resulted in separate counts for spliced and unspliced transcripts, and an ambiguous count for reads compatible with either spliced or unspliced transcripts.

### Post alevin-fry processing

#### Combining counts from spliced cDNA and intronic regions

For single-cell samples, we only included reads aligning to spliced and ambiguous cDNA transcripts in the counts matrix. For single-nuclei samples, all counts for spliced cDNA and intronic regions were summed for each gene to return the total counts summarized by gene in the counts matrix.

After combining read counts, values are rounded to integer values.

#### Filtering cells

In addition to an unfiltered counts matrix, we provide a matrix filtered to only cell barcodes from droplets that are likely to include true cells. To do this we used DropletUtils::emptyDropsCellRanger(), a function that estimates the profile of cells containing ambient RNA and tests the likelihood of all other droplets as differing from the ambient profile (Lun et al. 2019). This function more closely mimics the filtering performed in Cell Ranger than does its predecessor DropletUtils::emptyDrops(). We consider droplets with an FDR less than or equal to 0.01 to be cell-containing droplets. Only cells that pass this FDR threshold are included in the filtered counts matrix.

For some libraries, DropletUtils::emptyDropsCellRanger() may fail due to low numbers of droplets with reads or other violations of its assumptions. For these libraries, only droplets containing at least 100 UMI are included in the filtered counts matrix.

### Processed gene expression data

In addition to the raw gene expression data, we also provide a _processed.rds file containing a SingleCellExperiment object with further filtering applied, a normalized counts matrix, and results from dimensionality reduction.

Prior to normalization, low-quality cells are removed from the gene by cell counts matrix. To identify low-quality cells, we use miQC, a package that jointly models proportion of reads belonging to mitochondrial genes and number of unique genes detected. Cells with a high likelihood of being compromised (greater than 0.75) and cells that do not pass a minimum number of unique genes detected threshold of 200 are removed from the counts matrix present in the _processed.rds file.

Log-normalized counts are calculated using the deconvolution method presented in Lun, Bach, and Marioni (2016). The log-normalized counts are used to model variance of each gene prior to selecting the top 2000 highly variable genes (HVGs). These HVGs are then used as input to principal component analysis, and the top 50 principal components are selected. Finally, these principal components are used to calculate the UMAP (Uniform Manifold Approximation and Projection) embeddings.

## CITE-seq quantification

CITE-seq libraries with reads from antibody-derived tags (ADTs) were also quantified using salmon alevin and alevin-fry, rounded to integer values.

Reference indices were constructed from the submitter-provided list of antibody barcode sequences corresponding to each library using the --features flag of salmon index. Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with selective alignment and subsequent quantification via alevin-fry.

### Combining CITE counts with RNA counts

The unfiltered CITE-seq and RNA-seq count matrices often include somewhat different sets of cell barcodes, due to stochastic variation in library construction and sequencing. When normalizing these two count matrices to the same set of cells, we chose to prioritize RNA-seq results for broad comparability among libraries with and without CITE-seq data. Any cell barcodes that appeared only in CITE-seq data were discarded. Cell barcodes that were present only in the RNA-seq data (i.e., did not appear in the CITE-seq data) were assigned zero counts for all ADTs. When cells were filtered based on RNA-seq content after quantification, the CITE-seq count matrix was filtered to match.

## Multiplexed libraries

Multiplexed libraries, or libraries with cells or nuclei from more than one biological sample, were processed to allow demultiplexing using both hashtag oligonucleotide (HTO) results and genotype data.

### Hashtag oligonucleotide (HTO) quantification

HTO reads were also quantified using salmon alevin and alevin-fry, rounded to integer values. Reference indices were constructed from the submitter-provided list of HTO sequences corresponding to each library using the --features flag of salmon index. Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with selective alignment and subsequent quantification via alevin-fry.

As with the CITE-seq data, we retained all cells with RNA-seq data, setting HTO counts to zero for any missing cell barcodes. When cells were filtered based on RNA-seq content after quantification, the HTO count matrix was filtered to match.

### HTO demultiplexing

We performed HTO demultiplexing using both DropletUtils::hashedDrops() and Seurat::HTODemux() only on the filtered cells, using default parameters for each.

We report the demultiplexed sample calls and associated statistics for both algorithms, but do not separate the multiplexed library into individual samples.

### Genetic demultiplexing

For multiplex libraries where bulk RNA-seq data is available for the individual samples, we also performed demultiplexing analysis using genotype data following the methods described in Weber et al. (2021):

The genetic demultiplexing calls are reported alongside HTO demultiplexing results for each library, but we again do not separate the individual samples. For information on where the demultiplexing calls can be found, see the section on demultiplexing results in the SingleCellExperiment file contents.

## Spatial transcriptomics

### Mapping and quantification using Space Ranger

Processing spatial transcriptomics libraries requires two steps - gene expression quantification and tissue detection. In the absence of independent tissue detection methods to use with Alevin-fry, we used 10X Genomics’ Space Ranger to obtain both gene expression and spatial information. spaceranger count takes FASTQ files and a microscopic slide image as input and performs alignment, quantification, and tissue detection for each spot. In contrast to alevin-fry, which maps reads to a reference transcriptome index, Space Ranger aligns transcript reads to the reference genome using STAR (Dobin et al. 2012). See the 10X documentation for more information on how Space Ranger quantifies gene expression and detects tissue.

## Bulk RNA samples

### Preprocessing with fastp

Prior to quantifying gene expression for bulk RNA-seq samples, FASTQ files were pre-processed using fastp to perform adapter trimming, quality filtering, and length filtering. For length filtering, trimmed reads shorter than 20 basepairs were removed by using the --length_required 20 option. All other filtering and trimming was performed using the default strategies enabled in fastp.

### Mapping and quantification using salmon

To quantify gene expression for bulk RNA-seq samples, we used salmon quant. Here, we performed selective alignment of the trimmed and filtered FASTQ files to a decoy-aware reference transcriptome index (Srivastava et al. 2020). The decoy-aware reference transcriptome, was created from spliced cDNA sequences with the entire genome sequence as a decoy.

#### Salmon parameters

A benefit of using salmon is the ability to incorporate RNA-seq specific technical biases and correct counts accordingly. We chose to enable the --seqBias and --gcBias flags, to correct for sequence-specific biases due to random hexamer primer and fragment-level GC biases, respectively.