Downloadable files
The ScPCA Portal download packages include gene expression data, a QC report, and associated metadata for each processed sample. These files are delivered as a zip file.
For all downloads, sample folders (indicated by the SCPCS
prefix) contain the files for all libraries (SCPCL
prefix) derived from that biological sample.
Most samples only have one library that has been sequenced.
For multiplexed sample libraries, the sample folder name will be an underscore-separated list of all samples found in the library files that the folder contains.
See the FAQ section about samples and libraries for more information.
The files associated with each library are (example shown for a library with ID SCPCL000000
):
An unfiltered counts file:
SCPCL000000_unfiltered.rds
,A filtered counts file:
SCPCL000000_filtered.rds
,A processed counts file:
SCPCL000000_processed.rds
,A quality control report:
SCPCL000000_qc.html
,
Every download also includes a single single_cell_metadata.tsv
file containing metadata for all libraries included in the download.
The folder structure within the zip file is determined by whether individual samples or all samples associated with a project are selected for download.
Download folder structure for project downloads:
If a project contains bulk RNA-seq data, two tab-separated value files, bulk_quant.tsv
and bulk_metadata.tsv
, will be included in the download.
The bulk_quant.tsv
file contains a gene by sample matrix (each row a gene, each column a sample) containing raw gene expression counts quantified by Salmon.
The bulk_metadata.tsv
file contains associated metadata for all samples with bulk RNA-seq data.
See also processing bulk RNA samples.
Download folder structure for individual sample downloads:
Note that if a sample selected for download contains a spatial transcriptomics library, the files included will be different than pictured above. See the description of the Spatial transcriptomics output section below.
Gene expression data
Single-cell or single-nuclei gene expression data is provided in three forms - as an unfiltered counts file, a filtered counts file, and a processed counts file.
The unfiltered counts file, SCPCL000000_unfiltered.rds
, is an RDS file containing a SingleCellExperiment
object.
Within the SingleCellExperiment
object is the counts matrix, where the rows correspond to genes or features and the columns correspond to cell barcodes.
Here, all potential cell barcodes that are identified after running alevin-fry are included in the counts matrix.
The object also includes summary statistics for each cell barcode and gene, as well as metadata about that particular library, such as the reference index and software versions used for mapping and quantification.
The filtered counts file, SCPCL000000_filtered.rds
is also an RDS file containing a SingleCellExperiment
object with the same structure as above.
The cells in this file are those that remain after filtering using emptyDrops.
As a result, this file only contains cell barcodes that are likely to correspond to true cells.
The processed counts file, SCPCL000000_processed.rds
is an RDS file containing a SingleCellExperiment
object containing both the raw and normalized counts matrices.
The filtered counts file is further filtered to remove low quality cells, such as those with a low number of genes detected or high mitochondrial content.
This file contains the raw and normalized counts data for cell barcodes that have passed both levels of filtering.
In addition to the counts matrices, the SingleCellExperiment
object stored in the file includes the results of dimensionality reduction using both principal component analysis (PCA) and UMAP.
See Single-cell gene expression file contents for more information about the contents of the SingleCellExperiment
objects and the included statistics and metadata.
See also Using the provided RDS files in R.
QC Report
The included QC report serves as a general overview of each library, including processing information, summary statistics and general visualizations of cell metrics.
Metadata
The single_cell_metadata.tsv
file is a tab-separated table with one row per library and the following columns.
column_id |
contents |
---|---|
|
Sample ID in the form |
|
Library ID in the form |
|
|
|
10X kit used to process library |
|
Number of cells after filtering with |
|
Original sample identifier from submitter |
|
Original participant id, if there are multiple samples from the same participant |
|
Submitter name/id |
|
Age at time sample was obtained |
|
Sex of patient that the sample was obtained from |
|
Tumor type |
|
Subcategory of diagnosis or mutation status (if applicable) |
|
Where in the body the tumor sample was located |
|
What stage of disease was the sample obtained? At diagnosis or recurrence? |
Additional metadata may also be included, specific to the disease type and experimental design of the project.
Examples of this include treatment or outcome.
Metadata pertaining to processing will also be available in this table and inside of the SingleCellExperiment
object.
See the Experiment metadata section for more information on metadata columns that can be found in this file as well as inside the SingleCellExperiment
object.
For projects with bulk RNA-seq data, the bulk_metadata.tsv
file will be included for project downloads.
This file will contain fields equivalent to those found in the single_cell_metadata.tsv
related to processing the sample, but will not contain patient or disease specific metadata (e.g. age
, sex
, diagnosis
, subdiagnosis
, tissue_location
, or disease_timing
).
Multiplexed sample libraries
For libraries where multiple biological samples were combined via cellhashing or similar technology (see the FAQ section about multiplexed samples), the organization of the downloaded files and metadata is slightly different.
For project downloads, the counts and QC files will be organized by the set of samples that comprise each library, rather than in individual sample folders.
These sample set folders are named with an underscore-separated list of the sample ids for the libraries within, e.g., SCPCS999990_SCPCS999991_SCPCS999992
.
Bulk RNA-seq data, if present, will follow the same format as bulk RNA-seq for single-sample libraries.
Because we do not perform demultiplexing to separate cells from multiplexed libraries into sample-specific count matrices, sample downloads from a project with multiplexed data will include all libraries that contain the sample of interest, but these libraries will still contain cells from other samples.
For more on the specific contents of multiplexed library SingleCellExperiment
objects, see the Additional SingleCellExperiment components for multiplexed libraries section.
The metadata file for multiplexed libraries (single_cell_metadata.tsv
) will have the same format as for individual samples, but each row will represent a particular sample/library pair, meaning that there may be multiple rows for each scpca_library_id
, one for each scpca_sample_id
within that library.
Spatial transcriptomics libraries
If a sample includes a library processed using spatial transcriptomics, the spatial transcriptomics output files will be available as a separate download from the single-cell/single-nuclei gene expression data.
For all spatial transcriptomics libraries, a SCPCL000000_spatial
folder will be nested inside the corresponding sample folder in the download.
Inside that folder will be the following folders and files:
A
raw_feature_bc_matrix
folder containing the unfiltered counts filesA
filtered_feature_bc_matrix
folder containing the filtered counts filesA
spatial
folder containing images and position informationA
SCPCL000000_spaceranger_summary.html
file containing the summary html report provided by Space RangerA
SCPCL000000_metadata.json
file containing library processing information.
A full description of all files included in the download for spatial transcriptomics libraries can also be found in the spaceranger count
documentation.
Every download also includes a single spatial_metadata.tsv
file containing metadata for all libraries included in the download.