ScDiagnostics BioC2024 Workshop

Anthony Christidis

anthony-alexander_christidis@hms.harvard.edu

Andrew Ghazi

Smriti Chawla

Nitesh Turaga

Ludwig Geistlinger

Robert Gentleman

2024-07-25

Source: vignettes/BioC2024_scDiagnostics_Workshop.Rmd

BioC2024_scDiagnostics_Workshop.Rmd

Overview

Description

scDiagnostics is a comprehensive toolkit designed for the analysis and diagnostics of single-cell RNA sequencing (scRNA-seq) data. This package provides functionalities for comparing principal components, visualizing canonical correlation analysis (CCA) outputs, and plotting cell type-specific MDS and PCA projections, among other tasks.

Pre-requisites

Before attending this workshop, participants should have a basic understanding of single-cell RNA sequencing (scRNA-seq) data analysis and familiarity with Bioconductor packages. Specifically, knowledge of the SingleCellExperiment package will be useful for working with single-cell datasets. While familiarity with dimensionality reduction techniques, such as Principal Component Analysis (PCA), and statistical tests will enhance the learning experience, it is not required.

A solid foundation in single-cell analysis techniques and tools will significantly benefit those looking to fully leverage the capabilities of specialized packages like scDiagnostics.” Basic proficiency in R and R Markdown is also recommended to follow along with the workshop exercises.

Setup (R / Bioconductor)

First, let’s load the necessary libraries and data.

# Load necessary libraries
library(scDiagnostics)
library(scran)
library(scater)

We will use the following datasets available in scDiagnostics for the workshop. These datasets are processed versions of the dataset HeOrganAtlasData(tissue = c("Marrow")) in the Bioconductor package scRNAseq.

# Load data
data("reference_data")
data("query_data")

We will also create a dataset where we extract only one cell type (CD4) from the data. We use the SingleR annotation to subset the data. THis will allow us to evaluate how well aligned the CD4 cells are between the reference and query datasets.

# Subset data for CD4
reference_data_subset <- reference_data[, which(reference_data$expert_annotation == "CD4")]
query_data_subset <- query_data[, which(query_data$expert_annotation == "CD4")]

# Selecting highly variable genes
ref_top_genes <- getTopHVGs(reference_data_subset, n = 500)
query_top_genes <- getTopHVGs(query_data_subset, n = 500)

# Intersect the gene symbols to obtain common genes
common_genes <- intersect(ref_top_genes, query_top_genes)
reference_data_subset <- reference_data_subset[common_genes,]
query_data_subset <- query_data_subset[common_genes,]

# Run PCA
reference_data_subset <- runPCA(reference_data_subset)
query_data_subset <- runPCA(query_data_subset)

Time outline

An example for a 45-minute workshop:

Activity	Time
Introduction	10m
Overview of Package Functions	25m
Example Workflow	10m

Learning goals

The goal of this workshop is to provide participants with a thorough understanding of the scDiagnostics package, enabling them to effectively analyze and interpret single-cell RNA sequencing data. Participants will gain hands-on experience in utilizing the package’s diverse functionalities to enhance their single-cell data analysis workflows.

By the end of this workshop, participants will understand the capabilities and applications of scDiagnostics, master visualization techniques for both multiple and single cell types, calculate correlations and distances between cell types, perform and interpret statistical tests, detect and visualize anomalies, evaluate data quality and gene overlap, and utilize additional analytical tools.

Learning objectives

By the end of this workshop, participants will be able to:

Visualize Differences Between Datasets:
- Use various plotting functions to compare query and reference datasets across multiple and single cell types.
Evaluate Gene Overlap:
- Utilize tools for gene overlap measures.
Calculate Correlations and Distances:
- Implement functions to calculate correlations and distances between different cell types.
Perform Statistical Tests:
- Apply statistical methods to compare groups and assess significance.
Detect Anomalies:
- Identify and visualize anomalies in single-cell data.

Brief Overview of scDiagnostics

Visualize Differences Between Datasets

`calculateDiscriminantSpace`

This function projects query single-cell RNA-seq data onto the discriminant space defined by reference data. The reference data is used to identify important variables and compute discriminant vectors, which are then used to project both reference and query data. Similarity between the query and reference projections is assessed using cosine similarity and Mahalanobis distance.

disc_output <- calculateDiscriminantSpace(reference_data = reference_data,
                                          query_data = query_data, 
                                          query_cell_type_col = "SingleR_annotation", 
                                          ref_cell_type_col = "expert_annotation",
                                          calculate_metrics = TRUE,
                                          alpha = 0.01)

plot(disc_output, plot_type = "scatterplot")

plot(disc_output, cell_types = "CD4-CD8", plot_type = "boxplot")

head(disc_output$`CD8-CD4`$query_mahalanobis_dist)
#> [1]  0.9361404 17.9476202  0.2322331  2.4614729  0.9040419  0.2030644
disc_output$`CD8-CD4`$mahalanobis_crit
#> [1] 9.21034

`plotCellTypePCA`

This function projects the query dataset onto the principal component space of the reference dataset and then plots the specified principal components for the specified cell types. It uses the ‘projectPCA’ function to perform the projection and ggplot2 to create the plots.

pc_plot <- plotCellTypePCA(query_data = query_data, 
                           reference_data = reference_data,
                           cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
                           query_cell_type_col = "SingleR_annotation", 
                           ref_cell_type_col = "expert_annotation", 
                           pc_subset = 1:3)
pc_plot

`boxplotPCA`

This function generates a boxplot visualization of principal components (PCs) for different cell types across two datasets (query and reference).

pc_boxplot <- boxplotPCA(query_data = query_data, 
                         reference_data = reference_data,
                         query_cell_type_col = "SingleR_annotation", 
                         ref_cell_type_col = "expert_annotation", 
                         cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
                         pc_subset = 1:5)
pc_boxplot

`plotCellTypeMDS`

This function facilitates the assessment of similarity between reference and query datasets through Multidimensional Scaling (MDS) scatter plots. It allows the visualization of cell types based on a dissimilarity matrix computed from a user-selected gene set.

mds_plot <- plotCellTypeMDS(query_data = query_data, 
                            reference_data = reference_data, 
                            cell_types = c("CD4", "CD8", "B_and_plasma"),
                            query_cell_type_col = "SingleR_annotation", 
                            ref_cell_type_col = "expert_annotation")
mds_plot

`comparePCA`

This function compares the principal components (PCs) obtained from separate PCA on reference and query datasets for a single cell type using either cosine similarity or correlation.

The S3 plot method generates a heatmap to visualize the cosine similarities between principal components from the output of the comparePCA function.

similarity_mat <- comparePCA(query_data = query_data_subset, 
                             reference_data = reference_data_subset, 
                             query_cell_type_col = "expert_annotation", 
                             ref_cell_type_col = "expert_annotation",
                             pc_subset = 1:5)
plot(similarity_mat)

`compareCCA`

This function compares the subspaces spanned by the top principal components (PCs) of the reference and query datasets using canonical correlation analysis (CCA). It calculates the canonical variables, correlations, and a similarity measure for the subspaces.

The S3 plot method generates a visualization of the output from the compareCCA function. The plot shows the cosine similarities of canonical correlation analysis (CCA) coefficients, with point sizes representing the correlations.

similarity_mat <- compareCCA(query_data = query_data_subset, 
                             reference_data = reference_data_subset, 
                             query_cell_type_col = "expert_annotation", 
                             ref_cell_type_col = "expert_annotation",
                             pc_subset = 1:5)
plot(similarity_mat)

`plotWassersteinDistance`

This function generates a density plot of Wasserstein distances under the null hypothesis that the two datasets come from the same distribution. It computes the null distribution of Wasserstein distances and compares it to the distances between reference and query data.

wasserstein_plot <- plotWassersteinDistance(query_data = query_data_subset,
                                            reference_data = reference_data_subset, 
                                            query_cell_type_col = "expert_annotation", 
                                            ref_cell_type_col = "expert_annotation", 
                                            n_resamples = 100,
                                            alpha = 0.05)
wasserstein_plot

Evaluate Gene Overlap

`calculateVarImpOverlap`

This function uses the Random Forest algorithm to calculate the importance of genes in differentiating between cell types within both a reference dataset and a query dataset. The function then compares the top genes identified in both datasets to determine the overlap in their importance scores.

var_imp <- calculateVarImpOverlap(reference_data = reference_data, 
                                    query_data = query_data, 
                                    ref_cell_type_col = "expert_annotation", 
                                    query_cell_type_col = "SingleR_annotation")
var_imp[["var_imp_comparison"]]
#>     B_and_plasma-CD8     B_and_plasma-CD4 B_and_plasma-Myeloid 
#>                 0.78                 0.84                 0.32 
#>              CD8-CD4          CD8-Myeloid          CD4-Myeloid 
#>                 0.74                 0.30                 0.36

Calculate Correlations and Distances

`calculateAveragePairwiseCorrelation`

Computes the average pairwise correlations between specified cell types in single-cell gene expression data.

The S3 plot method takes the output of the calculateAveragePairwiseCorrelation function, which should be a matrix of pairwise correlations, and plots it as a heatmap.

cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data, 
                                                      reference_data = reference_data, 
                                                      query_cell_type_col = "SingleR_annotation", 
                                                      ref_cell_type_col = "expert_annotation", 
                                                      cell_types = c("CD4", "CD8", "B_and_plasma"), 
                                                      pc_subset = 1:5,
                                                      correlation_method = "spearman")
plot(cor_matrix_avg)

Perform Statistical Tests

`calculateNearestNeighborProbabilities`

This function computes the probabilities for each query cell of belonging to either the reference or query dataset for each cell type using nearest neighbor analysis.

The S3 plot method generates a density plot showing the distribution of probabilities for each cell of belonging to either the reference or query dataset for each cell type.

nn_output <- calculateNearestNeighborProbabilities(query_data = query_data, 
                                                   reference_data = reference_data,
                                                   query_cell_type_col = "SingleR_annotation", 
                                                   ref_cell_type_col = "expert_annotation",
                                                   pc_subset = 1:5)
plot(nn_output)

`calculateHotellingPValue`

This function calculates Hotelling’s T-squared statistic for comparing multivariate means between reference and query datasets, projected onto a subset of principal components (PCs). It performs a permutation test to obtain p-values for each cell type specified.

p_values <- calculateHotellingPValue(query_data = query_data, 
                                     reference_data = reference_data, 
                                     query_cell_type_col = "SingleR_annotation", 
                                     ref_cell_type_col = "expert_annotation",
                                     pc_subset = 1:5) 
round(p_values, 5)
#> B_and_plasma          CD8          CD4      Myeloid 
#>        0.006        0.000        0.000        0.000

`regressPC`

This function performs linear regression of a covariate of interest onto one or more principal components, based on the data in a SingleCellExperiment object.

The S3 plot method generates plots to visualize the results of regression analyses performed on principal components (PCs) against cell types.

regress_res <- regressPC(reference_data = reference_data,
                         ref_cell_type_col = "expert_annotation", 
                         cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
                         pc_subset = 1:15)
plot(regress_res, plot_type = "r_squared")

plot(regress_res, plot_type = "p-value", alpha = 0.05)

Detect Anomalies

`detectAnomaly`

This function projects the query data onto the PCA space of the reference data. An isolation forest is then built on the reference data to identify anomalies in the query data based on their PCA projections. If no query dataset is provided by the user, the anomaly scores are computed on the reference data itself. Anomaly scores for the data with all combined cell types are also provided as part of the output.

The S3 plot method extracts the specified PCs from the given anomaly detection object and generates scatter plots for each pair of PCs. It uses ggplot2 to create a faceted plot where each facet represents a pair of PCs. Anomalies are highlighted in red, while normal points are shown in black.

anomaly_output <- detectAnomaly(reference_data = reference_data, 
                                query_data = query_data, 
                                ref_cell_type_col = "expert_annotation", 
                                query_cell_type_col = "SingleR_annotation",
                                pc_subset = 1:5,
                                n_tree = 500,
                                anomaly_treshold = 0.6) 
plot(anomaly_output, 
     cell_type = "CD4", 
     pc_subset = 1:5, 
     data_type = c("query", "reference")[1])

`calculateCellSimilarityPCA`

This function calculates the cosine similarity between cells based on the principal components (PCs) obtained from PCA (Principal Component Analysis) loadings.

The S3 plot method creates a heatmap plot to visualize the cosine similarities between cells and principal components (PCs).

top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, decreasing = TRUE)[1:6])
cosine_similarities <- calculateCellSimilarityPCA(se_object = reference_data, 
                                                  cell_names = top6_anomalies, 
                                                  pc_subset = 1:25, 
                                                  n_top_vars = 50)
plot(cosine_similarities, pc_subset = 15:25)

Example Workflow

First let us see how well the SingleR annotation performed

table(Expert_Annotation = query_data$expert_annotation, 
      SingleR = query_data$SingleR_annotation)
#>                  SingleR
#> Expert_Annotation B_and_plasma CD4 CD8 Myeloid
#>      B_and_plasma          136   0   0      10
#>      CD4                     0 190   1      13
#>      CD8                     0 134 196      29
#>      Myeloid                 0   0   0      40

We can see that according to expert annotation, a lot of CD8 cells were incorrectly labeled as CD4 cells by SingleR. Let’s identify some of these cells (the top 5 deviating cells for CD4).

First, we compute the distances from each query cell to each reference cell.

distance_data <- calculateCellDistances(query_data = query_data, 
                                        reference_data = reference_data, 
                                        query_cell_type_col = "SingleR_annotation", 
                                        ref_cell_type_col = "expert_annotation",
                                        pc_subset = 1:10)

Now, let’s compute the top six anomalies for CD4 using the anomaly detection function.

cd4_anomalies <- detectAnomaly(reference_data = reference_data, 
                               query_data = query_data, 
                               query_cell_type_col = "SingleR_annotation", 
                               ref_cell_type_col = "expert_annotation",
                               pc_subset = 1:10,
                               n_tree = 500,
                               anomaly_treshold = 0.5)$CD4
cd4_top6_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6])

Next, we plot the densities of the distances for each of these anomalous cells for CD4 and CD8.

plot(distance_data, ref_cell_type = "CD4", cell_names = cd4_top6_anomalies)

plot(distance_data, ref_cell_type = "CD8", cell_names = cd4_top6_anomalies)

We can also compute some measures of similarity between these cells and all cell types.

overlap_measures <- calculateCellDistancesSimilarity(query_data = query_data, 
                                                     reference_data = reference_data, 
                                                     cell_names = cd4_top6_anomalies,
                                                     query_cell_type_col = "SingleR_annotation", 
                                                     ref_cell_type_col = "expert_annotation",
                                                     pc_subset = 1:10) 
overlap_measures
#> $bhattacharyya_coef
#>                 Cell B_and_plasma       CD8       CD4    Myeloid
#> 1 AAACGGGCACATGACT-1    0.1925182 0.7128711 0.2116577 0.07436962
#> 2 ATTGGTGGTAACGCGA-1    0.4236378 0.3799852 0.3830778 0.27169272
#> 3 ATTGGACCATACTACG-1    0.2097601 0.8765028 0.3656771 0.09452698
#> 4 CATTCGCTCGTCACGG-1    0.1363147 0.7006176 0.2448559 0.05778528
#> 5 CACCTTGGTTATGTGC-1    0.2452612 0.9885897 0.4514740 0.04147370
#> 6 TTGCGTCTCCACTCCA-1    0.2741303 0.9457587 0.5678963 0.12581074
#> 
#> $hellinger_dist
#>                 Cell B_and_plasma       CD8       CD4   Myeloid
#> 1 AAACGGGCACATGACT-1    0.8985999 0.5358441 0.8878864 0.9620969
#> 2 ATTGGTGGTAACGCGA-1    0.7591853 0.7874102 0.7854439 0.8534092
#> 3 ATTGGACCATACTACG-1    0.8889544 0.3514217 0.7964439 0.9515635
#> 4 CATTCGCTCGTCACGG-1    0.9293467 0.5471585 0.8689903 0.9706775
#> 5 CACCTTGGTTATGTGC-1    0.8687570 0.1068190 0.7406254 0.9790436
#> 6 TTGCGTCTCCACTCCA-1    0.8519799 0.2328977 0.6573459 0.9349809