Spatial data from Slide-seq#

In this tutorial, we show how to use the optimized version of the Hotspot algorithm implemented inside Harreman by using one of the datasets used in the original Hotspot tutorial: the spatial data from the Slide-seq paper (Rodriques et al., Science, 2019). The objective of this tutorial is to show that the PyTorch version of the Hotspot algorithm that has been implemented inside Harreman generates the same results but much faster. For this, the same parameters as in the original Hotspot tutorial will be used.

In this tutorial we will cover:

  1. Computing gene autocorrelation to identify spatial genes.

  2. Computing pairwise correlations between spatial genes to identify modules.

  3. Plotting modules, correlations, and module scores.

import harreman
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import mplscience
from scipy.stats import pearsonr, spearmanr, gaussian_kde
import random
from sklearn import linear_model
import warnings
warnings.filterwarnings("ignore")

Defining the parametric vs non-parametric test function#

To compare the results obtained from both the parametric and non-parametric tests, the function below has been defined.

def corr_plot(x, y, mean_expr, max_num=10000, outlier=0.01, line_on=True, method='spearman',
              legend_on=True, size=30, dot_color=None, outlier_color="r",
              alpha=0.8, color_rate=10, corr_on=None, colorbar=True, pmax=95, pmin=5):

    if method == 'pearson':
        score = pearsonr(x, y)
    if method == 'spearman':
        score = spearmanr(x, y)
    np.random.seed(0)
    if len(x) > max_num:
        idx = np.random.permutation(len(x))[:max_num]
        x, y = x[idx], y[idx]
        mean_expr = mean_expr[idx]
    outlier = int(len(x) * outlier)

    xy = np.vstack([x, y])
    z = gaussian_kde(xy)(xy)
    idx = z.argsort()
    idx1, idx2 = idx[outlier:], idx[:outlier]

    if dot_color is None:
        c_score = mean_expr
    else:
        c_score = dot_color

    plt.set_cmap("Blues")
    vmax = np.percentile(c_score, pmax)
    vmin = np.percentile(c_score, pmin)
    plt.scatter(x, y, c=c_score, edgecolor=None, s=size, alpha=alpha, vmin=vmin, vmax=vmax)
    if colorbar:
        plt.colorbar()

    if line_on:
        clf = linear_model.LinearRegression()
        clf.fit(x.reshape(-1, 1), y)
        xx = np.linspace(x.min(), x.max(), 1000).reshape(-1, 1)
        yy = clf.predict(xx)
        plt.plot(xx, yy, "k--", label="R=%.3f" % score[0])

    if legend_on or corr_on:
        plt.legend(loc="best", fancybox=True, ncol=1)

Loading the dataset#

We load the dataset, store the total counts and log-normalize the raw counts.

url = "https://github.com/YosefLab/scVI-data/blob/master/rodriques_slideseq.h5ad?raw=true"
adata = sc.read("rodriques_slideseq.h5ad", backup_url=url)
adata.obs["total_counts"] = np.asarray(adata.X.sum(1)).ravel()

adata.layers["csc_counts"] = adata.X.tocsc()

# renormalize the data for expression viz on plots
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

adata.obsm['spatial'] = np.array(adata.obsm['spatial'])

Computing the proximity graph#

Firstly, we compute the neighborhood graph using the compute_knn_graph function. For this, we need to specify the latent (or observed) space we are using to compute our cell metric with the compute_neighbors_on_key parameter. Further, only one of compute_neighbors_on_key or distances_obsp_key is needed in order to run the function. Similarly, either the neighborhood_radius or the n_neighbors parameter needs to be used (the former needs to contain information in micrometers).

Here, we set weighted_graph=False to just use binary, 0-1 weights and n_neighbors=300 to create a local neighborhood in the tissue space with the 300 nearest neighbors for every spot. Larger neighborhood sizes can result in more robust detection of correlations and autocorrelations at a cost of missing more fine-grained, smaller-scale, spatial patterns.

harreman.tl.compute_knn_graph(adata, 
                           compute_neighbors_on_key="spatial", 
                           n_neighbors=300,
                           weighted_graph=False,
                           verbose=True)
Computing the neighborhood graph...
Computing the weights...
Finished computing the KNN graph in 4.708 seconds

Computing gene autocorrelation to determine genes with spatial variation#

Once the neighborhood graph is computed, local autocorrelation needs to be assessed using the compute_local_autocorrelation function. For this, we need to specify the gene counts matrix (with the layer_key parameter), which background model to use (with the model parameter), the per-cell scaling factor (optional, with the umi_counts_obs_key parameter), and whether we want to focus on metabolic genes or not (with the use_metabolic_genes parameter; False by default). In case we focus on metabolic enzymes, the species parameter is used to specify the organism dataset corresponds to. Further, the empirical test can also be run by permutation testing to later compare the results with the theoretical one (with the permutation_test parameter).

Here, we are focusing on all genes and we are using the Bernoulli distribution to model our count data.

harreman.hs.compute_local_autocorrelation(adata, layer_key="csc_counts", model='bernoulli', umi_counts_obs_key="total_counts", permutation_test=True)

We visualize the scatter plot to compare the p-values from the theoretical (x-axis) and empirical (y-axis) tests.

x = -np.log10(adata.uns['gene_autocorrelation_results'].Z_Pval.values)
y = -np.log10(adata.uns['gene_autocorrelation_results'].Perm_Pval.values)

genes = adata.uns['gene_autocorrelation_results'].index

X = adata[:,genes].X.tocsc()
sums = X.sum(axis=0).A1
counts = (X > 0).sum(axis=0).A1
mean_expr = sums / counts
 
x = np.where(x>3, 3, x)
y = np.where(np.isinf(y), 3, y)
corr_plot(x, y, mean_expr, method='spearman', colorbar=True, pmin=10, pmax=90)
plt.axhline(-np.log10(0.05), color='r', linestyle='--')
plt.axvline(-np.log10(0.05), color='r', linestyle='--')
plt.xlabel('-log10(p) (z-test)')
plt.ylabel('-log10(p) (permutation)')
plt.title('Autocorrelation p-value (Permutation vs Z-Score)')
plt.show()
../../_images/fad6580f4348fc87697e12402fc32d09ccfb44416f8c2fb3285905a21fb7fdd0.png

We only use those genes with a significant autocorrelation (FDR < 0.05) for the subsequent analysis.

hs_results = adata.uns['gene_autocorrelation_results']
hs_genes = hs_results.index[hs_results.Z_FDR < 0.05]

Computing local correlation to group genes into spatial modules#

To get a better idea of what spatial patterns exist, it is helpful to group the genes into modules. This is done by using the concept of local correlations - that is, correlations that are computed between genes between cells in the same neighborhood.

Here, we avoid running the calculation for all Genes x Genes pairs and instead only run this on genes that have significant spatial autocorrelation to begin with.

The compute_local_correlation function returns a Genes x Genes matrix of Z-scores for the significance of the correlation between genes. This object is retained in the AnnData object and is used in the subsequent steps.

Here, the empirical test can also be run by permutation testing to later compare the results with the theoretical one (with the permutation_test parameter).

harreman.hs.compute_local_correlation(adata, genes=hs_genes, permutation_test=True, verbose=True)
Computing pair-wise local correlation on 876 features...
Pair-wise local correlation results are stored in adata.uns with the following keys: ['lc_perm_pvals', 'lc_perm_pvals_sym', 'lcs', 'lc_zs', 'lc_z_pvals', 'lc_z_FDR']
Finished computing pair-wise local correlation in 630.758 seconds

We visualize the scatter plot to compare the p-values from the theoretical (x-axis) and empirical (y-axis) tests.

x = -np.log10(adata.uns['lc_z_pvals'].values).flatten()
y = -np.log10(adata.uns['lc_perm_pvals'].values).flatten()

genes = adata.uns['lc_z_pvals'].index

X = adata[:,genes].X.tocsc()
sums = X.sum(axis=0).A1
counts = (X > 0).sum(axis=0).A1
mean_expr = sums / counts

mean_expr_matrix = (mean_expr[:, np.newaxis] + mean_expr[np.newaxis, :]) / 2
mean_expr_values = mean_expr_matrix.flatten()
 
x = np.where(x>3, 3, x)
y = np.where(np.isinf(y), 3, y)
corr_plot(x, y, mean_expr_values, method='spearman', colorbar=False, pmin=10, pmax=90)
plt.axhline(-np.log10(0.05), color='r', linestyle='--')
plt.axvline(-np.log10(0.05), color='r', linestyle='--')
plt.xlabel('-log10(p) (z-test)')
plt.ylabel('-log10(p) (permutation)')
plt.title('Pairwise correlation p-value (Permutation vs Z-Score)')
plt.show()
../../_images/1764b6da1a1aab7e30c567b0f45b3c82cc3b9fa9245fc07a27b32a6578374201.png

Now that pairwise local correlations are calculated, we can group genes into modules.

To do this, the create_modules function is used, which performs agglomerative clustering with two caveats:

  1. If the FDR-adjusted p-value of the correlation between two branches exceeds fdr_threshold, then the branches are not merged.

  2. If two branches are to be merged and they both have at least min_gene_threshold genes, then the branches are not merged. Further genes that would join to the resulting merged module (and are therefore ambiguous) either remain unassigned (if core_only=True) or are assigned to the module with the smaller average correlations between genes, i.e. the least-dense module (if core_only=False). Unassigned genes are indicated with a module number of -1.

harreman.hs.create_modules(adata, min_gene_threshold=20, core_only=False, fdr_threshold=0.05)

Plotting module correlations#

We then can visualize the created modules.

harreman.pl.local_correlation_plot(adata)
../../_images/f3d1e59fdfd6ae56838a3c3344b6558c5c10f563c44fbb673f9dbd60b6b20502.png

Calculating module scores and visualizing them#

To characterize the general behavior of a module, module scores are computed.

harreman.hs.calculate_module_scores(adata)
modules = adata.obsm['module_scores'].columns
adata.obs[modules] = adata.obsm['module_scores']

Finally, we visualize the module scores on top of the UMAP for comparison.

with mplscience.style_context():
    sc.pl.spatial(adata, color=modules, frameon=False, vmin="p1", vmax="p99", spot_size=30, ncols=3, cmap='viridis')
../../_images/7524e781d0bc12444e730eddeeb488be4b80e87836136f6b9834c392ceec5e8f.png