Test statistic 2: Are genes a and b spatially co-localized (or interacting with each other)?#
Once metabolic genes with a relevant spatial gene expression pattern have been selected, pairwise correlation can be computed to eventually group genes into modules and define tissue zones. Here we made use of the second formula of the Hotspot algorithm (DeTomaso and Yosef, Cell systems, 2021), which is defined as follows:
where a and b are two different genes expressed by cells i and j, respectively, and X refers to the gene expression matrix of dimension genes x cells.
The weight \(w_{ij}\) represents communication strength between neighboring cells, and it is defined in the same way as in Test statistic 1.
This statistic is also relevant to address the question of which interactions happen in the defined tissue regions. To infer metabolite exchange (or ligand-receptor interaction) events present in the tissue without adding the cell type constraint, a cell-type-agnostic approach has also been implemented. This analysis, despite not making use of any novel formulas, is based on gene pairs defined in HarremanDB and/or CellChatDB (Jin et al., Nature protocols, 2024), therefore restricting the communication analysis to already defined gene pairs that are known to interact with each other. As gene pairs can be made up of either different or the same genes, the formula needs to be adapted to each case. For the former case, the pairwise correlation formula above is used, while for the latter, that is, if \(a = b\), the spatial autocorrelation formula defined in Test statistic 1 is used.
For significance testing, both an empirical and a theoretical test have been implemented. The latter corresponds to the already existing theoretical test introduced in the Hotspot method (DeTomaso and Yosef, Cell systems, 2021). In this case, instead of considering a null model that assumes the expression values of genes a and b are independent, which significantly underestimates the variance of \(H_{ab}\) if at least one gene has high autocorrelation (which is required to select these genes), a conditionally independent null hypothesis is tested. Here, we test how extreme \(H_{ab}\) is compared with independent values of gene b given the observed value of gene a, that is, \(P(H_{ab}|a)\), and vice versa, that is, \(P(H_{ab}|b)\). Eventually, we conservatively retain the least-significant result. After going through equations defined in Test statistic 1 adapted for \(H_{ab} = \sum_{i}^{}\sum_{j}^{} w_{ij} \left(X_{ai}X_{bj} + X_{bi}X_{aj}\right)\), and conditioning on gene a, the second moment of H is expressed as follows:
To assess communication significance, the statistic is converted into a Z-score using the equation below, and a significance value is obtained for every gene by comparing the Z values to the normal distribution:
P-values are then adjusted using the Benjamini-Hochberg procedure.
In the empirical test setting, cell IDs in the counts matrix corresponding to gene a (or b) are shuffled M times (\(M=1,000\) by default). Then, the correlation values for each gene according to \(H_{ab} = \sum_{i}^{}\sum_{j}^{} w_{ij} \left(X_{ai}X_{bj} + X_{bi}X_{aj}\right)\) are computed in each iteration, and \(p-value = \frac{x+1}{M+1}\) is used to calculate the p-value, where x represents the number of permuted H values that are higher than the observed one and M is the total number of permutations. Finally, similar to the parametric test, the most conservative p-values are considered. These p-values are then adjusted using the Benjamini-Hochberg method.