Visualizing spatial patterns

In the spatial analysis tutorial, we explored the polarity and colocalization metrics that allow us to identify protein-specific patterns of spatial organization across cell populations. Once a cell population has been defined with a large effect size of a spatial metric of interest, we can select components from that population and draw 2D or 3D layouts to show the spatial distribution of protein markers.

The polarity and colocalization metrics are summary metrics that indicate the presence of a general spatial pattern in a component, for example, whether one marker is clustered (polarity) or if two markers are colocalized (colocalization). However, these metrics do not tell us where in the component graph the polarity or colocalization event appear. For this, we can use a local spatial metric that can help us to score nodes in the graph based on how much the local marker abundance deviates from the expectation.

In this tutorial, we will learn how to use the Local G spatial metric to identify regions of MPX component graphs where a spatial pattern in marker abundance is present. In the 2D and 3D visualization tutorials, we could already see some patterns by simply coloring nodes based on marker abundance. However, for low abundant markers, such patterns can be quite challenging to pick up since the marker counts per node are sparse. The Local G spatial metric summarizes the abundance over an extended local neighborhood around each node which addresses this problem.

Setup

import pandas as pd
import numpy as np

from pathlib import Path
from pixelator.mpx import read, simple_aggregate

import plotly.graph_objects as go

from scipy.stats import norm
import scipy as sp

import seaborn as sns
import matplotlib.pyplot as plt

Setup

DATA_DIR = Path("<path to the directory to save datasets to>")

Load data

We will continue with the stimulated T cell data that we analyzed in the previous tutorial (Karlsson et al, Nature Methods 2024). Here, we read the data files and apply the same filtering steps as in Spatial Analysis.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/next/uropod-t-cells-v1.0-immunology-I"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Uropod_control.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Uropod_CD54_fixed_RANTES_stimulated.layout.dataset.pxl"

data_files = [
    DATA_DIR / "Uropod_control.layout.dataset.pxl",
    DATA_DIR / "Uropod_CD54_fixed_RANTES_stimulated.layout.dataset.pxl",
]

pg_data_combined = simple_aggregate(
    ["Control", "Stimulated"], [read(path) for path in data_files]
)

components_to_keep = pg_data_combined.adata[
    (pg_data_combined.adata.obs["molecules"] >= 8000)
    & (pg_data_combined.adata.obs["tau_type"] == "normal")
].obs.index

adata = pg_data_combined.adata
adata = adata[adata.obs.index.isin(components_to_keep), :]

Now we can select components with high colocalization for a specific marker pair to get some good example components to work with. First, we will filter our colocalization scores to keep only CD162/CD50 and then we will add the abundance data for CD162, CD37 and CD50 to the colocalization table.

colocalization_scores = pg_data_combined.colocalization

colocalization_scores = colocalization_scores[
    colocalization_scores["component"].isin(components_to_keep)
]
# Create a new columns with marker_1/marker_2 using iloc
colocalization_scores['contrast'] = colocalization_scores['marker_1'] + '/' + colocalization_scores['marker_2']
colocalization_scores = colocalization_scores[colocalization_scores['contrast'] == 'CD162/CD50']

# Fetch data from pg_data_combined and convert to a DataFrame
marker_counts = adata.X[
    :, adata.var.index.isin(["CD162", "CD37", "CD50"])
]
marker_counts = pd.DataFrame(
    marker_counts, 
    columns=["CD162", "CD37", "CD50"]
)
marker_counts["component"] = adata.obs.index

# Merge the colocalization scores and marker count data dataframes on 'component'
colocalization_scores = colocalization_scores.merge(marker_counts, on='component', how='left')

Next, we will select components with high colocalization scores for CD162/CD50. In the histogram below, we see the distribution of Pearson Z scores for the CD162/CD50 colocalization. Since we want to narrow down the selection of components, we set a threshold for the Pearson Z scores. Here, we use a threshold of 35 to keep the components with the highest colocalization scores. Note that the threshold is arbitrary and should be adjusted based on the data.

# Set Pearson Z threshold
pearson_z_threshold = 35

sns.histplot(colocalization_scores['pearson_z'], bins=50)
plt.axvline(pearson_z_threshold, linestyle="--")
plt.title("Pearson Z scores [CD162/CD50]")

Text(0.5, 1.0, 'Pearson Z scores [CD162/CD50]')

Then we apply our threshold to get a final selection of 13 components.

plot_components = colocalization_scores[
    (colocalization_scores['pearson_z'] > pearson_z_threshold) & (colocalization_scores['sample'] == 'Stimulated')
].sort_values(by='CD162', ascending=False)

plot_components = plot_components[["component", "pearson_z", "CD162", "CD50", "CD37"]]
plot_components

	component	pearson_z	CD162	CD50	CD37
434	61cb84c709f41e7b_Stimulated	41.687521	947	1222	631
454	59b284adcc91ec91_Stimulated	44.136111	789	864	448
467	23606c9599265d9a_Stimulated	40.091311	690	707	452
510	18278b69dc73ad16_Stimulated	41.707346	486	736	249
511	437d692c40d6d601_Stimulated	53.791253	450	493	286
569	50cd2ae85ea1c0e1_Stimulated	44.591328	414	651	287
441	19ab7abc50018ac8_Stimulated	40.458924	303	467	116
426	898664166bbd71ae_Stimulated	35.591388	301	626	317
555	d33f7aa55f7874ff_Stimulated	36.517807	220	238	112
474	7778298975953f4f_Stimulated	36.881728	216	420	138
681	703a192bee0b36f2_Stimulated	37.979454	197	183	178
564	f4d9b2d31012177a_Stimulated	38.389434	178	161	119
461	bd77e6c5f6e229e5_Stimulated	36.717635	153	289	108
596	60523a26c6f19593_Stimulated	35.865508	136	111	38
538	8a91a136efcae912_Stimulated	35.421162	68	118	69

Local G scores

Local G is a Z score that measures how much the marker counts in a local neighborhood deviates from the global mean. Neighborhoods are formed around each node in the graph, where the size of the neighborhood is determined by the k parameter. The larger k is, the larger the neighborhood around each node becomes, which in turn increases the spatial scale at which the score is calculated.

For instance, if we set k = 1, the local neighborhood around each node will be the direct neighbors (1 step away). If we set k = 2, the neighborhood will include the direct neighbors and the neighbors that are 2 steps away.

The effect of using a higher k is that the spatial scale increases; we are leveraging information from node that are further away in the graph, and the local G score will be more smoothed out.

We can compute the Local G metric for a single component graph with the local_g method. Here, we select one of the components from our plot_components table.

example_graph = pg_data_combined.graph(
    "60523a26c6f19593_Stimulated", simplify=True, use_full_bipartite=True
)

gi_mat = example_graph.local_g(k=1, normalize_counts=False, use_weights=True)
gi_mat

markers	ACTB	B2M	CD102	CD11a	CD11b	CD11c	CD127	CD137	CD14	CD150	...	CD82	CD84	CD86	CD9	HLA-ABC	HLA-DR	TCRVb5	mIgG1	mIgG2a	mIgG2b
node
476	-0.112989	4.244440	1.324578	-0.255276	-0.092230	-0.196027	-0.184765	-0.092230	-0.087492	-0.138440	...	0.275866	3.560649	2.291411	-0.092230	5.033729	4.049060	3.191708	-0.184765	-0.199366	-0.159923
853	-0.128300	-0.482082	-0.360847	-0.031168	-0.104727	-0.222588	-0.209801	-0.104727	-0.099348	-0.157199	...	-0.212409	-0.262182	-0.842142	-0.104727	-0.655899	-0.132802	0.125599	-0.209801	-0.226381	-0.181593
2542	-0.068183	-0.057985	-0.259402	-0.154046	-0.055656	-0.118292	-0.111496	-0.055656	-0.052797	-0.083541	...	0.865988	0.952489	0.168033	-0.055656	-0.120373	0.762917	-0.158028	-0.111496	-0.120307	-0.096505
2780	-0.141421	-1.511105	-0.450477	-0.141530	-0.115438	-0.245354	-0.231258	-0.115438	-0.109509	-0.173277	...	-0.636528	-0.414664	-1.254543	-0.115438	-1.329761	-1.267388	-0.327771	-0.231258	-0.249534	-0.200165
641	-0.071459	1.259288	1.080777	-0.161446	-0.058330	-0.123974	-0.116852	-0.058330	-0.055333	-0.087555	...	-0.321630	0.801021	0.303520	-0.058330	0.726877	0.824362	-0.165619	-0.116852	-0.126086	-0.101141
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1167	-0.040589	3.042094	-0.154419	-0.091702	-0.033131	-0.070418	-0.066372	-0.033131	-0.031430	-0.049731	...	4.445383	-0.178907	0.845398	-0.033131	3.568833	3.150288	-0.094072	-0.066372	-0.071618	-0.057448
36	-0.040589	0.534851	-0.154419	-0.091702	-0.033131	-0.070418	-0.066372	-0.033131	-0.031430	-0.049731	...	-0.182687	-0.178907	0.845398	-0.033131	2.952603	1.933318	-0.094072	-0.066372	-0.071618	-0.057448
623	-0.040589	-0.468046	-0.154419	-0.091702	-0.033131	-0.070418	-0.066372	-0.033131	-0.031430	-0.049731	...	-0.182687	-0.178907	-0.466948	-0.033131	0.857419	-0.094965	-0.094072	-0.066372	-0.071618	-0.057448
1024	-0.040589	1.537748	-0.154419	-0.091702	-0.033131	-0.070418	-0.066372	-0.033131	-0.031430	-0.049731	...	-0.182687	-0.178907	0.845398	-0.033131	2.459618	0.716349	-0.094072	-0.066372	-0.071618	-0.057448
1176	-0.040589	2.289921	-0.154419	-0.091702	-0.033131	-0.070418	-0.066372	-0.033131	-0.031430	-0.049731	...	-0.182687	4.765435	2.157744	-0.033131	2.706110	1.933318	-0.094072	-0.066372	-0.071618	-0.057448

3649 rows × 79 columns

note

There are in fact to definitions of local G: $G_i$ and $G_i^*$ . The main difference between these two variants is how they define a neighborhood around a center node $i$ ; $G_i$ excludes $i$ from the neighborhood while $G_i^*$ includes it. You can see the full definitions of the two statistics in the algorithms section for Local G. For visualization purposes, the choice between $G_i$ and $G_i^*$ is not critical, but it can affect the interpretation of the scores. You can select what method to use by setting method to one of “gi” or “gstari” in the local_G function.

In the plot below, we show the distribution (histogram) of local G scores for CD162 with k = 1 for our selected component. We see that a large fraction of nodes have G scores around 0, while a smaller fraction of the nodes (right tail) have very high scores. The latter corresponds to nodes that are located in regions of the graph where the abundance of CD162 is higher than expected.

sns.histplot(gi_mat["CD162"], bins=50)
plt.axvline(1, linestyle="--")
plt.title("Local G scores [CD162]")
plt.xlabel("Local G")
plt.ylabel("Number of nodes")

Text(0, 0.5, 'Number of nodes')

If we want to visualize the spatial distribution of the local G scores, we can project them back onto a 3D layout as we did in the 3D cell visualization tutorial.

First, let’s compute 3D weighted pmds layouts for our component graphs:

example_graph_layout_data = example_graph.layout_coordinates("wpmds_3d", 
only_keep_a_pixels=False,
get_node_marker_matrix=True)

In the plot below, we project the local G scores for CD162 onto the wpmds 3D layout. The color scale is centered at 0, since we are visualizing Z scores.

vals = gi_mat["CD162"]
# Note that we need to reindex the values to match the layout data
# to ensure the nodes have the same ordering.
vals = vals.reindex(example_graph_layout_data.index)
max_abs_val = max(abs(vals))

fig = go.Figure(
    data=[
        go.Scatter3d(
            x=example_graph_layout_data["x"],
            y=example_graph_layout_data["y"],
            z=example_graph_layout_data["z"],
            mode="markers",
            marker=dict(size=3, opacity=0.4, colorscale="RdBu", cmin=-max_abs_val, cmax=max_abs_val, reversescale=True),
            marker_color=vals
        ),
    ]
)
fig.update_layout(title="Local G scores for CD162 (k = 1)")
fig.show()

Note that the Local G scores can have extreme outliers which tend to skew the color scale. To mitigate this, we can trim the values to a certain quantile range. Below, we adjust values higher than the 99th percentile.

vals[vals > vals.quantile(0.99)]=vals.quantile(0.99)

max_abs_val = max(abs(vals))

fig = go.Figure(
    data=[
        go.Scatter3d(
            x=example_graph_layout_data["x"],
            y=example_graph_layout_data["y"],
            z=example_graph_layout_data["z"],
            mode="markers",
            marker=dict(size=3, opacity=0.4, colorscale="RdBu", cmin=-max_abs_val, cmax=max_abs_val, reversescale=True),
            marker_color=vals
        ),
    ]
)
fig.update_layout(title="Local G scores for CD162 (k = 1)")
fig.show()

If we generate the same plot using raw counts, we can see that the spatial pattern is still quite clear, but less concentrated in the uropod.

fig = go.Figure(
    data=[
        go.Scatter3d(
            x=example_graph_layout_data["x"],
            y=example_graph_layout_data["y"],
            z=example_graph_layout_data["z"],
            mode="markers",
            marker=dict(size=3, opacity=0.4, colorscale="OrRd"),
            marker_color=np.log1p(example_graph_layout_data["CD162"])
        ),
    ]
)
fig.update_layout(title="Log1p counts for CD162")
fig.show()

P values

With the local_G function, we can also compute p values for each node and marker. This allows us to identify “hot” and “cold” zones, i.e. nodes where marker abundance is significantly different from the global mean. The sign of the Z score can be used to determine if the marker is over- (“hot”) or under-expressed (“cold”) in the local neighborhood.

With an adjusted p value threshold of 0.05, we find 149 nodes in the “hot” zone and the rest are not significant (do not deviate from the global mean).

# Two-sided test
p_values = sp.stats.norm.sf(abs(gi_mat["CD162"]))*2
p_values_adj = sp.stats.false_discovery_control(p_values)

df = pd.DataFrame({
    "adj_p_val": p_values_adj,
    "Gi": gi_mat["CD162"]
})
df["zone"] = df.apply(
    lambda x: "Hot" if x["Gi"] > 0 and x["adj_p_val"] < 0.05 else "Cold" if x["Gi"] < 0 and x["adj_p_val"] < 0.05 else "Not significant",
    axis=1
)
df["node_color"] = df["zone"].apply(
    lambda x: "red" if x == "Hot" else "blue" if x == "Cold" else "grey"
)
df["zone"].value_counts()

zone
Not significant    3500
Hot                 149
Name: count, dtype: int64

If we want to visualize the “hot” and “cold” zones, we can color the nodes accordingly. Below, we color the nodes red if they are in the “hot” zone, blue if they are in the “cold” zone, and grey otherwise.

fig = go.Figure(
    data=[
        go.Scatter3d(
            x=example_graph_layout_data["x"],
            y=example_graph_layout_data["y"],
            z=example_graph_layout_data["z"],
            mode="markers",
            marker=dict(size=3, opacity=0.4, colorscale="OrRd"),
            marker_color=df["node_color"]
        ),
    ]
)
fig.update_layout(title="Hot zone CD162 (k = 1)")
fig.show()

Note that the p values are context-dependent and should be interpreted with caution. In the computation of local G, the neighborhood size is adjusted with the k parameter and when computing the marker abundance within each neighborhood, each neighboring node contributes with a certain weight that depends on its connectivity. This weighting scheme is activated by default, but can be skipped by setting use_weights = False. These parameters (k, use_weights, …) can be adjusted to suit the specific needs of the analysis but will also affect the p values.

In the plot below, we demonstrate the “smoothing” effect that we get by increasing k to a neighborhood of 4. Now, the spatial pattern is clearer and the “hot” zone in the Uropod is more pronounced.

gi_mat = example_graph.local_g(k=4, normalize_counts=False, use_weights=True)

vals = gi_mat["CD162"]
# Note that we need to reindex the values to match the layout data
# to ensure the nodes have the same ordering.
vals = vals.reindex(example_graph_layout_data.index)
vals[vals > vals.quantile(0.99)]=vals.quantile(0.99)

max_abs_val = max(abs(vals))

fig = go.Figure(
    data=[
        go.Scatter3d(
            x=example_graph_layout_data["x"],
            y=example_graph_layout_data["y"],
            z=example_graph_layout_data["z"],
            mode="markers",
            marker=dict(size=3, opacity=0.4, colorscale="RdBu", 
                        cmin=-max_abs_val, cmax=max_abs_val, 
                        reversescale=True),
            marker_color=vals
        ),
    ]
)
fig.update_layout(title="Local G scores for CD162 (k = 1)")
fig.show()

If we apply the same grouping of nodes based on the adjusted p values, we now find many more nodes in the “hot” zone. Since we are including more nodes in each local neighborhood (k = 4) when computing the local G scores, there’s a higher chance that neighborhoods will include nodes with counts for CD162, thus placing more nodes in the “hot” zone.

If we group nodes the same way as before, we now find 354 nodes in the “hot” zone.

# Two-sided test
p_values = sp.stats.norm.sf(abs(gi_mat["CD162"]))*2
p_values_adj = sp.stats.false_discovery_control(p_values)

df = pd.DataFrame({
    "adj_p_val": p_values_adj,
    "Gi": gi_mat["CD162"]
})
df["zone"] = df.apply(
    lambda x: "Hot" if x["Gi"] > 0 and x["adj_p_val"] < 0.05 else "Cold" if x["Gi"] < 0 and x["adj_p_val"] < 0.05 else "Not significant",
    axis=1
)
df["node_color"] = df["zone"].apply(
    lambda x: "red" if x == "Hot" else "blue" if x == "Cold" else "grey"
)
df["zone"].value_counts()

zone
Not significant    3369
Hot                 278
Cold                  2
Name: count, dtype: int64

fig = go.Figure(
    data=[
        go.Scatter3d(
            x=example_graph_layout_data["x"],
            y=example_graph_layout_data["y"],
            z=example_graph_layout_data["z"],
            mode="markers",
            marker=dict(size=3, opacity=0.4),
            marker_color=df["node_color"]
        ),
    ]
)
fig.update_layout(title="Hot zone CD162 (k = 4)")
fig.show()

Caveats

The Local G spatial metric is useful for visualizing spatial patterns in the data, but isolated events do not necessarily indicate biological significance. Just because a marker displays a certain spatial pattern in a single MPX cell component, this might not generalize across larger groups of MPX cell components.

Spatial patterns should be investigated across larger groups of components, for instance, by using differential polarity analysis (see our spatial analysis tutorial). Similarly, in microscopy data, observing a spatial pattern in a single cell is insufficient to establish biological significance. The pattern must be consistent across many cells to be meaningful. Once a clear pattern has been established, the local G score becomes a valuable tool for visualization and deeper analysis of MPX components displaying this spatial pattern.

Setup​

Setup​

Load data​

Local G scores​

P values​

Caveats​

Setup

Setup

Load data

Local G scores

P values

Caveats