Skip to main content
Version: 0.19.x

Data Handling

This tutorial describes how to start working with the MPX output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

  • Load PXL files in Python or R and access the multi-modal data contents including protein counts, metadata and spatial scores.
  • Understand and work with spatial polarity (protein clustering patterns).
  • Interpret and work with colocalization scores (protein co-clustering).
  • Directly access the spatial graph structure through the edge list.
  • Aggregate multiple PXL files into an integrated data object with sample identities.
  • Save aggregated data objects for later reuse in analyses.

Setup

To start with, we need to load some packages and functions that we will need.

from pixelator import read
DATA_DIR = Path("<path to the directory to save datasets to>")

Loading data

We begin by locating the output from the pixelator pipeline. The input for downstream analysis is the PXL file contained in the analysis directory.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.19.x/technote-v1-vs-v2-immunology-II"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
from pathlib import Path

pg_data = read(DATA_DIR / "Sample05_V2_PBMC_r1.layout.dataset.pxl")

Component meta data

In addition to the counts data, the object contains meta data of each MPX graph component, each corresponding to a cell. This table contains information that can be useful in quality control, with information such as how many protein molecules were detected (edges), the sequencing depth (reads), and the graph connectivity of each component (mean_upia_degree).

pg_data.adata.obs
pixels a_pixels b_pixels antibodies molecules reads mean_reads_per_molecule median_reads_per_molecule mean_b_pixels_per_a_pixel median_b_pixels_per_a_pixel mean_a_pixels_per_b_pixel median_a_pixels_per_b_pixel a_pixel_b_pixel_ratio mean_molecules_per_a_pixel median_molecules_per_a_pixel is_potential_doublet n_edges_to_split_doublet leiden tau_type tau
component
002f13bdbf78b499 2964 1931 1033 77 21000 65968 3.141333 3.0 3.140342 2.0 5.870281 3.0 1.869313 10.875194 6.0 False 0.0 12 normal 0.932284
00387aa4175efe8d 4936 3436 1500 78 29584 97357 3.290867 3.0 2.616997 2.0 5.994667 3.0 2.290667 8.610012 4.0 False 0.0 0 normal 0.942541
005dd44dcce1635a 5145 3326 1819 77 34845 112356 3.224451 3.0 2.837943 2.0 5.189115 3.0 1.828477 10.476548 6.0 False 0.0 2 normal 0.982450
00ccd2f98f49bd73 2826 1905 921 77 18128 56203 3.100342 3.0 3.006299 2.0 6.218241 4.0 2.068404 9.516010 5.0 False 0.0 1 normal 0.968978
0105b317c16664a6 2298 1481 817 77 13043 40489 3.104270 3.0 2.974342 2.0 5.391677 3.0 1.812729 8.806887 5.0 False 0.0 12 normal 0.923715
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
fe9dfa1d7899a406 4627 3175 1452 78 38388 123015 3.204517 3.0 2.891969 2.0 6.323691 4.0 2.186639 12.090709 7.0 False 0.0 12 normal 0.929266
feeaa3dfeb603c36 2353 1552 801 77 13130 39950 3.042650 3.0 2.684923 2.0 5.202247 3.0 1.937578 8.460052 5.0 False 0.0 12 normal 0.931461
ff05dde6ba92a240 4746 3086 1660 77 22972 71465 3.110961 3.0 2.897278 2.0 5.386145 3.0 1.859036 7.443940 4.0 False 0.0 0 normal 0.964793
ff32ebf3d43d0ad2 5026 3323 1703 78 25411 77659 3.056117 3.0 2.738188 2.0 5.342924 3.0 1.951262 7.647006 5.0 False 0.0 2 normal 0.981417
ffe179c90dda64a2 5045 3510 1535 78 38504 125633 3.262856 3.0 2.827920 2.0 6.466450 3.0 2.286645 10.969801 6.0 False 0.0 1 normal 0.974452

1101 rows × 20 columns

The antibody count table is also present in the data object.

pg_data.adata.to_df()
marker B2M CD102 CD11a CD11b CD11c CD123 CD14 CD150 CD152 CD154 ... CD86 CD9 CD94 HLA-ABC HLA-DR TCRVB5 ACTB mIgG1 mIgG2a mIgG2b
component
002f13bdbf78b499 1733 30 55 33 15 5 13 36 16 12 ... 12 20 0 723 14 6 52 8 13 8
00387aa4175efe8d 5128 77 402 194 159 4 29 25 25 35 ... 14 43 2 2123 38 28 194 14 19 17
005dd44dcce1635a 1970 113 358 270 527 17 418 27 24 28 ... 113 59 0 728 255 24 65 14 23 2
00ccd2f98f49bd73 5071 39 116 47 21 2 15 75 29 15 ... 9 30 0 2059 38 18 40 15 14 8
0105b317c16664a6 1587 13 68 16 15 2 20 27 5 15 ... 4 20 0 655 13 10 25 7 4 7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
fe9dfa1d7899a406 4967 37 194 88 42 8 40 61 24 33 ... 11 63 0 1859 54 17 74 13 22 21
feeaa3dfeb603c36 1124 8 47 28 16 3 13 34 8 10 ... 5 15 0 457 16 2 33 10 12 5
ff05dde6ba92a240 2558 94 237 113 46 4 36 37 23 51 ... 29 54 0 1148 73 36 77 31 47 46
ff32ebf3d43d0ad2 1829 73 271 110 267 11 361 14 15 15 ... 112 43 0 635 200 14 41 7 21 5
ffe179c90dda64a2 6126 89 247 56 37 7 27 158 23 32 ... 16 27 0 2189 22 19 47 15 34 9

1101 rows × 84 columns

Spatial data

In addition to count data corresponding to protein abundance, MPX also generates spatial data which be used to visualize individual cells, and to find spatial structures of proteins across cells. The PXL file comes with MPX polarity scores, describing the degree of spatial clustering of each protein on each cell, and MPX colocalization scores, which describe the degree of spatial colocalization of pairs of proteins on each cell. If you want to read more about spatial metrics in MPX data, click here.

MPX Polarity scores

The polarity scores are stored in the polarization attribute of the PXL object.

pg_data.polarization
marker morans_i morans_z morans_p_value morans_p_adjusted component
0 ACTB -0.008266 -2.031748 0.021090 0.152425 6b437c2d776b156c
1 B2M -0.005748 -1.307924 0.095450 0.341647 6b437c2d776b156c
2 CD102 -0.001478 -0.162568 0.435429 0.487791 6b437c2d776b156c
3 CD11a 0.008608 1.567035 0.058553 0.275501 6b437c2d776b156c
4 CD11b 0.004012 0.986947 0.161834 0.405787 6b437c2d776b156c
... ... ... ... ... ... ...
82410 CD9 -0.006795 -0.365057 0.357534 0.471918 401f95f79fab5327
82411 HLA-ABC -0.003767 -0.242467 0.404209 0.482039 401f95f79fab5327
82412 HLA-DR -0.018672 -1.110681 0.133353 0.384813 401f95f79fab5327
82413 TCRVB5 -0.008693 -0.607933 0.271616 0.452067 401f95f79fab5327
82414 mIgG2a -0.010073 -0.500296 0.308433 0.460591 401f95f79fab5327

82415 rows × 6 columns

MPX colocalization scores

The colocalization scores are stored in the colocalization attribute of the PXL object.

pg_data.colocalization
marker_1 marker_2 pearson pearson_mean pearson_stdev pearson_z pearson_p_value pearson_p_value_adjusted jaccard jaccard_mean jaccard_stdev jaccard_z jaccard_p_value jaccard_p_value_adjusted component
0 ACTB B2M -0.004560 -0.015768 0.021177 0.529221 2.983260e-01 3.279794e-01 0.238776 0.255971 0.006499 -2.645897 0.004074 0.008290 6b437c2d776b156c
1 ACTB CD102 -0.139127 -0.001299 0.022592 -6.100671 5.281217e-10 2.546859e-09 0.241212 0.256722 0.006304 -2.460207 0.006943 0.013305 6b437c2d776b156c
2 B2M CD102 -0.207535 -0.038987 0.024024 -7.015775 1.143388e-12 7.468879e-12 0.271056 0.296946 0.008064 -3.210476 0.000663 0.001633 6b437c2d776b156c
3 ACTB CD11a -0.080339 -0.009859 0.026331 -2.676688 3.717688e-03 6.546656e-03 0.245492 0.254757 0.007883 -1.175337 0.119930 0.158149 6b437c2d776b156c
4 B2M CD11a 0.033593 -0.044887 0.022971 3.416419 3.172522e-04 6.780218e-04 0.326744 0.296844 0.007272 4.111668 0.000020 0.000067 6b437c2d776b156c
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3054474 CD9 mIgG2a -0.003638 0.005354 0.023431 -0.383788 3.505677e-01 3.751922e-01 0.205582 0.205415 0.009261 0.017988 0.492824 0.494683 fdd311340a4cf64f
3054475 HLA-ABC mIgG2a 0.041583 -0.004647 0.022663 2.039919 2.067921e-02 3.121286e-02 0.190202 0.190517 0.008624 -0.036504 0.485440 0.489189 fdd311340a4cf64f
3054476 HLA-DR mIgG2a 0.016440 -0.001183 0.022256 0.791809 2.142360e-01 2.476247e-01 0.163569 0.192963 0.008628 -3.406731 0.000329 0.000869 fdd311340a4cf64f
3054477 TCRVB5 mIgG2a 0.008547 0.003769 0.028088 0.170113 4.324607e-01 4.454262e-01 0.211843 0.208188 0.011769 0.310603 0.378051 0.404345 fdd311340a4cf64f
3054478 mIgG1 mIgG2a -0.091892 0.001774 0.027947 -3.351533 4.018274e-04 8.438472e-04 0.155880 0.187808 0.011331 -2.817681 0.002419 0.005211 fdd311340a4cf64f

3054479 rows × 15 columns

Edgelist

The edgelist is a list of uniquely identified antibodies per each cell, and simultaneously constitutes the bipartite graph which makes up an MPX experiment. Each row contains an edge (a uniquely identified antibody) which is identified by a UMI, UPIA, and a UPIB. The UPIA is the unique identifier for the specific A pixel, while the UPIB is the unique identifier for the B pixel. Both A pixels and B pixels may contain multiple edges, resulting in a spatial graph which can be used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.

pg_data.edgelist
upia upib umi marker sequence count unique_molecules_count component
0 CGGGTTAGTTCATAATTAGCCCATT ATTGCAGGTCGTTAGACAGATAGTG CGTCAGTGGG CD19 ACCAACTT 15 1 6b437c2d776b156c
1 CGTGATCAGACAACACGATGTGAGG GCAGGGAACGACGGGTTTGCGTGGC GAAGCGTCCT CD19 ACCAACTT 15 1 8914534a50b9a420
2 CTTTTAGTGTTTTCTATATTTCTTA CGTTTGTTTTCTGCGAAATTACTGC TGCGAGGGGG CD19 ACCAACTT 15 1 b033c579272106dc
3 TTGGTGTACTTCATTCTGTCGGGTC AGCGGGGAGCTTGCGGTTTGTACAT TAGTGGACCA CD19 ACCAACTT 14 1 d45e83b2ccb78637
4 GTGAAAAATTATCGTGTTGTGTGCT ATTACAAGGTGTATCTTGGAATTCG GTGCAAGCAT CD19 ACCAACTT 13 1 597225108b6c1cd9
... ... ... ... ... ... ... ... ...
25842387 TGTGCGCGTTTGGATTAGTTGGTCT TTGTACGAGGTCGGGGCGAGCATGG GACCATGGTT B2M TAGCGCAG 2 1 2ab91b04ef86fe1b
25842388 TACGTGGGGCGGGGTGCTCTTAGAG GATTTTGTATTTTTTACTGCTCTGT TACCATCGCG B2M TAGCGCAG 2 1 f90f28a810d96e1b
25842389 GATAATTATGCTGCCCACTAGGGAG GTATCGCCATGTGAGTACGCTGGTA CACAAATCAG B2M TAGCGCAG 2 1 e8ecdfba70823d72
25842390 GAGGATCATGGTAGGAAGGTTGCGT TATCCGGTTTCATTCTAGGCGGTTG CGAAAGGTTT B2M TAGCGCAG 2 1 cab06520f2dc08c6
25842391 GGGGTAGTCTTTAGTATTCGCACTT TTTTTGCTATTCTTGAAATGTATCT CTAATAGACG B2M TAGCGCAG 2 1 4f3ab4049ece50f0

25842392 rows × 8 columns

Aggregating data

For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. When doing so, the cell ID and sample ID are merged to form the new cell ID in the merged object (i.e. “S1_RCVCMP0000830”).

In the code chunk below, we will create a merged object from four PXL files. These PXL files represent a resting and PHA stimulated PBMC sample, both in duplicate. We will create a merged object and update the object’s metadata to indicate the condition and replicate number. The resulting object will be explored further in the next tutorial on Quality Control.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.19.x/technote-v1-vs-v2-immunology-II"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample06_V2_PBMC_r2.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl"
from pixelator import simple_aggregate

paths = [
DATA_DIR / "Sample05_V2_PBMC_r1.layout.dataset.pxl",
DATA_DIR / "Sample06_V2_PBMC_r2.layout.dataset.pxl",
DATA_DIR / "Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl",
DATA_DIR / "Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl",
]

datasets = list([read(path) for path in paths])
pg_data_combined = simple_aggregate(
["resting_r1", "resting_r2", "stimulated_r1", "stimulated_r2"], datasets
)
pg_data_combined

Save data

After combining the samples, we can save the merged data object to use later.

pg_data_combined.save(DATA_DIR / "combined_data.pxl", force_overwrite=True)

We have now seen how to load MPX data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.