Skip to main content
Version: 0.18.x

Data Handling

This tutorial describes how to start working with the MPX output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

  • Load PXL files in Python or R and access the multi-modal data contents including protein counts, metadata and spatial scores.
  • Understand and work with spatial polarity (protein clustering patterns).
  • Interpret and work with colocalization scores (protein co-clustering).
  • Directly access the spatial graph structure through the edge list.
  • Aggregate multiple PXL files into an integrated data object with sample identities.
  • Save aggregated data objects for later reuse in analyses.

Setup

To start with, we need to load some packages and functions that we will need.

from pixelator import read
DATA_DIR = Path("<path to the directory to save datasets to>")

Loading data

We begin by locating the output from the pixelator pipeline. The input for downstream analysis is the PXL file contained in the analysis directory.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.18.x/technote-v1-vs-v2-immunology-II"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
from pathlib import Path

pg_data = read(DATA_DIR / "Sample05_V2_PBMC_r1.layout.dataset.pxl")

Component meta data

In addition to the counts data, the object contains meta data of each MPX graph component, each corresponding to a cell. This table contains information that can be useful in quality control, with information such as how many protein molecules were detected (edges), the sequencing depth (reads), and the graph connectivity of each component (mean_upia_degree).

pg_data.adata.obs
pixels a_pixels b_pixels antibodies molecules reads mean_reads_per_molecule median_reads_per_molecule mean_b_pixels_per_a_pixel median_b_pixels_per_a_pixel mean_a_pixels_per_b_pixel median_a_pixels_per_b_pixel a_pixel_b_pixel_ratio mean_molecules_per_a_pixel median_molecules_per_a_pixel leiden tau_type tau
component
RCVCMP0000000 5833 3749 2084 79 35040 114108 3.256507 3.0 2.794345 2.0 5.026871 3.0 1.798944 9.346492 5.0 0 normal 0.935262
RCVCMP0000001 4769 3338 1431 79 33399 107689 3.224318 3.0 2.933493 2.0 6.842767 4.0 2.332635 10.005692 5.0 1 normal 0.939779
RCVCMP0000002 4227 2724 1503 78 34376 113349 3.297330 3.0 3.825624 3.0 6.933466 4.0 1.812375 12.619677 7.5 0 normal 0.922382
RCVCMP0000004 3285 2159 1126 79 13859 42368 3.057075 3.0 2.352478 2.0 4.510657 3.0 1.917407 6.419176 3.0 2 normal 0.938652
RCVCMP0000005 9958 6413 3545 79 43839 136302 3.109149 3.0 2.397318 2.0 4.336812 3.0 1.809027 6.835958 3.0 3 normal 0.945426
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
RCVCMP0003226 4556 3070 1486 78 27696 86200 3.112363 3.0 2.829642 2.0 5.845895 4.0 2.065949 9.021498 5.0 6 normal 0.963925
RCVCMP0003903 1493 1156 337 77 6151 20833 3.386929 3.0 1.844291 1.0 6.326409 3.0 3.430267 5.320934 2.0 6 normal 0.961566
RCVCMP0004008 578 376 202 72 2333 7063 3.027432 3.0 2.412234 2.0 4.490099 3.0 1.861386 6.204787 3.0 6 normal 0.940160
RCVCMP0004936 445 294 151 68 3084 11090 3.595979 3.0 2.071429 2.0 4.033113 3.0 1.947020 10.489796 7.0 15 normal 0.987061
RCVCMP0007273 552 310 242 74 1432 4558 3.182961 3.0 2.093548 1.0 2.681818 2.0 1.280992 4.619355 2.0 9 normal 0.914020

1125 rows × 18 columns

The antibody count table is also present in the data object.

pg_data.adata.to_df()
marker B2M CD102 CD11a CD11b CD11c CD123 CD14 CD150 CD152 CD154 ... CD86 CD9 CD94 HLA-ABC HLA-DR TCRVB5 ACTB mIgG1 mIgG2a mIgG2b
component
RCVCMP0000000 4690 349 509 81 679 2 50 23 13 31 ... 262 68 3 1961 763 19 67 14 22 15
RCVCMP0000001 3379 94 262 156 50 15 78 105 21 34 ... 16 38 0 1215 61 15 71 24 22 12
RCVCMP0000002 3070 39 202 58 35 5 42 30 17 36 ... 18 40 0 1395 2871 26 48 17 31 17
RCVCMP0000004 2014 26 63 92 18 1 19 27 14 16 ... 15 69 0 758 35 18 31 13 15 8
RCVCMP0000005 6731 227 240 185 112 5 79 134 66 60 ... 72 172 0 2066 267 65 179 46 82 44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
RCVCMP0003226 3245 72 217 116 246 17 464 14 23 25 ... 145 57 0 1486 546 22 62 15 26 4
RCVCMP0003903 731 23 49 99 79 5 106 2 4 3 ... 20 12 1 356 91 6 15 5 5 1
RCVCMP0004008 329 8 21 42 60 1 31 0 2 0 ... 10 13 0 150 25 1 10 1 5 0
RCVCMP0004936 329 23 1 4 1 0 6 8 3 1 ... 3 68 0 156 5 3 4 1 2 0
RCVCMP0007273 152 2 17 10 4 1 3 6 5 0 ... 9 5 0 58 176 4 20 0 1 4

1125 rows × 84 columns

Spatial data

In addition to count data corresponding to protein abundance, MPX also generates spatial data which be used to visualize individual cells, and to find spatial structures of proteins across cells. The PXL file comes with MPX polarity scores, describing the degree of spatial clustering of each protein on each cell, and MPX colocalization scores, which describe the degree of spatial colocalization of pairs of proteins on each cell. If you want to read more about spatial metrics in MPX data, click here.

MPX Polarity scores

The polarity scores are stored in the polarization attribute of the PXL object.

pg_data.polarization
marker morans_i morans_z morans_p_value morans_p_adjusted component
0 ACTB -0.008267 -1.422853 0.077389 0.300134 RCVCMP0000000
1 B2M -0.005889 -1.774103 0.038023 0.205942 RCVCMP0000000
2 CD102 -0.001513 -0.078050 0.468894 0.493943 RCVCMP0000000
3 CD11a 0.008562 1.460010 0.072144 0.290312 RCVCMP0000000
4 CD11b 0.004098 1.012623 0.155620 0.392762 RCVCMP0000000
... ... ... ... ... ... ...
84017 CD82 -0.008716 -0.142741 0.443247 0.488600 RCVCMP0007273
84018 CD86 -0.011350 -0.557550 0.288576 0.451475 RCVCMP0007273
84019 CD9 -0.015076 -0.286918 0.387088 0.475960 RCVCMP0007273
84020 HLA-ABC -0.014450 -1.101991 0.135233 0.374945 RCVCMP0007273
84021 HLA-DR -0.013202 -1.453868 0.072991 0.291751 RCVCMP0007273

84022 rows × 6 columns

MPX colocalization scores

The colocalization scores are stored in the colocalization attribute of the PXL object.

pg_data.colocalization
marker_1 marker_2 pearson pearson_mean pearson_stdev pearson_z pearson_p_value pearson_p_value_adjusted jaccard jaccard_mean jaccard_stdev jaccard_z jaccard_p_value jaccard_p_value_adjusted component
0 ACTB B2M -0.004562 -0.012738 0.025900 0.315672 3.761257e-01 3.975680e-01 0.238681 0.257546 0.008556 -2.204776 0.013735 0.024148 RCVCMP0000000
1 ACTB CD102 -0.139138 -0.001331 0.022208 -6.205428 2.727414e-10 1.339170e-09 0.241310 0.258330 0.008305 -2.049466 0.020208 0.033899 RCVCMP0000000
2 B2M CD102 -0.207174 -0.040284 0.027302 -6.112788 4.895280e-10 2.333059e-09 0.271056 0.296720 0.008591 -2.987515 0.001406 0.003181 RCVCMP0000000
3 ACTB CD11a -0.080322 -0.005232 0.024710 -3.038852 1.187408e-03 2.290011e-03 0.245492 0.256245 0.007787 -1.380914 0.083653 0.116167 RCVCMP0000000
4 B2M CD11a 0.033532 -0.045293 0.019437 4.055448 2.501919e-05 6.373526e-05 0.326618 0.297709 0.008697 3.324091 0.000444 0.001126 RCVCMP0000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3111936 CD9 mIgG2a -0.003460 -0.005256 0.025624 0.070107 4.720541e-01 4.777580e-01 0.206122 0.204214 0.009976 0.191295 0.424147 0.441729 RCVCMP0003226
3111937 HLA-ABC mIgG2a 0.041451 -0.000336 0.025202 1.658047 4.865403e-02 6.731650e-02 0.190092 0.191577 0.009459 -0.156950 0.437642 0.452392 RCVCMP0003226
3111938 HLA-DR mIgG2a 0.016710 -0.001551 0.025647 0.712022 2.382256e-01 2.710816e-01 0.163395 0.192182 0.008039 -3.580660 0.000171 0.000477 RCVCMP0003226
3111939 TCRVB5 mIgG2a 0.008811 -0.001998 0.024293 0.444936 3.281831e-01 3.551442e-01 0.212500 0.206988 0.011616 0.474513 0.317567 0.352194 RCVCMP0003226
3111940 mIgG1 mIgG2a -0.092031 -0.000510 0.021879 -4.183096 1.437827e-05 3.799320e-05 0.155738 0.185831 0.010408 -2.891322 0.001918 0.004200 RCVCMP0003226

3111941 rows × 15 columns

Edgelist

The edgelist is a list of uniquely identified antibodies per each cell, and simultaneously constitutes the bipartite graph which makes up an MPX experiment. Each row contains an edge (a uniquely identified antibody) which is identified by a UMI, UPIA, and a UPIB. The UPIA is the unique identifier for the specific A pixel, while the UPIB is the unique identifier for the B pixel. Both A pixels and B pixels may contain multiple edges, resulting in a spatial graph which can be used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.

pg_data.edgelist
upia upib umi marker sequence count unique_molecules_count component
0 CGGGTTAGTTCATAATTAGCCCATT ATTGCAGGTCGTTAGACAGATAGTG CGTCAGTGGG CD19 ACCAACTT 15 1 RCVCMP0000000
1 CGTGATCAGACAACACGATGTGAGG GCAGGGAACGACGGGTTTGCGTGGC GAAGCGTCCT CD19 ACCAACTT 15 1 RCVCMP0000001
2 CTTTTAGTGTTTTCTATATTTCTTA CGTTTGTTTTCTGCGAAATTACTGC TGCGAGGGGG CD19 ACCAACTT 15 1 RCVCMP0000002
3 ACCCAGTTCTGCTCTGTCACGCAGT GATGTGGAAATCTAATATGTATACC GAGTTTGGGA CD19 ACCAACTT 14 1 RCVCMP0000004
4 TTGGTGTACTTCATTCTGTCGGGTC AGCGGGGAGCTTGCGGTTTGTACAT TAGTGGACCA CD19 ACCAACTT 14 1 RCVCMP0000005
... ... ... ... ... ... ... ... ...
26003768 ATTTAGGTACGTTTGAGGATCAGGG TTAGCTTCACAGCTGCTATCGGGTT ATCAGCAACG CD44 CGGCTCAT 2 1 RCVCMP0000255
26003769 TCTTATTTTAAAATAAAATTCTCAA CTTTAACCTTCTTCAACTATGAGAA TAAGAAGACA CD44 CGGCTCAT 2 1 RCVCMP0000779
26003770 GTCTATGTGGTAGTTAGGAAATGTA TGGATTTCTCTGCACAGAAGCGTTT ATACTGACTC CD44 CGGCTCAT 2 1 RCVCMP0000494
26003771 TAGGTAGATTGTGTCACGTTGACTT GTGCAAGGTAGATCTGGAATGGACG AGCTGTTTAC CD44 CGGCTCAT 2 1 RCVCMP0000572
26003772 GGCGCGAGTGAATCGCGTGCTGTTC TTAGTGATATGTCTCCTTCGCATGT GGGAGGACGA CD44 CGGCTCAT 2 1 RCVCMP0001151

26003773 rows × 8 columns

Aggregating data

For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. When doing so, the cell ID and sample ID are merged to form the new cell ID in the merged object (i.e. “S1_RCVCMP0000830”).

In the code chunk below, we will create a merged object from four PXL files. These PXL files represent a resting and PHA stimulated PBMC sample, both in duplicate. We will create a merged object and update the object’s metadata to indicate the condition and replicate number. The resulting object will be explored further in the next tutorial on Quality Control.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.18.x/technote-v1-vs-v2-immunology-II"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample06_V2_PBMC_r2.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl"
from pixelator import simple_aggregate

paths = [
DATA_DIR / "Sample05_V2_PBMC_r1.layout.dataset.pxl",
DATA_DIR / "Sample06_V2_PBMC_r2.layout.dataset.pxl",
DATA_DIR / "Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl",
DATA_DIR / "Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl",
]

pg_data_combined = simple_aggregate(
["resting_r1", "resting_r2", "stimulated_r1", "stimulated_r2"], [read(path) for path in paths]
)
pg_data_combined
Pixel dataset contains:
AnnData with 4454 obs and 84 vars
Edge list with 105218620 edges
Polarization scores with 331421 elements
Colocalization scores with 12238659 elements
Contains precomputed layouts
Metadata:
samples: {'resting_r1': {'version': '0.18.0', 'sample': 'Sample05_V2_PBMC_r1', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}, 'resting_r2': {'version': '0.18.0', 'sample': 'Sample06_V2_PBMC_r2', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}, 'stimulated_r1': {'version': '0.18.0', 'sample': 'Sample07_V2_PHA_PBMC_r1', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}, 'stimulated_r2': {'version': '0.18.0', 'sample': 'Sample08_V2_PHA_PBMC_r2', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}}

Save data

After combining the samples, we can save the merged data object to use later.

pg_data_combined.save(DATA_DIR / "combined_data.pxl", force_overwrite=True)

We have now seen how to load MPX data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.