Data Handling
This tutorial describes how to start working with the MPX output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.
After completing this tutorial, you will be able to:
- Load PXL files in Python or R and access the multi-modal data contents including protein counts, metadata and spatial scores.
- Understand and work with spatial polarity (protein clustering patterns).
- Interpret and work with colocalization scores (protein co-clustering).
- Directly access the spatial graph structure through the edge list.
- Aggregate multiple PXL files into an integrated data object with sample identities.
- Save aggregated data objects for later reuse in analyses.
Setup
To start with, we need to load some packages and functions that we will need.
from pixelator import read
DATA_DIR = Path("<path to the directory to save datasets to>")
Loading data
We begin by locating the output from the pixelator
pipeline. The input
for downstream analysis is the PXL file contained in the analysis
directory.
baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.18.x/technote-v1-vs-v2-immunology-II"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
from pathlib import Path
pg_data = read(DATA_DIR / "Sample05_V2_PBMC_r1.layout.dataset.pxl")
Component meta data
In addition to the counts data, the object contains meta data of each
MPX graph component, each corresponding to a cell. This table contains
information that can be useful in quality control, with information such
as how many protein molecules were detected (edges
), the sequencing
depth (reads
), and the graph connectivity of each component
(mean_upia_degree
).
pg_data.adata.obs
pixels | a_pixels | b_pixels | antibodies | molecules | reads | mean_reads_per_molecule | median_reads_per_molecule | mean_b_pixels_per_a_pixel | median_b_pixels_per_a_pixel | mean_a_pixels_per_b_pixel | median_a_pixels_per_b_pixel | a_pixel_b_pixel_ratio | mean_molecules_per_a_pixel | median_molecules_per_a_pixel | leiden | tau_type | tau | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
component | ||||||||||||||||||
RCVCMP0000000 | 5833 | 3749 | 2084 | 79 | 35040 | 114108 | 3.256507 | 3.0 | 2.794345 | 2.0 | 5.026871 | 3.0 | 1.798944 | 9.346492 | 5.0 | 0 | normal | 0.935262 |
RCVCMP0000001 | 4769 | 3338 | 1431 | 79 | 33399 | 107689 | 3.224318 | 3.0 | 2.933493 | 2.0 | 6.842767 | 4.0 | 2.332635 | 10.005692 | 5.0 | 1 | normal | 0.939779 |
RCVCMP0000002 | 4227 | 2724 | 1503 | 78 | 34376 | 113349 | 3.297330 | 3.0 | 3.825624 | 3.0 | 6.933466 | 4.0 | 1.812375 | 12.619677 | 7.5 | 0 | normal | 0.922382 |
RCVCMP0000004 | 3285 | 2159 | 1126 | 79 | 13859 | 42368 | 3.057075 | 3.0 | 2.352478 | 2.0 | 4.510657 | 3.0 | 1.917407 | 6.419176 | 3.0 | 2 | normal | 0.938652 |
RCVCMP0000005 | 9958 | 6413 | 3545 | 79 | 43839 | 136302 | 3.109149 | 3.0 | 2.397318 | 2.0 | 4.336812 | 3.0 | 1.809027 | 6.835958 | 3.0 | 3 | normal | 0.945426 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
RCVCMP0003226 | 4556 | 3070 | 1486 | 78 | 27696 | 86200 | 3.112363 | 3.0 | 2.829642 | 2.0 | 5.845895 | 4.0 | 2.065949 | 9.021498 | 5.0 | 6 | normal | 0.963925 |
RCVCMP0003903 | 1493 | 1156 | 337 | 77 | 6151 | 20833 | 3.386929 | 3.0 | 1.844291 | 1.0 | 6.326409 | 3.0 | 3.430267 | 5.320934 | 2.0 | 6 | normal | 0.961566 |
RCVCMP0004008 | 578 | 376 | 202 | 72 | 2333 | 7063 | 3.027432 | 3.0 | 2.412234 | 2.0 | 4.490099 | 3.0 | 1.861386 | 6.204787 | 3.0 | 6 | normal | 0.940160 |
RCVCMP0004936 | 445 | 294 | 151 | 68 | 3084 | 11090 | 3.595979 | 3.0 | 2.071429 | 2.0 | 4.033113 | 3.0 | 1.947020 | 10.489796 | 7.0 | 15 | normal | 0.987061 |
RCVCMP0007273 | 552 | 310 | 242 | 74 | 1432 | 4558 | 3.182961 | 3.0 | 2.093548 | 1.0 | 2.681818 | 2.0 | 1.280992 | 4.619355 | 2.0 | 9 | normal | 0.914020 |
1125 rows × 18 columns
The antibody count table is also present in the data object.
pg_data.adata.to_df()
marker | B2M | CD102 | CD11a | CD11b | CD11c | CD123 | CD14 | CD150 | CD152 | CD154 | ... | CD86 | CD9 | CD94 | HLA-ABC | HLA-DR | TCRVB5 | ACTB | mIgG1 | mIgG2a | mIgG2b |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
component | |||||||||||||||||||||
RCVCMP0000000 | 4690 | 349 | 509 | 81 | 679 | 2 | 50 | 23 | 13 | 31 | ... | 262 | 68 | 3 | 1961 | 763 | 19 | 67 | 14 | 22 | 15 |
RCVCMP0000001 | 3379 | 94 | 262 | 156 | 50 | 15 | 78 | 105 | 21 | 34 | ... | 16 | 38 | 0 | 1215 | 61 | 15 | 71 | 24 | 22 | 12 |
RCVCMP0000002 | 3070 | 39 | 202 | 58 | 35 | 5 | 42 | 30 | 17 | 36 | ... | 18 | 40 | 0 | 1395 | 2871 | 26 | 48 | 17 | 31 | 17 |
RCVCMP0000004 | 2014 | 26 | 63 | 92 | 18 | 1 | 19 | 27 | 14 | 16 | ... | 15 | 69 | 0 | 758 | 35 | 18 | 31 | 13 | 15 | 8 |
RCVCMP0000005 | 6731 | 227 | 240 | 185 | 112 | 5 | 79 | 134 | 66 | 60 | ... | 72 | 172 | 0 | 2066 | 267 | 65 | 179 | 46 | 82 | 44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
RCVCMP0003226 | 3245 | 72 | 217 | 116 | 246 | 17 | 464 | 14 | 23 | 25 | ... | 145 | 57 | 0 | 1486 | 546 | 22 | 62 | 15 | 26 | 4 |
RCVCMP0003903 | 731 | 23 | 49 | 99 | 79 | 5 | 106 | 2 | 4 | 3 | ... | 20 | 12 | 1 | 356 | 91 | 6 | 15 | 5 | 5 | 1 |
RCVCMP0004008 | 329 | 8 | 21 | 42 | 60 | 1 | 31 | 0 | 2 | 0 | ... | 10 | 13 | 0 | 150 | 25 | 1 | 10 | 1 | 5 | 0 |
RCVCMP0004936 | 329 | 23 | 1 | 4 | 1 | 0 | 6 | 8 | 3 | 1 | ... | 3 | 68 | 0 | 156 | 5 | 3 | 4 | 1 | 2 | 0 |
RCVCMP0007273 | 152 | 2 | 17 | 10 | 4 | 1 | 3 | 6 | 5 | 0 | ... | 9 | 5 | 0 | 58 | 176 | 4 | 20 | 0 | 1 | 4 |
1125 rows × 84 columns
Spatial data
In addition to count data corresponding to protein abundance, MPX also generates spatial data which be used to visualize individual cells, and to find spatial structures of proteins across cells. The PXL file comes with MPX polarity scores, describing the degree of spatial clustering of each protein on each cell, and MPX colocalization scores, which describe the degree of spatial colocalization of pairs of proteins on each cell. If you want to read more about spatial metrics in MPX data, click here.
MPX Polarity scores
The polarity scores are stored in the polarization
attribute of the
PXL object.
pg_data.polarization
marker | morans_i | morans_z | morans_p_value | morans_p_adjusted | component | |
---|---|---|---|---|---|---|
0 | ACTB | -0.008267 | -1.422853 | 0.077389 | 0.300134 | RCVCMP0000000 |
1 | B2M | -0.005889 | -1.774103 | 0.038023 | 0.205942 | RCVCMP0000000 |
2 | CD102 | -0.001513 | -0.078050 | 0.468894 | 0.493943 | RCVCMP0000000 |
3 | CD11a | 0.008562 | 1.460010 | 0.072144 | 0.290312 | RCVCMP0000000 |
4 | CD11b | 0.004098 | 1.012623 | 0.155620 | 0.392762 | RCVCMP0000000 |
... | ... | ... | ... | ... | ... | ... |
84017 | CD82 | -0.008716 | -0.142741 | 0.443247 | 0.488600 | RCVCMP0007273 |
84018 | CD86 | -0.011350 | -0.557550 | 0.288576 | 0.451475 | RCVCMP0007273 |
84019 | CD9 | -0.015076 | -0.286918 | 0.387088 | 0.475960 | RCVCMP0007273 |
84020 | HLA-ABC | -0.014450 | -1.101991 | 0.135233 | 0.374945 | RCVCMP0007273 |
84021 | HLA-DR | -0.013202 | -1.453868 | 0.072991 | 0.291751 | RCVCMP0007273 |
84022 rows × 6 columns
MPX colocalization scores
The colocalization scores are stored in the colocalization
attribute
of the PXL object.
pg_data.colocalization
marker_1 | marker_2 | pearson | pearson_mean | pearson_stdev | pearson_z | pearson_p_value | pearson_p_value_adjusted | jaccard | jaccard_mean | jaccard_stdev | jaccard_z | jaccard_p_value | jaccard_p_value_adjusted | component | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACTB | B2M | -0.004562 | -0.012738 | 0.025900 | 0.315672 | 3.761257e-01 | 3.975680e-01 | 0.238681 | 0.257546 | 0.008556 | -2.204776 | 0.013735 | 0.024148 | RCVCMP0000000 |
1 | ACTB | CD102 | -0.139138 | -0.001331 | 0.022208 | -6.205428 | 2.727414e-10 | 1.339170e-09 | 0.241310 | 0.258330 | 0.008305 | -2.049466 | 0.020208 | 0.033899 | RCVCMP0000000 |
2 | B2M | CD102 | -0.207174 | -0.040284 | 0.027302 | -6.112788 | 4.895280e-10 | 2.333059e-09 | 0.271056 | 0.296720 | 0.008591 | -2.987515 | 0.001406 | 0.003181 | RCVCMP0000000 |
3 | ACTB | CD11a | -0.080322 | -0.005232 | 0.024710 | -3.038852 | 1.187408e-03 | 2.290011e-03 | 0.245492 | 0.256245 | 0.007787 | -1.380914 | 0.083653 | 0.116167 | RCVCMP0000000 |
4 | B2M | CD11a | 0.033532 | -0.045293 | 0.019437 | 4.055448 | 2.501919e-05 | 6.373526e-05 | 0.326618 | 0.297709 | 0.008697 | 3.324091 | 0.000444 | 0.001126 | RCVCMP0000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3111936 | CD9 | mIgG2a | -0.003460 | -0.005256 | 0.025624 | 0.070107 | 4.720541e-01 | 4.777580e-01 | 0.206122 | 0.204214 | 0.009976 | 0.191295 | 0.424147 | 0.441729 | RCVCMP0003226 |
3111937 | HLA-ABC | mIgG2a | 0.041451 | -0.000336 | 0.025202 | 1.658047 | 4.865403e-02 | 6.731650e-02 | 0.190092 | 0.191577 | 0.009459 | -0.156950 | 0.437642 | 0.452392 | RCVCMP0003226 |
3111938 | HLA-DR | mIgG2a | 0.016710 | -0.001551 | 0.025647 | 0.712022 | 2.382256e-01 | 2.710816e-01 | 0.163395 | 0.192182 | 0.008039 | -3.580660 | 0.000171 | 0.000477 | RCVCMP0003226 |
3111939 | TCRVB5 | mIgG2a | 0.008811 | -0.001998 | 0.024293 | 0.444936 | 3.281831e-01 | 3.551442e-01 | 0.212500 | 0.206988 | 0.011616 | 0.474513 | 0.317567 | 0.352194 | RCVCMP0003226 |
3111940 | mIgG1 | mIgG2a | -0.092031 | -0.000510 | 0.021879 | -4.183096 | 1.437827e-05 | 3.799320e-05 | 0.155738 | 0.185831 | 0.010408 | -2.891322 | 0.001918 | 0.004200 | RCVCMP0003226 |
3111941 rows × 15 columns
Edgelist
The edgelist is a list of uniquely identified antibodies per each cell, and simultaneously constitutes the bipartite graph which makes up an MPX experiment. Each row contains an edge (a uniquely identified antibody) which is identified by a UMI, UPIA, and a UPIB. The UPIA is the unique identifier for the specific A pixel, while the UPIB is the unique identifier for the B pixel. Both A pixels and B pixels may contain multiple edges, resulting in a spatial graph which can be used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.
pg_data.edgelist
upia | upib | umi | marker | sequence | count | unique_molecules_count | component | |
---|---|---|---|---|---|---|---|---|
0 | CGGGTTAGTTCATAATTAGCCCATT | ATTGCAGGTCGTTAGACAGATAGTG | CGTCAGTGGG | CD19 | ACCAACTT | 15 | 1 | RCVCMP0000000 |
1 | CGTGATCAGACAACACGATGTGAGG | GCAGGGAACGACGGGTTTGCGTGGC | GAAGCGTCCT | CD19 | ACCAACTT | 15 | 1 | RCVCMP0000001 |
2 | CTTTTAGTGTTTTCTATATTTCTTA | CGTTTGTTTTCTGCGAAATTACTGC | TGCGAGGGGG | CD19 | ACCAACTT | 15 | 1 | RCVCMP0000002 |
3 | ACCCAGTTCTGCTCTGTCACGCAGT | GATGTGGAAATCTAATATGTATACC | GAGTTTGGGA | CD19 | ACCAACTT | 14 | 1 | RCVCMP0000004 |
4 | TTGGTGTACTTCATTCTGTCGGGTC | AGCGGGGAGCTTGCGGTTTGTACAT | TAGTGGACCA | CD19 | ACCAACTT | 14 | 1 | RCVCMP0000005 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
26003768 | ATTTAGGTACGTTTGAGGATCAGGG | TTAGCTTCACAGCTGCTATCGGGTT | ATCAGCAACG | CD44 | CGGCTCAT | 2 | 1 | RCVCMP0000255 |
26003769 | TCTTATTTTAAAATAAAATTCTCAA | CTTTAACCTTCTTCAACTATGAGAA | TAAGAAGACA | CD44 | CGGCTCAT | 2 | 1 | RCVCMP0000779 |
26003770 | GTCTATGTGGTAGTTAGGAAATGTA | TGGATTTCTCTGCACAGAAGCGTTT | ATACTGACTC | CD44 | CGGCTCAT | 2 | 1 | RCVCMP0000494 |
26003771 | TAGGTAGATTGTGTCACGTTGACTT | GTGCAAGGTAGATCTGGAATGGACG | AGCTGTTTAC | CD44 | CGGCTCAT | 2 | 1 | RCVCMP0000572 |
26003772 | GGCGCGAGTGAATCGCGTGCTGTTC | TTAGTGATATGTCTCCTTCGCATGT | GGGAGGACGA | CD44 | CGGCTCAT | 2 | 1 | RCVCMP0001151 |
26003773 rows × 8 columns
Aggregating data
For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. When doing so, the cell ID and sample ID are merged to form the new cell ID in the merged object (i.e. “S1_RCVCMP0000830”).
In the code chunk below, we will create a merged object from four PXL files. These PXL files represent a resting and PHA stimulated PBMC sample, both in duplicate. We will create a merged object and update the object’s metadata to indicate the condition and replicate number. The resulting object will be explored further in the next tutorial on Quality Control.
baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.18.x/technote-v1-vs-v2-immunology-II"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample06_V2_PBMC_r2.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl"
from pixelator import simple_aggregate
paths = [
DATA_DIR / "Sample05_V2_PBMC_r1.layout.dataset.pxl",
DATA_DIR / "Sample06_V2_PBMC_r2.layout.dataset.pxl",
DATA_DIR / "Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl",
DATA_DIR / "Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl",
]
pg_data_combined = simple_aggregate(
["resting_r1", "resting_r2", "stimulated_r1", "stimulated_r2"], [read(path) for path in paths]
)
pg_data_combined
Pixel dataset contains:
AnnData with 4454 obs and 84 vars
Edge list with 105218620 edges
Polarization scores with 331421 elements
Colocalization scores with 12238659 elements
Contains precomputed layouts
Metadata:
samples: {'resting_r1': {'version': '0.18.0', 'sample': 'Sample05_V2_PBMC_r1', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}, 'resting_r2': {'version': '0.18.0', 'sample': 'Sample06_V2_PBMC_r2', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}, 'stimulated_r1': {'version': '0.18.0', 'sample': 'Sample07_V2_PHA_PBMC_r1', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}, 'stimulated_r2': {'version': '0.18.0', 'sample': 'Sample08_V2_PHA_PBMC_r2', 'analysis': {'params': {'polarization': {'polarization': {'transformation': 'log1p', 'permutations': 50, 'min_marker_count': 5, 'random_seed': None}}, 'colocalization': {'colocalization': {'transformation_type': 'rate-diff', 'neighbourhood_size': 1, 'n_permutations': 50, 'min_region_count': 5, 'min_marker_count': 5}}}}}}
Save data
After combining the samples, we can save the merged data object to use later.
pg_data_combined.save(DATA_DIR / "combined_data.pxl", force_overwrite=True)
We have now seen how to load MPX data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.