Skip to main content

Data Handling

This tutorial describes how to start working with the MPX output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

  • Load PXL files in Python or R and access the multi-modal data contents including protein counts, metadata and spatial scores.
  • Understand and work with spatial polarity (protein clustering patterns).
  • Interpret and work with colocalization scores (protein co-clustering).
  • Directly access the spatial graph structure through the edge list.
  • Aggregate multiple PXL files into an integrated data object with sample identities.
  • Save aggregated data objects for later reuse in analyses.

Setup

To start with, we need to load some packages and functions that we will need.

from pixelator import read

Loading data

We begin by locating the output from the pixelator pipeline. The input for downstream analysis is the PXL file contained in the analysis directory.

from pathlib import Path

DATA_DIR = Path.cwd().parents[3] / "data"
pg_data = read(DATA_DIR / "Sample01_human_pbmcs_unstimulated.dataset.pxl")

Component meta data

In addition to the counts data, the object contains meta data of each MPX graph component, each corresponding to a cell. This table contains information that can be useful in quality control, with information such as how many protein molecules were detected (edges), the sequencing depth (mean_reads), and the graph connectivity of each component (mean_upia_degree).

pg_data.adata.obs
               vertices  edges  antibodies  ...  leiden  tau_type       tau
component ...
RCVCMP0000000 3940 23925 77 ... 2 normal 0.983287
RCVCMP0000002 2337 6719 72 ... 1 normal 0.973446
RCVCMP0000003 2942 8596 78 ... 5 normal 0.982575
RCVCMP0000005 4890 17206 79 ... 3 normal 0.973380
RCVCMP0000006 5271 21254 76 ... 2 normal 0.986411
... ... ... ... ... ... ... ...
RCVCMP0002333 586 2494 53 ... 6 normal 0.975075
RCVCMP0002356 270 640 55 ... 3 normal 0.971325
RCVCMP0003679 278 827 43 ... 4 normal 0.975404
RCVCMP0005679 490 3560 60 ... 8 normal 0.969577
RCVCMP0006318 132 660 45 ... 6 normal 0.979106

[477 rows x 18 columns]

The antibody count table is also present in the data object.

pg_data.adata.to_df()
marker         CD274  CD44  CD25  CD279  CD41  ...  CD64  CD49D  CD158  CD314  CD35
component ...
RCVCMP0000000 62 553 12 4 6 ... 0 61 3 3 534
RCVCMP0000002 11 180 7 2 1 ... 0 9 1 2 5
RCVCMP0000003 25 66 6 0 3 ... 1 19 2 4 3
RCVCMP0000005 31 347 8 8 5 ... 3 26 0 12 18
RCVCMP0000006 16 213 5 2 6 ... 1 46 2 1 524
... ... ... ... ... ... ... ... ... ... ... ...
RCVCMP0002333 6 96 1 0 73 ... 6 6 1 0 140
RCVCMP0002356 1 6 0 0 9 ... 0 1 0 0 10
RCVCMP0003679 0 14 1 0 3 ... 0 2 0 0 0
RCVCMP0005679 5 15 10 2 1046 ... 6 3 0 0 27
RCVCMP0006318 3 26 0 1 9 ... 0 0 0 0 16

[477 rows x 80 columns]

Spatial data

In addition to count data corresponding to protein abundance, MPX also generates spatial data which be used to visualize individual cells, and to find spatial structures of proteins across cells. The .pxl file comes with MPX polarity scores, describing the degree of spatial clustering of each protein on each cell, and MPX colocalization scores, which describe the degree of spatial colocalization of pairs of proteins on each cell. If you want to read more about spatial metrics in MPX data, click here.

MPX Polarity scores

pg_data.polarization
       morans_i  morans_p_value  ...  marker      component
0 -0.002986 0.771680 ... ACTB RCVCMP0000830
1 -0.016127 0.177355 ... B2M RCVCMP0000830
2 0.014672 0.125056 ... CD102 RCVCMP0000830
3 -0.006782 0.590223 ... CD11a RCVCMP0000830
4 0.000836 0.891148 ... CD127 RCVCMP0000830
... ... ... ... ... ...
33474 -0.014165 0.441007 ... HLA-DR RCVCMP0000033
33475 0.001574 0.860511 ... TCRb RCVCMP0000033
33476 -0.002854 0.070234 ... mIgG1 RCVCMP0000033
33477 -0.000523 0.491835 ... mIgG2a RCVCMP0000033
33478 -0.000318 0.364643 ... mIgG2b RCVCMP0000033

[33479 rows x 6 columns]

MPX colocalization scores

pg_data.colocalization
        marker_1 marker_2  ...  jaccard_p_value_adjusted      component
0 ACTB ACTB ... NaN RCVCMP0000000
1 ACTB B2M ... 4.538319e-01 RCVCMP0000000
2 B2M B2M ... NaN RCVCMP0000000
3 ACTB CD102 ... 4.966833e-03 RCVCMP0000000
4 B2M CD102 ... 6.962944e-09 RCVCMP0000000
... ... ... ... ... ...
1211818 CD82 TCRb ... 2.251738e-01 RCVCMP0006318
1211819 CD86 TCRb ... 4.009157e-01 RCVCMP0006318
1211820 HLA-ABC TCRb ... 4.971263e-01 RCVCMP0006318
1211821 HLA-DR TCRb ... 4.959882e-01 RCVCMP0006318
1211822 TCRb TCRb ... NaN RCVCMP0006318

[1211823 rows x 15 columns]

Edgelist

The edgelist is a list of uniquely identified antibodies per each cell, and simultaneously constitutes the bipartite graph which makes up an MPX experiment. Each row contains an edge (a uniquely identified antibody) which is identified by a UMI, UPIA, and a UPIB. The UPIA is the unique identifier for the specific A pixel, while the UPIB is the unique identifier for the B pixel. Both A pixels and B pixels may contain multiple edges, resulting in a spatial graph which can be used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.

pg_data.edgelist
                              upia  ...      component
0 ATTAGAAGGCGTTTTGTGGTTTCGG ... RCVCMP0000208
1 TTAACATAGATCAGGTATAGAATGG ... RCVCMP0000594
2 AAAAATACAGATCACAGTTGCGTGC ... RCVCMP0000085
3 ATTTGTTTATTGGAGTTATATGGCT ... RCVCMP0000828
4 TTTAGCGTTTTAGGTTCTGTGATTT ... RCVCMP0000328
... ... ... ...
4655110 GACATTTATCTTAGAAAAAGAGTCA ... RCVCMP0000332
4655111 ATAGCGGAGTATTCGGGCTAGTGAG ... RCVCMP0000160
4655112 TCTGATGCTTTGAGTCGAGACGGGC ... RCVCMP0000598
4655113 GATCTATTGTTGTGAGGGCGATGGC ... RCVCMP0000005
4655114 ACCGGTTCCTGGAATAATTGGTGGG ... RCVCMP0000989

[4655115 rows x 9 columns]

Aggregating data

For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. When doing so, the cell ID and sample ID are merged to form the new cell ID in the merged object (i.e. “S1_RCVCMP0000830”).

from pixelator import simple_aggregate

paths = [
DATA_DIR / "Sample01_human_pbmcs_unstimulated.dataset.pxl",
DATA_DIR / "Sample02_human_pbmcs_unstimulated.dataset.pxl",
]

pg_data_combined = simple_aggregate(
["sample1", "sample2"], [read(path) for path in paths]
)
pg_data_combined
Pixel dataset contains:
AnnData with 1056 obs and 80 vars
Edge list with 10667394 edges
Polarization scores with 74726 elements
Colocalization scores with 2724648 elements
Metadata:
samples: {'sample1': {'version': '0.12.0', 'sample': 'Sample01_SG_unstimulated_S1_001', 'file_format_version': 1, 'analysis': {'params': {'compute_polarization': True, 'compute_colocalization': True, 'use_full_bipartite': False, 'polarization_normalization': 'clr', 'polarization_binarization': False, 'colocalization_transformation': 'log1p', 'colocalization_neighbourhood_size': 1, 'colocalization_n_permutations': 50, 'colocalization_min_region_count': 5}}}, 'sample2': {'version': '0.12.0', 'sample': 'Sample02_SG_unstimulated_S2_001', 'file_format_version': 1, 'analysis': {'params': {'compute_polarization': True, 'compute_colocalization': True, 'use_full_bipartite': False, 'polarization_normalization': 'clr', 'polarization_binarization': False, 'colocalization_transformation': 'log1p', 'colocalization_neighbourhood_size': 1, 'colocalization_n_permutations': 50, 'colocalization_min_region_count': 5}}}}

Save data

After combining the samples, we can save the merged data object to use later.

pg_data_combined.save(DATA_DIR / "combined_data.pxl")

We have now seen how to load MPX data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.