# Algorithms

This section aims to explain briefly some of the algorithms used in **pixelator**. For extended explanations, if relevant, we try to include references in each sub-section.

## Edge Collapsing

From the sampled NGS reads, **pixelator** tries to reconstruct the original molecules that have been assayed by MPX. In order to achieve this, we use the combination of the antibody barcode, the UMI and UPIA to uniquely associate all UPIBs that show that fragment combination in the reads. Once that is done, after translating each nucleotide sequence to a binary sequence (using 2-bit encoding), we use Annoy (Approximate Nearest Neighbors Oh Yeah) to cluster those fragments at a certain Hamming distance from each other (default: 2 mismatches). The algorithm will then group a set of Ab/UMI/UPIA fragments into a different groups where all fragments are separated just by less than the given Hamming distance from each other.

The reason to use Annoy is that many to many comparisons takes a lot time. Performance in a worst case scenario is O(N^2) algorithm. On the other hand, Annoy is very quick and reduces the time complexity to O(log N).

Along the groups of fragments, **pixelator** creates a frequency table and we pick the most frequent UPIB and associate that with the Ab/UMI/UPIA fragment. That fragment combination with most frequent UPIB becomes the representative molecule or **edge** from each of the groups. **Pixelator** stores all edges in a large table, and for each edge it computes the total count of UPIBs (associated to the number of reads of a certain molecule), the number of fragments, and the total number of unique UPIBs.

The number of resulting edges is dependant on many factors: the origin of the cells, the experimental conditions (e.g. any stimulation of the cells), the number or reads sequenced for the library, and the number of cells in the experiment.

## Multiplet recovery

From the edge list, **pixelator** generates an undirected bipartite graph where each Ab/UMI combination forms an edge, and each UPIA and UPIB form a node. Each graph is a representation of a putative cell but at this stage of the pipeline we use the concept of a component to refer to them.

Due to the complex nature of the assay biochemistry and the fact that the reaction is carried on a single tube, MPX edge lists may have a proportion of undesired artefactual edges. Those few edges (only a small proportion of the total edge number of the experiment) are removed in a step we call **multiplet recovery** in the `graph`

pipeline stage if the `--multiplet-recovery`

removal flag is activated (by default it is on).

This removal is achieved by using community detection algorithms that operate on the graph. Densely connected communities are assumed to represent cells and **pixelator** removes any spurious edges connecting them. Each component of the resulting edge list is assigned a recovered ID and **pixelator** saves a table of all the recovered IDs in relation to original IDs.

## Edge Rank Plot and Cell Calling

The edge rank plot is a useful tool to visualize the range and distribution of antibody counts (edges) across cells
called by **pixelator**, and to help you to assess the data quality. The edge rank plot is constructed by visualizing all
graph components (cells) ranked from highest to lowest by the number of unique antibodies (edges). **Pixelator** uses
the edge rank plot internally to perform an automatic cell calling by finding the threshold point which distinguishes
real cells from a background of small spurious components. The threshold is recognized as the elbow point, where
the component size (edges) rapidly declines in relation to the rank. Although this process is automatic, it is advisable
to plot the edge rank plot and set a manual cutoff to refine the selection of cells. The cutoff should be selected to
approximate the elbow point where the component size is sharply declining in relation to the size rank. The number of
edges that corresponds to an appropriate cutoff will vary depending on the dataset, as the number of molecules detected
varies by multiple factors, such as sequencing depth, cell type, cell size, and cell states, such as whether the cell
has undergone any type of stimulation or not.

## Antibody Count Distribution Outliers

Cell components might rarely have a diverging distribution of counts across proteins, such that the counts might be skewed to a single
protein or evenly distributed to most proteins. This can be an indication that the component is not originating from
a cell, and could instead originate from debris or from an antibody aggregate. Antibody aggregates form rarily,
and usually have either high complexity, consisting of wide range of different antibodies, or low complexity, mainly constituted
by a single antibody. **Pixelator** detects these by using the metric Tau (`tau`

), adjusted from Yanai *et al.*,
which measures the skewness of antibody counts across different antibodies for a given component. Tau is a numerical value
ranging from 0 and 1 describing the degree to which skewness the distribution of counts across markers. A Tau of 0 indicates
equal distribution of counts across all antibodies and a Tau of 1 indicates that all counts are distributed to a single antibody.
**Pixelator** classifies components by their Tau score in `tau_type`

as `high`

if the Tau score is above 0.995 or deviates from the
population median with more than 2 interquartile ranges (IQRs), and as `low`

if the Tau score is lower than 5 IQRs from the
population median. If the `tau_type`

is neither `high`

or `low`

, the `tau_type`

is set to `normal`

. A typical down-stream
quality control will involve removing components marked as `high`

or `low`

`tau_type`

.

Additionally, aggregates often have dense pixels with higher levels of antibodies detected per each pixel, which can be measured
using the `mean_umi_per_upia`

metric stored in the component meta data.

## MPX Polarity Scores

The polarity score aims to describe the degree of spatial clustering a protein exhibits upon the surface of a cell. It
is calculated as Moran's I (`morans_i`

), a statistic of spatial autocorrelation originally developed by Patrick Alfred Pierce Moran.
In the context of MPX, `morans_i`

measures the degree of spatial clustering of a protein within the graph component (the cell),
and ranges from -1 (perfectly dispersed), to 1 (perfectly clustered), where 0 represents no spatial autocorrelation. `morans_i`

can be represented as a z-score (`morans_z`

), which is calculated as the number of standard deviations the Moran's I statistics
deviates from the expected mean of an analytical random distribution. The Z-score aims to normalize the polarity score to make the metric comparable between cells,
but the two metrics are correlated in MPX data, and often yields similar result.

## MPX Colocalization Scores

The MPX colocalization score is a metric to assess the degree of cooccurence of a pair of proteins within small
neighborhoods upon a cell. The scores are calculated as 1) the Pearson correlation (`pearson`

) and 2) Jaccard index (`jaccard`

)
between counts of two markers within a neighborhood of one A pixel and its immediate neighboring A pixels. **Pixelator**
precalculates colocalization scores and performs a Monte Carlo simulation, to recast the score to reflect its deviation
from the degree of colocalization that would be expected by random chance. The resulting score is a Z-score where values
above zero indicate higher degree of colocalization than would be expected by random chance, while lower than zero indicates
less colocalization than expected by random chance.