Geometric transformer that receives a single structure at input and annotates residues with likelihood that they are part of a binding site.
The geometric transformer only uses the atom names, no mass, charge etc.
Similar to convolution their geometric attention mechanism focuses on 8 nearest neighbors (3.2A) and then increases up to 64 nn (8.2A).
They use ~300,000 chains from the PDB for training (!). This is because they have extracted all the bio assemblies at 30% sequence identity.
They defined the most common atom names for all molecule types which gave them 79 elements. Interactions between these elements can be given as a 79x79 matrix.
Interaction cutoff is taken as 5A.
Pesto outperforms Scannet by a large margin, 0.93 vs 0.87 ROC AUC.
In some cases, processing MD trajectories of unbound proteins with PeSTo identifies certain interfaces better than when PeSTo is run on the starting static structure.
ProteinMPNN is a framework that receives a backbone and generates the most probable sequence that could fit it. This is firmly for protein design where you have a binding interface or a structure that you ‘need to fit’.
Using distances is better than using dihedral angles: This resulted in a sequence recovery increase from 41.2% (baseline model) to 49.0% (experiment 1), see Table 1 below; interatomic distances evidently provide a better inductive bias to capture interactions between residues than dihedral angles or N-Ca-C frame orientations
We found that training models on backbones to which Gaussian noise (std=0.02Å) had been added improved sequence recovery on confident protein structure models generated by AlphaFold (average pLDDT>80.0) from UniRef50, while the sequence recovery on unperturbed PDB structures decreased.
They employed a scaffold made by Rosetta, that was supposed to house a peptide recognizing some protein. The rosetta designs failed, but when they used ProteinMPNN to get sequences for it, they bound even better than the original bare-peptide.
They created a siamese EGNN, one given WT the other one mutant, with their difference being the ddG prediction.
They used the AB-Bind dataset which consisted of 645 mutants from 29 complexes.
They created a set of non-redundant antibody-antigen binders with 1475 complexes. They imposed 70% clustering on antigens.
They mutated one complex per cluster and ran foldx resulting in 942,723 ddG foldX data points.
On ABBind dataset they achieve a pearson correlation of 0.8 - however when they impose stringent CDR cutoffs the correlation drops dramatically, indicating overtraining.
When they run the training on the synthetic dataset, it stops being sensitive to overtraining.
Using AF2 they developed a pipeline to fold and dock proteins simultaneously. The pipeline shows good performance in distinguishing interacting and non-interacting proteins.
Acceptable models are those with DockQ > 0.23. Success rate is defined as percentage of acceptable poses.
The best version of their model achieves a 39.4 success rate.
AlphaFold2 outperforms other docking methods.
Using the number of Cb in contact (within 8A) or plDDT of the interface results in ROC AUC in the region of .9 distinguishing interacting and non-interacting proteins.
As input they insert a chain break of 200 residues to model the interaction.
They note that it is very important to create the right MSAs for AF2.
As negative cases for interactions (non-interacting proteins) they employ data from Negatome.
They draw from the Masif method in that they define a triangular mesh. Each vertex is encoded with physicochemical information and then each patch of a defined radius is encoded numerically.
They teach overlapping patches to have similar embeddings as they are assumed to have overlapping functions as well.
They employed contrastive learning, annotating patches as positive if they were within 1.5A from centered vertices and negative if they were centered on vertex 5A away.
Their learned similarity distances cluster by curvature, hydropathy and charge.
They compared Surface ID to structure based similarity measurement approaches with SurfaceID performing slightly better.
They clustered the antibody-epitope patches simultaneously. It clustered the binding modes between HIV-1 GP120, two for influenza HA and one for SARS-CoV-2 RBD. Anti-ha clustered had same epitope but different paratopes showing that the algorithm distinguishes on that level
They proposed a design scheme for antibodies. Look for similar epitopes by surface id and use the antibodies as putative binders to the query.
Using a siamese network and Sabdab to predict antibody-antigen binding in a binary fashion.
They clustered the antigens at 0.9 sequence identity. They assumed that similar antibodies from the same antigen group bind in the same manner. This resulted in 3,892 antigen pairs.
They also created a dataset of covid specific antibodies with 9309 positive samples and 1710 negatives.
They used tha CKSAAP encoding, but compared against others such as one-hot, pssm or their-own trained word2vec.
They benchmark the different encodings and models to show that CKSAAP + CNN come out on top.
Their siamese CNN with CKSAAP achieve a staggering .85 PR AUC.
Mildly flexible docking tools that runs very fast, as compared to traditional docking methods.
They used the DIPS datasets of about 42,000 binary complexes from the PDB
They represent proteins as graphs. Nodes are given the ESM2 650M embeddings and the edges the distances alongside orientation distributions from trRosetta.
The graph module serves as input to the structural module that, similar to AF2, performs recycling of the rotation of the two proteins.
The number of trainable parameters is 4.3m.
Losses are from AF2-multimer, FAPE, IDDT-ca and structure violation loss.
DockQ score of 0.23 is seen as a successful dock.
The method runs in seconds, which is significantly faster than typical docking methods.
Though faster, it does not perform better than traditional docking methods.
They have successfully illustrated GeoDock's capability to induce minimal backbone movement, even though its training data exclusively comprises bound protein complexes. Notably, the resultant predicted structures bear a striking resemblance to the initial unbound structures, underscoring the method's ability to generate structurally consistent outcomes despite its limited training scope.
Tour de force of impact of NGS sequencing depth and clustering on picking hits from display campaigns. Repository of information for a detailed walk-through of a display campaign.
Altogether NGS is big help with respect to random colony picking - better binders can be produced and larger epitope diversity.
Campaigns against three (related) antigens, sars-cov-2 trimer protein, monomer s1 and RBD.
They define a set of research questions on the relation of NGS statistics and kinetics - e.g. is higher frequency in NGS correlating with higher affinity?
Sequences for VH and VL of 200 unique antibodies were synthesized, cloned into expression vectors for mammalian IgG, and subsequently expressed and purified as complete IgG molecules. Out of the 200 antibodies tested, 169 (84.5%) exhibited affinities <1 µM for RBD, S1, or the trimer. The selection of these 200 distinct antibody sequences was based on 57 well-defined clusters, which were identified at the convergence of three target populations (41 clusters), exclusive to either the S1 (1 cluster) or RBD (1 cluster) populations, or originating from 14 clusters derived from the trimer NGS population. The selection criteria considered the most abundant representative per cluster, regardless of whether they intersected with S1 or RBD.
Clustering methods used were 100% identity, clonotyping and their own unsupervised clustering (abscan!).
Abscan is based on in-house usupervised method
The Abscan clustering method typically results in higher diversity, relative to traditional clonotyping.
The abundance of the top representative in each AbScan cluster gives best correlation to binding affinity.
One of chief advantages of clustering is identifying sequences within cluster of interest with lower number of liabilities.
They trained an XGBoost method on NGS statistics etc. to discriminate binders and non-binders (though the dataset is very small)
Same cluster = same epitope
They employ abundance threshold of 0.005% using concatenated CDRs as a basic way to discriminate binders and non-binders.
Abscan can be described (in high level) as follows: They utilize an unsupervised machine learning approach to cluster specific regions of interest, such as HCDR3. This clustering process is based on various sequence-related properties, NGS statistics (including relative abundance and round-to-round enrichment) pertaining to different regions of interest (HCDR3, HCDR3 + LCDR3, concatenated CDRs), and employs diverse algorithms (such as the Elbow method, Ordering Points to Identify the Clustering Structure (OPTICS), physicochemical reduction of the amino acid space, traditional clonotyping, and Levenshtein distance (LD)).