Software for predicting T-cell epitopes. The one relevant to antibodies is netMHC2pan - it will predict 15-mer epitopes for each of human MHCs, giving it a score whether it is a strong or weak binder as opposed to naturally occurring peptides.
MHC Class I (MHC-I) and MHC Class II (MHC-II). MHC-I predominantly present peptides derived from intracellular proteins, whereas MHC-II predominantly presents peptides from extracellular proteins
People use the binding affinity or eluted binding data.
Predictions can be done on either multi allelic or single allele binding.
The combined dataset used for training of NetMHCpan-4.1 consists of 13 245 212 data points covering 250 distinct MHC class I molecules, and the combined dataset used for training of NetMHCIIpan-4.0 consists of 4 086 230 data points covering a total of 116 distinct MHC class II molecules
The core improvement is integration of the NNALIgn_MA in netmhci/iiPAN
Polyspecificity and polyreactivity are cognate, however the first is thought to be driven by factors such as overlapping epitopes whereas polyreactivity by excess charge or hydrophobicity.
Baculovirus particles assay (BVP) is often used to test polyreactivity. mAbs are added at high-concentrations to BVP coated plates.
They generated a dataset of polyreactive antibodies (~300 antibodies) that was heterogeneous in terms of antibodies/nanobodies, monospecific and formats.
They tested different concentrations (from 6.67nM to 667 nM) and well coating types (percentage BVP) - this was aimed at reducing noise from experimental conditions.
They tested two prediction modes, language models and structural descriptors. For language models, PROT5, ESM2 and Antiberty were used. Descriptors were calculated using Alphafold2-multimer. The language model predictions were superior to those calculated from AF2-multimer ones.
They introduced a set of single and double mutations based on most likely variants proposed by an ensemble of language models (ESMs). Most of the mutations not only didn’t remove binding ability, but actually improved it.
They performed evolution with the ESM-1b language model and the ESM-1v ensemble of five language models (six language models in total)
In the first round of evolution, they measured the antigen interaction strength by biolayer interferometry (BLI) of variants that contain only a single-residue substitution from wild-type.
In the second round, they measured variants containing combinations of substitutions, where we selected substitutions that corresponded to preserved or improved binding based on the results of the first round.
They performed these two rounds for all seven antibodies, measuring 8–14 variants per antibody in round one and 1–11 variants per antibody in round two
Across all seven antibodies, they found that 71–100% of the first-round Fab variants (containing a single-residue substitution) retained sub-micromolar binding to the antigen, and 14–71% percent of first-round variants led to improved binding affinity (defined as a 1.1-fold or higher improvement in Kd compared to wild-type)
Thirty-six out of all 76 language-model-recommended, single-residue substitutions (and 18 out of 32 substitutions that lead to improved affinity) occur in framework regions.
They found that Fabs for 21 out of the 31 language-model-recommended, affinity-enhancing variants that we tested had a higher melting temperature (Tm) than wild-type, and all variants maintained thermostability (Tm > 70 °C).
They tested for polyspecificity but there were no off the chart changes in the poly profile.
Five out of 32 affinity-enhancing substitutions (~16%) involve changing the wild-type residue to a rare or uncommon residue
Approach based on general protein language models consistently outperformed all baseline methods, including the antibody-specific ones (!).
They developed a language model to predict protein protein interactions from sequence on the basis of a large language model.
They Train protBERT to predict PPI.
They use the BIOGRID dataset, where interactors are mapped if they are confirmed by two independent sources, such as two independent experimental techniques in two separate studies. In total they have 179,018 positive pairs.
They use negatome 2.0 as a negative dataset. It relies on various sources such as manual curation from the literature or subunits from the PDB that do not interact with each other. Total of 3,958 pairs were used.
They use ProtBERT-BFD to pretrain the model.
They mapped each protein pair as [CLS] Protein A [SEP] Protein B [SEP], mapping the final output to binary.
They achieve 92% accuracy on the test set.
They also perform well on annotating negative binders as coming from different subcellular compartments. On positive samples in this dataset, the model was 85% accurate. On negative samples, SYNTERACT was only 38% accurate, classifying many negatives from subcellular compartment sampling as interactors.
Proposing a Bayesian scheme to optimally select generated antibodies from a previously introduced language model (GLM).
They use the 1B GLM-AB model from BioMap. Training involves variation on MLM that masks entire spans of sequence.
The entire point is how to ‘select’ better antibodies according to some unknown ‘fitness function’. If you only get a few experimental data points at a time to evaluate f, you’d better make them count. Their combination of bayesian scheme and a language model optimizes how the ‘next generated sequence points’ are picked so that best approximation to f is reached.
They use (computational simulation) Absolut! framework rather than wet lab data.
Demonstration that training the transformers on paired antibody data provides improvements over single-chain models. They created two models - one trained on single chains from Jafffe, the other with the paired information.
when comparing embeddings, they only extracted one sequence from the paired transformer to make comparison with the single sequence transformer sound.
They showed what happens when one performs UMAP of the light chains of the paired and unpaired transformer. The unpaired transformer produces more random dispersal, whereas the paired variety performs much tighter clustering, similar to heavy chains. Performance on heavy chain clustering is similar.
They asked for prediction of masked positions in heavy chains when the chain is paired with the native mutated variety versus back-mutated germline one. The cross entropy loss was much better when the prediction was made in the presence of the native mutated light chain.
They contrasted their paired model with ESM2 (650M). They fine tuned ESM2 on the paired data. They averaged all attention heads over all layers to get a single score. The fine-tuned ESM2 attends to conserved Cysteines and CDR regions whereas EMS2 does not, focusing more on linear stretches
Introducing ESM-2 and ESMFold. Scaling transformer model parameter size to 15B allows for more precise predictions of structures.
They make available an atlas of 617 million predicted structures
Learning objective is MLM, masking 15% of protein input.
Perplexity ranges from 1 for a perfect model to 20 for a model that makes predictions at random. Intuitively, perplexity describes the number of amino acids the model is uncertain between when it makes a prediction
After 270k training steps the 8M parameter model has a perplexity of 10.45, and the 15B model reaches a perplexity of 6.37.
The 15B model achieves best perplexity and structural modeling accuracy.
For some structures, accuracy of structure prediction jumps from 7.7A at 8m parameters to 7.0A at 35m parameters and to 3.2A at 15m parameters. The 3b model brings it down to 2.8 and 15B model to 2.6. For other structures, good prediction is only achieved at 15B
Their structure predictor closely follows AlphaFold2, but instead of evoformer, they use the representation from the ESM-2.
They contrast ESM to some other language models and show that in zero shot fashion some correlations can be made with experimental measurements of variants.
They compare performance of ESM and DeepScan on 41 deep mutational scanning datasets collated in a single paper. They claim ESM has better overall correlations but it is not crystal clear from the graph and by their own admission by paired t-test.
They find that pretraining the data on Uniref30 gives worst performance. An ok performance is given for Uniref50 or Uniref70 with a dip again at Uniref100.
Binding sites have much higher conservation.
Core of the protein also appears to have lower conservation.
100B parameter protein model fine-tuned and a 1B antibody-specific model.
For PLM training they employ data from Uniref90 and ColabFold. After filtering and deduplication they are left with approximately 350m sequences, or 100B tokens.
On proteins, xTrimoPGLM-100B outperforms ESM2-15B on 12 of 15 downstream tasks (e.g. thermostability, structure prediction etc.).
They train a 1B protein model and then fine tune it on antibodies from OAS
Their masking procedure includes span masking not only several residues at a time.
They use 678m OAS sequences.
They benchmarked the antibody model on naturalness and antibody structure prediction and the Xtrimo-pglm-oas outperformed ESMFold, ALphafold2 and IgFold.