INDI - an integrated nanobody database for immunoinformatics

nanobodies, database, INDI

Executive summary: To design nanobody-based drugs, scientists need databases that collate the increasing volume of biological sequence data in the public domain. To address this problem, we developed Integrated Nanobody Database for Immunoinformatics (INDI) - a novel nanobody database that collates nanobody information from all the major data repositories in the public domain. By using INDI, scientists get an accurate picture of the current state of insights into nanobody sequence, structure, and function to accelerate their research.

Note: This article covers content published in Piotr Deszyński, Jakub Młokosiewicz, Adam Volanakis, Igor Jaszczyszyn, Natalie Castellana, Stefano Bonissone, Rajkumar Ganesan, Konrad Krawczyk. INDI—integrated nanobody database for immunoinformatics. Nucleic Acids Research, 2021; gkab1021, https://doi.org/10.1093/nar/gkab1021

Introduction

Clinical development of antibody-based therapeutics is complex and time-consuming, often taking more than a decade. This challenge arises because of the inherent complexity of antibodies.

An antibody consists of two polypeptide chains that need to be co-engineered and co-expressed. Moreover, the protein is large - and this makes delivery more challenging, especially in scenarios such as tumor penetration.

This encourages scientists to use a subclass of antibodies discovered in camelids - single domain antibodies, VHH, or nanobodies, with the first drug based on this novel format getting approved in 2018.

However, developing nanobodies with the help of traditional laboratory approaches takes years. Computational approaches can accelerate this process to deliver life-saving therapeutics faster and make them more affordable.

Problem

Successful computational approaches that could solve the nonbody riddle need to rely on quality sequence and structure data on the biology of these molecules. iCAN and sdAB-DB databases paved the way to collecting nanobody-related data.

But since these databases used manual identification of antibodies, these databases offer a relatively small volume of publicly-available nanobody data (2391 and 1452 sequences reported, respectively).

To design nanobody-based drugs, scientists need database solutions capable of collecting the increasing volume of biological sequence data in the public domain.

Solution

To address this problem, we developed the Integrated Nanobody Database for Immunoinformatics (INDI). It’s a novel nanobody database that collates nanobody information from all the major data repositories in the public domain in an automated manner for the most part.

INDI collates data from the following sources:

  • NCBI GenBank,
  • Protein Data Bank,
  • Patents,
  • Next-generation sequencing (NGS) repositories,
  • Scientific publications.

Results

The INDI database integrates nanobody sequences, structures, and their associated metadata in the public domain. Thanks to automatic updates from heterogeneous sources, the database can keep up with the new data appearing in these sources.

Thanks to INDI’s data heterogeneity, nanobody researchers can get an accurate picture of the current insights into nanobody sequence, structure, and function - and then use this knowledge to accelerate the development of:

  • analytical frameworks,
  • structural modeling,
  • de novo nanobody design protocols,
  • and deep-learning models developed for nanobody design.

INDI offers a reliable data foundation to develop nanobody-specific computational methods for accelerating the delivery of novel nanobody-based therapeutics.

Key use cases

Sequence-based search

Users can perform Variable Region Search and CDRH3 search - the two common use cases of nanobody sequence identification. While Variable Region Search is based on retrieving the entirety of the variable region, CDRH3 search allows finding the most variable portion of the nanobody responsible for most of the antigen contacts, CDRH3.

Text search

Every nanobody sequence in INDI comes with rich textual annotations. This textual content reveals biological targets, origins, and purpose of the study of these molecules. Metadata fields are heterogeneous across all the covered sources (as well as within them).

To help with information retrieval across these five diverse sources, we implemented a text index created on all the metadata fields used in these databases. By searching for keywords, users get the most relevant matches to their queries.

Users can also specify the potential targets of nanobodies reported as part of the depositions and individual accession numbers. All the results are displayed as an interactive table listing that enables easy sorting and finding the details of matching text entries.

Bulk download

INDI makes data available for offline use as well - in the form of an extract of the two pillars of our data model: sequences and metadata separately. The sequence extract includes the V-region sequences of identified nanobodies and every sequence entry is linked to the metadata fields in the meta-extract.


If you’d like to see how INDI works, check out this page. You can carry out a nanobody-specific sequence-based search and metadata retrieval.

Need more features? Take a look at AbStudio - a solution that allows teams to create, collate, and discover antibody-specific datasets to accelerate research decision-making.