![](https://crypto4nerd.com/wp-content/uploads/2023/08/138Um1nDjooXaqvaxVeHpow-1024x454.jpeg)
CLIP and its descendants have become a standard staple in text-to-image models. Can we do the same but for text-to-protein? Yes!
β‘οΈ Xu, Yuan, et al present ProtST, a framework for learning joint representations of text protein descriptions (via PubMedBERT) and protein sequences (via ESM). In addition to a contrastive loss, ProtST has a multimodal mask prediction objective, e.g., masking 15% of tokens in text and protein sequence, and predicting those jointly based on latent representations, and mask prediction losses based on sequences or language alone. Additionally, the authors design a novel ProtDescribe dataset with 550K aligned protein sequence-description pairs. ProtST excels across many protein modeling tasks in the PEER benchmark, including protein function annotation and localization, but also allows for zero-shot protein retrieval right from the textual description (see an example below). Looks like ProtST has a bright future of being a backbone behind many protein generative models π
Actually, ICML features several protein generation works like GENIE by Lin and AlQuraishi and FrameDiff by Yim, Trippe, De Bortoli, Mathieu, et al β those are not yet conditioned on textual descriptions, so incorporating ProtST there looks like a no-brainer performance boost π.
βοΈ MPNNs on molecules have a strict locality bias that inhibits modeling long-range interactions. Kosmala et al derive Ewald Message Passing and apply the idea of Ewald summation that breaks down the interaction potential into short-range and long-range terms. Short-range interaction is modeled by any GNN while long-range interaction is new and is modeled with a 3D Fourier transform and message passing over Fourier frequencies. Turns out this long-range term is pretty flexible and can be applied to any network modeling periodic and aperiodic systems (like crystals or molecules) like SchNet, DimeNet, or GemNet. The model was evaluated on OC20 and OE62 datasets. If you are interested in more details, check out the 1-hour talk by Arthur Kosmala at the LOG2 Reading Group!
A similar idea of using Ewald summation for 3D crystals is used in PotNet by Lin et al where the long-range connection is modeled with incomplete Bessel functions. PotNet was evaluated on the Materials Project dataset and JARVIS β so reading those two papers you can have a good understanding of the benefits brought by Ewald summation for many crystal-related tasks π
β‘οΈ Another look at imbuing any GNNs with equivariance for crystals and molecules is given by Duval, Schmidt, et al in FAENet. A standard way is to bake certain symmetries and equivariances right into GNN architectures (like in EGNN, GemNet, and Ewald Message Passing) β this is a safe but computationally expensive way (especially when it comes to spherical harmonics and tensor products). Another option often used in vision β show many augmentations of the same input and the model should eventually learn the same invariances in the augmentations. The authors go for the 2nd path and design a rigorous way to sample 2D / 3D data invariant or equivariant augmentations (e.g., for energy or forces, respectively) all with fancy proofs βοΈ. For that, the data augmentation pipeline includes projecting 2D / 3D inputs to a canonical representation (based on PCA of the covariance matrix of distances) from which we can uniformly sample rotations.
The proposed FAENet is a simple model that uses only distances but shows very good performance with the stochastic frame averaging data augmentation while being 6β20 times faster. Works for crystal structures as well!