Thomas Dorfer

I'm a data scientist at Molcure Inc. in Tokyo, Japan, with training in neuroscience. My primary interests lie in machine learning and protein modeling, particularly the use of natural language processing to decipher the hidden grammar of protein sequences.

My Work

Conference Papers

Dorfer TA, Robinson CN, Cocchi L, Mattingley JB, Sale MV, Zalesky A, Gollo LL. Whole-brain network states predict behavioral responses to transcranial magnetic stimulation. Imaging@Brisbane, Brisbane, August 2018.

Pater MRA, Dorfer TA, Gschwind L, Coynel D, Papassotiropoulos A, de Quervain DJ, Luksys G. Predicting Human Memory Performance through Multi-Voxel Pattern Analysis. Federation of European Neuroscience Societies, Berlin, July 2018.

Dorfer TA, Roberts JA, Breakspear M, Gollo LL. Slow oscillatory brain activity renders convolution with the hemodynamic response function redundant. Organization for Human Brain Mapping, Singapore, June 2018.

Dorfer TA, Pater M, Gschwind L, Papassotiropoulos A, de Quervain DJ, Luksys G. FMRI-based prediction models for free recall, recognition memory, emotional valences, arousal, and memorability of pictures. Organization for Human Brain Mapping, Singapore, June 2018.

Gollo LL, Cocchi L, Hearne L, Dorfer TA, Roberts J, Breakspear M. Can we predict the intensity of the effects of brain stimulation? Organization for Human Brain Mapping, Singapore, June 2018.

ProtLearn

ProtLearn is a feature extraction tool for protein sequences. It is a freely available Python package that allows the user to efficiently extract amino acid sequence features from proteins and peptides, which can then be used for a variety of downstream machine learning tasks.

Natural Language Processing for Proteins (NLProt)

The application of Natural Language Processing (NLP) to protein sequence prediction has recently gained traction in the fields of machine learning and computational biology. This was primarily fueled by the recent advances in deep learning and language models such as Google's BERT and its successors RoBERTa and ALBERT. The aim of this site is to provide a comprehensive and chronologically ordered list of the recently published literature in this area. If you have come across relevant work that should be added to this list, please feel free to make a pull request or open an issue here.

Papers (blobs) arranged by similarity and date (the more recent, the bigger).

Papers

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost
arXiv, July 2020 | Paper

BERTology Meets Biology: Interpreting Attention in Protein Language Models
Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani
arXiv, July 2020 | Paper | Blog

PEDL: extracting protein–protein associations using deep language models and distant supervision
Leon Weber, Kirsten Thobe, Oscar Arturo Migueles Lozano, Jana Wolf, Ulf Leser
Bioinformatics, July 2020 | Paper

Signal Peptides Generated by Attention-Based Neural Networks
Zachary Wu, Kevin K. Yang, Michael J. Liszka, Alycia Lee, Alina Batzilla, David Wernick, David P. Weiner, Frances H. Arnold
ACS Synthetic Biology, July 2020 | Paper

USMPep: universal sequence models for major histocompatibility complex binding affinity prediction
Johanna Vielhaben, Markus Wenzel, Wojciech Samek & Nils Strodthoff
BMC Bioinformatics, July 2020 | Paper

Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks
Ananthan Nambiar, Simon Liu, Mark Hopkins, Maeve Heflin, View ORCID ProfileSergei Maslov, Anna Ritz
bioRxiv, June 2020 | Paper

ProGen: Language Modeling for Protein Generation
Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, Richard Socher
bioRxiv, March 2020 | Paper | Blog

UDSMProt: universal deep sequence models for protein classification
Nils Strodthoff, Patrick Wagner, Markus Wenzel, Wojciech Samek
Bioinformatics, January 2020 | Paper

Modeling aspects of the language of life through transfer-learning protein sequences
Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, Florian Matthes, Burkhard Rost
BMC Bioinformatics, December 2019 | Paper

Generative models for graph-based protein design
John Ingraham, Vikas K. Garg, Regina Barzilay, Tommi Jaakkola
NeurIPS, December 2019 | Paper

Accurate Protein Structure Prediction by Embeddings and Deep Learning Representations
Iddo Drori, Darshan Thaker, Arjun Srivatsa, Daniel Jeong, Yueqi Wang, Linyong Nan, Fan Wu, Dimitri Leggas, Jinhao Lei, Weiyi Lu, Weilong Fu, Yuan Gao, Sashank Karri, Anand Kannan, Antonio Moretti, Mohammed AlQuraishi, Chen Keasar, Itsik Pe'er
arXiv, November 2019 | Paper

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models
Mustafa Abdallah, Ashraf Mahgoub, Hany Ahmed, Somali Chaterji
Scientific Reports, November 2019 | Paper

Unified rational protein engineering with sequence-only deep representation learning
Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, George M. Church
Nature Methods, October 2019 | Paper

Evaluating Protein Transfer Learning with TAPE
Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song
bioRxiv, June 2019 | Paper | Blog

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences
Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus
bioRxiv, May 2019 | Paper

A High Efficient Biological Language Model for Predicting Protein–Protein Interactions
Yanbin Wang, Zhu-Hong You, Shan Yang, Xiao Li, Tong-Hai Jiang, Xi Zhou
Cells, February 2019 | Paper

Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations
Robin Winter, Floriane Montanari, Frank Noé, Djork-Arné Clevert
Chemical Science, November 2018 | Paper

Natural language processing in text mining for structural modeling of protein complexes
Varsha D. Badal, Petras J. Kundrotas, Ilya A. Vakser
BMC Bioinformatics, March 2018 | Paper

Identifying the missing proteins in human proteome by biological language model
Qiwen Dong, Kai Wang, Xuan Liu
BMC Systems Biology, December 2016 | Paper

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
Ehsaneddin Asgari, Mohammad R. K. Mofrad
PLOS ONE, November 2015 | Paper

Survey of Natural Language Processing Techniques in Bioinformatics
Zhiqiang Zeng, Hua Shi, Yun Wu, Zhiling Hong
Computational and Mathematical Methods in Medicine, October 2015 | Paper

Technical Writing

Dynamic Replay of time-series data | Utilizing matplotlib and double-ended queues in Python

April 6, 2020 · 2min read · Read article in Towards Data Science

Artefact Correction with ICA | Illustrated with an example from the neurosciences

April 4, 2020 · 5min read · Read article in Towards Data Science

DATA SCIENTIST | MOLCURE Inc.