The HPA project started back in 2003 and we started the work with the Cell Atlas in 2006. The Genome Project was finished in 2001 and with the knowledge that humans have approximately 20,000 genes, Prof. Mathias Uhlén decided to start a project with the aim to characterize the proteins that these genes encode for. This is important knowledge, because proteins are a closer proxy to function. By identifying the context in which the proteins are expressed, that is in which organs, cells and organelles, we can start to understand their function.
Initial efforts were directed towards the generation of a proteome-wide collection of antibodies and validation of their specificity. In order to pull off such a mammoth of a task, early efforts were aimed to establish robust automated platforms and protocols for performing immunoassays at large-scale.
The Human Protein Atlas database has over 250,000 visits per month from researchers around the world. The data is used for everything, from basic cell biology and systems biology to studies aimed to understand and prevent disease and develop better drugs. For the HPA Cell Atlas in particular, knowledge of the subcellular distribution of proteins is something that cannot be inferred well by sequencing. Thus, this data is highly complementary to sequencing based studies aimed at understanding cellular functions.
A current bottleneck in our work is the reliable classification of the subcellular distribution of proteins in our images. There are several factors that render this task complex. We know that about half of all proteins are localized to multiple compartments in the cell and we are also using a range of different cell types with different morphology. There is also a significant class imbalance with a mix of very rare and highly common patterns in the images. With the Kaggle challenge, we hope to obtain a robust classifier that can assign the subcellular location(s) of proteins in all different cell types. This classifier will not only relieve us from time-consuming manual pattern classification, but also provide opportunities for improved analysis of the cellular architecture.
The next computational problem to tackle would be to not only identify the patterns, but actually segment them to allow better subsequent analysis of the data. Segmentation of single fine patterns is a challenging task and segmentation of mixed patterns even more so. Being able to segment more efficiently the different organelles would enable a great leap forward towards quantitative measurements of differences in protein localization upon perturbation and quantitative understanding of the cell as a system.
Recent development in machine learning and novel methods for large-scale high-parametric imaging will greatly influence the future of imaging for cellular biology. Development in computational imaging will improve resolution over large fields of views, and ‘label-free’ applications where structures of the cell are predicted will enable minimally-invasive and high-parametric measurements of live cells. Another area of interest is highly multiplexed technologies where hundreds of proteins or RNAs can be visualized in the same sample. I believe that these developments together with the establishment of open-access image repositories, as well as improved computational models for quantification of spatial patterns and environments, will position imaging as a key technology in the quest to characterize all human cells.
Collaborative projects between academia and industry is greatly needed to advance science. Our lab develops pipelines for automated feedback microscopy, and in this case we are dependent on a good communication with the microscope provider and ideally also to be included in the company development plans to maximally align the synergy.
Learn more: Kaggle and HPA
Through sponsoring this competition, Leica Microsystems is further able to contribute to both the extension and improvement of biological knowledge, as well as, help build tools for precisely analyzing the vast amount of data created.