Overview

 Our bioinformatics activities are mainly focused on the following topics:

1. Next Generation Sequencing: the emerging high-throughput sequencing technologies have led to an increasing need of bioinformatics pipelines able to accurately and efficiently process, interpret and manage sequencing data. Our activities are mainly focused on the design and implementation of ad hoc data analysis pipelines for several sequencing applications (e.g. DNA-seq, RNA-seq), their optimization on cluster and cloud environments and the development of new algorithms and procedures in order to overcome challanges proper to each different application.

2. Network-based pharmacologygiven a biological process under study, network-based approaches integrate different data and knowledge sources to obtain a network where the molecules of interest are linked to other similar genes or proteins. Our strategies exploit the genomic data and the topological features of the networks to identify possible combinations of hit targets where it is desirable to act with a pharamacological therapy.

3. Tissue Engineering and Developmental Biology: tissue engineering and regenerative medicine are rapidly delivering products able to recover and improve the functionality of damaged tissues and organs. However, tools for monitoring the cellular pluripotency are needed for the design and implementation of devices that would reach the clinics. Our activities in this field are devoted to the extraction of quantitative measures of the cell status from their whole-genome expression profiles and to the identification of the links between stem cells and developing oocytes and embryos.

The working group:

Research fellows and PhD students: Federica De Paoli, Ivan Limongelli, Giovanna Nicora, Ettore Rizzo, Elisabetta Sauta, Susanna Zucca.

Supervisors: Riccardo Bellazzi, Paolo Magni

Next Generation Sequencing

By sequencing technologies it is possible to “read” a DNA molecule, that is determine the precise order of its nucleotides. A DNA molecule cannot be read from the beginning to the end like a book: it has to be fragmented in small pieces which are sequenced separately and then combined together by computational algorithms like a puzzle. The high demand for low cost sequencing has driven the development of high-throughput or “second generation” sequencing technologies, able to parallelize sequencing process by producing millions of DNA sequences at once. Sequencers alone, however, are not enough to explain the possible genomic variability underlying diseases, behaviours and traits of living beings. Moreover, the quantity of data generated by these machines is huge (order of TeraBytes per week with a single sequencer).  

Bioinformatics here aims to manage, handle and interpret sequencing data in order to identify or suggest correlations between genomic patterns and one or more phenotypes of interest.

Specifically, in our laboratory we are active and we have know how on:

  • data analysis on the main sequencing platforms on the market: Illumina (GA, HiSeq, MiSeq) and Life Technologies (Solid, Ion Torrent, Ion Proton);

  • set up and management of the data analysis environments on high performance, cluster and cloud computing;

  • optimization of data analysis pipelines for a high-automated and high-parallel processing;

  •  design and implementation of analysis pipelines for the following sequencing applications: whole-genome, targeted resequencing (genes panel, amplicons, whole-exome), RNA-seq and Cancer-seq;

  • publicly genomic databases and design of ad-hoc genomic data resources;

  • development of new methodologies, algorithms and procedures needed by the different applications in order to increase power and improve the quality of results.

Network-based approaches to pharmacology

Complex diseases are caused by a combination of genetic and environmental factors, thus a disease phenotype is rarely a consequence of an abnormality in a single effector gene product, but reflects various pathobiological processes that interact in a complex network. In recent years, system biology approaches and, more specifically, network-based approaches emerged as powerful tools for studying complex diseases. These methods are often built on the knowledge of physical or functional interactions between biological entities, which are usually represented as an interaction network. The interactions can be conveniently represented as networks (graphs) with nodes (vertices) which denote molecules, and links (edges) which denote interaction between them. Depending on the type of interaction, the corresponding edge might be directed (i.e. protein activation) or undirected (i.e. binding between two proteins). The physical and functional interaction networks are increasing applied to understand and to analyze complex diseases.

The network construction also helps to facilitate a more personalized approach for the disease diagnosis and treatment. The first step of rational drug design is to understand the cellular dysfunction that is caused by a disease. Single-target drugs may, perhaps, correct some dysfunctional aspects of the disease module, but they could also alter the activity of molecules that are situated in the neighborhood of the disease module, leading to detectable side effects. This network-based view of drug action implies that most disease phenotypes are  difficult to reverse through the use of a single ‘magic bullet’, that is, an intervention that affects a single node  in the network. Increasing attention is therefore being given to therapies that involve multiple targets, which may be more effective in reversing the disease phenotype than a single drug. This new drug discovery approach is called polypharmacology. The efficacy of this approach has been demonstrated by combinatorial therapies for AIDS, cancer and depression, raising an important question: can one systematically identify multiple drug targets that have an optimal impact on the disease phenotype? This is an archetypical network problem and has led to the development of methods to identify optimal drug combinations, starting either from the metabolic network or from the bipartite network that links compounds to their drug-response phenotypes. Such research has led to potentially safer multi-target combinations for inflammatory conditions and to the optimization of anticancer drug combinations.

In this context, we developed a method that, given a complex disease, starts by constructing a network that integrates different data sources, including gene expression data sets, protein interactions and disease-related pathways. Our strategies exploit the topological features of the network to identify the entities involved in the disease and the core disease causative pathways. In this way, we are able to identify possible combinations of hit targets where it is desirable to act with a multicomponent therapy. The best ranked combinations are selected based on a synergistic score: for each of them a potential new therapy could be discovered.

Our strategies are furthermore focused on the analysis of interaction networks between drugs (small molecules) and genes (proteins) in order to develop methods useful for pharmacogenomic discovery. Pharmacogenomics has the potential to transform the way medicine is practiced, by replacing broad methods of screening and treatment with a more personalized approach that takes into account both clinical factors and patient’s genetics. One area where gene-based prescribing is steadily advancing is the area of cancer genomics.

Knowledge on the mutational status of genes can be better understood when integrated with information about gene expression and related to alterations in: the copy number of each gene (CNVs), a very common phenomenon in cancer; mutations in promoters and enhancers; variations in the affinity of transcription factors and DNA binding proteins; or dysregulation of epigenetic control.

The construction and analysis of a network that integrates all these data can aid the discovery of the suitable genes to target. Once the potential gene or pathway targets are identified, bioinformatics methods can be used to generate prediction for potential “leads” (or drug candidates) for a high-throughput drug screen.

Tissue Engineering & Developmental Biology

Tissue engineering aims to recover and improve the functionality of damaged tissues and organs by constructing living components useful for regeneration. One of the most important steps in tissue engineering processes is the selection of appropriate cell sources for implantation. Although stem cells have been identified as a promising source, different issues must be addressed before their clinical use for tissue replacement. In this context, the application of bioinformatic approaches to genome-wide expression data may help to understand how tissues develop at a molecular level.

Our research activities aim at developing novel bioinformatic methods that provide insights into cellular development by simultaneously exploiting microarray-based data and knowledge repositories. Thanks to the collaborations of our laboratory with other research centres, the proposed methods have been applied to different fields, including stem cells differentiation and oocytes development.

Stem cells

Stem cells are self-renewing populations characterized by pluripotency, i.e. the ability to evolve into diverse mature cell types. In mammals, embryonic stem cells (ESCs) can be isolated, proliferated and differentiated in vitro into a potentially unlimited variety of tissues. Whilst ESCs have he greatest potential for clinical applications in terms of pluripotency, their use raises several ethical issues.

The recent discovery that adult somatic cells can be reprogrammed in vitro to obtain induced pluripotent stem cells (iPSCs) has paved the way for new opportunities to study diseases and develop patient-specific therapies. However, one of the challenges concerns ensuring that reprogrammed cells are actually pluripotent and have not moved into partially differentiated states. 

In this context, we found that dimension reduction techniques can be successfully applied to the transcriptome data to obtain predictive models of the differentiation stage of stem cells and reprogrammed cells. Used in combination with gene selection strategies, these models map the temporal gene expression data of samples in standard culturing conditions to a one-dimensional space, obtaining a device named Differentiation scale. Uncharacterized samples, such as iPSCs, can be projected on this graphical tool to determine their actual pluripotency with respect to normal dynamics of differentiation.

The integration of multiple experiments with networks and knowledge on embryonic development highlights the most influent pathways during specific phases of differentiation. In particular, we are investigating the utility of methods for combining multiple gene expression data sets in order to obtain a reliable signature of the cellular identity. In addition, we have studied different prioritization strategies that exploit literature-derived gene annotations and network properties. Borrowing some ideas from text-mining and Information Retrieval, candidate marker genes emerge from the study of their annotations and from the analysis of the network connectivity patterns. 

Developmental Biology

One of the not yet fully explored developmental processes is the differentiation of the mammalian oocyte during folliculogenesis. Recently, there have been increasing efforts in the characterization of oogenesis and early embryogenesis by means of knowledge repositories. Results from microarray-based studies have been analysed with bioinformatics tools for annotation and association of molecules, with the common aim of hypothesizing unknown entities that play important roles in cellular development. 

A subset of “maternal effect” genes have been identified for their important role in the early stages of development: these factors can modify the oocytes developmental competence and the gene expression in the zygote. However, the complete network of essential key regulator genes in mammals still remains unclear. In the analysis of data from oocytes, an added value is represented by the evidence that a known maternal-effect gene is often related to another gene that has not been previously considered. Such evidence can be obtained from gene annotations and association networks. 

The developmental competence of oocytes have been also related to the chromatin organization. Based on the presence of a ring of heterochromatin surrounding the nucleolus, two different types of oocytes have been identified in the mouse ovary: 

  • surrounded nucleolus oocytes (SN)
  • not surrounded nucleolus (NSN)

These two types of oocytes have different developmental competence: in the mouse, NSN oocytes arrest development at the 2-cell stage, whereas SN oocytes may develop to term. This characterization provides a useful model for determining a priori the oocyte developmental competence which guides the subsequent phases of cellular growth. One of the most important questions in this context is related to the transcriptional changes influenced by a specific chromatin configuration, which may be responsible of a different behaviour in terms of cellular maturation. 

In order address these issues, we have developed knowledge-based bioinformatics approaches to compare the transcriptome data of developmentally competent oocytes (SN) with those that cease development at the 2-cell stage. Using keywords extracted from the Gene Ontology and from the publications referencing each gene in PubMed, we developed knowledge-based gene association networks that allowed identifying a core set of factors guiding the transition from oocytes to embryos. Our activities in this field are currently focused on the integration of multiple knowledge sources that would help to assess the role of each gene during development and to uncover the transcriptional link existing between embryogenesis and stem cells differentiation.