* Orientation

Research activities

* Orientation

The Atelier de Bio-Informatique (ABI) is an open structure that gathers biologists, biophysicists and computer scientists from various institutions, wishing to work together at the interface between biology and computer science, in a multidisciplinary environment. Founded at the Curie Institute in 1985, the Atelier has remained there until September 1996, when it came to the 12 Rue Cuvier building at the Pierre and Marie Curie University (Paris VI).
ABI's activity centers on two very closely connected domains: the analysis of sequences and structures in biological molecules on one side, and the organization of sequence data and knowledge obtained thorugh genome sequencing on the other. More recently we have become interested in bringing together genome data, chromosome organization and molecular evolution.

Further details about the above mentioned themes under 6 different headings:

Motif finding in biological sequences

Considering a string of characters on a given alphabet, we try to find motifs satisfying a certain set of constraints. Generally speaking, a motif is an abstract object whose instances are the occurrences in the string. Acting on definitions:

as raised interesting questions as regards the analysis of biological sequences (DNA, RNA and proteins). In that field, our activity has been to formalize the definition of motifs associated with different biological contexts (when and how can we state that sequence fragments present significant resemblance - by opposition to the 2-sequence problem) and to develop efficient algorithms for exhaustive searches for the thus defined motifs. The simplest example is the case in which the motif is a word on an alphabet, of fixed or maximal size, present at least r times in the string. The classical Karp, Miller and Rosenberg algorithm easily solves this problem, with small adaptations to the case where we have n sequences and we search for the motif in q out of these n sequences. Using strictly repeated factors is of course inadequate in most relevant biological problems. Therefore, we have relaxed this constraint by introducing a notion of similarity R (not necessarily transitive) defined on the alphabet. This relation can be easily obtained by using a typical substitution matrix used in sequence analysis. We then search for sets of similar words in the sequences, the similarity being provided by the relation R linking the corresponding symbols. These sets are in fact maximal cliques of a relation over these words. As a consequence, the motif is defined by a product of cliques of R. We have developed an efficient algorithm (a generalization of KMR) to compute those maximal cliques. In order to refine the notion of approximate repeat, we have extended the case of a single relation R over the alphabet to the case of several simultaneous relations, by replacing the notion of maximal clique by that of coverage. In a third step, we have considered the case where the occurrences are no longer strict realizations of the motif, but instead present a certain distance from it (within a certain range of errors such as substitutions and indels). Finally, this last notion is generalized from the case where the distance is measured on the basis of symbols (Levenhstein) to the case where it is measured on the basis of sub-motifs, involving the use of a numerical matrix of similarity between the symbols. In this case, it is a notion analog to the one used in Blast for comparing two sequences, but it's used here for comparing multiple sequences. In fact, the algorithms we have developed may be used as they stand or as cores for other applications. In the former case, they are useful for determining the set of motifs that are characteristic of a set of sequences. As a core for other applications, one can cite the multiple alignment by blocks (peptide matching) or extensive databank searches.

Search for structural motifs in proteins

Searching for 3D motifs common to a set of n 3D structures is a problem that can be formulated in a very similar way to the one of searching for lexical motifs. In the case of protein structures - limited to the peptide skeleton - one can use the internal coordinates of the atoms of the skeleton, and reduce the problem to finding cliques of repeated words, where the symbols of the alphabet correspond to the triplets of dihedral angles j, f, and W and the symbols proximity is deduced from the one of the associated angular values (the relation R is then defined by an interval graph). The program we thus have developed is an efficient and mathematically well defined solution to the important problem that homology modelling studies have had to face manually until now, that is the simultaneous search for common sub-structures in n proteins.

One should note that the representation chosen in the above approach (internal coordinates) is limited to the search for motifs constituted by consecutive atoms in the peptidic chain. Hence, we have tried to find how to ease this constraint. Nevertheless, our approach remains very close to the previous ones: the lexical motifs (constituted by contiguous symbols) were replaced by geometrical ones (association of geometrical elements such as triangles). However, so far the combinatorics of the problem limits the field of application to the pairwise comparisons, such as the search for 3D databanks (e.g. the PDB).

Search for secondary structures in RNA

The main initial goal of the work on this theme was to develop a tool allowing for the description and search of RNA secondary structures in databanks. For that purpose, we developed a language (Palingol) inspired from contraint programming techniques. Palingol aims at allowing the biologist:

One should emphasize the fact that the representation chosen in Palingol is not limited to the search of "pure" secondary structures, but also allows handling more complex structures, such as pseudo-knots, tertiary structure elements or even "alternative" forms associated with thermodynamic equilibrium. Palingol has already been used for searching of known structures (tRNA, Rho-independent terminators, triple helices) having in mind its implementation in an integrated sequence analysis environment. Palingol is also used for exploting novel secondary structures through trial and error approaches.

In parallel with the development of Palingol, we have also tried to ask ourselves the inverse question, i.e.: given a set of n sequences, that we thought to fold in the same way, we search for the largest common folding. It is a problem similar to the ones described previously, as regards the search of lexical and structural motifs. The main difference, stems from the fact that given a sequence we cannot associate it with a single secondary structure, but only a larger set of potential sub-optimal structures. The search for the largest common motif takes then place in this space of ambiguous examples.

Integration of data, methods and knowledge

Following the work on the development of the specialized database Colibri, devoted to the E. coli genome, we have oriented our work in the framework of a projet GREG; on the development of a cooperative environment to aid at the analysis of genomic data. This system should allow for the representation and exploitation of the knowledge issue of the sequencing of complete genomes. It should also help the user in the choice and chaining of the analytical methods available. Finally, it should present and manage the results of the analysis in a graphical way. Therefore, we aim at the integration of two types of knowledge: the one concerning biological entities (genes, signals, maps, etc.) and those others dealing with the methods, implying all the different methods available currently to the scientific community. Our main goal is to represent the methodologies of sequence analysis as well as the strategies of integration of the tools in a coherent way, and not just to provide an uniform access to a set of data and methods.

A prototype of such an environment is already functional; it has been used to annotate the Bacillus subtilis and the Mycoplasma pulmonis genomes. A prototype for the analysis of Arabidopsis taliana is underway, on the framework of Genoplante. Though first originated at the Atelier, this project is currently developed by a set of laboratories (including us), and particularly at INRIA/Grenoble by Alain Viari, who as recently left the Atelier. Finally, apart from its purely biological interest, we consider such an environment as vital since it offers a valuable platform for the integration and validation in situ of the algorithms developed at the Atelier.

Genome analysis

Following the work done at the Atelier on the annotation of the genome of Bacillus subtilis, we have become extremely interested in the analysis of the problems related to the structure and organization of bacterial genomes, with the laboratory of Antoine Danchin at Institut Pasteur. The study of life should not be restricted to the study of biochemical objects, but must, preferably, investigate their relationships. Because genomes are the blueprints of life, they cannot be considered as simple collections of genes. In the same way, cells are not tiny test tubes, they organize complexes of nucleic acids, proteins and other molecules as well.  In order to try and understand genome organization, we must therefore explore the distribution of genes along the chromosome. A way to do this is to explore the proximity of genes, extending this exploration to many more types of vicinities than their simple succession in the genomic text. We are aware of the fact that this "associationist" approach is quite primitive, and that approaches using rules should be developed and at some point combined with it. Of course "neighbor" is meant here in the broadest possible sense. It means that objects that are neighbors share some property. This includes not only similarity in structure or dynamics, but also the existence of links they may have in common with other objects. One can immediately understand the conceptual similarity between the search for common features in bacteria genomes and our interests in the search for motifs in biological sequences.

Due to their compact genomes, prokaryotes have been thought to lack long repeats. From here to conclude that any redundant sequence would be counter-selected, was a too easily warranted conclusion. Even if bacteria strive to attain minimal functional genomes, the action of transposable elements alone continuously introduces repeats in the sequence. Moreover, innovation brought about by repeats (be it only in terms of spacing or timing) may confer significant selective advantages. We have analyzed large repeats in bacterial genomes using the adaptations of the KMR algorithms and suffix trees. We came about with different patterns related to chromosomal rearrangements, pathogenicity strategies and horizontal transfer. It seems that repeats not only are positively selected in certain circumstances of bacterial evolution, but also that they are a major motor of this evolution by the level of rearrangements, integration and deletions of genetic material that they mediate. An interesting spin-off of this research took place by a  collaboration of the Atelier with the laboratory of Pierre Netter at IJM, especially through the work of Guillaume Achaz that proposed a model for the generation of repeats in yeast that seems to be of general interest. Other collaborations about this theme include the studies of Mycoplasma pathogenicity with Alain Blanchard at the INRA in Bordeaux.

Biases in DNA sequences

Mutational pressures leading to dramatic differences between the nucleotide composition of genomes have long been recognized among bacteria since the pioneer works by Sueoka in early 60's. These patterns produce heterogeneity in the chromosome, either because there is horizontal transfer between genomes with very different compositions, or because the mechanisms causing the bias act differently on different regions of the chromosome. At the Atelier we have been interested in the study of both types of biases.

Grantham and his colleagues were the first to analyze codon usage using Correspondence Analysis in E. coli. Later on, when a large number of genes became available, it was observed in early 90's by Claudine Medigue from the Atelier that the best simultaneous 2D representation of the genes and codon usage had a "rabbit head" shape. This shape could be explained by the existence of three major classes of genes differing in their codon bias. The same pattern was found in B. subtilis. The A+T-rich codon preference characterizing the third class is different from that of class I and II. Genes of this class are in general clustered together in the chromosome, which, along with their functional classification and similarity to prophage genes or typical horizontally transferred genes (e.g. toxins), suggests a foreign origin. We are interested in the analysis of these genes, since their acquisition may have constituted an important landmark for the evolution of the species. Recently most of our work has been on the mechanisms of this acquisition, as deduced from sequence analysis.

Under no-strand bias conditions, one would expect to find the equalities A=T and C=G in each strand of the DNA double helix. This equality was found in earlier analysis of DNA sequences and became known as Chargaff's second parity rule. However, the analysis of complete bacterial genomes by Lobry in mid-90's revealed an important asymmetry between the leading and the lagging replicating strands. Interestingly, we have shown that most of bacterial genomes present replication biases, and when they do, they exhibit the same fundamental asymmetry: the relative abundance of G and C is opposed, so that G is richer in the leading strand, which is frequently accompanied by a larger abundance of T over A in the leading strand. These biases propagate into higher-order biases in a correlated way, thereby changing the relative frequencies of codons and amino acids of genes and corresponding proteins in each of the replicating strands. Therefore we have started to tackle the question of which mutation or selection biases may be acting on the different replication strands of bacteria.