Eduardo P. C. Rocha
The constraints in chromosome organisation
Navigation
The constraints in chromosome organisation
Bacterial chromosomes are adaptively organized to processes such as cellular factories and chromosome replication and segregation (Figure 1). A classical example of the former is the organization of functionally related genes into operons, allowing the development of sophisticated strategies for the regulation of gene expression (Lawrence 2003). Longer-range organizational levels resulting from the dynamic interaction of the chromosome and gene expression with the cell factory involve the association between gene expression and cell compartmentalization, chromosome segregation or cell differentiation (Danchin et al. 2000; Shapiro et al. 2002). In E. coli, genes involved in the sulphur metabolism cluster together, possibly to allow compartmentalization of metabolism itself and its toxic metabolic intermediates (Rocha et al. 2000). Translation related genes also cluster at a supra-operonic level in many bacterial genomes (Lathe et al. 2000). For example, E. coli and B. subtilis highly expressed and stress related genes are clustered on the two opposite sides of the origin of replication (Rocha et al. 2003c). Genomic islands are also frequently associated with genes involved in pathogenicity and antibiotic resistance (Hacker et al. 2003; Rowe-Magnus et al. 2003), but also catabolic pathways (van der Meer et al. 2003) and symbiosis (Sullivan et al. 1998). In many of these cases, the clustering of closely related functions may derive from efficient horizontal gene transfer (Lawrence et al. 2003). Nevertheless, neighbour genes tend to be co-expressed, even if they're not in the same operon (Korbel et al. 2004). This implies that a proper description of prokaryotic genome organisation must include supra-operonic organisation (Audit et al. 2003), and is consistent with the neighbourhood conservation of genes in different operons between distantly related genomes (Lathe et al. 2000).

Among the various types of organisational features of the bacterial chromosome, my work has particularly focused on the organisation of the bacterial genome in relation to replication. By this, I sustain that chromosome replication is a major cause for the arrangement and interrelation of many of the elements constituting the chromosomal organization. This is because the intrinsic mechanistic asymmetries or gradients of chromosome replication impose biases on sequence composition and gene distribution in the bacterial chromosome. Because the mechanistic asymmetries of replication have such an important role in shaping the genome, I shall start by briefly describing how replication takes place in bacteria. For further information on the mechanisms and regulation of replication start, elongation and terminus, there are excellent reviews available (Marians 1992; Bussiere et al. 1999; Kunkel et al. 2000; Giraldo 2003).
Replication in bacteria
In bacteria where replication has been thoroughly studied, such as Escherichia coli and Bacillus subtilis, chromosome replication starts at a single chromosome locus, the origin (ori). The key regulation of chromosome replication takes place at the start of the process and is tightly coupled to cell mass (Boye et al. 1996). Initiation occurs once per cell cycle during a short time interval at all origins within a cell (Skarstad et al. 1986). Secondary initiations of the newly replicated origins are then avoided by a transient, temporally coordinated blockage in re-initiation at the origin (Campbell et al. 1990). Replication proceeds by the bi-directional progression of the replication forks along the chromosome (Figure 2). In each replication fork, a complex with two DNA polymerases replicates the two DNA strands. The pace of the replication fork varies significantly between bacteria. For example, in E. coli the fork progresses at ~1000 nt/s, in Pyrococcus abyssi at ~300 nt/s (Myllykallio et al. 2000) and in Mycoplasma capricolum at ~100 nt/s (Seto et al. 1998). Thus, although the genome of M. capricolum genome is 7x smaller than E. coli’s, it takes longer to be replicated. Short DNA segments to which a replication terminator protein binds (ter sites), arrest the movement of the replication forks when these pass through the terminus region (Bussiere et al. 1999). The chromosome dimer is then resolved by site-specific recombination (Kuempel et al. 1991). The system of replication fork arrest diverges widely among bacteria, and deletion of ter sites is not lethal in B. subtilis (Iismaa et al. 1987) nor in E. coli (Henson et al. 1985). On the contrary, the system of chromosome resolution is strongly conserved (Recchia et al. 1999).Elongation of the newly synthesized strand involves the displacement of the replication fork along the chromosome. The replication fork is a holoenzyme with several components (Glover et al. 2001), among which a helicase that unwinds the chromosome (DnaB in E. coli), and two DNA polymerases III (DNAP), that replicate the two strands (Figure 2). Because DNA polymerisation occurs in the 5'->3' direction, one strand is replicated continuously - the leading strand - whereas the other strand is replicated in discrete steps - the lagging strand - through the Okazaki fragments. The cycle of the synthesis of Okazaki fragments starts by the synthesis of a new 10-12 nt RNA primer, by a primase (Kitani et al. 1985). Subsequently, the lagging strand polymerase transits from the 3'-OH terminus of the just completed Okazaki fragment to the new primer terminus, resuming DNA synthesis through 1000-2000 nt in E. coli and around 100 nt in P. abyssi (Matsunaga et al. 2002). The cycle ends by the removal of the RNA primer and gap filling. Inversely, the synthesis of the leading strand is essentially continuous, closely following the unwinding of the DNA duplex, with processivities surpassing 500 kb and eventually all the replichore (Marians 1992).
Thus, there are two polymerase cores within one holoenzyme particle, each replicating a different DNA strand. In E. coli, the two core polymerases are not pre-dedicated to one strand and can be interchanged without differences in processivity (Yuzhakov et al. 1996). In B. subtilis and the other Firmicutes there are two different genes coding for the alpha-subunit of the DNA polymerase (Bruck et al. 2000; Dervyn et al. 2001). One of these subunits is the orthologue of E. coli subunit (DnaE), whereas the other shows weak sequence similarity to it (PolC) (Koonin et al. 1996). Recent evidence suggests complementary roles for these proteins, with DnaE involved in lagging strand synthesis and PolC involved in leading strand synthesis (Dervyn et al. 2001). An important difference between the two proteins is that PolC includes the exonuclease domain, which is coded separately in E. coli (theta subunit, DnaQ) (Huang et al. 1999).

Recently, two works have demonstrated the existence of multiple origins of replication in the archaea Sulfolobus (Lundgren et al. 2004; Robinson et al. 2004). Sulfolobus solfataricus and S. acidocaldarius have three synchronous replication origins that are unevenly distributed in the chromosome. The forks progress at the rate of ~80-100 nt/s, leading to asynchronous termination. It remains to be understood how replication termination and chromosome segregation take place under these circumstances, but this explains the mixed patterns obtained in GC skews analyses of these genomes. These works also put forward the mixed character of archaeal replication, which shares resemblance with either prokaryotes or eukaryotes. It is thus most likely that part of the replication-associated structure of bacterial genomes is common to only some of the archaeal genomes.
Major asymmetries and gradients in replication
The mechanistic asymmetries of bacterial replication result in mutational or selective biases, differentiating the chromosome within replichores and between replicating strands (Figure 2).Collisions between polymerases. Chromosomal replication often takes place in moments of intense transcription and collisions between DNAP and RNA polymerase (RNAP) are inevitable. Although RNAP transcription rate varies with the growth phase, it is usually in the range 40-50 nt/s, thus 20 times slower than DNAP in E. coli (Bremer et al. 1996). Because both polymerases progress in the 5'->3' direction, transcripts coded in the lagging strand lead to head-on collisions whereas leading strand transcripts lead to co-oriented collisions (Figure 3). Two important consequences result from this asymmetry. When the replication fork arrives at an actively transcribed operon, polymerases will inevitably collide if the operon is in the lagging strand but occur with probability ~19/20 (i.e. 95%) if the operon is in the leading strand (because DNAP is 20 times faster than RNAP). Thus, the probability of collision depends on the direction of transcription. But so does the outcome of the collision. Studies of both E. coli rRNA (French 1992) and Saccharomyces cerevisiae tRNA genes (Deshpande et al. 1996) in vivo, indicate that only head-on collisions interfere significantly with the progression of the replication fork. As a consequence, differential probability and consequences of collisions between polymerases lead to an asymmetric distribution of genes between the two strands.
Compositional differences between replicating strands. During the replication of the Okazaki fragments, the leading strand is kept single-stranded while the neo-formed lagging strand is being synthesised. Yet, the lagging strand stays double-stranded when the neo-formed leading strand in being synthesised. Single stranded DNA (ssDNA) can form secondary structures, and these may lead to replication errors (Leach 1994) and palindrome deletion (Trinh et al. 1991). Also, some types of substitutions increase in ssDNA more than others, and thus the different replication of the two strands leads to compositional differences between the replicating strands (Francino et al. 1997; Frank et al. 1999).
Gene dosage effects. Fast-growing bacteria have growth rates requiring replication re-initiation before the round in progress is complete. In this way, E. coli can attain growth rates of 2.5 doublings/h (Bremer et al. 1996). However, the replication fork moves at about 600 to 1000 nt/s (Marians 1992), and to duplicate the average 5 Mb E. coli chromosome it takes from 40 to 67 minutes (Bremer et al. 1996). Hence, E. coli cells under exponential growth show 2 to 3 simultaneous rounds of replication, a new one starting every ~20 minutes. Under these conditions, genes near the origin of replication are over-represented in the bacterial cell by a factor of 4 (22) to 8 (23) relative to genes near the terminus of replication (Chandler et al. 1975). Hence, in fast-growing bacteria highly expressed genes tend to be positioned near the origin of replication.

Figure 3- Differential outcome of DNAP and RNAP collisions, when operons are in the leading or the lagging strands. In bacteria translation is coupled to transcription, and this has important consequences in the model. In order not to charge excessively the figure, the translation machinery was not represented.
Gradients along the replichore. The directionality of chromosome replication leads to a chronological asymmetry along the replichore. Since about one hour separates the beginning from the end of replication, the favorable conditions leading to replication start may no longer prevail at the end of the process and this may result in different mutation rates and compositional biases (Daubin et al. 2003b). Dimer separation after full chromosome replication requires site specific or homologous recombination (Corre et al. 2002). This may also result in differential mutation and recombination biases near the terminus of replication.
In the following sections, these different biases are discussed in the framework of the major replication asymmetries.
Differences between leading and lagging strands
Gene strand bias
The "polymerases collision avoidance" model. The 7 rDNA operons in E. coli are all on the leading strand, resulting in co-orientation of replication and transcription (Ellwood et al. 1982), which was speculated to result from selection to prevent too frequent collisions between DNAP and RNAP (Nomura et al. 1977). These and other works showing that RNAP movement slows-down DNA replication (Pato 1975), led to the proposition that the differential outcome of collisions between DNAP and RNAP should result in a selective pressure towards coding highly expressed genes in the leading strand (Brewer 1988). The lower collision rate would carry two major advantages: i) faster DNA replication; ii) less transcript loss. It could also diminish the number of replication arrests, which are dangerous for the cell (Kuzminov 2001). In fast-growing E. coli there are ~70000 ribosomes per cell, corresponding to roughly 10000 transcripts per rRNA operon per replication cycle (Bremer et al. 1996). Under the same conditions, there are about 70 RNAP transcribing each of the 7 rRNA operons, implicating a potential for 490 collisions per replication round, i.e. about 5% of the rRNA transcription. This could significantly retard replication if collisions are head-on (Brewer 1988). Indeed, highly expressed translation-related genes, such as rRNA and ribosomal proteins, have been systematically found coded in the leading strands of genomes for which replicating strands can be identified (e.g. (Zeigler et al. 1990; Blattner et al. 1997; Rocha 2002)). A rare exception concerns the sequenced genome of Pasteurella multocida, which exhibits two recent inversions near the origin of replication that shifted two of the 6 rDNA operons and several ribosomal protein genes to the lagging strand (May et al. 2001). The accordance of the polymerases collision avoidance model has never been thoroughly tested, due to obvious experimental difficulties, and only recently has started to be questioned.Frequency of genes in the leading strand. The sequencing of the regions surrounding replication origins of both B. subtilis (Ogasawara et al. 1992) and E. coli (Burland et al. 1993), showed that their genes are systematically coded in the leading strand. However, the sequencing of the two complete genomes resulted in very different observations (Figure 4). The frequency of leading strand genes is ~75% in B. subtilis (Kunst et al. 1997), but only ~55% in E. coli (Blattner et al. 1997). A first systematic survey of gene strand bias showed that genomes having from 55% to 80% of genes in the leading strand, although systematically more than 85% of ribosomal proteins were coded in the leading strand of these genomes (McLean et al. 1998). In fact, 78% of the genes of Firmicutes (including Mycoplasmas) are in the leading strand, to be compared to 58% for the other genomes (Rocha 2002). Interestingly, the group with higher biases coincides with the group of genomes containing two different (and probably strand-dedicated (Dervyn et al. 2001)) DNAP alpha-subunits at the replication fork (Figure 5). The causal association between these two observations is tempting, although not trivial. One could suppose that the additional structural asymmetry at the fork would lead to increased asymmetry in its stability relative to different types of collisions (Rocha 2002). If this is so, genomes containing 2 different DNAP alpha-subunits should be less sensitive to co-oriented collisions and/or more sensitive to head-on collisions. One can think of several ways on how this could happen.

Figure 4- Density of coding sequences (yellow), and leading strand coding sequences (red) in model bacteria and highly biased genomes of Firmicutes (B. subtilis and T. tengcongensis) and others (E. coli and T. pallidum). The densities were computed in non-overlapping windows of 10 kb (T. tengcongensis and T. pallidum) and 20 kb (E. coli and B. subtilis).
Collisions between polymerases can result in dangerous replication arrests, which may be solved by the action of homologous recombination (Lusetti et al. 2002). B. subtilis has two different DNAP and also different mechanisms for homologous recombination where the RecBCD system of E. coli is functionally replaced by the AddAB system (Dubnau et al. 2002). This might render it more sensitive to replication arrests, which in E. coli have been proposed to result only from head-on collisions (French 1992). Genomes containing 2 different DNAP alpha-subunits lack the proofreading E. coli exonuclease protein DnaQ. Although PolC contains a homologous exonuclease motif, DnaE does not. Further, DnaQ performs a structural role by increasing the stability of the DNAP complex and thus processivity in E. coli (Maki et al. 1987; Studwell et al. 1990). Genomes lacking DnaQ could thus be more sensitive to collisions. Interestingly, comparative works seem to indicate that B. subtilis RNAP is more processsive than the one of E. coli (Artsimovitch et al. 2000). The conjugation of these differences might render B. subtilis DNAP more sensitive to collisions and thus favor higher gene strand bias. However, collisions between polymerases are expected to occur between RNAP and the helicase (Figure 3), probably by DNA knotting and without direct contact (Olavarrieta et al. 2002). Although this does not necessarily invalidate the possibility that differential DNAP sensitivity could lead to higher overall biases, further experimental work will be necessary to assess this hypothesis.

Figure 5- Percentage of leading strand genes among bacterial genomes for which an origin and terminus can be reliably found. Open circles represent genomes containing two different alpha-subunits of the DNA polymerase and close circles represent genomes for which only one subunit was found by homology searches (this graph is an update of (Rocha 2002)).
Biased gene location and gene function. These works suggest that compositionally different DNAPs are correlated with different levels of gene strand bias, but also that expression levels can hardly account for all the observed trends. The number of genes that are highly expressed in bacteria is typically low (Andersson et al. 1990), and cannot justify the high frequency of leading strand genes in Firmicutes. Furthermore, according to the polymerases collision model, gene strand bias should be higher in fast-growing bacteria, where transcription and replication (hence collisions) are very frequent and where fast growth is an important component of the global fitness. However, one typically finds the inverse (Rocha 2002). That expression is not a determinant of gene strand bias was recently demonstrated in B. subtilis and E. coli (Rocha et al. 2003a), where essentiality seems to be the unexpected key to understand this bias. In B. subtilis, the frequency of leading strand essential genes (96%) and non-essential genes (74%) is very different and is independent of expression levels (Figure 6). Qualitatively similar results are found in E. coli when high expression is defined using codon usage biases, transcriptome or proteome data. These results seem to hold when essentiality is assigned by homology in most other bacterial genomes (Rocha et al. 2003b). In all cases, essential genes are more biased than non-essential genes, and among non-essential genes, there is rarely a significant effect attributable to expression level. Finally, when comparing the location of orthologues in close genomes, essential genes are conserved in the leading strand more often than the other genes.

Figure 6- Distribution of genes in the leading (black bar) and lagging (white bar) strands in the genomes of B. subtilis and E. coli, classed according to essentiality and expressiveness (Rocha et al. 2003a).
Re-thinking the polymerases collision model. The polymerases collision model does not satisfactorily explain the linkage between gene strand bias and essentiality. If the major problem associated with collision between polymerases were replication slow-down, then one would expect higher biases among the genes leading to higher collision rates, i.e. among highly expressed genes. Yet, among the >3000 genes expressed during exponential growth in E. coli (Tao et al. 1999), the essential genes, many of which are expressed at low levels, are highly biased, whereas the highly expressed genes that are not essential are equally distributed between the replicating strands. This suggests that the problem of collisions is not related with its rate (i.e. with expression levels), but with gene function. In co-oriented collisions, the transcript may be finished, whereas head-on collisions result in aborted transcripts (Figure 3). The latter may be translated into truncated non-functional peptides, which is particularly deleterious for essential functions. Non-functional peptides making part of large complexes, as is often the case of essential genes, are typically dominant negative (Pakula et al. 1989). As a result, the truncated peptide may inactivate an entire complex. If this model proves correct, it suggests that aborted transcripts may be more poisonous than previously thought.
It is yet unclear if high levels of gene strand bias in genomes containing two different DNAP alpha-subunits are related with higher frequency of essential genes in the leading strand. One might consider that the problem of truncated peptides associated with collisions between polymerases is common to all expressed genes, but assumes a particular relevance in the context of genes coding for essential functions. If so, one could speculate that the asymmetry between the two types of collisions, head-on and co-oriented, would be qualitatively identical in all genomes, but have more severe consequences in genomes having two different DNAP alpha-subunits because they may be less robust to collisions, as discussed above. Hence, the overall bias would be larger in these genomes.
Compositional strand bias
Early work suggested that the frequencies of G and C and A and T (NG=NC and NA=NT) are roughly equal in ssDNA (Rudner et al. 1968). These results were integrated in the theoretical body of molecular evolution, under the form of two parity rules (Sueoka 1995). The first parity rule (PR1) indicates that when mutation and selection are symmetrical relative to both DNA strands, the rates of reciprocal substitutions are equal. The second parity rule indicates that at equilibrium, and under PR1 conditions, NG=NC and NA=NT in each DNA strand. In fact, under no-strand bias the second parity rule is the expected outcome of the first (Lobry 1995). However, the analysis of the first bacterial genomes showed that NG=NC and NA=NT do not hold when the genomes are divided into leading and lagging replicating strands (Lobry 1996a). The subsequent flood of bacterial genomes confirmed the generality of this observation (Grigoriev 1998; McLean et al. 1998; Mrázek et al. 1998; Mackiewicz et al. 1999; Rocha et al. 1999b; Lobry et al. 2002). The analysis of the first 100 available complete genomes in the bacterial branch of the tree of life using conventional techniques indicates that less than a dozen lack any type of compositional strand bias (Figure 7).
Several reviews have been dedicated to the subject of strand compositional bias (Francino et al. 1997; Frank et al. 1999; Karlin 1999). Here, I shall concentrate on the most recent results, with a natural emphasis on my own work. With the exception of Streptomyces coelicolor (see below), the leading strand is richer in G, relative to C, and to a lesser degree richer in T, relative to A (Lobry 1996a). Thus keto (leading strand) opposes amino (lagging strand) in compositional strand bias. The C-richness of the leading strands of mycoplasmas (McLean et al. 1998) is due to gene strand bias and biased composition of genes (Perrière et al. 1996), not compositional strand bias. Drawing GC skews, defined as (G-C)/(G+C) in sliding windows, has become a standard method to identify the origin and terminus of replication in many bacteria, where experimental evidence is rarely available (Lobry 1996b; Grigoriev 1998). Compositional strand bias is relatively homogeneous in the genome and is especially strong at third codon positions, as expected from a mutational bias (Lobry 1996a; McLean et al. 1998; Rocha et al. 1999b). In highly biased genomes, strand bias is the major force shaping codon usage (McInerney 1998; Romero et al. 2000). In these cases, one can easily discriminate the position of the genes relative to the replicating strand based on their codon usage or amino acid content of the respective proteins (Lafay et al. 1999; Mackiewicz et al. 1999; Rocha et al. 1999b). In Borrelia burgdorferi, one can predict with 95% accuracy the replicating strand positioning of the gene based solely on its amino acid content (97% using nucleotide composition). The accuracy of such discrimination analysis is also significant, although less pronounced, in the other genomes (Figure 8). This is one of the most striking examples of a mutational bias shaping protein composition.

Figure 8- Accuracy of the discrimination of the leading strand genes and proteins based on their composition in nucleotides and amino acids. Accuracy is defined as the percentage of correct predictions of 30% of genes (the remaining 70% being used for learning) (Rocha et al. 1999b). Here, I analyzed 58 genomes corresponding to one strain per species.
Strand bias is probably neutral and evolves fast. Genes that switch of replicating strands following chromosomal rearrangements evolve faster to adapt to the composition of the new replicating strand. As a result they show lower average sequence similarity (Tillier et al. 2000b). The ratio of synonymous/non-synonymous rates of these switched genes is similar to the one of the average orthologues, in agreement with a mutational driving force for this process (Rocha et al. 2001). The analysis of 49 orthologues that have switched from replicating strand since speciation of Chlamydia muridarum and Chlamydophyla pneumoniae shows that these genes have completely adapted to the new strand (Figure 9).

Figure 9- Distribution of GC skews in the third codon position of the genes that have switched between replicating strand since speciation between C. pneumoniae and C. muridarum. GC skew is defined as (NG-NC)/(NG+NC), where NG and NC are the number of G and C in the gene sequence (Rocha et al. 2001).
Mechanism(s) causing compositional strand bias. Hypotheses aiming at identifying the causes of compositional strand bias must take into account the compositional changes it creates, its apparently neutral basis, and its near ubiquity.
Asymmetrical functioning and processivity of the two DNAP at each replication fork could result in different mutation frequencies in the two strands, although there are conflicting reports on this issue (see (Frank et al. 1999) for a discussion on this subject). The lagging strand has been proposed to be less mutagenic based on the analysis of mutations in lacZ (Fijalkowska et al. 1998). The leading strand DNAP is highly processive to stay on the template throughout replication, while the lagging complex needs to allow rapid cycling and may thus dissociate more easily from the template, leaving a mismatch for correction. However, lacZ is a native lagging strand gene, and adaptation to the leading strand composition is expected to involve higher substitutions rates (Szczepanik et al. 2001). Radman proposed that the MutSHL mismatch repair system would be more efficient in correcting errors in the lagging strand because it requires DNA nicks to proceed and these are more readily available in the lagging strand (Radman 1998). However, some genomes without the mismatch repair system have strand biases (e.g. Actinobacteria), and some genomes with MutSHL do not (e.g. Synechocystis and Anabaena). More importantly, all the above hypotheses allow explaining different mutation rates between strands. However, the relevant variable regarding compositional strand bias is the substitutions’ asymmetry and different mutation rates per se do not lead to compositional asymmetry (Lobry et al. 1999). Recent works identified strand compositional bias in archaea (Lopez et al. 1999), mitochondria (Reyes et al. 1998; Mohr et al. 1999) and phages (Mrázek et al. 1998; Grigoriev 1999; Miller et al. 2003). These elements possess different DNAP and repair mechanisms, different G+C compositions (Sueoka 1962) and mutation rates (Drake et al. 1998). Nevertheless, they all have a qualitatively similar compositional strand bias. Thus, the major source of compositional asymmetry is probably to be found in a fundamental property such as the chemical stability of DNA.
The cytosine deamination theory proposes that compositional strand bias is caused by the chemical instability of cytosine in ssDNA (Frank et al. 1999). Relative to dsDNA, the rate of cytosine deamination increases by a factor of 140 in ssDNA, and by a further factor of 4 when cytosine is methylated (see (Lutsenko et al. 1999) for a recent review). In the latter case, the deamination produces a T (instead of U), which is not corrected by the uracil-DNA glycosylase (Coulondre et al. 1978). The leading strand is more exposed in the single stranded state (in order to serve as template to the synthesis of the new lagging strand), and C->T mutations would lead to GC and TA skews (and larger GC skews in G+C poor genomes). Still, GC skews are larger than expected, relative to TA skews, under the cytosine deamination hypothesis (McLean et al. 1998). Data on inverted genes of Chlamydia and Bacillus was concordant with a major effect of cytosine deamination in establishing the bias, but only when a smaller contribution of asymmetric C->G substitutions was considered (Rocha et al. 2001). Caution is necessary when interpreting the data of the latter analysis because the substitutions could not be oriented, and the genomes had accumulated multiple substitutions. Yet, when the joint contributions of cytosine deamination and C->G asymmetry were taken into consideration, there was a satisfactory explanation of the association between GC and TA skews in the available set of bacterial genomes (Rocha et al. 2001). As a result, cytosine deamination of leading strand ssDNA is currently seen as the most likely and important cause of compositional strand bias. However, it may fail to describe the entire complexity of this phenomenon since Streptomyces coelicolor leading strands are G-poor and C-rich (Bentley et al. 2002). S. coelicolor is the least biased genome displayed in Figure 8 and GC skew is mostly present in the hypervariable outer arms of the chromosome, not in the central most conserved regions. Further, the closely related S. avermitilis shows no significant compositional strand bias (Omura et al. 2001). It is still unknown if the inverted bias of S. coelicolor is indeed related directly with composition strand bias, with its extreme GC composition (72%), or with the frequent inversions, fusions, deletions and insertions of large plasmids at these chromosomal regions in Streptomyces (Chen et al. 2002).
Variation in the intensity of compositional strand bias. As is clear from Figure 8, compositional strand bias varies significantly from genome to genome. This may be related with the different stability of the genomes, as chromosome shuffling will tend to level off the bias (Rocha et al. 1999c; Tillier et al. 2000b; Mackiewicz et al. 2001a; Achaz et al. 2003). Different length of the Okazaki fragments may also contribute to modulate the bias, since a smaller exposure in the ssDNA state would lead to lower levels of cytosine deamination (Mrázek et al. 1998). Consistent with this hypothesis, the intensity of compositional strand bias is positively correlated with the duration of the single stranded state of the H-genes during mitochondrial replication (Reyes et al. 1998). Eukaryotes have Okazaki fragments that are 10 times smaller than those in E. coli, and lack extensive replication strand bias (Gierlik et al. 2000). Pyrococcus abyssi and Sulfolobus acidocaldarius also have small Okazaki fragments (Matsunaga et al. 2002), which could explain the lower biases typically found in Archaea (Lopez et al. 2001). Finally, some bacteria are more exposed than others to DNA mutagenic agents. Obligatory intracellular bacteria may be particularly well protected from this point of view, and this could partly explain why they tend to exhibit higher biases.
Gradients along the replichores
Gene distribution
Gene dosage effects. When the time required to replicate the chromosome exceeds the duplication time, the dosage of genes near the origin in the cell increases exponentially with the number of simultaneous replication rounds. Fast-growing bacteria, such as E. coli or B. subtilis, with multiple simultaneous replication rounds, selectively accumulate highly expressed transcription and translation genes near the replication origin because of this effect (Figure 10). For slow growing bacteria, the positioning of highly expressed genes near the origin does not carry sufficient selective advantage and there is no appreciable bias in the distribution of highly expressed genes along the replichore (Figure 10).
Gene dosage effects have rarely been experimentally tested, but they have been frequently invoked to explain the deleterious effect of chromosomal rearrangements (Liu et al. 1996; Roth et al. 1996). Experimentally induced inversions in the E. coli chromosome altering the distance of genes to the origin of replication can lead to halving growth rates (Louarn et al. 1985). Similarly, systematic translocations of an expressed gene in the S. enterica typhimurium chromosome indicate that its positioning closer to the origin leads to higher expression levels (Schmid et al. 1987). Essential genes are also not randomly placed along the replichore, but this is not an intrinsic feature of essential genes, being caused by the over-representation of highly expressed genes in this subset (Figure 11). Thus, among the constraints imposed by replication on the distribution of genes in the bacterial chromosome, expression plays a role in the distribution of genes as a function of the distance to the origin of replication, whereas essentiality constrains the replicating strand where genes are coded.

Figure 11- Distribution of essential genes in function of the distance (%) to the origin of replication in E. coli and B. subtilis (Rocha 2004). A black bar represents essential highly expressed genes, whereas a white bar represents essential non-highly expressed genes. Essential genes were retrieved from the literature to B. subtilis (Kobayashi et al. 2003) and from the PEC database for E. coli (http://www.shigen.nig.ac.jp/ecoli/pec/). Highly expressed genes were defined as the top 10% CAI values in the genome (Sharp et al. 1987).
Phage integration and gene transfer
The genomes of B. subtilis (Kunst et al. 1997) and E. coli K12 (Blattner et al. 1997) show A+T rich ter regions associated with the presence of prophages. Since recombination is involved in the resolution of the replicated chromosomes and in the integration of horizontally transferred genes, it was tempting to relate the two. Horizontal gene transfer (HGT) was suggested to cluster near the terminus of replication because of two effects. First, because horizontally transferred genes are typically weakly (or not) expressed, gene dosage effects would lead to their positioning away from the origin. Second, local hyper-recombination at dif sites would favor the insertion of horizontally transferred sequences. However, in B. subtilis the HGT at the terminus is mostly caused by the presence of one single large SPb prophage (Kunst et al. 1997). Similarly, in E. coli, lambdoid prophages tend to cluster closer to the terminus of replication (Campbell 1992), and it has recently been suggested that hyper-recombination in E. coli ter region is mostly caused by these sites (Corre et al. 2000). The availability of a large number of genomes from bacteria closely-related to E. coli, B. subtilis and S. pneumoniae allows the re-appraisal of these analyses by identifying genes arising from HGT. The results suggest a very small bias towards larger rates of HGT in the second half of the chromosome for Enterobacteria and Streptococcus, and smaller than expected HGT in the terminus of Bacillus (Figure 12). Although the regions immediately adjacent to the origin seem to under-represent HGT systematically, there is no clear bias for an over-representation of HGT at the terminus.
Figure 12- Distribution of HGT elements in the replichore of three bacterial groups (Rocha 2004). The E. coli group includes the strains K12, O157:H7, CFT073, Shigella flexneri, S. enterica typhimurium and S. enterica typhi. The Bacillus group includes B. subtilis, B. halodurans, O. iheyensis, L. monocytogenes and L. innocua. The Streptococcus group includes two strains of S. pyogenes, S. pneumoniae, S. agalactiae and one strain of S. mutans. In each group one counts as HGT every gene that does not have an orthologue in any of the other genomes. Orthologues were defined as in (Rocha et al. 2001).
Heterogeneities in nucleotide composition
Early analyses of nucleotide composition in bacterial genomes showed high inter-genomic (Sueoka 1962) but low intra-genomic variability (Rolfe et al. 1958). Further, each bacterial lineage has a characteristic pattern of codon usage, which is determined by several factors such as its G+C content, its tRNA set and gene expression levels (Grantham et al. 1980; Ikemura 1981). These observations led to the development of HGT detection methods based on sequence composition (Médigue et al. 1991; Lawrence et al. 1997). These methods assume that nucleotide composition is relatively uniform along the chromosome. However, G+C content among genes expressed at low levels is lower in late-replicating regions of the E. coli genome (Deschavanne et al. 1995). It also shows significant variations, although not in function of replichore position, in Mycoplasma genitalium (Kerr et al. 1997). This may lead to an overestimation of HGT by methods based on nucleotide composition, although real HGT are most often found by these methods (Daubin et al. 2003a). The heterogeneity of bacterial genomes has recently been confirmed and extended by an analysis using several complete genomes (Daubin et al. 2003b). Half of the sequenced genomes show A+T richness at the terminus of replication, with a significant part of the others showing significant heterogeneities along the chromosome not coincident with the replication terminus. Tests done in E. coli, Chlamydia and Helicobacter showed that HGT was not the cause of these heterogeneities and one must then conclude that the composition of bacterial chromosomes is more heterogeneous than previously thought. A significant part of it is correlated with the organization of the chromosome relative to replication, although it leads to smaller differences in nucleotide composition than compositional strand bias (Figure 13).
Figure 13- Compositional strand bias is the difference in G and C between genes in the lagging and leading strands of 87 genomes (average values DG=-2.1% and DC=1.9%). G+C poorness at the terminus is the difference in G and C between genes in the 90% of the genome around the origin and the 10% around the terminus of replication (average values DG=0.3% and DC=0.2%) (Rocha 2004). When the chromosomes are spliced in 50%:50%, the differences are smaller (average values DG=0.1% and DC=0.1%). The graphs represent box-plots, where the boxes delimitate the 1st and 3rd quartiles. The centre of the diamond indicates the mean, which is always significantly different from zero (P<0.01).
The mechanisms establishing these replichore compositional gradients are probably highly variable. The genome of Corynebacterium diphtheriae shows a remarkably lower G+C content at the terminus, but the closely related, and largely co-linear, genome of C. efficiens does not show such a pattern (Cerdeno-Tarraga et al. 2003).
The mutation rates of TA -> GC transitions and transversions in a lacZ revertant in 4 loci along the chromosome of Salmonella enterica Typhimurium showed significant heterogeneities, but not a clear gradient in the direction origin to terminus (Hudson et al. 2002). However, AT richness near the terminus is the result of the balance between GC->AT and TA->GC mutations, and this balance has yet to be analyzed. Two other possible causes for replichore compositional gradients can be proposed. First the recombination at dif sites may induce a bias associated with replication and/or repair of the end of both concatenates (see (Daubin et al. 2003b) and references therein). Second, if the end of replication is taking place in conditions of depleted nutrients, then A+T richness at the terminus could result from the relative larger availability of these nucleotides in bacterial cells (Sharp et al. 1989; Rocha et al. 2002b). Exponentially growing bacteria are simultaneously replicating regions near the terminus and origin of replication, and no base composition heterogeneity should be observed under these circumstances. However, under low growth rates, for which the condition of depleted nutrients are more plausible, nucleotide scarcity may be important for the late replicating regions of the chromosome. Given the under-representation of Guanine and Cytosine in the bacterial cytosol (Danchin et al. 1984), nucleotide scarcity would lead to an A and T enrichment at the terminus region. Since A+T richness shows a significant increase just at the vicinity of the terminus, this would suppose a sudden decrease of the G+C nucleotides pool at the end of chromosome replication.
Conclusion
Bacterial genomes show lower heterogeneity in coding density and nucleotide composition than most eukaryotes. Still, they have important variations regarding several aspects of genome organization, many of which relate with the asymmetries induced by the mechanism of chromosome replication. Interestingly, one single key asymmetry may lead to different biases. For example, compositional strand bias and gene strand bias both originate in the asymmetric nature of the replication fork that divides the chromosome in leading and lagging strands. Yet, the two biases are poorly correlated (Tillier et al. 2000a; Rocha 2002). This is because their direct cause is different: in one case asymmetric exposure of ssDNA creates asymmetric composition, whereas in the other case the collision between polymerases favors leading strand genes. In Table 1, I summarize the different types of bias that have been found and their mechanistic basis. I also tentatively assign to each bias a mutational or selective basis. In some cases, e.g. compositional strand bias or gene strand bias, such assignment is relatively straightforward, but in others, it is not. It is also not unlikely that in some cases mutational biases have been recruited for other (eventually selective) purposes. In any case, comparative genome analysis will have to take such biases progressively into account in oligonucleotide frequency or even phylogenetic studies, since the evolution of the sequences clearly depends of their location in the chromosome. Experimental work in organisms other than E. coli will also be important to understand the diversity of biases associated with bacterial replication. The wealth of expression data coming from genomics and microbial cell biology may also allow a better understanding of the associations between replication associated biases and other structuralizing variables such as cell division, gene expression and compartmentalization of the cell.Table 1- Comparative analysis of the different biases arising in chromosomes from the asymmetric replication mechanisms operating in bacterial cells. A "?" indicates when there still seems to lack a consensus in the community about the subject, or the available data is still scarce. See comments and references in the text.
|
Replication
bias |
Selection /
Mutation |
effect |
Major basis
of the bias |
|
between
strands |
|
|
|
|
Nucleotide
composition |
M |
Glead/Glag >>
Clead/Clag and Tlead/Tlag > Alead/Alag |
Chemical vulnerability of ssDNA |
|
Gene
distribution I |
S |
Most essential genes are coded in the
leading strand |
Collisions between DNAP and RNAP |
|
Gene
distribution II |
S |
Most genes are coded in the leading
strand |
Collisions between DNAP and RNAP; DNAP
composition |
|
Palindrome
deletion |
M |
Large palindromes are deleted |
Slipped mispair in ssDNA |
|
chi sites |
S (?) |
c sites are more abundant in the leading
strand |
Recombination repair of stalled
replication forks |
|
along
replichores |
|
|
|
|
Gene
distribution |
S |
Highly expressed genes cluster near the
origin |
Gene dosage in fast growing bacterial
cells |
|
Nucleotide
composition |
M |
A+T rich terminus |
Prophages, recombination (?), nucleotide
scarcity (?) |
|
Rate of
sequence evolution |
M (?) |
Sequence divergence increases along the
replichore |
Recombination (?), nucleotide scarcity
(?) |
|
both |
|
|
|
|
Polarized
motifs |
S |
Motifs are over-represented in the
leading strand and near dif sites |
Chromosome segregation and/or resolution (?) |


