Introduction or protein sequences, o Gene o Gene order
to comparative genomics
Comparative genomics is a widely
studied holistic approach in which two or more than two different genomes from
different organisms are compared to study about the differences and
similarities between them. This field of biological research compares genomes
of organisms at different levels, the genomic features that are normally
DNA or protein sequences,
Mapping positions and maps,
Function and evolution.
To obtain multiple view points, the
comparison is made at different levels and in this way it could be understood
how life forms of organisms differ from each other.
What is genome made of?
All living creatures are different
from one another, so their genomes are also variable but one thing that makes
comparative genomics interesting to study is this that all the genomes are made
up of DNA, and this DNA carries the complete information and amazingly this
information is encoded by only four following nucleotides:
Understanding the sequence of these
nucleotides in linear DNA molecules is important in discovery of DNA’s
double-helical structure. Due to this, DNA sequencing has emerged as a
fundamental approach in study of comparative genomics.
of comparative genomics
The fundamental nature of the field of comparative genomics
is that sequence that stays conserved or identically same across multiple
and/or distant species is likely to be constrained (similar due to evolutionary
pressures), which denotes a biological function. However, the opposite is not
always true: A DNA sequence can, of course, have a biological function without
being conserved with any other species’ genome. This is especially true for
novel lineage-specific changes where time has not yet afforded the sequence a
signature of conservation. Conservation does not necessarily imply identity:
Sequence can be constrained to prefer two or three out of the four bases.
of comparative genomics
Following are some of the major goals of comparative genomics
1. The principle goal is to identify all the DNA sequences
present in an organism that are functional.
2. Determination of the biological roles of the sequences that
3. Determining the evolutionary history of the organism
4. The comparison between cross species to know about the
conserved coding regions of the genome.
bioinformatics tools used in comparative genomics helps to understand the
complete genomic sequence.
of comparative genomics
A vital step in
understanding of the genome is the identification of DNA sequences that are
conserved over millions of years. Comparative genomics helps us to highlight
the genes that are essential to life and pinpoints those genomic signals that
coordinate the functions of genes across various species. The treatment of any
human disease or improvement in human health through latest approaches at gene
level is also only possible by studying numerous biological system.
With the passage of
time, the sequencing is becoming more common, easier and less expensive so its
applications are also extending to fields like agriculture, zoology and
biotechnology. Evolutionary history and evolutionary relationships between
different species are made very clear through comparative genomics and
ultimately scientists are able to understand evolution in appearance, behavior
and biology of organisms.
tools for genome-scale sequence alignment
Alignment of two genomic sequences,
that are subjected to be compared, is considered as the first and foremost step
in comparative genomics. After alignment of sequence elements between the
conserved regions, the whole genome alignment is done.
computational tools are very useful for multiple purposes like alignment of DNA
sequences, visualization of conservation levels between sequences,
determination and identification of highly conserved regions. There are
different approaches or tools used to analyze the sequences of interest. Some
of the algorithms and their URLs are mentioned below.
and Smith-Waterman are some of the conventional algorithms that are no longer
used these days instead some of the latest algorithms like VISTA is widely
The VISTA family of tools is a comprehensive suite of databases that is
used for comparative analysis of multiple genomic sequences.
VISTA could be used in two ways:
sequences and alignments for analysis through different VISTA servers.
alignments of the whole genome of different species.
facts and figures
long ago, In 1995, the first genome sequence of a bacterium, of a free living
organism, Haemophilus influenzae was
officially published. From then uptill now lots of genomes have been completely
organisms are still in progress which indicates that this field is still
emerging with some remarkable results.
following table compares the sizes of
several complete and draft sequences of archaea, bacteria, and eukaryotae
genomes, and showcases both the vast range of genome sizes and the large size
of some of the eukaryotae genomes and the vast range of genome sizes.
Methanosarcina acetivorans str. C2A
Helicobacter pylori 26695
Haemophilus influenzae Rd
Escherichia coli K12
Guillardia theta nucleomorph
Saccharomyces cerevisiae S288C
Oryza sativa L. ssp. indica (draft)
Homo sapiens (draft)
Different levels of Comparing genomes
1. Comparative analysis of genome structure
This approach comprises the
comparison of overall nucleotide statistics, genome structure at DNA as well as
gene level. The details of these analysis are as follows:
statistics such as size of genome, total G+C content, regions of different (G+C) con-tent, and genome signature such as
codon usage biases,amino acid usage biases, and the ratio of observed
di-nucleotide frequency and the expected frequency given random nucleotide
distribution present a global view of the similarities and differences of the
The genome of mouse is 10% smaller than human genome due to the
lower DNA repeats.Although the total G+C content of two Helicobacter pylori strains
J99 and 26695 are same but they have different G+C content in certain
regions which imparts specific characteristics to that particular stain.
B. Comparison of genome structure at DNA level
In this approach firstly we
study the synteny and genome rearrangement events. The conserved synteny gives
information about evolution and combined ancestors. Synteny means “the
gene loci on same chromosomes”. It also refers to the to “two regions
of two genomes have similarity of sequence and rough conservation of the order
of genes in those regions”, and thus are likely to be related by common
descent. Synteny is detected by two methods:
Identification of long conserved
sequence elements By comparing of proteins by BLASTp.
Secondly, the breakpoint or
the boundaries of syntenic regions are studied. This involves the G+C content,
gene density and density of different DNA repeats at breakpoints. Comparison of
break points also gives information regarding evolution.
Thirdly, DNA repeats are
analysed in this approach. DNA repeats are repetitive DNA sequences that are
contained in most genomes. By analysing the content and distribution of DNA
repeats their function can be predicted. For example when the distribution of
L1 elements which are a type of repeats are analysed in a region on the X
chromosome in human, mouse and bovine, it is found that there are more L1 elements in one strand of
DNA than the other ,in all three species. Therefore it is suspected that L1
elements could have potential function. A tool to analyze DNA repeats is
of genome structure at gene level
In this Study gene order
has been compared in two species since gene order corresponds with evolutionary
distance between genomes as chromosomal breakage and exchange of chromosomal
fragments cause alteration of gene order. For example when gene order of
Saccharomyces cerevisiae and Candida albicans are compared the result showed
that gene order is remarkably different in the two varieties of yeasts.
2. Comparative analysis of coding
This approach involves the identification of gene-coding regions,
comparison of gene content, and comparison of protein content explained below:
Identification of gene-coding regions
In this analysis gene
identification algorithm are used to find and compare the coding regions of
DNA. The four main approaches are as follows:
Based on direct evidence of
transcriptionBased on homology with known genesStatistical/ab initio approachesgenome comparison
A. Comparison of gene content
After finding the coding
regions content of genes are compared across the genomes. The two basic
statistics of this comparison are:
Estimation of total number of genes.Estimation of percentage of the genome that
code for genes, average gene length, gene density and codon usage etc.
The gene set can be compared
with other genomes using pairwise sequence comparison tool such as “BLASTN or
B. Comparison of protein content
In this type of analysis the product of
genes from different genomes are compared. This approach is also known as
comparative proteomics. This comparison is useful in identification of specific
pathways or functional categories that have high diversity among the genomes.
“KEGG pathway database and the Gene Ontology (GO) hierarchy” are widely used
resources for pathways and functional categories.
3. Comparative analysis of non-coding regions
Genome contains large amount of non-coding DNA as such 97% of human
genome DNA is non-coding. Studies regarding non coding regions are very
important as they may have their roles in regulation of transcription, DNA
replication, and other biological functions. Although, regulatory elements from
the non-coding region of a genome are hard to identify.
Regulatory elements are identified using comparative analysis approach
by non-coding regions of genomes of many species. We presume in this approach
that regulatory elements evolve at a slower rate due to selective pressure than
that of non-regulatory sequences.
This approach is used successfully for the discovery of regulatory
elements associated in gene expression regulation for many genes, including HBB
(encoding ?-globin) and BTK (encoding Bruton’s tyrosine kinase), IL 4,5,13
interleukins, cystic fibrosis transmembrane conductance regulator genes, stem
cell leukemia gene (SCL) loci etc.
Applications of Comparative genomics
genome correspondence, comparative genomics helps in the gene identification.
Real genes based on their patterns of nucleotide conservation across
evolutionary time are identified through comparative genomics. The alignments
of known genes also reveal the conservative reading frame of the protein
genome of a specie encodes genes and other functional elements, interspersed
with non-functional nucleotides in a single string of DNA. Recognizing
protein-coding genes typically relies on finding stretches of ORFs that are too
long to have likely occurred by chance. Since stop codons occur at a frequency
of roughly 1 in 20 in random sequence, ORFs of at least 60 amino acids will
occur frequently by chance. This poses a huge challenge for higher eukaryotes
in which genes are typically broken into many, small exons in mammals. The
basic problem is distinguishing functional ORFs from spurious ORFs. In
mammalian genomes, estimates of hypothetical genes have ranged from 28,000 to
more than 120,000 genes. The internal coding exons had been easily identified
using Comparative analysis of human genome with mouse genome.
motifs are short DNA sequences (6 to 15bp) that are used to control the
expression of genes. Each motif is recognized by a specific DNA-binding protein
called transcription factor (TF). Transcription factor binds precise sites in
the promoter region of target genes with some degree of sequence variation.
Thus, different binding sites may contain slight variations of the same
underlying motif, and the definition of a regulatory motif should capture these
variations while remaining as specific as possible. Comparative genomics
distinguishes regulatory motifs from non-functional patterns based on their
conservation. Examples include:
Identification of TF DNA-binding
motif using comparative genomics and denovo motifRegulatory motifs of Human
Promoters were identified by comparison with other mammals.Another important finding of
comparative genomics is the gene and regulatory element by comparison of
theory is the foundation of comparative genomics, while the results of
comparative genomics developed the theory of evolution. When two or more of the
genome sequence are compared, evolutionary relationship results in a
genomics exploits both similarities and differences in the proteins, RNA,
and regulatory regions of
different organisms to conclude how selection has
acted upon these elements. Elements responsible for similarities between
different species remain
conserved through time (stabilizing
selection), while those responsible for
differences are divergent (positive selection).
of the important goals of the field is the identification of the mechanisms of
eukaryotic genome evolution. However, it is complicated by the multiplicity of
events that have taken place throughout the history of individual lineages,
leaving only inaccurate and superimposed traces in the genome of each living
organism. For this reason comparative genomics, studies of small model organisms, for example
model Caenorhabditis elegans and closely related Caenorhabditis briggsae
are of great importance to enhance our understanding of evolutionary mechanisms.
field gains the benefits of comparative genomics. Identifying the loci of
advantageous genes is a key step in breeding crops that are optimized for
greater yield, cost-efficiency, quality, and disease resistance. For example,
one genome wide association study conducted on 517 rice landraces revealed
80 loci associated with several categories of agronomic performance, such as
grain weight, amylose content,
and drought tolerance. Many of these loci were previously uncharacterized. This
methodology is powerful and quick, while previous methods were very time
medical field also benefits from the study of comparative genomics. Vaccinology
has experienced useful advances due to comparative genomics. In reverse vaccinology, researchers discover
candidate antigens for vaccine development by analyzing the genome of a
pathogen or a family of pathogens.
of multiprotective vaccines is the result of comparison of differend related
pathogens. Researchers employed such an approach to create a universal vaccine
for Group B Streptococcus, that are responsible
for severe neonatal infections.
genomics can also be used to generate specificity for vaccines against
pathogens that are closely related to commensal microorganisms. For example,
researchers used comparative genomic analysis of commensal and pathogenic
strains of E. coli to identify pathogen specific genes as a basis for finding
antigens that result in immune response against pathogenic strains but not
genomics also opens new avenues in other areas of research. As DNA sequencing
technology has become more accessible, the number of sequenced genomes has
grown. With the increasing pool of available genomic data, the effectiveness of
comparative genomic inference has grown as well. A notable case of this
increased potency is found in recent primate research. Comparative genomic
methods allowed researchers to gather information about genetic variation,
differential gene expression, and evolutionary dynamics in primates that were
indiscernible using previous data and methods. The Great Ape Genome Project
used comparative genomic methods to investigate genetic variation with
reference to the six great ape species,
finding healthy levels of variation in their gene pool despite decreased
microbial genomics hepls to investigates the historical epidemics and deaths
and how the approaches developed may be applicable to more recent and