A further increase in mapping resolution is accomplished by manipulating
cloned DNA fragments directly. Because DNA is the physical material of the
genome, the procedures are generally called
physical mapping.
One goal of physical mapping is to identify a set of overlapping
cloned fragments that together encompass an entire chromosome or an entire
genome. The resulting physical map is useful in three ways. First, the genetic
markers carried on the clones can be ordered and hence contribute to the overall
genome mapping process. Second, when the contiguous clones have been obtained,
they represent an ordered library of DNA sequences that can be exploited
for future genetic analysis
for example, to correlate mutant phenotypes with disruptions of specific
molecular regions. Third, these clones form the raw material that will be
sequenced in large-scale genome projects.
In the preparation of physical maps of genomes, vectors that can carry very large inserts are naturally the most useful. Cosmids, YACs (yeast artificial chromosomes), BACs (bacterial artificial chromosomes), and PACs (phage P1-based artificial chromosomes) have been the main types. Cosmids and YACs were introduced in Chapters 12 and 13 . BACs ( Figure 14-12 ) are based on the 7-kb F plasmid of E. coli. Recall that F can carry large fragments of E. coli DNA as F′ derivatives ( Chapter 7 ). In a similar manner, as cloning vectors, they can also carry inserts of fragments of foreign DNA as large as 300 kb, although the average is about 100 kb. PACs are produced by a type of engineering similar to that of phage P1; they carry inserts comparable to those of BACs.
Although the maximum insert sizes of BACs and PACs are not as large as those of YACs, the former types have several advantages over YACs. First, they can be amplified in bacteria and isolated and manipulated simply with basic bacterial plasmid technology. Second, BACs and PACs form fewer hybrid inserts than YACs do. Hybrid inserts are composed of several different fragments; their presence can thwart attempts to order the clones.
However, despite these useful vectors, the task of genomic cloning is a daunting one. Even so-called small genomes still contain huge amounts of DNA. Consider, for example, the 100-Mb genome of the tiny nematode Caenorhabditis elegans; because an average cosmid insert is about 40 kb, at least 2500 cosmids would be required to embrace this genome, and many more would be required to narrow the number to such a complete set. YACs can contain on the order of 1 Mb, so here the task is somewhat simpler.
Cloning a whole genome begins by amassing a large number of randomly cloned inserts. The contents of these clones must be characterized in some way, and overlaps must be determined. A set of overlapping clones is called a contig. In the early phases of a genome project, contigs are numerous and represent cloned “islands” of the genome. But, as more and more clones are characterized, contigs enlarge and merge into one another, and eventually the project should end up with a set of contigs that equals the number of chromosomes.
If good chromosomal landmarks are known, FISH analysis can be used to locate the approximate positions of the large inserts. Figure 14-14 shows results of a FISH analysis that generates a rough ordering of BACs and PACs clones in human chromosomes. FISH je metoda za stavljanje u poredak BAC i PAC klonove.
Ordering by clone fingerprints.
The genomic insert carried by a vector has its own unique sequence, which
can be used to generate a DNA fingerprint. For example, a multiple restriction-enzyme
digestion can generate a set of bands whose number and positions are a unique
“fingerprint” of that clone. The different bands generated by separate clones
can be aligned either visually or by using a computer program to determine
if there is any overlap between the inserted DNAs. In this way, the contig
can be built up.
Unique short sequences of large cloned inserts can be used as tags to align the various clones into contigs. For example, if clone A has tags 1 and 2 and clone B has tags 2 and 3, clones A and B must overlap in the region of tag 2. The practical procedure is to amass a large set of random clones with small genomic inserts (say, in λ phage) and sequence short regions of each. From these sequences, pairs of PCR primers are designed that will amplify the short specific sequence of DNA flanked by the primers. These short DNA sequences are known as sequence-tagged sites (STSs). Even though initially the location of these STSs in the genome is not known, a panel of many STSs can be used to characterize clones with large genomic inserts (such as YAC clones). The clones that are shown to have specific STSs in common must have overlapping inserts and therefore can be aligned into contigs. An example of this process is shown in Figure 14-15 .
Short stretches of sequence are sometimes obtained from cDNA clones. These stretches are known as expressed sequence tags (ESTs). ESTs are obtained by sequencing into the cDNA insert by using a primer based on the vector sequence. They can be used to align the cDNAs on the contig, thus anchoring the gene map to the physical map. Further, if part of the open reading frame (ORF) of the transcript is contained within the EST, the “virtual” translation of the ORF can provide a “sneak preview” of the function of the protein encoded by the mRNA from which the cDNA was derived.
The combination of these physical methods has resulted in the cloning
of whole genomes of several organisms. For example, the C. elegans
genome is now available as sets of cosmid or YAC contigs. Furthermore,
the DNA of the contigs has been arranged on nitrocellulose filters in ordered
arrays; so, to find out where a specific piece of DNA of interest lies in
the genome, that DNA is used as a probe on the contig filters, and a positive
hybridization signal announces the precise location of the DNA (
Figure 14-16
).
Several of the smaller human chromosomes have been fully cloned as overlapping
sets of YAC clones (contigs). We shall examine the cloning of the Y chromosome
as an example because it illustrates several of the techniques of physical
mapping. The STS map of the Y chromosome was in fact obtained by two different
methods
YAC alignment and deletion analysis.
YAC alignment.
Flow sorting yielded a sample of Y chromosomes, from which λ clones
were made. From clones that did not contain repetitive DNA, STS primers were
designed. In all, 160 primer pairs were made. A Y chromosome YAC library of
10,368 clones was obtained in which the average insert size was 650 kb. From
these numbers, each point on the Y chromosome was estimated to have been
sampled an average of four times. The YAC clones were divided into 18 pools
of 576 YACs, and the pools were screened with the STS primers. Subdivision
of positive pools led rapidly to the assignment of a particular STS to specific
YACs. The total STS content of each YAC was assessed, and overlaps between
the YACs were determined in the same way as that shown in the generalized
example in
Figure 14-15
.
Deletion analysis.
Various types of Y chromosome deletions occur naturally. For example, some
XX males contain truncated fragments of the Y, whereas some XY females have
deletions of the region containing the maleness (testis-determining) gene
(see
Chapters 2
and
23
). These Y deletions were maintained in cell culture and formed the basis
for aligning the Y chromosome STSs. Each deletion was tested for STS content.
Because by nature the deletions were nested sets, the STS content could be
used not only to develop an STS map, but also to map the coverage of the deletions.
The principle is illustrated in
Figure 14-17
. The STS maps produced by YAC alignment and by deletion analysis were identical.
Several different strategies have been successfully applied to genome projects. Their advantages and disadvantages depend on the size and complexity of the genome. Of particular importance is the frequency of repetitive DNA in the genome.
Random clone sequencing.The first genome to be cloned was that of the bacterium Haemophilus
influenzae. Genomic DNA was mechanically sheared and used to obtain
a large number of random clones that were presumed to overlap each other
in numerous ways. Primers based on adjacent vector DNA were used to sequence
short regions at the ends of the cloned Haemophilus inserts. Then
these short sequences were used (much like sequence-tagged sites) to align
the genomic clones. Because so many random short sequences were obtained,
together they encompassed most of the Haemophilus genome. Gaps were
filled in by “primer walking”; that is, by using the end of a cloned
sequence as a primer to sequence into adjacent uncloned fragments.
Most genomic sequencing programs start with a set of ordered clones. We
have seen that an ordered set of YAC clones was developed for the human Y
chromosome and other human chromosomes. However, YAC clones are not suitable
for sequencing directly. YACs are subcloned into overlapping BACs or PACs.
The BACs or PACs are again aligned into contigs by using STSs or the alignment
of clone fingerprints. The BAC or PAC clones are again subcloned into smaller
inserts for sequencing. At this level, multiple overlapping clones are sequenced
randomly (without establishing clone alignment) so that any BAC or PAC clone
is sequenced as many as five times in all.
We shall follow the methods used to identify the genomic sequence of the cystic fibrosis (CF) gene as an example. No primary biochemical defect was known at the time that the gene was isolated, so it was very much a gene in search of a function. Linkage to molecular markers had located the gene to the long arm of chromosome 7, between bands 7q22 and 7q31.1. The CF gene was thought to be inside this region, flanked by the gene met (a proto-oncogene; see Chapter 22 ) at one end and a molecular marker, D788, at the other end. But between these markers lay 1.5 centimorgans (map units) of DNA, a vast uncharted terrain of 1.5 million bases. Additional markers within the region were obtained by using new probes derived from a chromosome 7 library made by flow sorting.
However, the two key techniques that were used to traverse the huge genetic distances were chromosome walking ( Chapter 13 ) and a related technique called chromosome jumping. The latter technique provides a way of jumping across potentially unclonable areas of DNA and generates widely spaced landmarks along the sequence that can be used as initiation points for multiple bidirectional chromosomal walks.
Chromosome jumping is illustrated in Figure 14-19 . In this procedure, large fragments are created by partial restriction cleavage of the DNA in the region believed to contain the gene of interest. Each DNA fragment is then circularized, thus bringing the beginning and end of the fragment together. This junction is cut out and cloned into a phage vector, which together with the other junction segments make up a jumping library. A probe from the beginning of the stretch of DNA under investigation can be used to screen the jumping library to find the clone that contains the beginning sequence. When this clone is found, the other end of the junction sequence is excised and used to screen the library again to make a second jump. From each jump position, chromosome walks can be made in both directions to search for genelike sequences.
A restriction map of the overall region was obtained with rare-cutting restriction enzymes, and the restriction sites were used to position and orient the sequences obtained from jumping and walking. When enough sequencing had been done to cover representative parts of the overall region, the hunt for any genes along this stretch began. Genes were sought by several techniques. First, human genes were known to be generally preceded at the 5′ end by clusters of cytosines and guanines, called CpG islands, and several of these clusters were found. Second, it was reasoned that a gene would show homology to the DNA of other animals, because of evolutionary conservation, so candidate sequences were used to probe what were called zoo blots of genomic DNA from a range of animals. Third, genes should have appropriate start and stop signals. Fourth, genes should be transcribed, and transcripts should be found.
Ultimately, a strong candidate gene was found spanning 250 kb of the region. Some CF symptoms are expressed in sweat glands; so, from cultured sweat gland cells, cDNA was prepared, and a 6500-nucleotide cDNA homologous to the candidate gene was detected. On sequencing this cDNA in normal and CF patients, the cDNA of the patients showed the deletion of three base pairs, eliminating a phenylalanine from the protein. Therefore it was very likely that this was the CF coding sequence. Thus the CF gene had been found. From its cDNA nucleotide sequence, an amino acid sequence was inferred. In turn, from this inferred sequence, the three-dimensional structure of the protein was predicted. This protein is structurally similar to ion-transport proteins in other systems, suggesting that a transport defect is the primary cause of CF. When used to transform mutant cell lines from CF patients, the wild-type gene restored normal function; this phenotypic “rescue” was the final confirmation that the isolated sequence was in fact the CF gene
Genetics focuses on the nature of genes, and a major goal is to characterize
their structure and function. Recombinant DNA technology has allowed individual
genes to be isolated in a test tube and then characterized at the molecular
level. The technology is based on restriction enzymes, which cut DNA into
defined fragments. Restriction target sites can be mapped and act as DNA landmarks.
Restriction fragments often have sticky ends, enabling them to be inserted
into a vector capable of replicating in a bacterial cell. Such molecular
hybrids are known as recombinant DNA. Bacteria amplify a single recombinant
DNA molecule to form a DNA clone. Common vectors are plasmids, phages, and
cosmids. An entire genome can be cloned in a set of clones known as a library.
A specific clone can be found in a library by using a probe that specifically
binds to the DNA or to the protein of the desired clone. Specific clones
can also be isolated by their ability to transform null mutants. Tagging
also is useful for cloning a gene: transforming DNA or a transposon is used
to cause a mutation by insertion, and the DNA adjacent to the tag is isolated.
Chromosome walking provides a way of isolating a gene by sequential isolation
of overlapping clones, starting from a marker linked to the desired gene.
Cloned DNA can be sequenced by several methods, including the arrest of DNA
chain growth by dideoxynucleotides. The polymerase chain reaction uses primers
to amplify DNA sequences. It is a way of rapidly isolating DNA whose structure
is already partly sequenced and of detecting small amounts of one specific
type of DNA. Gel electrophoresis separates variously sized DNA or RNA molecules
from a mixture. Probes can detect specific DNA or RNA molecules on the gel,
in procedures known as Southern and Northern analyses, respectively.