Low coverage
genome sequence and ovine 60K SNP chip creation
The consortium has investigated various methods by which low sequence
coverage of the ovine genome could be produced. In addition to various
simulations using bovine and ovine sequence, it has also included
pilot studies using
Sanger resequencing from existing sheep sequence and Roche
454 GS20 sequencing of previously sequenced sheep BACs to provide baseline
information.
An important aspect of the process is to identify SNPs, their genomic
location, estimate their minor allele frequency (MAF), and provide sufficient
known surrounding unique sequence to design probes for their detection.
This is a challenge using existing technology; however, several aspects
have already been identified.
- The sequencing would make use of the virtual
sheep genome to provide a framework for genome assembly.
- The new sequencing technologies result in very short sequence lengths,
this means sequencing needs to use a divide and conquer approach to
assembly, even then assembly though even modest lengths of repetitive
sequence is a challenge.
- The short contigs almost certainly need to be ordered and orientated
against a reference genome such as the bovine.
- The best technology for SNP detection and estimation of MAF provides
insufficient genomic sequence for probe design and genome positioning.
Current
Approach
The International Sheep Genomics Consortium’s immediate objective
is to skim sequence the ovine genome so as to identify SNPs in order to
produce a 60K SNP chip. Roche 454 FLX sequencing technology is a new technology
based on pyrosequencing. Simulations based on the limited ovine genomic
sequence available, and results from a pilot ovine resequencing projects
identified the following strategy: Roche 454 FLX technology would be used
to produce a 3x whole genome coverage, consisting of 0.5x shotgun sequence
coverage from 6 ewes. Each animal would represent a different breed, and
the resulting sequence would be assembled using the bovine genome as a
framework which would then be reorganized using the virtual sheep genome.
This approach has been estimated to provide assembled and ordered sequence
for approximately 60% of the ovine genome. It would also detect 286,000
probable SNPs with defined genomic locations of which 180,000 would potentially
be useful to select from for construction of a 60K ovine SNP chip comprising
equally spaced SNPs. Based on available information this would provide
a resource where the mean linkage disequilibrium (r2) between
adjacent SNPs would be in excess of 0.25, which is suitable for whole
genome association studies.
Sequencing is of this phase is nearing completion with AgResearch in New
Zealand sequencing the Romney, Texel and Scottish Blackface and Baylor
HGSC in Houston Texas sequencing Merino, Poll Dorset and Awassi breeds.
The project has recently had two additional components added to identify
more SNPs as well as estimate their minor allele frequency more accurately
and improve the genome assembly. The first extension is to include ~4
Gbp of reduced representational sequencing with an Illumina Solexa Genome
Analyser to identify numerous additional SNPs and estimate their minor
allele frequency using a technique outlined by
Smith et al. (2008). The second extension is to improve assembly
by creation of paired end reads of various insert sizes and sequencing
lengths using a combination of next generation and Sanger sequencing.
Roche
FLX skim sequencing method
Sequencing
- Six animals (females), each of different breeds were selected (Fig.
1)
- different breeds help identify SNPs with
higher minor allele frequencies (MAF)
- females chosen to equalise representation
of the X chromosome
- DNA isolated from white blood cells using standard Protease K digestion
and salt ethanol precipitation
- Each animal sequenced to 0.5 x genome coverage (1.5 Gbp) via Roche
454 FLX
- Two 454 FLX libraries made per animal with each library titrated
and the best used
Assembly
- 454 reads repeat masked with an in-house repeats database consisting
of repbase bovine repeats coupled with CAP3 assembled ovine FLX sequence
segments found to have >1000 hits in the bovine genome
- Unique hits matched to location on bovine genome
- MEGABLAST used with options -D 3 -t 21 -W 11 -q -3 -r 2 -G 5
-E 2 -s 56 -N 2 -F "m D" -U T
- Unique is defined as being where only a single hit occurred with
an e value of less than 1e-5, or multiple hits were present with the
ratio e top hit/e second hit being less than 1e-20
- Retrieved raw reads matching bovine scaffold segments (typically
< 2 Mbp) and assembled using Newbler
- Position orientated Newbler ovine contigs on to bovine scaffold
- Summarised as a virtual ovine sequence (MELD, see Fig
2)
- Reorder MELDed ovine segments using ovine BES information and VSG
into ovine genome order
SNP
Detection
- align sequence reads to 454 MELD sequence (Fig
3)
- filter high quality SNPs based on:
- unique genomic match,
- high quality sequence, no flanking SNPs,
- not within or adjacent to a homopolymeric run,
- at least 2 reads of minor allele preferably from different
animals,
- at least 50 bp of flanking sequence on both sides
Current
modifications and extensions
- The consortium is now also using reduced representational sequencing
with Illumina Solexa genome analyser (Curt Van Tassell, Tim Smith
& James Kijas pers comm).
- 60 animals (primarily female) and ~1% of the genome sequenced to
20X depth/run
- 4 Solexa sequencing runs of 1 Gbp with 35 bp reads should generate
at least 150,000 high quality SNPs
- Solexa sequences to be positioned on Roche 454 MELDed sequence to
provide genome location of SNP and sufficient flanking sequence for
probe design
- We expect the majority of SNPs selected for use on the 60k chip
will originate from this approach
- Paired end reads to aid de novo assembly are being created
on a limited trial basis by Baylor HGSC
Groups
involved
- Funding: Ovita (New Zealand), ISL Grant (Sydney University, Australia),
Genesis Faraday (United Kingdom)
- Roche 454 FLX sequencing: AgResearch, University of Otago and Baylor
HGSC
- Illumina Solexa reduced representational sequencing: CSIRO, Illumina,
USDA
- Assembly: AgResearch, Baylor HGSC, CSIRO
- SNP detection: AgResearch, Baylor HGSC, CSIRO, USDA
Time
Frame
| |
Sequence 3X coverage using 454 FLX |
complete early January 2008 |
| |
Assemble and create MELD sequence |
target late January 2008 |
| |
Illumina
RRS sequencing |
target early February 2008 |
| |
SNP
detection and selection |
target mid March 2008 |
| |
SNP
Chip creation and validation |
target end June 2008 |
|