17-21 March 2025
To foster international participation, this course will be held online
This course will introduce biologists and bioinformaticians to the concepts of de novo genome assembly and annotation, providing a theoretical framework and practical examples. A variety of
sequencing technologies and their applications to generate haplotype-phased, high-quality reference genomes will be presented and discussed. They include Illumina short reads (for both assembly
and gene annotation), PacBio HiFi (‘High Fidelity’) and CLR (‘Continuous Long Read’) reads, Oxford Nanopore long and ultralong reads, as well as scaffolding technologies including optical mapping
and proximity ligation (Hi-C). Special attention will be given to quality control throughout the assembly process (e.g. tools such as Genomescope, Merqury, Pretext) as well as to consensus,
structural error mitigation and manual curation. The concept of Telomere-to-telomere (T2T) genome assembly, and the means to achieve it, will also be introduced. Annotation tools using Illumina
RNA-Seq and Pacbio IsoSeq data will be introduced. By the end of the course the students will be able to understand what is needed to generate an annotated and curated reference genome of
high-quality.
The course is aimed at researchers interested in learning more about genome assembly and annotation. It will include information useful to both beginners and more advanced users. We will start by
introducing general assembly and annotation concepts and algorithms, providing a historical context. We will then describe all major components of a typical genome assembly workflow using the
Vertebrate Genomes Project assembly pipeline as example. We will further analyse the multiple ways a genome can be annotated to maximize its utility for downstream analyses. There will be a mix
of lectures and hands-on practical exercises, either using graphical interfaces (https://assembly.usegalaxy.eu/) and basic command line. Prior experience with Linux is welcome but not required.
No prior background in DNA sequencing is required.
- Understand the concepts related to de novo genome assembly and annotation for genomes of all sizes, from viruses to mammals
- Learn the strengths and weaknesses of different sequencing technologies, including Illumina short read sequencing, Pacific Biosciences and Oxford Nanopore
long read sequencing, as well as scaffolding technologies including optical mapping and proximity ligation (Hi-C), for de novo genome assembly and annotation.
- Gain hands on experience with common tools for de novo genome assembly, assembly quality evaluation, assembly visualization and manual curation
- Hands on experience of feature annotation (e.g. genes, repeats)
Monday– Classes from 2-8 pm Berlin time
Session 1: Introduction
In this kick-off session students will be introduced to genomics. This introduction will include a historical background on DNA sequencing and genome
assembly techniques that were developed since the discovery of nucleic acids. It will range from early strategies to popular high throughput approaches, including a general overview of so called
‘third-generation’ approaches (long-read sequencing) and scaffolding approaches (optical mapping, proximity ligation Hi-C). This introduction on DNA sequencing will be accompanied by an overview
of the evolution of bioinformatics throughout the 20th Century and the early 21st Century. Special attention will be paid to the evolution of genome assembly approaches, including those that will
be employed in the rest of the course. We will discuss some of the latest developments in the field. State-of-the-art applications, targeted to ecological and evolutionary studies, will also be
introduced.
Session 2: Principles of bioinformatics in genome assembly algorithms
In this module we will deepen the bionformatic concepts that are pivotal to understand the different steps of a
genome assembly pipeline. We will dive into past and current assembly algorithms (greedy, De Brujin graphs, OLC, string graphs) and we will also present the most common methods and file formats
used. The importance of raw data evaluation (contaminations, adapter-trimming, read correction) and reference-tools to assess genome properties (e.g. Genomescope2) will be presented. We will also
present and discuss popular QC tools to evaluate consensus and structural accuracy (e.g. Merqury, BUSCO). We will also review key concept in genome assembly and evaluation (e.g. N* metrics,
consensus accuracy, haplotype switches). This section will be interactive, with hands on questions and cases to be addressed by the students. We will test some of these algorithms with small test
datasets and analyse the outputs. Assembly graphs will be visualized and discussed with tools such as Bandage.
-Aureliano Bombarely, Professor at IBMCP (Instituto de Biología Molecular y Celular de Plantas) in Valencia, Spain
-Kirsty McCaffrey, Bioinformatics Assistant, The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY
-Marco Sollitto, Postdoctoral Associate, University of Florence / The Vertebrate Genome Laboratory, The
Rockefeller University, New York, NY
-Bonhwang Koo, Data Support Assistant, The Vertebrate Genome Laboratory, The Rockefeller University, New
York, NY
-Terence Murphy, Head of Comparative Genomics Resources, NCBI
Cancellation Policy:
> 30 days before the start date = 30% cancellation fee
< 30 days before the start date= No Refund.
Physalia-courses cannot be held responsible for any travel fees, accommodation or other expenses incurred to you as a result of the cancellation.