Assembly and Annotation of genomes

Dates

17-21 March 2025

 

To foster international participation, this course will be held online

 

overview

This course will introduce biologists and bioinformaticians to the concepts of de novo genome assembly and annotation, providing a theoretical framework and practical examples. A variety of sequencing technologies and their applications to generate haplotype-phased, high-quality reference genomes will be presented and discussed. They include Illumina short reads (for both assembly and gene annotation), PacBio HiFi (‘High Fidelity’) and CLR (‘Continuous Long Read’) reads, Oxford Nanopore long and ultralong reads, as well as scaffolding technologies including optical mapping and proximity ligation (Hi-C). Special attention will be given to quality control throughout the assembly process (e.g. tools such as Genomescope, Merqury, Pretext) as well as to consensus, structural error mitigation and manual curation. The concept of Telomere-to-telomere (T2T) genome assembly, and the means to achieve it, will also be introduced. Annotation tools using Illumina RNA-Seq and Pacbio IsoSeq data will be introduced. By the end of the course the students will be able to understand what is needed to generate an annotated  and curated reference genome of high-quality.

 

Targeted Audience & Assumed Background

The course is aimed at researchers interested in learning more about genome assembly and annotation. It will include information useful to both beginners and more advanced users. We will start by introducing general assembly and annotation concepts and algorithms, providing a historical context. We will then describe all major components of a typical genome assembly workflow using the Vertebrate Genomes Project assembly pipeline as example. We will further analyse the multiple ways a genome can be annotated to maximize its utility for downstream analyses. There will be a mix of lectures and hands-on practical exercises, either using graphical interfaces (https://assembly.usegalaxy.eu/) and basic command line. Prior experience with Linux is welcome but not required. No prior background in DNA sequencing is required.

Learning outcomes

-       Understand the concepts related to de novo genome assembly and annotation for genomes of all sizes, from viruses to mammals

-       Learn the strengths and weaknesses of different sequencing technologies, including Illumina short read sequencing, Pacific Biosciences and Oxford Nanopore long read sequencing, as well as scaffolding technologies including optical mapping and proximity ligation (Hi-C), for de novo genome assembly and annotation.

-       Gain hands on experience with common tools for de novo genome assembly, assembly quality evaluation, assembly visualization and manual curation

-       Hands on experience of feature annotation (e.g. genes, repeats)

program

 

Monday– Classes from 2-8 pm Berlin time 

 

Session 1: Introduction
In this kick-off session students will be introduced to genomics. This introduction will include a historical background on DNA sequencing and genome assembly techniques that were developed since the discovery of nucleic acids. It will range from early strategies to popular high throughput approaches, including a general overview of so called ‘third-generation’ approaches (long-read sequencing) and scaffolding approaches (optical mapping, proximity ligation Hi-C). This introduction on DNA sequencing will be accompanied by an overview of the evolution of bioinformatics throughout the 20th Century and the early 21st Century. Special attention will be paid to the evolution of genome assembly approaches, including those that will be employed in the rest of the course. We will discuss some of the latest developments in the field. State-of-the-art applications, targeted to ecological and evolutionary studies, will also be introduced.  

Session 2: Principles of bioinformatics in genome assembly algorithms
In this module we will deepen the bionformatic concepts that are pivotal to understand the different steps of a genome assembly pipeline. We will dive into past and current assembly algorithms (greedy, De Brujin graphs, OLC, string graphs) and we will also present the most common methods and file formats used. The importance of raw data evaluation (contaminations, adapter-trimming, read correction) and reference-tools to assess genome properties (e.g. Genomescope2) will be presented. We will also present and discuss popular QC tools to evaluate consensus and structural accuracy (e.g. Merqury, BUSCO). We will also review key concept in genome assembly and evaluation (e.g. N* metrics, consensus accuracy, haplotype switches). This section will be interactive, with hands on questions and cases to be addressed by the students. We will test some of these algorithms with small test datasets and analyse the outputs. Assembly graphs will be visualized and discussed with tools such as Bandage.

 


Invited Speakers

 

 

-Aureliano Bombarely, Professor at IBMCP (Instituto de Biología Molecular y Celular de Plantas) in Valencia, Spain

 

-Kirsty McCaffrey, Bioinformatics Assistant, The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY


-Marco Sollitto, Postdoctoral Associate, University of Florence / The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY


-Bonhwang Koo, Data Support Assistant, The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY


-Terence Murphy, Head of Comparative Genomics Resources, NCBI

Cost overview

Package 1

 

530 €


Cancellation Policy:

 

> 30  days before the start date = 30% cancellation fee

< 30 days before the start date= No Refund.

 

Physalia-courses cannot be held responsible for any travel fees, accommodation or other expenses incurred to you as a result of the cancellation.