Session content

Monday– Classes from 2-8 pm Berlin time

Session 1: Introduction
In this kick-off session students will be introduced to genomics. This introduction will include a historical background on DNA sequencing and genome assembly techniques that were developed since the discovery of nucleic acids. It will range from early strategies to popular high throughput approaches, including a general overview of so called ‘third-generation’ approaches (long-read sequencing) and scaffolding approaches (optical mapping, proximity ligation Hi-C). This introduction on DNA sequencing will be accompanied by an overview of the evolution of bioinformatics throughout the 20th Century and the early 21st Century. Special attention will be paid to the evolution of genome assembly approaches, including those that will be employed in the rest of the course. We will discuss some of the latest developments in the field. State-of-the-art applications, targeted to ecological and evolutionary studies, will also be introduced.

Session 2: Principles of bioinformatics in genome assembly algorithms
In this module we will deepen the bionformatic concepts that are pivotal to understand the different steps of a genome assembly pipeline. We will dive into past and current assembly algorithms (greedy, De Brujin graphs, OLC, string graphs) and we will also present the most common methods and file formats used. The importance of raw data evaluation (contaminations, adapter-trimming, read correction) and reference-tools to assess genome properties (e.g. Genomescope2) will be presented. We will also present and discuss popular QC tools to evaluate consensus and structural accuracy (e.g. Merqury, BUSCO). We will also review key concept in genome assembly and evaluation (e.g. N* metrics, consensus accuracy, haplotype switches). This section will be interactive, with hands on questions and cases to be addressed by the students. We will test some of these algorithms with small test datasets and analyse the outputs. Assembly graphs will be visualized and discussed with tools such as Bandage.

Tuesday – Classes from 2-8 pm Berlin time

Session 3: Hands on genome assembly
In this module we will start our genome assembly using the Galaxy GUI. Preliminary set up for this session will be presented during the previous sessions. We will conduct the session in breakout rooms. We will run the VGP assembly pipeline using publicly available datasets. We will introduce and discuss computational requirements. This means at least one long read technology (either Pacbio CLR/HiFi or Nanopore) and one scaffolding technology (optical maps/hic).Please note that Illumina reads will only be used for scaffolding and annotation during the practical session. After an overview of the pipeline and the QC tools employed, we will run it.

Session 4: From contigs to scaffolds
This module is dedicated to the technologies and algorithms to achieve chromosome-level assemblies. This features an in-depth review of popular scaffolding technologies, including Bionano optical mapping, and various proximity ligation techniques (Hi-C) such as Dovetail OmniC, as well as the latest approaches for finished T2T genome assembly based on ultralong (UL) nanopore reads. We will present and discuss the tools available for the scaffolding technologies (e.g. Salsa2) and T2T assembly (e.g. Verkko), as well as various tools for QC and visualization (e.g. Pretext and HiGlass). We will employ the scaffolding tools on the outputs from the previous session. Reference-guided scaffolding tools will also be briefly discussed.

Wednesday – Classes from 2-8 pm Berlin time

Session 5: Hands on genome assembly
In this module we will continue our genome assembly using the VGP pipeline. We will review any issue that may have occurred in Session 4. We will employ the tools described and tested in Session 5 to scaffold the contigs previously generated. We will evaluate the results of the scaffolding and address potential issues. Students will be asked to share and discuss together their results.

Session 6: Genome curation
In this session, we will discuss the importance of QC and of manual validation of the assemblies generated by the automated pipelines. We will explain how to improve the assembly of the genomes. We will investigate assembly errors, discuss the reasons of these errors and potential fixes. We will present tools that aid manual genome curation (e.g. HiGlass).

Thursday – Classes from 2-8 pm Berlin time

Session 7: Hands on genome curation
In this module we will work to understand how a genome assembly can be improved by manual curation. After detecting and discussing potential misassemblies, we will look into the tools available for manual curation and use them to improve some assemblies that will be made available as examples.

Session 8: From chromosome-level genomes to annotated genomes
In this module we will introduce and provide a detailed overview of genome annotation techniques. This will not be limited to the annotation of genes, but also include other important features such as tandem and interspersed repeats (centromeric, pericentromeric, telomeric repeats, transposable elements). We also will describe the data types required to use some of these techniques (e.g. RNAseq IsoSeq). We will discuss current challenges in genome annotation.

Friday – Classes from 2-8 pm Berlin time

Session 9: Hands on genome annotation
In this module we will analyse a transcriptome. This will either be from the same species we have worked on for the assembly, or from another species if no expression data is available. We will take advantage of multiple evidences to correct gene models, also discussing the concept of alternative splicing. To that aim, we will use software (such as Maker) that take into account different gene predictors, available on the Galaxy platform.

Session 10: Look at a big one
In this module students will be introduced through a practical exercise to the principles of Telomere-to-Telomere genome assembly, that is, the techniques available to generate gapless and nearly error-free reference genomes for any species, from input data, to manual curation. We will also discuss what lies beyond individual reference genomes: accurate pangenomes that can capture the genetic makeup of a species.