6th-9th May 2025
To foster international participation, this course will be held online
Topic: Machine learning for sequence data with time and causation
This course will introduce methods and approaches to analyse data, chiefly longitudinal (sequence) data (repeated in time/space), when time and the cause-effect relationship matter. Time and
causation pose specific challenges in all aspects of processing and analysis, from visualization to exploratory data analysis, to modelling and validation, to the interpretation of results. The
course will outline the main challenges related to dealing with time and causation when analysing (sequence) data: first, this will be done briefly from the classical statistical perspective;
then, more extensively, from the machine learning perspective. Time and causation need special attention also with the resolution of biases in the data and results: e.g. confounding, colliding
and mediator bias, the disentangling of cause-effect relations. Specific areas that will be covered include modelling of sequence data, forecasting (prediction of time-series data), survival
analysis, graph models, Bayesian networks, machine learning algorithms, epidemiology and gene-expression experiments.
The course is structured in modules over four days. The first two days will mostly cover the basic concepts and the classical statistical perspective; the last two days will be devoted to the
machine learning approach. Each day will include lectures with class discussions of key concepts and practical hands-on sessions with collaborative exercises where students will interact with the
whole class and instructors to apply the acquired skills. After and during each exercise, results will be interpreted and discussed. At the end of the course, a quiz will be taken together to
recap and highlight the most important concepts covered, and there will be room to discuss specific research problems and questions from participants.
The course is aimed at advanced students, researchers and professionals interested in learning how to deal with time and causation in sequence data, and how to analyze them in the context of real
life applications in biology. It will include information useful for both absolute beginners and more advanced users willing to delve into some aspects of the implementation of longitudinal
models and scripting code. We will start by introducing the general concepts and approaches to deal with sequence data in the presence of time and cause-effect relationships; we will then explore
applications to specific scientific domains (e.g. forecasting, epidemiology, gene expression) and extensions to machine learning methods.
Attendees are expected to have a background in biology and the research problems involving prediction, inference, pattern discovery; previous exposure to inferential and predictive experiments
would be beneficial. There will be a mix of lectures and hands-on practical exercises using Python, Markdown/Jupyter Notebooks and the Linux command line. Some basic understanding of Python
programming and of the Linux environment will be advantageous, but is not required.
At the end of the course the student will have an understanding of:
- how to recognise and treat spatial and temporal dependencies in the data
- how to disentangle cause-effect relationships in the data
- the most common methods to analyse data with time and/or cause components
- methods and principles of machine learning for sequence data
- specific applications to life-science domains like epidemiology and gene expression experiments
- how to design, analyse and interpret scientific experiments with time and cause components
Day1– Classes from 2-8 PM Berlin time
- Sequence data: examples and challenges
- Time is pervasive; cause-effect relations are tricky
- The classical statistical perspective
- Confounding, colliding, mediator biases
- Statistical models to analyse data with repeated records over time (multiple time points) and space (multiple locations)
Day2– Classes from 2-8 PM Berlin time
- Graph models and Bayesian networks
- Cross-validation with temporal, spatial and cause data structures
- The machine-learning perspective: predicting time series, performance metrics)
Day3– Classes from 2-8 PM Berlin time
- A primer on longitudinal data in epidemiology: times series of disease incidence/prevalence, survival analysis)
- Imputation of missing data with time/space/cause dependencies (RFi, KNNi, etc.)
- More ML: deep Learning and Transformer Models for the analysis of sequence data
Day4– Classes from 2-8 PM Berlin time
- Analysis of residuals and model diagnostics
- Case study. Multi-omics analysis: a study in interpretability on HeLa Cell Cycling for integration of mRNA, Translation Data and Proteomics: from raw data to final insights
- Final recap quiz
- Discussing your own research problems and wrap-up discussion
Should you have any further questions, please send an email to info@physalia-courses.org
Cancellation Policy:
> 30 days before the start date = 30% cancellation fee
< 30 days before the start date= No Refund.
Physalia-courses cannot be held responsible for any travel fees, accommodation or other expenses incurred to you as a result of the cancellation.