Fundamentals of Data Analysis

This page contains information on the Fundamentals of Data Analysis course, including the syllabus, lecture notes, and other resources.

The course has been updated through the years. Please find each year’s details in the following.

Fondamenti di Analisi dei Dati e Laboratorio 2025/2026

Changes in the academic year 2025/2026

In academic year 2025/2026, the syllabus and course have been completely revised to adapt the load for the transition of the course from the Master's Degree to the Bachelor's Degree. The resulting course now focuses on core concepts without delving too much into details covered by other courses or more suited for Master's studies.

Structure of the course

The course includes a theory module (6 CFU) and a laboratory module (3 CFU). Theory sessions focus on core topics and algorithms, but also cover intuition and examples. Laboratory sessions focus on the solution of real data analysis problems using the Python language and related data science libraries through live coding sessions in the classroom.

Examination

The examination consists of a written exam and project in Python. Passing two in-itinere tests held during the course allows to be exempted from the written exam.

Data Science Challenge

At the end of the course, a challenge on data analysis will be organized. Students will have to solve a real data analysis project in 24/48 hours and present the results afterwards. The challenge can be tackled by small groups of 1-3 students. If the presentation is deemed to be sufficient, students are exempted from the project.

Notes

High-quality, open-source notes will be provided during the course. For reference, consider last years' notes which can be found here. Please consider that last years' notes are significantly more extensive than the ones intended for this year (and released lecture by lecture during the year) due to changes in the syllabus. Old notes will be always available for reference, but their content are not subject to examination. The student will instead refer to the current year's notes.

Synthetic Syllabus

This flier reports brief information on the course. A tentative syllbus is reported in the following:

Introduction to Data Analysis: Aims, relevance of data, the data analysis lifecycle, and course structure.
Exploratory Data Analysis and Descriptive Statistics: Data acquisition, types of data, tabular datasets, measures of central tendency and spread, data visualization, wrangling, and normalization.
Introduction to Laboratories: Python fundamentals, essential libraries (Numpy, Scipy, Matplotlib, Seaborn, Plotly, Statsmodels, Scikit-Learn), and hands-on data cleaning.
Probability for Data Analysis and Data Distributions: Uncertainty, random variables, probability estimation, joint and conditional probabilities, expectation, variance, covariance, common probability distributions
Data Association: Pearson Chi-square statistic, Cramer V, Covariance, and various correlation coefficients (Pearson, Point-biserial, Spearman, Kendall).
Statistical Inference tools for Data Analysis: Sampling, standard error, confidence intervals, bias-variance trade-off, hypothesis testing, and normality assessment.
Linear Regression: Simple and multiple linear regression, estimating coefficients, model accuracy assessment, variable selection, qualitative predictors, and interaction terms.
Logistic Regression: Relationship between continuous and binary variables, logistic function, model interpretation, and multinomial logistic regression.
Introduction to Predictive Analysis: Overfitting, empirical risk minimization, generalization, model selection (train/val/test split, cross-validation), and regularization (Ridge, Lasso).
Classification Problems: Classification vs. regression, evaluation measures, logistic regression as a discriminative classifier, softmax regression, and handling data imbalance.
Data Representation and Clustering: Importance of data representation, feature extractors, supervised vs. unsupervised learning, and K-Means clustering.
Density Estimation: Parametric vs. non-parametric density estimation, Kernel Density Estimation, Parzen Window, and Maximum Likelihood.
Principal Component Analysis (PCA) for Unsupervised Dimensionality Reduction: Definition of PCA and applications.
Supervised Dimensionality Reduction: Fisher Linear Discriminant and Linear Discriminant Analysis (LDA).