EduXchange.NL

Statistics for Data Scientists

MAT32806

About this course

In many areas of biological, environmental and social science new tools and strategies are developed to measure multiple features at subjects and objects of interest. Typical for these new types of data is that they occur in large volumes, are high-dimensional or are hierarchically structured. For example, in genetics (humans, animals, plants) data can be available at the levels of DNA, RNA, proteins, metabolites and all kinds of phenotypes. In food science, the effects of diet and lifestyle variables can be investigated with respect to physical and mental indicators of performance and well-being. In social science, indicators of economic success can be studied in relation to educational, socio-economic, psychological and lifestyle variables as well as social media behavior. These modern data require new techniques for analysis and visualization which are provided by a new science that combines elements of statistics, mathematics, computer science and substantive knowledge: Data Science.
Statistics takes a central place in Data Science as it offers a general framework for model building, inference and evaluation in a wide range of data science applications. Statistics provides strategies for reliability evaluations of data analysis outcomes, also when these results are obtained by techniques outside the classical statistical domain of e.g. regression and analysis of variance. Modern statistics presents powerful techniques for the analysis of contemporary data like penalized regression and classification (e.g. ridge, lasso, elastic net), Bayesian methods, mixed modelling, generalized linear and additive modelling, decision trees and random forests. The course will introduce such techniques across the domains of supervised and unsupervised statistical learning.
The course thus has two main objectives: (i) to acquaint the student with a coherent set of modern techniques at the interface of Statistics and Data Science, and (ii) to support the development of skills with which students can choose, build, and evaluate the best modelling strategies for a wide range of complex data challenges. Case studies will serve to illustrate strategies for model building and evaluation and to operationalize key concepts such as dimension reduction, sparsity, hierarchical modelling, and penalization. All analyses will be performed with the statistical programming language R.

Learning outcomes

After succesful completion of this course students are expected to be able to:

  • explain and compare a broad range of modern statistical methods in data science;
  • select an appropriate data analysis method based on the characteristics of the data;
  • apply data analysis methods for data science (in R);
  • evaluate the reliability of the outcomes of an analysis;
  • communicate results from data science life cycle to a multidisciplinary data science team.

Required prior knowledge

Assumed Knowledge:
MAT20306 Advanced Statistics or MAT22306 Quantitative Research Methodology and Statistics or MAT24306 Advanced Statistics for Nutritionists

Link to more information

If anything remains unclear, please check the FAQ of Wageningen University.

Offering(s)

  • Start date

    6 January 2025

    • Ends
      31 January 2025
    • Term *
      Period 3
    • Location
      Wageningen
    • Instruction language
      English
    • Register between
      1 Jun, 00:00 - 24 Nov 2024
    Enrolment starts in 28 days
  • Start date

    10 March 2025

    • Ends
      2 May 2025
    • Term *
      Period 5
    • Location
      Wageningen
    • Instruction language
      English
    • Register between
      1 Jun, 00:00 - 9 Feb 2025
    Enrolment starts in 28 days
These offerings are valid for students of TU Eindhoven