About this course
Advancements in technology and information processing are rapidly changing many fields of plant sciences, animal sciences and ecology, including research, agriculture and conservation. For example, distributed sensor networks currently allow for the acquisition of huge volumes of data on many relevant aspects, ranging from soil and vegetation characteristics, abiotic conditions like weather, to the behaviour of animals. The availability of unprecedented amounts of data is unlocking potential, however, it also creates a major challenge: the ability to effectively process and analyse it. In the current data-centered digital era that is driven by technological change, the volume of data will continue to skyrocket due to decreasing costs of data collection, storage and processing. Fostered by these technological developments, researchers and various branches of business are increasingly embracing data science: a concept to unify data processing, statistics, artificial intelligence and their related algorithms to extract knowledge from data. Hence, data science is increasingly becoming an integral part of decision making in many fields, including precision agriculture, livestock management and nature conservation, as it fosters automated prediction and classification (e.g.: is this animal ill?, is this plant a weed?, is this apple ready to pick?, when should we harvest?).
To keep up with these technological developments, students need to become acquainted with the terms, concepts and methodology accompanying these developments. This is especially important since it can require a different approach to using data and conducting science than the approaches they are familiar with. Namely, the large volumes of data usually come from various sources, each with their own characteristics, uncertainties and measurement errors. The data from these different sources need to be integrated, and the inherent heterogeneity should be accounted for. Moreover, the collected sensor data are generally not immediately fit for analyses, so that pre-processing of the raw data is needed. After initial data pre-processing, the engineering of informative and discriminating features (i.e., measurable properties of the phenomenon being observed) is a crucial step for creating effective algorithms. Furthermore, the collection of large volumes of data leads to a shift away from frequentist hypothesis testing towards analytics that is more focussed on prediction, classification, pattern recognition or anomaly detection. To this end, machine learning techniques are often used, usually by high performance computing.
This course covers the main elements of using a data science approach to solving agricultural or ecological problems. The students will be guided through the main concepts and skills that are required to become a successful data scientist working in ecology. These skills relate to three pillars of data science expertise: (1) mathematics and statistics; (2) computer science and programming; and (3) domain knowledge, i.e., the understanding of patterns and processes governing (agro-)ecological systems. Hence, this course builds upon, and expands, the understanding and skills generated in other courses, and focuses on combining these in an interdisciplinary way to be optimally able to solve (agro-) ecological problems with a data-driven approach. Approaches to solving common (agro-)ecological problems will be discussed, as well as the common problems to the associated data: the usually large degrees of spatial-temporal (auto)correlation and the non-independence between individuals. Methods to deal with these issues will be discussed, including algorithms that specifically account for these issues.
During the course, students will increase their knowledge and skills via hands-on experience where the taught principles and methods are put into practice. Using large datasets from current cutting-edge science projects (e.g., data gathered about animal behaviour via wearable sensors such as GPS and inertial measurement units, or data about vegetation via airborne or ground-based spectral sensors), different steps in the data science lifecycle will be covered and practiced: from problem definition; data management, cleaning and pre-processing; data exploration; feature engineering; selecting and training algorithms; optimizing hyperparameters; validating algorithms; testing predictions; to visualization and communication of results. The students will be trained to apply different machine learning techniques, and critically evaluate their merits. During the course, students will acquire and expand data science skills that will prepare them for a quantitative MSc thesis, and that will benefit their future career in academia or business.
After successful completion of this course students are expected to be able to:
- Explain important concepts in data science needed to solve typical ecological problems;
- Explain how key features of ecological data influence the selection, training, validation and evaluation of algorithms;
- Identify and select machine learning algorithms appropriate to specific ecological problems;
- Create a reproducible workflow (loading raw data, data processing, feature engineering, and machine learning algorithms) to efficiently analyse ecological datasets;
- Critically evaluate the reliability and adequacy of trained algorithms;
- Create ecological insight from data using a data science approach;
- Communicate the key elements and findings of a data science project clearly and concisely.
Experience with programming in R is needed to follow and successfully complete this course. For example, students who followed a course in which R is heavily used, e.g. CSA34306 Ecological Modelling and Data Analysis in R, will likely have sufficient background knowledge to participate in this course. We strongly urge students without prior experience with programming in R to learn programming in R before the start of the course, either by:
following the online course ‘R programming’ on Coursera (https://www.coursera.org/learn/r-programming): this course can be audited for free, and following the first 2 weeks of this course will suffice;
or studying the free online book ‘Hands-On Programming with R’ (https://rstudio-education.github.io/hopr/), where parts 1 and 2 provides sufficient prerequisite knowledge.
We advice students that are unsure about their level of R skills to go through the first 2 parts of the online book ‘Hands-On Programming with R’ (the latter url above). If most elements discussed in these first 2 parts are understood, then the understanding of R programming is sufficient to participate in this course.
We assume general understanding on ecology, mathematics and statistics. Familiarity with the concept of data science (e.g. INF34306 Data Science Concepts), the application of statistical methods to ecological data (e.g., CSA34306 Ecological Modelling and Data Analysis in R), and algorithms used in data science (e.g., MAT32806 Statistics for Data Scientists; FTE35306 Machine Learning; GRS34806 Deep Learning in Data Science) is helpful but not urgent.