Knowledge of the most relevant statistical methods for large data set analysis. The student will learn how to run a complete work flow of analysis by employing R software. From the collection and management of large data set, through the choice of the most appropriate models to the final interpretation and contextualization of the results.
Prerequisiti
Knowledge of basic concepts of Statistics like inference, confidence interval, test of hypothesis, simple regression model. For the coding part, it is recommended to have some familiarity with Python language, or similar software like R and Python, even through on-line courses (ex. Coursera).
Metodi didattici
The class integrates theoretical lectures with practicals based on R and Python to learn how to implement and analyze the most appropriate models according to the available data. A tutor will help students weekly in acquiring all the necessary theoretical and practical knowledge. All the material used during the lectures (slides. script. data) will be available on Kiro platform.
Verifica Apprendimento
The final mark is the weighted average of scores: 1) Midterm exam covers about the first 6 weeks of lectures. It is a multiple-choice quiz made of 20/25 questions. The mark can range between 0 and 30. 2) Second exam is based on a real data analysis project. Students can work in team (up to 3 team members) or in a solo group. Each team is required to choose a dataset from a list of data providers and to apply by means of R (or Python) some of the models discussed during the lectures. The mark can range between 0 and 30. If a student does not take the midterm or she/he fails it, during the final project presentation she/he will go through an oral exam on the first 6 weeks of lectures.
Testi
1) An Introduction to Statistical learning, Daniela Witten, Gareth M. James, Trevor Hastie, Robert Tibshirani (Pdf and supplemenary material available here https://www.statlearning.com/)
Contenuti
The aim of this course is to study and apply the most relevant statistical models in the analysis of large data set. The perspective in both theoretical and applicative: choosing and applying suitable models to exploit the whole informative content of (large) data set with a particular attention to the correct and contextualized interpretation of the final results. The course will be held with the interactive employment of open source software like R and Python to learn practically the complete analysis work-flow. A particular emphasis will be given to social network data, textual data, business-financial case studies. Some of the models that will be covered are: Naive Bayes Classifier, Latent Dirichlet Analysis, Clustering algorithm, Penalized regression Support Vector Machines.
Lingua Insegnamento
Inglese
Altre informazioni
Some lectured will be thought by experts of the big data field.
Students enrolled in the Inclusive Learning Modalities programme (“Modalità didattiche inclusive) are requested to contact the Professor and the Degree Course Coordinator in order to assess specific needs and define targeted support actions