Contribution of the course to the overall degree programme goals
This course is part of the educational activities of the Bachelor in Informatics.
The goal of this course is to enable students the understand and exploit predictive data analysis techniques including both supervised methods (classification and regression) and un-supervised methods (clustering and recommendation), with focus on web data (e.g., text documents, web graph). The course includes the exploitation of data mining software tools through the python programming language.
Expected learning outcomes
The course discusses fundamental technique for predictive and descriptive data analysis, with focus on Web data.
Students will achieve the following learning outcomes:
Knowledge and understanding: i) understanding principles of non-supervised learning; ii) understanding principles of supervised learning; iii) understanding principled of web content mining.
Applying knowledge and understanding: i) being able to apply supervised and unsupervised analysis techniques; ii) being able to use data analysis software tools (e.g., scikit-learn).
Making judgements: i) being able to choose the most appropriate to a given problem and to evaluate its performance.
Communication: i) reporting comprehensive comparative analysis among different data analysis methods
Students should have achieved the learning outcomes of courses "Programming", "Probability and Statistics", "Linear Algebra"
(even without passing the corresponding exams).
- Knowledge Discovery in Databases
- Similarity search in text:
- Text processing: tokenization, stemming, lemmatization, stopwords
- Similarity functions: Jaccard, Euclidean, Cosine
- Advanced Similarity approximations: k-shingles, Locality-Sensitive Hashing, Sim-Hashing
- Web Mining - Recommender systems:
- Content-based, Collaborative Filtering, user-based and item-based
- Dimensionality Reduction:
- Distance measures, curse of dimensionality, PCA
- k-means, k-medoids, Hierarchical, DB-Scan
- Intrinsic and extrinsic Evaluation
- Classification and Regression:
- k-NN, Naive Bayes, Decision Trees
- Ensemble methods: Bagging, Boosting, Random Forests
- Bias and Variance, overfitting and underfitting
- Imbalanced data
- Evaluation: accuracy measures, cross-validation
- Web Mining - Document Ranking:
- classification and regression for document ranking
- Graph Analysis with PageRank
Web Data Mining 2nd edition. Liu. Springer. 2011.
Selected readings provided during the course.
Learning outcomes are verified by a written exam and a project.
The written exam consists in questions and short exercise regarding the theory of the subjects discussed during the course. The written exam evaluates the theoretical knowledge gained by the student.
The project requires to conduct a comparative analysis of different tools applied to a specific dataset or problem.
The student must chose and motivate the most appropriate solution and deliver a report, to be discussed with the teacher. The project work evaluates the ability of the student in applying the theoretical knowledge to a real-world case study.
Lessons include both theoretical and practical sessions.
Teaching material is delivered through the Moodle platform.
During the course, the python programming language is used together with the scikit-learn library. Students are encouraged to bring their own laptops.
Last update of the programme