DATA AND WEB MINING

Academic year
2020/2021 Syllabus of previous years
Official course title
DATA AND WEB MINING
Course code
CT0509 (AF:337525 AR:178728)
Modality
On campus classes
ECTS credits
6
Degree level
Bachelor's Degree Programme
Educational sector code
INF/01
Period
1st Semester
Course year
3
Moodle
Go to Moodle page
This course is part of the educational activities of the Bachelor in Informatics.
The goal of this course is to enable students the understand and exploit predictive data analysis techniques including both supervised methods (classification and regression) and un-supervised methods (clustering and recommendation), also including web data (e.g., text documents, web graph). The course includes the exploitation of data mining software tools through the python programming language.
The course discusses fundamental technique for predictive and descriptive data analysis, with focus on Web data.

Students will achieve the following learning outcomes:

Knowledge and understanding: i) understanding principles of non-supervised learning; ii) understanding principles of supervised learning; iii) understanding principled of web content mining.

Applying knowledge and understanding: i) being able to apply supervised and unsupervised analysis techniques; ii) being able to use data analysis software tools (e.g., scikit-learn).

Making judgements: i) being able to choose the most appropriate method to a given problem and to evaluate its performance.

Communication: i) reporting comprehensive comparative analysis among different data analysis methods
Students should have achieved the learning outcomes of courses "Programming", "Probability and Statistics", "Linear Algebra"
(even without passing the corresponding exams).
- Knowledge Discovery in Databases
- Similarity search in text:
- Text processing: tokenization, stemming, lemmatization, stopwords
- Similarity functions: Jaccard, Euclidean, Cosine
- Advanced Similarity approximations: k-shingles, min-hashing
- Advanced Similarity approximations: Locality-Sensitive Hashing, Sim-Hashing
- Web Mining - Recommender systems:
- Content-based, Collaborative Filtering, user-based and item-based
- Dimensionality Reduction:
- Distance measures, curse of dimensionality, PCA
- Clustering:
- k-means, k-medoids, Hierarchical, DB-Scan
- Intrinsic and extrinsic Evaluation
- Classification and Regression:
- k-NN, Decision Trees
- Bias and Variance, overfitting and underfitting
- Ensemble methods: Bagging, Boosting, Random Forests
- Random Forests for feature selection, outlier detection
- Imbalanced data
- Evaluation: accuracy measures, cross-validation
Lecture notes. Selected readings provided during the course.

- Data Mining Concepts and Techniques Third Edition. Jiawei Han, Micheline Kamber Jian Pei. Morgan Kaufmann/Elsevier. Third Edition. 2012.
- Web Data Mining 2nd edition. Liu. Springer. 2011.
Learning outcomes are verified by a written exam and a project.

The written exam consists in questions and short exercise regarding the theory of the subjects discussed during the course. The written exam evaluates the theoretical knowledge gained by the student.

The project requires to conduct a comparative analysis of different tools applied to a specific dataset or problem.
The student must chose and motivate the most appropriate solution and deliver a report, to be discussed with the teacher. The project work evaluates the ability of the student in applying the theoretical knowledge to a real-world case study.
Lessons include both theoretical and practical sessions.
Teaching material is delivered through the Moodle platform.
During the course, the python programming language is used together with the scikit-learn library. Students are encouraged to bring their own laptops.
Italian
written and oral
Definitive programme.
Last update of the programme: 27/04/2020