DIGITAL TEXT ANALYSIS

Academic year
2020/2021 Syllabus of previous years
Official course title
LINGUISTICA COMPUTAZIONALE (DIGITAL TEXT ANALYSIS)
Course code
LM5480 (AF:318025 AR:176098)
Modality
On campus classes
ECTS credits
6
Degree level
Master's Degree Programme (DM270)
Educational sector code
L-LIN/01
Period
1st Semester
Course year
2
Moodle
Go to Moodle page
As part of the Philological-Editorial curriculum of the Master's Degree in Language Sciences, this course aims at providing the student with an hands-on training on the basic techniques for the computational annotation and analysis of written text.

The main goals of this course are:

- to provide students with the basic technical tools for the computational treatment of textual data
- to introduce students to the fundamental automated knowledge extraction techniques
- to introduce the student to the Python programming language and to some of its modules, among which NLTK, spaCy and gensim
- to stimulate critical thinking and the ability to think out of the box
1. Knowledge and understanding
- familiarity with the Python programming language and with some of its NLP/text mining packages (NLTK, spaCy, gensim)
- familiarity with the most commonly used techniques of (morphosyntactic) linguistic annotation
- learning of the basic techniques for the extraction of linguistic knowledge from corpora
- knowledge of the principal levels of linguistic annotation
- familiarity with the most commonly used techniques for the representation of structured information extracted from text

2. Applying knowledge and understanding
- knowledge of the features and limitations of the most common computational linguistics tools and approaches, so as to be able to pick the most appropriate solution for a given linguistic research issue
- use of Python for the implementation of scripts for the quantitative and computational analysis of text
- ability to advance and test original and sounded hypotheses

3. Making judgements
- ability to implement self-development strategies to improve technical skills
- awareness of the technical and deontological issues connected to the automatic treatment of language
- ability to compare competing hypotheses

4. Communication skills
- ability to write a report to describe the process, progress and result of an original scientific research
- ability to interact with researchers with a different scientific background
- ability to interact with the other students and the professor

5. Learning skills
- ability to learn novel scripting languages (among which, R, PERL, Matlab, Javascript, SQL)
- ability to acquire technical knowledge pertaining to issues only indirectly linked to the automatic treatment of language (e.g. the statistical analysis, the creation of web pages, the management of a database)
- ability to learn novel technical tools for the automatic treatment of language (e.g. annotation tools, corpora management and query tools)
Basic notions of general linguistics

Basic familiarity with computers, but no special experience with programming or software is expected
1. Intro to Python Programming
2. Python programming basics
3. Variable Types
4. Text manipulation in Python
5. Writing structured programs
6. Corpora: creation and manipulation
7. The automatic annotation of a corpus
8. Descriptive statistics for corpus linguistics / the Zipf's law
9. Collocations and association measures
10. The use of databases in linguistics
11. The creation of textual databases
12. Mapping for digital humanities
13. Topic modeling
14. Principles of stylometry
15. Recap
Even if the Jupyter notebooks available on [he university e-learning platform are mostly self-contained, the following background readings will provide the student with an in-depth explanation of the key concepts of the course:

- M. Baroni (2009) Distributions in text. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook, Vol. 2, Mouton de Gruyter: 803-821. Available online at: http://sslmit.unibo.it/~baroni/publications/hsk_39_dist_rev2.pdf
- S. Bird, E. Klein and E. Loper (2016) Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, Updated 1st edition, O’Reilly (ch. 2.1, 2.2, 3.2, 3.4, 3.5, 4.6-4.8, 5.1-5.4, 8). Available online at: https://www.nltk.org/book/
- D.M. Blei (2012) Probabilistic topic models. Communications of the ACM, 55 (4): 77-84. Available online at: http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf
- M. Davies (2015) Corpora: An introduction. In D. Biber and R. Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics, Cambridge University Press: 11-31.
- A. Dimitriadis and S. Musgrave (2009) Designing linguistic databases: A primer for linguists. In S. Musgrave, A. Dimitriadis, and M. Everaert (eds.), The use of databases in cross-linguistic studies, Mouton de Gruyter: 13-75. Available online at: https://pdfs.semanticscholar.org/621e/96de1ce52e2b62469afd2fa76853282207ec.pdf
- S. Evert (2009) Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook, Vol. 2, Mouton de Gruyter: 1212-1248 (sections 1-4). Available online at: http://www.stefan-evert.de/PUB/Evert2007HSK_extended_manuscript.pdf
- S.T. Gries and A. L. Berez (2017) Linguistic Annotation in/for Corpus Linguistics. In N. Ide and J. Pustejovsky (eds.), Handbook of Linguistic Annotation, Springer: 379-409. Available online at: http://www.stgries.info/research/2017_STG-ALB_LingAnnotCorpLing_HbOfLingAnnot.pdf
- S.T. Gries and J. Newman (2010) Creating And Using Corpora. In R. J. Podesva and D. Sharma (eds.), Research Methods in Linguistics, Cambridge University Press: 257-287. Available online at: http://www.stgries.info/research/2013_STG-JN_CreatingUsingCorpora_ResMethLing.pdf
- S. Jänicke, G. Franzini, M.F. Cheema and G. Scheuermann (2015). On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges. In EuroVis (STARs): 83-103. Available online at: https://www.informatik.uni-leipzig.de/~stjaenicke/Survey.pdf
- T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, Y. and D. Woodard (2017) Surveying stylometry techniques and applications. ACM Computing Surveys (CSUR), 50 (6): 1-36. Available online at: https://dl.acm.org/doi/abs/10.1145/3132039
- P. Svensson (2010) The Landscape of Digital Humanities. Digital Humanities Quarterly, 4 (1): 1–31. Available online at: http://digitalhumanities.org/dhq/vol/4/1/000080/000080.html
Learning assessment will be based on a series of programming exercises

PROGRAMMING EXERCISES (ATTENDING STUDENTS)

Students attending at least 70% of the classes qualify as "attending" students. Their learning is assessed through three sets of exercises, assigned one every 4/5 weeks. Each assignment should be submitted electronically by due date.

The partial grade is calculated as follows:

- first assignment: 20% of the partial grade
- second assignment: 40% of the partial grade
- third assignment: 30% of the partial grade
- in class participation: before each laboratory session, students will be required to perform short programming exercises as homework, which will then be briefly discussed in class during the lab session. Students must submit the exercises before the beginning of each lab session. Students should try to solve at least 50% of the exercises, or they will get a grade penalty of 1% of the partial grade for each notebook that is either insufficient or that hasn't been submitted on time (up to a maximum of 10% of the partial grade).

PROGRAMMING EXERCISES (NON-ATTENDING STUDENTS)

Non-attending students will be evaluated through a single set of programming exercises published on the university e-learning platform. Students are required to submit their solutions electronically by due date.
Lab sessions structured as follows:

- overview of the session key concepts and principles
- work on the programming exercises in the relevant Jupyter notebook available on the university e-learning platform
Italian
written and oral
Definitive programme.
Last update of the programme: 05/10/2020