COMPUTATIONAL LINGUISTICS MOD. 2

Academic year
2020/2021 Syllabus of previous years
Official course title
COMPUTATIONAL LINGUISTICS MOD. 2
Course code
LMJ070 (AF:318582 AR:175734)
Modality
On campus classes
ECTS credits
6 out of 12 of COMPUTATIONAL LINGUISTICS
Degree level
Master's Degree Programme (DM270)
Educational sector code
L-LIN/01
Period
1st Semester
Course year
2
Moodle
Go to Moodle page
As part of the Language Sciences, Language and Cognition and English Linguistics curricula of the Master's Degree in Language Sciences, and as part of the English and American Literary and Cultural Studies curricula of the Master's Degree in European, American and Postcolonial Languages and Literatures, this course aims at providing the student with an hands-on training on the basic techniques for the computational annotation and analysis of written text.

The main goals of this course are:

- to provide students with the basic technical tools for the computational analysis of textual data
- to introduce the student to the Python programming language
- to strengthen the student's ability to reflect on the properties of language
- to stimulate critical thinking and the ability to think out of the box
1. Knowledge and understanding
- familiarity with the Python programming language and with the NLTK package
- ability to design and implement simple algorithms
- familiarity with the main distributional semantics approaches
- learning of the basic techniques for the extraction of linguistic knowledge from corpora
- knowledge of the principal levels of linguistic annotation

2. Applying knowledge and understanding
- knowledge of the features and limitations of the most common computational linguistics tools and approaches, so as to be able to pick the most appropriate solution for a given linguistic research issue
- use of Python for the implementation of scripts for the quantitative and computational analysis of text
- ability to advance and test original and sounded hypotheses (relevant for the non-attending students only)

3. Making judgements
- ability to implement self-development strategies to improve technical skills
- awareness of the technical and deontological issues connected to the automatic treatment of language
- ability to retrieve the most relevant literature and to use it critically (relevant for the non-attending students only)
- ability to select a suitable theoretical framework to answer a research question of interest (relevant for the non-attending students only)
- ability to compare competing hypotheses (relevant for the non-attending students only)

4. Communication skills
- ability to write a report to describe the process, progress and result of an original scientific research (relevant for the non-attending students only)
- ability to interact with researchers with a different scientific background (among which, computational linguists and cognitive scientists)
- ability to interact with the other students and the professor

5. Learning skills
- ability to learn novel scripting languages (among which, R, PERL, Matlab, Javascript...)
- ability to acquire technical knowledge pertaining to issues only indirectly linked to the automatic treatment of language (e.g. the statistical analysis, the creation of web pages, the management of a database)
- ability to learn novel technical tools for the automatic treatment of language (e.g. annotation tools, corpora management and query tools)
Basic notions of general linguistics (morphology and syntax)

Basic mathematics skills

Basic familiarity with computers, but no special experience with programming or software is expected
1. Intro to Python Programming / The Jupyter Notebook
2. Python programming basics
3. Strings
4. Functions
5. Lists
6. Working with Files
7. Dictionaries, sets and more
8. Regular expressions
9. Searching Text With Python
10. Modules and packages
11. NLTK's corpora: An introduction
12. Working with Tagged Corpora
13. Text (pre-)processing using NLTK
14. Measuring the Association Between Words
15. Recap: Introduction to Natural Language Processing with Python
Even if the Jupyter notebooks available on the university e-learning platform are mostly self-contained, the following background readings will provide the student with an in-depth explanation of the key concepts of the course:

MANDATORY READINGS:

- S. Bird, E. Klein and E. Loper (2016) Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, Updated 1st edition, O’Reilly (ch. 2.1, 2.2, 3.2, 3.4, 3.5, 4.6-4.8, 5.1-5.4, 8). Available online at: https://www.nltk.org/book/
- A. B. Downey (2015) Think Python: How to Think Like a Computer Scientist, 2nd edition, O’Reilly (ch. 1, 2, 3, 5, 10, 11.1-11.5, 12.1-12.3, 14.1-14.4). Available online at: https://www.greenteapress.com/thinkpython/thinkpython.html
- D. Jurafsky and J. H. Martin (2008/2019) Speech and Language Processing, 2nd or 3rd edition (ch. 2.1). The draft version of the relevant chapter from the 3rd edition is available online at: https://web.stanford.edu/~jurafsky/slp3/2.pdf

(SUGGESTED) SUPPLEMENTARY READINGS:

- S. Evert (2009) Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook, Vol. 2, Mouton de Gruyter: 1212-1248 (sections 1-4). The extended version is available online at: http://www.stefan-evert.de/PUB/Evert2007HSK_extended_manuscript.pdf
Different assessment methods will be applied to the evaluation of attending and non-attending students.

ATTENDING STUDENTS

Students attending at least 70% of the classes qualify as "attending" students. Their learning is assessed through three sets of exercises, each one of which will be assigned every 4/5 weeks. Each assignment should be submitted electronically by due date.

The final grade will be calculated as follows:

- first assignment: 25% of the final grade
- second assignment: 35% of the final grade
- third assignment: 30% of the final grade
- in class participation: on laboratory sessions, short programming exercises will be given as homework and briefly discussed during the following lab. Prior the beginning of each lab session, all the students are required to submit the exercises assigned in the previous session. Students that didn't try to solve at least 50% of the exercises will be penalized at the rate of 2% of the maximum final grade for each notebook that is either insufficient or that hasn't been submitted on time (up to a maximum of 10% of the maximum final grade).

NON-ATTENDING STUDENTS

Non-attending students are required to carry out a programming project that should be described in details in a written report and discussed face to face with the instructor during the oral exam. The aim of the project is to build an automatically annotated corpus and to use Python to extract the linguistic information that is needed to perform an innovative quantitative linguistic analysis. Note that the specific topic of the project should have been agreed upon with the instructor. The final report must be submitted electronically at least one week prior to the exam.

The project will be graded as follows:

- quality of the code: 40% of the final grade
- knowledge of the relevant literature and of the state-of-the-art: 30% of the final grade
- quality of the report: 20% of the final grade
- one‐on‐one discussion with the instructor: 10% of the final grade
Lab sessions structured as follows:

- discussion of some programming exercises from the past homework
- overview of the session key concepts and principles
- work on the programming exercises in the relevant Jupyter notebook available on the university e-learning platform
English
written and oral
Definitive programme.
Last update of the programme: 29/04/2020