# KNOWLEDGE, INTERACTION AND INTELLIGENT SYSTEMS-1

2021/2022
Official course title
KNOWLEDGE, INTERACTION AND INTELLIGENT SYSTEMS-1
Course code
PHD157-1 (AF:364606 AR:193149)
Modality
On campus classes
ECTS credits
2
Degree level
Corso di Dottorato (D.M.45)
Educational sector code
INF/01
Period
Annual
Course year
1
Where
VENEZIA
Moodle
Go to Moodle page
The term "big data" indicates data so large that traditional algorithmic techniques cannot cope with neither storing nor analyzing it. Very often, however, the sheer size of such datasets is not proportional to their effective information content. Think about DNA sequencing: one human genome requires approximately 1 GB to be stored, and advanced techniques allow to sequence dozens of genomes in few days. Clearly, classic algorithmic techniques do not scale: for instance, all Italian genomes would require approximately 60 Petabytes just to be stored. Any two human genomes, however, are 99.99% similar: the key is compression. The real problem is: can we develop algorithms that operate directly on compressed data (without first decompressing it)?

The course tackles the problem of representing and manipulating big data through the use of compressed data structures. This research direction merges techniques from algorithms, data structures, and information theory in order to obtain structures able to, simultaneously, accelerate operations typical of information retrieval while occupying a space proportional to the compressed data (often, thousands of times smaller than the original datasets).
At the end of the course, the student will be able to:
1) understand the main lossless compression techniques used to represent unstructured (plain text) and structured (e.g. trees, graphs) data.
2) understand the relation between compression and computation, and how this can be exploited to accelerate operations on compressed data.
3) implement basic compressed data structures in C++ using the sdsl library.
- algorithms and data structures (e.g. sorting, hashing, binary search, big-O notation)
- probability theory
- basic knowledge of C/C++
1) Entropy, Shannon's theorem, prefix-free codes, compression
2) Bitvectors with rank/select/access, wavelet trees, geometric data structures
3) Compressed suffix arrays, FM-index, Burrows-Wheeler transform
4) Indexes based on Lempel-Ziv compressors
5) Practical compressed data structures: the sdsl library
- Navarro, Gonzalo. Compact data structures: A practical approach. Cambridge University Press, 2016.
- articoli di ricerca
One of the following:
- discussion of an existing article in the field
- implementation of a compressed index in C++
- original research (proposal/implementation of a new technique that may lead to an original publication)
Frontal lectures, slides, blackboard.
English
oral
Definitive programme.
Last update of the programme: 31/05/2021