The course aims to provide an introduction to modern approaches to information retrieval from a collection of documents. It describes the architecture of modern systems and highlights the issues that the designer must face during the design and implementation of modern search engines and information retrieval systems.
Course Prerequisites
The student should have a basic knowledge of Internet and Web architecture, be able to develop applications using object-oriented languages (preferably Java), and know how to implement simple data structures, such as stacks, queues, lists and trees.
Teaching Methods
The course includes lectures and a series of laboratory sessions aimed at creating a project for information recovery.
Assessment Methods
Written examination and project
Texts
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, Introduction to algorithms Online Resources
Contents
Advanced data structures for information retrieval (Linked List, Hash Table, Binary Tree, B-Tree, Binary Heap); The architecture of modern information retrieval systems Dictionary and Posting List Management (Tokenization, Stemming, Porter's Algorithm, Linguistic Preprocessing) Optimization Methods for Information Retrieval Index Types (BiWord Index, Positional Index, Permuterm Index, k-Gram Index, Soundex Index) Data Structures for Dictionaries (Prefix Tree, Prefix Binary Tree, Prefix B-Tree) Identification of Syntactic and Semantic Errors (Edit Distance; K-Gram Overlap, Jaccard Similarity Coefficient) Index Construction Algorithms (Blocked Sort Based Indexing Algorithm, Single Pass In Memory Indexing Algorithm, Distributed Indexing Algorithms, Dynamic Indexes) Index Compression Techniques (Heaps Law, Zipf's Law, Dictionary Compression; Postings File Compression; Gamma Code) Identification of Duplicates (Fingerprint, Shingling, Signature, Min Hashing) Document Ranking (Weighted Search, Inverse Document Frequency) Document Representation in Vector Form (Bag of Word, Word Embedding, Document Embedding) Document Similarity and Distance (Cosine Distance, Jaccard Distance, Edit Distance) Word Embedding for Syntactic and Semantic Document Analysis, Sentiment Analysis, Text and Document Classification, Prediction of Next Words Neural Networks for Word Embedding (Word2Vec, Continuous Bag of Words, SkipGram) Neural Networks for Document Embedding (Doc2Vec, Distributed Memory Model Of Paragraph Vectors, Paragraph Vector With A Distributed Bag Of Words, FastText) Solr Image Retrieval Systems, Image Feature Extraction Techniques (Local Binary Pattern, Haar Wavelet Transform, Histogram of Oriented Gradient) Document Databases and MongoDB
Course Language
English
More information
The teaching material will be available on the Kiro teaching page