Information Retrieval

Course Title: Information Retrieval

Course No: CSC423

Nature of the Course: Theory + Lab

Semester: 7

Full Marks: 60 + 20 + 20

Pass Marks: 24 + 8 + 8

Credit Hours: 3

Course Description

This course familiarizes students with different concepts of information retrieval techniques mainly focused on clustering, classification, search engine, ranking and query operations techniques.

Course Objectives

The main objective of this course is to provide knowledge of different information retrieval techniques so that the students will be able to develop information retrieval engine.

Course Contents

1. Introduction to IR and Web Search

2 hrs

1.1. IR Fundamentals

Introduction

Data vs Information Retrieval

Logical view of the documents

Architecture of IR System

1.2. Web Search and History

Web search system

History of IR

Related areas

2. Text properties, operations and preprocessing

5 hrs

2.1. Text Preprocessing Techniques

Tokenization

Text Normalization

Stop-word removal

Morphological Analysis

Word Stemming (Porter Algorithm)

Case folding

Lemmatization

2.2. Word Statistics and Indexing

Word statistics (Zipf's law, Heaps' Law)

Index term selection

Inverted indices

Positional Inverted index

2.3. Natural Language Processing in IR

Natural Language Processing in Information Retrieval

Basic NLP tasks – POS tagging

Shallow parsing

3. Basic IR Models

5 hrs

3.1. Boolean and Vector Models

Classes of Retrieval Model

Boolean model

Term weighting mechanism – TF, IDF, TF-IDF weighting

Cosine Similarity

Vector space model

3.2. Probabilistic and Other Models

Probabilistic models (the binary independence model)

Language models

KL-divergence

Smoothing

Non-Overlapping Lists

Proximal Nodes Mode

4. Evaluation of IR

2 hrs

4.1. Evaluation Metrics

Precision

Recall

F-Measure

MAP (Mean Average Precision)

(DCG) Discounted Cumulative Gain

Known-item Search Evaluation

5. Query Operations and Languages

4 hrs

5.1. Query Enhancement Techniques

Relevance feedback and pseudo relevance feedback

Query expansion (with a thesaurus or WordNet and correlation matrix)

Spelling correction (Edit distance, K – Gram indexes, Context sensitive spelling correction)

5.2. Query Languages

Single-Word Queries

Context Queries

Boolean Queries

Structural Query

Natural Language

6. Web Search

6 hrs

6.1. Search Engines and Spidering

Search engines (working principle)

Spidering (Structure of a spider, Simple spidering algorithm, multithreaded spidering, Bot)

Directed spidering (Topic directed, Link directed)

6.2. Crawlers and Link Analysis

Crawlers (Basic crawler architecture)

Link analysis (HITS, Page ranking)

Query log analysis

6.3. Advanced Web Search Topics

Handling "invisible" Web

Snippet generation

CLIR (Cross Language Information Retrieval)

7. Text Categorization

4 hrs

7.1. Categorization Fundamentals

Categorization

Learning for Categorization

General learning issues

7.2. Learning Algorithms

Bayesian (naïve)

Decision tree

Rocchio

8. Text Clustering

4 hrs

8.1. Clustering Fundamentals

Clustering

8.2. Clustering Algorithms

Hierarchical clustering

k-means

k-medoid

Expectation maximization (EM)

Text shingling

9. Recommender System

3 hrs

9.1. Recommendation Techniques

Personalization

Collaborative filtering recommendation

Content-based recommendation

10. Question Answering

5 hrs

10.1. Information Extraction

Information bottleneck

Information Extraction

Ambiguities in IE

10.2. QA System Architecture

Architecture of QA system

Question processing

Paragraph retrieval

Answer processing

11. Advanced IR Models

5 hrs

11.1. Latent Semantic Analysis

Latent Semantic Indexing (LSI)

Singular value decomposition

Latent Dirichlet Allocation

11.2. String Searching and Pattern Matching

Efficient string searching

Knuth – Morris – Pratt

Boyer – Moore Family

Pattern matching

Laboratory Works

The laboratory should contain all the features mentioned in a course. The Laboratory work should contain implementation of Boolean Retrieval Model, Vector Space Model, tokenization, similarity measurement, Porter stemmer, web spider, text categorization, and recommender system.

1.Program to demonstrate the Boolean Retrieval Model and Vector Space Model
2.Tokenize the words of large documents according to type and token
3.Program to find the similarity between documents
4.Implement Porter stemmer
5.Build a spider that tracks only the link of nepali documents
6.Group the online news onto different categorize like sports, entertainment, politics
7.Build a recommender system for online music store

Text Books

1.Modern Information Retrieval, Ricardo Baeza-Yates, Berthier Ribeiro-Neto
2.Information Retrieval; Data Structures & Algorithms: Bill Frakes

Notes:

This syllabus follows the official B.Sc. CSIT curriculum of Tribhuvan University. In case of any doubt or revision, the university's published syllabus shall be considered authoritative.

Information Retrieval

Information Retrieval

Select University

Select Program

TabFlux . Information Retrieval . TU . BSC-CSIT