Information Retrieval
0%
Course Title: Information Retrieval
Course No: CSC423
Nature of the Course: Theory + Lab
Semester: 7
Full Marks: 60 + 20 + 20
Pass Marks: 24 + 8 + 8
Credit Hours: 3
Course Description
Course Objectives
Course Contents
1.1. IR Fundamentals
- Introduction
- Data vs Information Retrieval
- Logical view of the documents
- Architecture of IR System
1.2. Web Search and History
- Web search system
- History of IR
- Related areas
2.1. Text Preprocessing Techniques
- Tokenization
- Text Normalization
- Stop-word removal
- Morphological Analysis
- Word Stemming (Porter Algorithm)
- Case folding
- Lemmatization
2.2. Word Statistics and Indexing
- Word statistics (Zipf's law, Heaps' Law)
- Index term selection
- Inverted indices
- Positional Inverted index
2.3. Natural Language Processing in IR
- Natural Language Processing in Information Retrieval
- Basic NLP tasks – POS tagging
- Shallow parsing
3. Basic IR Models
5 hrs
3.1. Boolean and Vector Models
- Classes of Retrieval Model
- Boolean model
- Term weighting mechanism – TF, IDF, TF-IDF weighting
- Cosine Similarity
- Vector space model
3.2. Probabilistic and Other Models
- Probabilistic models (the binary independence model)
- Language models
- KL-divergence
- Smoothing
- Non-Overlapping Lists
- Proximal Nodes Mode
4. Evaluation of IR
2 hrs
4.1. Evaluation Metrics
- Precision
- Recall
- F-Measure
- MAP (Mean Average Precision)
- (DCG) Discounted Cumulative Gain
- Known-item Search Evaluation
5.1. Query Enhancement Techniques
- Relevance feedback and pseudo relevance feedback
- Query expansion (with a thesaurus or WordNet and correlation matrix)
- Spelling correction (Edit distance, K – Gram indexes, Context sensitive spelling correction)
5.2. Query Languages
- Single-Word Queries
- Context Queries
- Boolean Queries
- Structural Query
- Natural Language
6. Web Search
6 hrs
6.1. Search Engines and Spidering
- Search engines (working principle)
- Spidering (Structure of a spider, Simple spidering algorithm, multithreaded spidering, Bot)
- Directed spidering (Topic directed, Link directed)
6.2. Crawlers and Link Analysis
- Crawlers (Basic crawler architecture)
- Link analysis (HITS, Page ranking)
- Query log analysis
6.3. Advanced Web Search Topics
- Handling "invisible" Web
- Snippet generation
- CLIR (Cross Language Information Retrieval)
7.1. Categorization Fundamentals
- Categorization
- Learning for Categorization
- General learning issues
7.2. Learning Algorithms
- Bayesian (naïve)
- Decision tree
- KNN
- Rocchio
8. Text Clustering
4 hrs
8.1. Clustering Fundamentals
- Clustering
8.2. Clustering Algorithms
- Hierarchical clustering
- k-means
- k-medoid
- Expectation maximization (EM)
- Text shingling
9.1. Recommendation Techniques
- Personalization
- Collaborative filtering recommendation
- Content-based recommendation
10.1. Information Extraction
- Information bottleneck
- Information Extraction
- Ambiguities in IE
10.2. QA System Architecture
- Architecture of QA system
- Question processing
- Paragraph retrieval
- Answer processing
11.1. Latent Semantic Analysis
- Latent Semantic Indexing (LSI)
- Singular value decomposition
- Latent Dirichlet Allocation
11.2. String Searching and Pattern Matching
- Efficient string searching
- Knuth – Morris – Pratt
- Boyer – Moore Family
- Pattern matching
Laboratory Works
- 1.Program to demonstrate the Boolean Retrieval Model and Vector Space Model
- 2.Tokenize the words of large documents according to type and token
- 3.Program to find the similarity between documents
- 4.Implement Porter stemmer
- 5.Build a spider that tracks only the link of nepali documents
- 6.Group the online news onto different categorize like sports, entertainment, politics
- 7.Build a recommender system for online music store
Text Books
- 1.Modern Information Retrieval, Ricardo Baeza-Yates, Berthier Ribeiro-Neto
- 2.Information Retrieval; Data Structures & Algorithms: Bill Frakes