This project implements a Boolean Information Retrieval Model for a collection of computer science journal abstracts. It focuses on building:
- Inverted Index: Maps terms to the documents containing them.
- Positional Index: Tracks term positions within documents to support proximity queries.
The system supports Boolean queries with up to three terms connected by AND, OR, and NOT operators, and proximity queries to find documents where two terms appear within k words of each other.
- Abstracts.zip: Contains 448 abstracts in English (each file is a unique document).
- Stopword-List.txt: List of stop words to filter during preprocessing.
- Gold Standard: A set of 10 queries to evaluate correctness.
-
Preprocessing pipeline with tokenization, stop words removal, and Porter stemming.
-
Construction and saving/loading of both inverted and positional indexes.
-
Query parser and executor for:
- Boolean queries (
AND,OR,NOT) - Proximity queries (e.g.,
"word1 word2 /k"for terms within k words).
- Boolean queries (
-
Simple and intuitive GUI built with
customtkinterto enter queries and display results. -
Measures and displays query execution time.
-
Python 3.7+
-
Packages:
nltk(for tokenization and stemming)customtkinter(for GUI)Pillow(for image handling)chardet(for encoding detection)
Install dependencies via:
pip install nltk customtkinter pillow chardetMake sure to download the NLTK tokenizer models:
import nltk
nltk.download('punkt')-
Prepare your data:
- Extract
Abstracts.zipto a folder. - Place
Stopword-List.txtin the same directory as the script. - Ensure background images (
Artboard 2.pngandSEO analytics team-amico (2).png) are available in the working directory for the GUI.
- Extract
-
Run the script:
python birs.py
-
Using the GUI:
- Enter a Boolean Query using up to three terms with
AND,OR, andNOT(e.g.,machine AND learning NOT neural). - Or enter a Proximity Query in the format
"word1 word2 /k"(e.g.,data science /3). - Click the corresponding Search button.
- Results show the list of document IDs matching the query and the query execution time.
- Enter a Boolean Query using up to three terms with
-
Preprocessing: Tokenizes text, applies stemming and stop-word removal.
-
Indexing: Builds inverted and positional indexes from preprocessed documents.
-
Query Processing:
boolean_query(): Evaluates Boolean expressions over the inverted index.proximity_query(): Checks word proximity using the positional index.
-
GUI: Built with
customtkinterandtkinter.Canvasto provide a user-friendly interface.
birs.mp4
- The system treats documents uniquely by filename (document ID).
- Stop-words and stemming ensure better indexing and query matching.
- Proximity queries require correct formatting: two terms followed by
/kwherekis an integer. - The code supports case folding and simple tokenization but does not handle phrases longer than two words for proximity queries.