This repository contains a comprehensive solution to a series of data preprocessing, tokenization, and feature selection tasks using Python. The project is divided into three parts, each focusing on different aspects of data science and machine learning.
-
Handling Missing Values:
- Implemented imputation techniques to handle missing values in the dataset.
-
Normality Tests:
- Applied normality tests to numerical columns, clearly stating hypotheses and commenting on the results.
-
Encoding and Scaling:
- Encoded categorical variables and scaled numerical features to prepare the data for further analysis.
- The dataset
noisy_data.csvis included in this repository.
-
Text Tokenization:
- Utilized
RegexpTokenizer()andword_tokenize()from the NLTK library to tokenize the text inwiki.txt. - Removed stop words and punctuation symbols, ensuring only meaningful tokens were retained.
- Utilized
-
Year Extraction:
- Applied regular expressions to extract all year mentions from the text.
-
Comparison:
- Compared the results of the two tokenization methods and provided insights into the differences.
- The text file
wiki.txtis included in this repository.
-
Feature Selection Techniques:
- Applied various feature selection techniques, including Correlation, Chi-Square, Mutual-Information, and Random Forest Feature Importance.
-
Feature Importance Visualization:
- Visualized the importance of selected features using bar charts.
-
Comparison and Analysis:
- Analyzed and commented on the results obtained from different feature selection techniques, identifying the best and worst methods.
- The Melbourne Housing dataset is available on Kaggle and can be downloaded from this link.
- Python 3.x
- Libraries: pandas, numpy, matplotlib, seaborn, nltk, sklearn
pip install pandas numpy matplotlib seaborn nltk sklearn