Name	Name	Last commit message	Last commit date
parent directory ..
Code_pdf_file.pdf	Code_pdf_file.pdf
Jupyter_notebook_code_file.ipynb	Jupyter_notebook_code_file.ipynb
README.MD	README.MD
melb_data.csv	melb_data.csv
noisy_data.csv	noisy_data.csv
wiki.txt	wiki.txt

Name

Last commit message

Last commit date

Jupyter_notebook_code_file.ipynb

Data Preprocessing and Feature Selection Techniques in Python

Overview

This repository contains a comprehensive solution to a series of data preprocessing, tokenization, and feature selection tasks using Python. The project is divided into three parts, each focusing on different aspects of data science and machine learning.

Part 01: Data Preprocessing on Noisy Data

Tasks:

Handling Missing Values:
- Implemented imputation techniques to handle missing values in the dataset.
Normality Tests:
- Applied normality tests to numerical columns, clearly stating hypotheses and commenting on the results.
Encoding and Scaling:
- Encoded categorical variables and scaled numerical features to prepare the data for further analysis.

Dataset:

The dataset noisy_data.csv is included in this repository.

Part 02: Text Tokenization and Regular Expression Extraction

Tasks:

Text Tokenization:
- Utilized RegexpTokenizer() and word_tokenize() from the NLTK library to tokenize the text in wiki.txt.
- Removed stop words and punctuation symbols, ensuring only meaningful tokens were retained.
Year Extraction:
- Applied regular expressions to extract all year mentions from the text.
Comparison:
- Compared the results of the two tokenization methods and provided insights into the differences.

Dataset:

The text file wiki.txt is included in this repository.

Part 03: Feature Selection on Melbourne Housing Dataset

Tasks:

Feature Selection Techniques:
- Applied various feature selection techniques, including Correlation, Chi-Square, Mutual-Information, and Random Forest Feature Importance.
Feature Importance Visualization:
- Visualized the importance of selected features using bar charts.
Comparison and Analysis:
- Analyzed and commented on the results obtained from different feature selection techniques, identifying the best and worst methods.

Dataset:

The Melbourne Housing dataset is available on Kaggle and can be downloaded from this link.

Installation and Usage

Prerequisites

Python 3.x
Libraries: pandas, numpy, matplotlib, seaborn, nltk, sklearn

Installation

pip install pandas numpy matplotlib seaborn nltk sklearn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.MD

Data Preprocessing and Feature Selection Techniques in Python

Overview

Part 01: Data Preprocessing on Noisy Data

Tasks:

Dataset:

Part 02: Text Tokenization and Regular Expression Extraction

Tasks:

Dataset:

Part 03: Feature Selection on Melbourne Housing Dataset

Tasks:

Dataset:

Installation and Usage

Prerequisites

Installation

FilesExpand file tree

Data Preprocessing and Feature Selection Techniques using Python

Directory actions

More options

Directory actions

More options

Latest commit

History

Data Preprocessing and Feature Selection Techniques using Python

Folders and files

parent directory

README.MD

Data Preprocessing and Feature Selection Techniques in Python

Overview

Part 01: Data Preprocessing on Noisy Data

Tasks:

Dataset:

Part 02: Text Tokenization and Regular Expression Extraction

Tasks:

Dataset:

Part 03: Feature Selection on Melbourne Housing Dataset

Tasks:

Dataset:

Installation and Usage

Prerequisites

Installation