Text Classification using NLP

Project Overview

This project, initially inspired by our participation in the UTA Datathon 2023, focused on building a machine-learning model to classify text based on its potential interest for fact-checking. Our goal was to predict if a statement would be considered “check-worthy” by the general public, using labeled data to train models that could distinguish between factual and potentially dubious statements.

Task Objective

To create a machine learning model capable of identifying check-worthy statements based on labeled input data and evaluating it on unseen test data.

Dataset: The data consisted of two main files:

checkworthy_labeled.tsv – The labeled training data with columns for ID, Text, and Category (Yes or No).

checkworthy_eval.tsv – The evaluation file, with the same structure, used to assess our model’s predictions.

Tools & Libraries

We utilized the following libraries for efficient data handling, model training, and evaluation:

Pandas and NumPy: For data processing and manipulation

Scikit-learn: For basic ML models like Logistic Regression and metrics

Matplotlib: For visualizations

Transformers, TensorFlow, and Keras: For fine-tuning pre-trained transformer models and deep-learning architecture

Approach

We adopted two distinct modeling approaches to analyze performance:

Logistic Regression

Data Loading and Preprocessing: Loaded the dataset with Pandas and converted the text data into a numerical matrix using the CountVectorizer function in Scikit-learn.

Model Building: Created two baseline models—LogisticRegressionCV and SGDClassifier—to assess performance on test data.

Hyperparameter Tuning: Fine-tuned the hyperparameters using GridSearchCV to optimize performance.

Validation & Testing: Split data into 80% training and 20% validation sets. Achieved an accuracy of 83.28% on the validation set, and 93.31% on the evaluation dataset, securing us 4th place in the competition.

Fine-Tuning a Pre-trained Transformer Model (BART)

Data Preprocessing: Employed BERT’s preprocessing and encoder modules for tokenizing and encoding text, adding necessary tokens such as [CLS] and [SEP].

Model Architecture: Constructed a deep learning model with a BERT encoder, followed by a dropout layer to prevent overfitting and a dense layer with sigmoid activation for binary classification.

Training and Hyperparameter Setting: Trained the model for 20 epochs using Adam optimizer and binary cross-entropy loss. Defined metrics like binary accuracy, precision, and recall for robust evaluation.

Model Evaluation: Achieved an accuracy of 82.26% on the validation set and 92.63% on the evaluation dataset, securing 7th place.

Results and Insights

Logistic Regression achieved higher accuracy and demonstrated the interpretability of linear models on textual data. Validation score: 83.28%; Evaluation score: 93.31%.

BART Model yielded strong results due to fine-tuning, showing the potential of transformers in text classification tasks. Validation score: 82.26%; Evaluation score: 92.63%.

Ranked Achievements:

Rank 4: Logistic Regression

Rank 7: BART Model

Key Learnings

This project demonstrated the value of both traditional ML techniques and transformer-based models in text classification, showing the strengths of interpretability and accuracy. Working with deep learning frameworks like TensorFlow and Transformers allowed us to better understand how to leverage pre-trained models for specific tasks, further enhancing our skill set in natural language processing (NLP) and model evaluation.

Text Classification using NLP

Project Overview

Task Objective

Dataset: The data consisted of two main files:

Tools & Libraries

Approach

Logistic Regression

Fine-Tuning a Pre-trained Transformer Model (BART)

Results and Insights

Ranked Achievements:

Key Learnings

Other Projects

Optimizing User Engagement A/B Testing with Digital Media Data

Data Science Case Study: Song Popularity Prediction System using Spotify API

Symbiont Phenotype Prediction Using Machine Learning

Optimizing User Engagement A/B Testing with Digital Media Data

Data Science Case Study: Song Popularity Prediction System using Spotify API

Optimizing User Engagement A/B Testing with Digital Media Data

Data Science Case Study: Song Popularity Prediction System using Spotify API