Text Classification using NLP

NLP

2023

Natural Language Processing

NLP

2023

Natural Language Processing

NLP

2023

Natural Language Processing


Project Overview

This project, initially inspired by our participation in the UTA Datathon 2023, focused on building a machine-learning model to classify text based on its potential interest for fact-checking. Our goal was to predict if a statement would be considered “check-worthy” by the general public, using labeled data to train models that could distinguish between factual and potentially dubious statements.

Task Objective

To create a machine learning model capable of identifying check-worthy statements based on labeled input data and evaluating it on unseen test data.

Dataset: The data consisted of two main files:

checkworthy_labeled.tsv – The labeled training data with columns for ID, Text, and Category (Yes or No).

checkworthy_eval.tsv – The evaluation file, with the same structure, used to assess our model’s predictions.

Tools & Libraries

We utilized the following libraries for efficient data handling, model training, and evaluation:

Pandas and NumPy: For data processing and manipulation

Scikit-learn: For basic ML models like Logistic Regression and metrics

Matplotlib: For visualizations

Transformers, TensorFlow, and Keras: For fine-tuning pre-trained transformer models and deep-learning architecture

Approach

We adopted two distinct modeling approaches to analyze performance:

Logistic Regression

Data Loading and Preprocessing: Loaded the dataset with Pandas and converted the text data into a numerical matrix using the CountVectorizer function in Scikit-learn.

Model Building: Created two baseline models—LogisticRegressionCV and SGDClassifier—to assess performance on test data.

Hyperparameter Tuning: Fine-tuned the hyperparameters using GridSearchCV to optimize performance.

Validation & Testing: Split data into 80% training and 20% validation sets. Achieved an accuracy of 83.28% on the validation set, and 93.31% on the evaluation dataset, securing us 4th place in the competition.

Fine-Tuning a Pre-trained Transformer Model (BART)

Data Preprocessing: Employed BERT’s preprocessing and encoder modules for tokenizing and encoding text, adding necessary tokens such as [CLS] and [SEP].

Model Architecture: Constructed a deep learning model with a BERT encoder, followed by a dropout layer to prevent overfitting and a dense layer with sigmoid activation for binary classification.

Training and Hyperparameter Setting: Trained the model for 20 epochs using Adam optimizer and binary cross-entropy loss. Defined metrics like binary accuracy, precision, and recall for robust evaluation.

Model Evaluation: Achieved an accuracy of 82.26% on the validation set and 92.63% on the evaluation dataset, securing 7th place.

Results and Insights

Logistic Regression achieved higher accuracy and demonstrated the interpretability of linear models on textual data. Validation score: 83.28%; Evaluation score: 93.31%.

BART Model yielded strong results due to fine-tuning, showing the potential of transformers in text classification tasks. Validation score: 82.26%; Evaluation score: 92.63%.

Ranked Achievements:

Rank 4: Logistic Regression

Rank 7: BART Model

Key Learnings

This project demonstrated the value of both traditional ML techniques and transformer-based models in text classification, showing the strengths of interpretability and accuracy. Working with deep learning frameworks like TensorFlow and Transformers allowed us to better understand how to leverage pre-trained models for specific tasks, further enhancing our skill set in natural language processing (NLP) and model evaluation.

NLP

2023

Natural Language Processing

Create a free website with Framer, the website builder loved by startups, designers and agencies.