Project Overview
This project, initially inspired by our participation in the UTA Datathon 2023, focused on building a machine-learning model to classify text based on its potential interest for fact-checking. Our goal was to predict if a statement would be considered “check-worthy” by the general public, using labeled data to train models that could distinguish between factual and potentially dubious statements.
Task Objective
To create a machine learning model capable of identifying check-worthy statements based on labeled input data and evaluating it on unseen test data.
Dataset: The data consisted of two main files:
checkworthy_labeled.tsv – The labeled training data with columns for ID, Text, and Category (Yes or No).
checkworthy_eval.tsv – The evaluation file, with the same structure, used to assess our model’s predictions.
Tools & Libraries
We utilized the following libraries for efficient data handling, model training, and evaluation:
Pandas and NumPy: For data processing and manipulation
Scikit-learn: For basic ML models like Logistic Regression and metrics
Matplotlib: For visualizations
Transformers, TensorFlow, and Keras: For fine-tuning pre-trained transformer models and deep-learning architecture
Approach
We adopted two distinct modeling approaches to analyze performance:
Logistic Regression
Data Loading and Preprocessing: Loaded the dataset with Pandas and converted the text data into a numerical matrix using the CountVectorizer function in Scikit-learn.
Model Building: Created two baseline models—LogisticRegressionCV and SGDClassifier—to assess performance on test data.
Hyperparameter Tuning: Fine-tuned the hyperparameters using GridSearchCV to optimize performance.
Validation & Testing: Split data into 80% training and 20% validation sets. Achieved an accuracy of 83.28% on the validation set, and 93.31% on the evaluation dataset, securing us 4th place in the competition.
Fine-Tuning a Pre-trained Transformer Model (BART)
Data Preprocessing: Employed BERT’s preprocessing and encoder modules for tokenizing and encoding text, adding necessary tokens such as [CLS] and [SEP].
Model Architecture: Constructed a deep learning model with a BERT encoder, followed by a dropout layer to prevent overfitting and a dense layer with sigmoid activation for binary classification.
Training and Hyperparameter Setting: Trained the model for 20 epochs using Adam optimizer and binary cross-entropy loss. Defined metrics like binary accuracy, precision, and recall for robust evaluation.
Model Evaluation: Achieved an accuracy of 82.26% on the validation set and 92.63% on the evaluation dataset, securing 7th place.
Results and Insights
Logistic Regression achieved higher accuracy and demonstrated the interpretability of linear models on textual data. Validation score: 83.28%; Evaluation score: 93.31%.
BART Model yielded strong results due to fine-tuning, showing the potential of transformers in text classification tasks. Validation score: 82.26%; Evaluation score: 92.63%.
Ranked Achievements:
Rank 4: Logistic Regression
Rank 7: BART Model
Key Learnings
This project demonstrated the value of both traditional ML techniques and transformer-based models in text classification, showing the strengths of interpretability and accuracy. Working with deep learning frameworks like TensorFlow and Transformers allowed us to better understand how to leverage pre-trained models for specific tasks, further enhancing our skill set in natural language processing (NLP) and model evaluation.



