Email Classification of Text Data Using Machine Learning and Natural Language Processing Technique

Oluwaseyi Ijogun, Hayden Wimmer, Carl Rebman

Research output: Contribution to book or proceedingConference articlepeer-review

Abstract

Spam and Phishing emails are the most crucial in social networks, many issues arise through emails such as cost of dealing with spam and phishing emails due to their large quantities, privacy resulting in loss of sensitive information, time taken to identify spam and phishing emails, and cyber security threat due to malicious content. Using a spam and phishing detection approach, a model can quickly recognize spam and phishing emails and classify them before they become a threat to the organization. In this study, a machine learning and Natural Language processing-based supervised learning approach was used and plays an effective role in improving email classification. The dataset was prepared and dynamically classified into 3 categories namely spam-ham, spam-phishing, and ham-phishing. Different methods for effective classification were performed such as data preprocessing, feature selection, model training, model testing, and classification result and performance evaluation. There were 5 machine learning algorithms used, and the result was evaluated using 8 performance indexes. The result shows that the XGBoost classifier out-performed other machine learning algorithms used, Results show that XGBoost machine learning algorithms outperformed other algorithms using the datasets. This research would help to improve categorizing emails into different folders based on their content, intent, or relevance, improve user experience, and better manage email inboxes by automatically filtering, sorting, and prioritizing messages.

Original languageEnglish
Title of host publicationOptimization and Data Science in Industrial Engineering - First International Conference, ODSIE 2023, Proceedings
EditorsA. Mirzazadeh, Zohreh Molamohamadi, Efran Babaee Tirkolaee, Gerhard-Wilhelm Weber, Janny Leung
PublisherSpringer Science and Business Media Deutschland GmbH
Pages212-236
Number of pages25
ISBN (Print)9783031814549
DOIs
StatePublished - 2025
Event1st International Conference on Optimization and Data Science in Industrial Engineering, ODSIE 2023 - Istanbul, Turkey
Duration: Nov 16 2023Nov 17 2023

Publication series

NameCommunications in Computer and Information Science
Volume2204
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference1st International Conference on Optimization and Data Science in Industrial Engineering, ODSIE 2023
Country/TerritoryTurkey
CityIstanbul
Period11/16/2311/17/23

Scopus Subject Areas

  • General Computer Science
  • General Mathematics

Keywords

  • Classification
  • NLP
  • Phishing
  • Spam

Fingerprint

Dive into the research topics of 'Email Classification of Text Data Using Machine Learning and Natural Language Processing Technique'. Together they form a unique fingerprint.

Cite this