Building a Spam Email Detection System Using FFNN, LSTM, and NLTK
Spam email detection is one of the major applications in the domain of natural language processing combined with machine learning. In this blog, I am going to discuss how to develop a spam detection system using Feedforward Neural Networks, Long Short-Term Memory, and the Natural Language Toolkit. Advanced preprocessing techniques combined with strong neural network models can classify an email as spam or not spam with high accuracy.
Spam emails are unsolicited messages that often carry malicious content or advertisements. Detecting such emails programmatically can save users from potential threats and unnecessary clutter. We will use FFNN and LSTM for classification and leverage NLTK for advanced text preprocessing techniques like lemmatization.
Before starting, ensure the required libraries are installed. Below is the code to set up the environment:
!pip install tensorflow
!pip install pandas
!pip install scikit-learn
!pip install nltk
!pip install seaborn
Importing Libraries
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import confusion_matrix, f1_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
Downloading NLTK Data
nltk.download(‘stopwords’)
nltk.download(‘wordnet’)
Data Preprocessing:
Preprocessing text data is critical for training an effective model. Steps include:
- Lemmatization: Reducing words to their base forms.
- Stopword Removal: Eliminating common words (like “the”, “and”) that do not contribute to classification.
- Tokenization and Padding: Converting text into sequences of tokens and padding them for uniform input length (for LSTM).
def preprocess_text(text):
lemmatizer = WordNetLemmatizer()
tokens = nltk.word_tokenize(text.lower())
filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stopwords.words('english')]
return ' '.join(filtered_tokens)
# Load dataset (example CSV file)
data = pd.read_csv('emails.csv')
data['processed_text'] = data['text'].apply(preprocess_text)
# Tokenization
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['processed_text'])
X = tokenizer.texts_to_sequences(data['processed_text'])
X = pad_sequences(X, maxlen=100) # Padding sequences to a fixed length
y = data['label'] # Assuming 'label' column contains 0 for ham and 1 for spam
Splitting Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Building the Models
FFNN Model
The FFNN is a type of artificial neural network where connections between nodes do not form cycles. Here’s how we define and compile the FFNN model:
ffnn_model = Sequential([
Dense(128, input_dim=X_train.shape[1], activation='relu'),
Dropout(0.5),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(1, activation='sigmoid')
])
ffnn_model.compile(optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
LSTM Model
The LSTM network is a type of recurrent neural network (RNN) well-suited for sequential data like text. Here’s how we define and compile the LSTM model:
lstm_model = Sequential([
Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=128, input_length=100),
LSTM(128, return_sequences=False),
Dropout(0.5),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(1, activation='sigmoid')
])
lstm_model.compile(optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
Training the Models with Early Stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
# Train FFNN
ffnn_history = ffnn_model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])
# Train LSTM
lstm_history = lstm_model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])
Evaluating the Models
Once trained, the models’ performance is evaluated using metrics such as accuracy, confusion matrix, and F1-score
# Predicting on the test set
ffnn_pred = (ffnn_model.predict(X_test) > 0.5).astype("int32")
lstm_pred = (lstm_model.predict(X_test) > 0.5).astype("int32")
# Confusion Matrices
ffnn_conf_matrix = confusion_matrix(y_test, ffnn_pred)
lstm_conf_matrix = confusion_matrix(y_test, lstm_pred)
# Visualization
sns.heatmap(ffnn_conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('FFNN Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
sns.heatmap(lstm_conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('LSTM Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# Classification Reports
print("FFNN Classification Report")
print(classification_report(y_test, ffnn_pred))
print("LSTM Classification Report")
print(classification_report(y_test, lstm_pred))
Visualizing Training History
# FFNN Training History
plt.plot(ffnn_history.history['accuracy'], label='Train Accuracy')
plt.plot(ffnn_history.history['val_accuracy'], label='Validation Accuracy')
plt.plot(ffnn_history.history['loss'], label='Train Loss')
plt.plot(ffnn_history.history['val_loss'], label='Validation Loss')
plt.title('FFNN Training History')
plt.xlabel('Epochs')
plt.ylabel('Metrics')
plt.legend()
plt.show()
# LSTM Training History
plt.plot(lstm_history.history['accuracy'], label='Train Accuracy')
plt.plot(lstm_history.history['val_accuracy'], label='Validation Accuracy')
plt.plot(lstm_history.history['loss'], label='Train Loss')
plt.plot(lstm_history.history['val_loss'], label='Validation Loss')
plt.title('LSTM Training History')
plt.xlabel('Epochs')
plt.ylabel('Metrics')
plt.legend()
plt.show()
F1 Score and Confusion Matrix & Train Graph
I have demonstrated how to build a spam email detection system using FFNN, LSTM, and NLTK. We achieved effective classification by preprocessing email text with lemmatization and training robust neural networks. This approach can be further enhanced by:
- Using advanced embeddings (e.g., Word2Vec, GloVe, or BERT) for richer text representation.
- Training on larger, more diverse datasets for improved generalization.
- Leveraging ensemble models to combine the strengths of multiple classifiers.
- For Me, I have used a low amount of epochs. You can train and adjust your epoch and accuracy more.
- Others can find better models or Algorithm to detect better spam
- Looking forward for me. If you think I can have more options. please let me know in the comment.
- GitHub Repo Link: https://github.com/ashikp/spamdetection-ai
- Dataset Link: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv
Thanks and Peace