Deep Learning Based Cyber-Attack Detection in IoT Networks

— an LSTM + CNN + Transformer Ensemble

TL;DR

Industrial IoT networks emit hundreds of multivariate sensor readings per second, and detecting cyber-attacks hidden in those streams in real time is a hard problem. In this project I trained three complementary deep models — LSTM, 1D-CNN, and a Transformer encoder — on the SWaT (Secure Water Treatment) dataset, and wired them into a Kafka-based streaming pipeline. Decisions are made by majority voting: an alarm is raised the moment at least 2 out of 3 models cross their thresholds.

All code and trained models on GitHub: github.com/mraknar/iot-anomaly-detection

The Problem

Three properties make attack detection in IIoT (Industrial IoT) particularly painful:

Severe class imbalance. In SWaT, 96.21% of samples are normal and only 3.79% are attacks. Plain accuracy is misleading — a model that always predicts "normal" already scores 96%.
Strong temporal dependency. The signature of an attack does not live in a single reading. It accumulates across roughly 10 consecutive sensor steps.
Low-latency requirement. In a critical environment like a water treatment plant, the alarm needs to fire within a second or two. Offline batch analysis is not enough.

A single model struggles to satisfy all three at once. That is why I combined three different architectures with complementary strengths.

The Dataset — SWaT (Secure Water Treatment)

A public benchmark dataset collected from iTrust Lab's scaled-down water treatment testbed.

1,441,719 time steps
51 sensor / actuator variables (pressure, flow, level, valve states, etc.)
Labels: Normal (0) and Attack (1)
Class distribution: 96.21% / 3.79%

Preprocessing pipeline:

Strip column names; drop timestamp and label columns from model inputs
Min-Max normalization fitted only on the training split, then serialised to scaler.pkl and reused unchanged at inference time — this is the only reliable way to avoid train/serve skew
Sliding-window segmentation: SEQ_LEN = 10, so each model input is a (10, 51) tensor
Stratified train/test split preserving the original 96/4 ratio

The Three Models

I deliberately chose architectures with complementary behaviour:

1) LSTM

Two stacked LSTM layers (64 → 32 units) with dropout. Its job is to capture gradual drift and long-term temporal dependencies in sensor behaviour. Behaviour: the slowest of the three to react at attack onset, but once it "wakes up" it does not miss attacks (high recall).

2) 1D-CNN

Two convolutional layers (64 → 32 filters, kernel size 3) + global max pooling + dropout. Its job is to catch local morphological patterns — short-lived spikes, sudden discontinuities. Behaviour: the most balanced precision–recall trade-off of the three.

3) Transformer Encoder

Multi-head self-attention (4 heads) + feed-forward block + global average pooling. Its job is to capture long-range, global interactions across the window. Behaviour: the fastest to react at attack onset — it can jump from probability 0.05 to 0.94 within one or two steps.

Training setup

Loss: binary cross-entropy with class weights to penalise the minority class
Optimizer: Adam (lr = 1e-3 for LSTM/CNN, 1e-4 for the Transformer) with gradient clipping
20 epochs, batch size 128
EarlyStopping(patience=7) + ModelCheckpoint(monitor='val_accuracy')

All three models reach ~99% validation accuracy offline, but the real story is in their different precision–recall profiles — and that is exactly what the ensemble feeds on.

The Pipeline — Real-Time Streaming with Kafka

Offline evaluation alone is an academic exercise. I wanted to see how the models behave in a real streaming scenario, so I wired up:

SWaT CSV  ──►  producer.py  ──►  Kafka topic  ──►  consumer.py  ──►  3 models + voting  ──►  ALARM/SAFE
                                  (Kafka + Zookeeper in Docker)

Producer (producer.py) — reads SWaT rows, serialises each to JSON, and publishes them to the swat_sensor_data topic. The demo scenario sends 50 normal rows followed by 50 attack rows so the live transition from normal to attack is observable.
Kafka broker — Zookeeper + Kafka brought up via Docker Compose; the topic is auto-created.
Consumer (consumer.py) — listens to the topic, maintains a 10-step sliding buffer, normalises the data using the saved scaler.pkl, and feeds it as a (1, 10, 51) tensor to all three models in parallel.

The Decision Rule — Majority Voting

Instead of trusting a single model output, I empirically tuned three different thresholds:

Model	Threshold	Rationale
LSTM	0.30	Noisy outputs but high recall → low threshold
CNN	0.40	Balanced — neutral threshold
Transformer	0.50	Very confident outputs → standard threshold

Decision rule: alarm = (votes ≥ 2) — strict majority across three independent decisions.

This significantly reduces false positives compared to any single model, while the Transformer's early sensitivity ensures recall is not sacrificed.

Live Behaviour Observed

Normal regime (before the attack):
  LSTM  ≈ 0.03–0.10
  CNN   ≈ 0.25–0.35
  TRANS ≈ 0.05–0.10
  → 0/3 votes → SAFE

Attack onset:
  TRANS jumps first    (~0.94–0.99)   → 1/3 votes
  CNN  crosses shortly after          → 2/3 votes → ALARM
  LSTM joins last (cumulative drift accumulates)

This ordering is a perfect match for what the literature predicts: Transformers detect global deviation early, CNNs catch local morphology, and LSTMs add temporal stability. The ensemble combines three different time scales of evidence.

Resources

GitHub Repository — all code, trained models, Docker setup, and the training notebook
SWaT Dataset — available on Kaggle

You can view and download my presentation PDF file below.

sunum İndir

Time Series Classification in IoT Networks

— an LSTM + CNN + Transformer Ensemble

TL;DR

The Problem

The Dataset — SWaT (Secure Water Treatment)