Building an Attention Residual UNet for CT Liver Segmentation
Medical image segmentation has evolved dramatically from manual tracing to fully automated deep learning pipelines. In this post, I walk through my implementation of an Attention Residual UNet for segmenting liver, hepatic vessels, and tumors from CT scans—covering the architecture decisions, data processing, training setup, and lessons learned.
Why This Matters
The liver reveals a lot about a patient's health through medical imaging. CT and MRI scans allow doctors to assess various conditions, but manual segmentation is:
- Time-consuming: 30-60 minutes per scan for expert radiologists
- Inconsistent: Subject to inter-observer variability
- Bottleneck: Slows down diagnosis and treatment planning
Automating this process can significantly improve diagnostic speed and consistency—which is exactly what this project aims to do.
The Evolution of Segmentation Methods
Before diving into my implementation, it helps to understand what came before:
Traditional Approaches
Thresholding segments regions based on pixel intensity—works when target structures have consistent values, but fails when liver tissue overlaps with surrounding organs on the intensity spectrum.
Region growing starts from seed points and expands based on similarity criteria—requires manual initialization and struggles with inhomogeneous tissues.
The Deep Learning Revolution
CNNs changed everything by automatically learning spatial hierarchies. But vanilla CNNs weren't designed for dense prediction tasks like segmentation.
UNet (2015) introduced the encoder-decoder architecture with skip connections—suddenly, we could produce pixel-level predictions while preserving spatial details. This became the foundation for medical image segmentation.
ResNet solved the vanishing gradient problem in deep networks through residual connections—enabling much deeper architectures that could learn more complex features.
My Architecture: Attention Residual UNet
I combined all three innovations into a single architecture:
The UNet Backbone
The classic encoder-decoder structure:
Encoder Path: Multiple convolutional blocks, each with two conv layers + ReLU + batch norm, followed by max pooling that halves spatial dimensions. This captures increasingly abstract features.
Decoder Path: Mirrors the encoder with upsampling (transposed convolutions) to reconstruct spatial resolution. Each level combines upsampled features with corresponding encoder features via skip connections.
Skip Connections: The key insight—directly connect encoder blocks to decoder blocks at each level:
D_i = f(E_i) + g(D_{i+1})
This preserves high-resolution details crucial for precise boundaries.
Adding Residual Connections
I replaced standard conv blocks with residual blocks throughout:
y = x + F(x, {W_i})
The input x is added directly to the block's output, allowing gradients to flow through the shortcut. This stabilizes training in deeper networks and helps preserve fine anatomical details through the network's depth.
Attention Mechanisms
Here's where it gets interesting. I added spatial attention gates just before each skip connection in the decoder:
F_att = α ⊙ F
The attention map α is computed using scaled dot-product attention:
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
This forces the network to focus on relevant regions (liver boundaries, vessels) while suppressing irrelevant background. The result: sharper segmentation boundaries and better handling of ambiguous regions.
Data Processing
I used the Medical Segmentation Decathlon dataset—a standardized benchmark for medical segmentation tasks.
Liver Dataset
- 54 contrast-enhanced CT images
- Each image: 70-300 slices
- Training: 43 images | Validation: 11 images
- Excluded images >300 slices to manage compute
Hepatic Vessel Dataset
- 303 labeled CT images
- Training: 172 images | Validation: 44 images
- Threshold: <80 slices (larger images provided diminishing returns)
The slice threshold was a practical choice—very large volumes significantly increase training time without proportional improvement in segmentation quality.
Training Setup
Configuration
| Parameter | Value |
|---|---|
| Batch Size | 32 |
| Initial Learning Rate | 1×10⁻⁴ |
| Optimizer | Adam |
| LR Scheduler | Reduce by 0.5 if no improvement for 3 epochs |
| Early Stopping | Stop if validation loss doesn't improve for 5 epochs |
Loss Function: Dice-Cross Entropy
I used a combined loss that balances two objectives:
Dice Loss: Directly optimizes the overlap metric we care about—the Dice coefficient:
Dice = (2 × |P ∩ T|) / (|P| + |T|)
Cross-Entropy: Provides stable gradients and handles class imbalance better during early training.
Model Capacity
- Liver model: 8 base features (memory-constrained due to large volumes)
- Vessel model: 16 base features (smaller volumes allowed more capacity)
Features double at each encoder level, so the bottleneck has 128/256 features respectively.
Results
After 25 epochs (liver) and 18 epochs (vessels), the models achieved:
| Task | Structure | Dice Score |
|---|---|---|
| Liver Segmentation | Liver | 0.88 |
| Liver Segmentation | Tumor | 0.73 |
| Vessel Segmentation | Hepatic Vessels | 0.73 |
| Vessel Segmentation | Tumor | 0.66 |
What Worked
The liver segmentation at 0.88 Dice is solid—the model captures boundaries accurately for this larger, well-defined structure. Training and validation loss converged smoothly, indicating good generalization.
What Didn't
Tumor segmentation struggled (0.66-0.73 Dice). The main culprit: class imbalance. Tumors occupy a tiny fraction of the total image volume compared to background. The model learns to predict "not tumor" everywhere because that's mostly correct.
Lessons Learned
Class Imbalance is Real
Even with Dice loss (which handles imbalance better than pure cross-entropy), small structures get overwhelmed. Future work should explore:
- Focal loss: Down-weight easy negatives, focus on hard examples
- Weighted Dice: Penalize tumor misclassifications more heavily
- Oversampling: Balance training batches toward tumor-containing slices
The Cascaded Approach
Training separate models for liver (stage 1) and tumors within liver ROI (stage 2) could improve results by:
1. Reducing the search space for tumor detection
2. Allowing stage-specific hyperparameters
3. More efficient use of compute
Attention Helps, But Isn't Magic
Attention gates improved boundary sharpness and helped with ambiguous regions, but couldn't overcome fundamental class imbalance. They're a tool, not a solution.
Try It Yourself
I've deployed an interactive demo at /demos/ct-segmentation where you can:
- Upload NIfTI files
- Visualize 3D segmentation results
- See the model's predictions in real-time
The full code is available on GitHub.
What's Next
Potential improvements I'm exploring:
- nnU-Net: Self-configuring framework that automatically adapts to each dataset
- Transformer-based architectures: UNETR, Swin-UNETR for better long-range dependencies
- Semi-supervised learning: Leverage unlabeled CT scans to improve generalization
Medical imaging is a fascinating application of deep learning—high stakes, limited data, and real clinical impact. The gap between research benchmarks and production-ready systems is significant, but projects like this help bridge it one step at a time.