Comparing SHAP and CRAFT Across Architectures for PEMFC SEM Images
π Bronze Award, Undergraduate / High-School Paper Competition β ASK 2026 (Annual Symposium of the Korea Information Processing Society)
Background
AI-driven materials development is increasingly active in renewable-energy research. The PEMFC uses a platinum (Pt) catalyst to generate electricity from hydrogen and oxygen and is a core module of hydrogen fuel-cell vehicles. The catalyst layer degrades over long-term operation, leading to performance loss. Accurate diagnosis of degradation is therefore essential for life-time prediction.
While deep classifiers can distinguish degradation states from SEM images at high accuracy, their decision processes are opaque to domain experts. XAI methods address this, but most prior work studies a single model β leaving open whether explanations transfer across architectures.
Research Questions
- Are SHAP attributions consistent across model architectures?
- Do the concepts extracted by CRAFT vary with model architecture?
Dataset and Classifiers
- 22 pristine (0 cycles) and 50 degraded (200K cycles) catalyst SEM images (50K magnification, 2 kV, SE) β 72 images total.
- 80/20 trainβtest split with ImageNet-pretrained initialization.
- All eight CNN/Transformer architectures reached 100% accuracy on the test set (5 pristine + 10 degraded images).
- For the XAI comparison we selected three architecturally distinct models: GoogLeNet (Inception), DenseNet121 (dense connections), and MaxViT-T (multi-axis Vision Transformer).
- Random seed fixed at 42.
- Without pretraining, MaxViT-T reached only 80% accuracy, confirming that transfer learning is essential for this small-data regime.
Because all models reach the same accuracy, model selection cannot rely on performance β making XAI-explanation dependence on architecture the central question.
Expert Reference
Following identical-location SEM (IL-SEM) studies [Shokhen 2022; Strandberg 2024], we used reported degradation indicators as the expert reference: pristine samples show homogeneous flat surfaces; degraded samples exhibit Pt agglomeration, carbon shrinkage, cracks, and dark regions.
Pt-agglomerate quantification confirmed statistically significant morphological change: per-image agglomerate count increased 60% (131 Β± 15 β 209 Β± 24) and median individual area decreased 14% (78 β 67 px) β Pt redistributes into more numerous, smaller agglomerates with degradation.
SHAP Meta-Analysis
We computed pixel-level Shapley values with a Gaussian-blur masker (Ο = 128) and aggregated across 26 settings of seven segmentation algorithms.
- Cross-architecture consensus is sparse: 2.7% at 0K, 0.8% at 200K.
- Inter-model IoU of 0.1β0.2 β important regions differ by architecture even in attribution-based explanation.
- However, at 200K 34% of the Pt-agglomerate region overlaps with cross-model consensus, demonstrating that combining attribution with domain knowledge recovers physically meaningful structure.
CRAFT Analysis
We ran CRAFT for the three models Γ four patch sizes (16, 32, 48, 64 px). Mapping pixel scale (3.97 nm/px from the scale bar, 14.2 nm/px after resize to 224Γ224), the patches correspond to physical receptive fields of β 227 / 454 / 680 / 907 nm.
- At 16 px, GoogLeNet extracted dark fine-structure concepts (mean intensity 37) consistent with carbon-support corrosion.
- MaxViT-T’s dominant concept had intensity 118 and DenseNet121’s had 81 β all three models attended to darker regions than 0K (136β163), a shared trend.
- As the patch grew, the contrast between 0K and 200K shrank; at 48β64 px reversals occurred β CRAFT analysis combined with domain knowledge requires patches small enough to capture fine structure.
Conclusion
Both methods exhibit architecture-dependent explanations, yet they agree on trends consistent with established degradation indicators: SHAP’s cross-model consensus concentrates around Pt agglomerates, and CRAFT picks up brighter surfaces at 0K and darker degradation structures at 200K.
Two takeaways: equally accurate models can still produce different XAI explanations, so interpretation should rely on multi-method, multi-model comparison combined with domain knowledge.
Future work extends the binary 0K vs. 200K classification to multi-class (50K, 100K, 150K, 200K) to track how a model’s reasoning shifts with degradation progress.
Acknowledgement
This work was supported by the basic R&D project of the Korea Institute of Energy Research (C6-2402-08).

