Publisher-Adaptive HTML Parsing System for Figure and Caption Extraction from Scientific Papers

Dec 3, 2025·
Woonghee Lee
Woonghee Lee
,
Inseop Kim
,
Junhyung Lee
,
Chanwoo Lee
· 0 min read
Abstract
System that adaptively extracts figure images and their captions from academic-paper web pages whose HTML structure varies by publisher, along with the body sentences that mention each caption. Useful for large-scale literature surveys and metadata pipelines that need to handle heterogeneous publisher layouts.
Type
Publication
Software Copyright Registration · Korea Copyright Commission
publications
Woonghee Lee
Authors
Senior Researcher · Principal Investigator
Woonghee Lee leads the DAIS Research Group at the Korea Institute of Energy Research (KIER). His research applies deep learning to image, signal, and text data — currently focused on AI methods for energy and materials research, including time-series and document analysis, robust deep models, and AI agents for research workflows.