Publisher-Adaptive HTML Parsing System for Figure and Caption Extraction from Scientific Papers
Dec 3, 2025·
,,,·
0 min read
Woonghee Lee
Inseop Kim
Junhyung Lee
Chanwoo Lee
Abstract
System that adaptively extracts figure images and their captions from academic-paper
web pages whose HTML structure varies by publisher, along with the body sentences
that mention each caption. Useful for large-scale literature surveys and metadata
pipelines that need to handle heterogeneous publisher layouts.
Type
Publication
Software Copyright Registration · Korea Copyright Commission

Authors
Senior Researcher · Principal Investigator
Woonghee Lee leads the DAIS Research Group at the Korea Institute of
Energy Research (KIER). His research applies deep learning to image, signal,
and text data — currently focused on AI methods for energy and materials
research, including time-series and document analysis, robust deep models,
and AI agents for research workflows.