Publisher-Adaptive HTML Parsing System for Figure and Caption Extraction from Scientific Papers
Dec 3, 2025·,,,·
0 min read
Woonghee Lee
Inseop Kim
Junhyung Lee
Chanwoo Lee
Abstract
System that adaptively extracts figure images and their captions from academic-paper
web pages whose HTML structure varies by publisher, along with the body sentences
that mention each caption. Useful for large-scale literature surveys and metadata
pipelines that need to handle heterogeneous publisher layouts.
Type
Publication
Software Copyright Registration · Korea Copyright Commission