DocEng 2011: A Versatile Model for Web Page Representation

The 11th ACM Symposium on Document Engineering Mountain View, California, USA September 19-22, 2011 A Versatile Model for Web Page Representation, Information Extraction and Content Re-Packaging Bernhard Kr?pl-Sypien, Ruslan Fayzrakhmanov, Wolfgang Holzinger, Mathias Panzenb?ck, Robert Baumgartner Presented by Bernhard Kr?pl-Sypien. ABSTRACT On todays Web, designers take huge efforts to create visu- ally rich websites that boast a magnitude of interactive ele- ments. Contrarily, most web information extraction (WIE) algorithms are still based on attributed tree methods which struggle to deal with this complexity. In this paper, we in- troduce a versatile model to represent web documents. The model is based on gestalt theory principlestrying to cap- ture the most important aspects in a formally exact way. It (i) represents and unifies access to visual layout, content and functional aspects; (ii) is implemented with semantic web techniques that can be leveraged for i.e. automatic reason- ing. Considering the visual appearance of a web page, we view it as a collection of gestalt figuresbased on gestalt primitiveseach representing a specific design pattern, be it navigation menus or news articles. Based on this model, we introduce our WIE methodology, a re-engineering pro- cess involving design patterns, statistical distributions and text content properties. The complete framework consists of the UOM model, which formalizes the mentioned com- ponents, and the MANM layer that hints on structure and serialization, providing document re-packaging foundations. Finally, we discuss how we have applied and evaluated our model in the area of web accessibility.
