Identifying Semantic Blocks in Web Pages Using Gestalt Laws of Grouping

Semantic block identification is an approach to retrieve information from web pages and applications. As website design evolves, however, traditional methodologies cannot perform well any more. This paper proposes a new model to merge web page content into semantic blocks by simulating human perception. A layer tree is constructed to remove hierarchical inconsistencies between the DOM tree representation and the visual layout of the web page. Subsequently, the Gestalt laws of grouping are interpreted as the rules for semantic block detection. During interpretation, the normalized Hausdorff distance, the CIE-Lab color difference, the normalized compression distance, and the series of visual information are proposed to operationalize these Gestalt laws. Finally, a classifier is trained to combine each operationalized law into a unified rule for identifying semantic blocks from the web page. Experiments are conducted to compare the efficiency of the model to a state-of-art algorithm, the VIPS. The comparison results of the first experiment show that the GLM model generates more true positives and less false negatives than VIPS. The next experiment upon a large-scale test set produces an average precision of 90.53% and recall rate of 90.85%, which is approximately 25% better than that of VIPS.