Identifying Semantic Blocks in Web Pages Using Gestalt Laws
of Grouping
Semantic block identification is an approach to retrieve information from web
pages and applications. As website design evolves, however, traditional methodologies cannot
perform well any more. This paper proposes a new model to merge web page content into
semantic blocks by simulating human perception. A layer tree is constructed to remove hierarchical
inconsistencies between the DOM tree representation and the visual layout of the
web page. Subsequently, the Gestalt laws of grouping are interpreted as the rules for semantic
block detection. During interpretation, the normalized Hausdorff distance, the CIE-Lab
color difference, the normalized compression distance, and the series of visual information
are proposed to operationalize these Gestalt laws. Finally, a classifier is trained to combine
each operationalized law into a unified rule for identifying semantic blocks from the web
page. Experiments are conducted to compare the efficiency of the model to a state-of-art
algorithm, the VIPS. The comparison results of the first experiment show that the GLM
model generates more true positives and less false negatives than VIPS. The next experiment
upon a large-scale test set produces an average precision of 90.53% and recall rate of
90.85%, which is approximately 25% better than that of VIPS.