Detecting Visually Similar Web Pages from Humanistic Perspective


We propose a novel concept to perform web page similarity detection from a human perspective. This proposition does not use feature or element based methods such as textual, link or web structure analysis as the basis of comparison. Instead, we apply Gestalt theory and consider a webpage as a single indivisible entity for the purpose of comparing web pages. The concept of supersignals, as a realization of Gestalt principles, provides us with a theoretical rationale for the conjecture that web pages must be treated as single indivisible entities. Moreover, we utilize algorithmic complexity theory to objectify, and directly compare, these indivisible supersignals. We illustrate the effectiveness of our approach, by applying the technique to the problem of detecting Phishing web pages. Via a large-scale, real-world case study, we demonstrate that 1) Phishing web page detection can be successfully achieved via a similarity approach; and 2) that our approach is highly effective at detecting similar web pages.