Return to search

Detecting Visually Similar Web Pages: Application to Phishing Detection

We propose a novel approach for detecting visual similarity between two web pages. The proposed approach applies Gestalt theory and considers a webpage as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We apply our new approach to the domain of anti-Phishing technologies, which at once gives us both a reasonable ground truth for the concept of “visually similar,” and a high-value application of our proposed approach.
Phishing attacks involve sophisticated, fraudulent websites that are realistic enough to fool a significant number of victims into providing their account credentials. There is a constant tug-of-war between anti-Phishing researchers who create new schemes to detect Phishing scams, and Phishers who create countermeasures. Our approach to Phishing detection is based on one major signature of Phishing webpage which can not be easily changed by those con artists –Visual Similarity. The only way to fool this significant characteristic appears to be to make a visually dissimilar Phishing webpage, which also reduces the successful rate of the Phishing scams or their criminal profits dramatically. For this reason, our application appears to be quite robust against a variety of common countermeasures Phishers have employed. To verify the practicality of our proposed method, we perform a large-scale, real-world case study, based on “live” Phish captured from the Internet.
Compression algorithms (as a practical operational realization of algorithmic complexity theory) are a critical component of our approach. Out of the vast number of compression techniques in the literature, we must determine which compression technique is best suited for our visual similarity problem. We therefore perform a comparison of nine compressors (including both 1-dimensional string compressors and 2-dimensional image compressors). We finally determine that the LZMA algorithm performs best for our problem.
With this determination made, we test the LZMA-based similarity technique in a realistic anti-Phishing scenario. We construct a whitelist of protected sites, and compare the performance of our similarity technique when presented with a) some of the most popular legitimate sites, and b) live Phishing sites targeting the protected sites. We found that the accuracy of our technique is extremely high in this test; the true positive and false positive rates reached 100% and 0.8%, respectively.
We finally undertake a more detailed investigation of the LZMA compression technique. Other authors have argued that compression techniques map objects to an implicit feature space consisting of the dictionary elements generated by the compressor. In testing this possibility on live Phishing data, we found that derived variables computed directly from the dictionary elements were indeed excellent predictors. In fact, by taking advantage of the specific characteristic of dictionary compression algorithm, we slightly improve on our accuracy when using a modified/refined LZMA algorithm for our already perfect NCD classification application. / Software Engineering and Intelligent Systems

Identiferoai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:AEU.10048/1682
Date06 1900
CreatorsTeh-Chung, Chen
ContributorsJames Miller (Electrical and Computer Engineering), Scott Dick (Electrical and Computer Engineering), Vicky Zhao (Electrical and Computer Engineering), Osmar Zaiane (Computing Science), Jens Weber (Software Engineering, University of Victoria)
Source SetsLibrary and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
LanguageEnglish
Detected LanguageEnglish
TypeThesis
Format3218303 bytes, application/pdf
RelationT.C Chen,TOIT(2010), http://portal.acm.org/citation.cfm?id=1754393&picked=prox&cfid=3322350&cftoken=44257891

Page generated in 0.0019 seconds