Talk:Website fingerprinting
Hm.. interesting idea. Try this:
Which sites to store?
editThe Alexa.com top 1 million + selected categories in DMOZ.org like banks.
Which pages to store?
editThe main page and the login pages.
How to find the login pages?
editCheck the link text of each <a> link off the main page for a set of keywords, eg. login, signin, sign-in. Index each page that matches up to a limit of 500.
If no link text if found, then check each link's destination page for login forms. They can be identified by input tags with the password setting and other keywords. If this doesn't work then try training a bayesian text classifier.
How to make a fingerprint?
edit- a hash of the HTML tags, eg. html,head,title,/title,/head
- a hash of the HTML except for URLs, eg. remove text in <a href="", style declarations, etc.
How do I know if the fingerprint method is good?
editCollect an archive of mirrored phishing websites and test.
How do I implement it?
editCollect the fingerprints and identify login pages using WhatWeb (http://www.morningstarsecurity.com/research/whatweb). I wrote WhatWeb BTW and you'll need to write a couple of custom plugins, maybe I will help you.... maybe not.
Make a web browser plugin.
editIt sends the URL + fingerprint to the server.
Make a server
editit receives URLs+page fingerprints and responds with:
- url found, fingerprint matches. all good in da hood
- url found, fingerprint doesn't match. maybe they redesigned their website, maybe it's a MITM attack, better check the actual URL for verification from the central server.
- url not found, fingerprint not found. whatever... don't lose your trust in small businesses
- url not found, fingerprint found. maybe it's a phishing site.
- url not found, fingerprint found for >1 trusted websites. probably a false positive for a CMS default login page.
Meh.. that's about it. I wouldn't bother the user unless you get condition 2 or 4. You should pop something up and let them choose what to do, eg. redirect to the trusted site with the same fingerprint.
What's the greatest challenge involved in this?
editWho's gonna pay for the bandwidth? It could work for a single corp, antivirus vendor or google.
Who wrote this up?
editAndrew Horton / urbanadventurer