Web Science/Part2: Emerging Web Properties/Search Engine Ecosystem

Course elements

PART1: Week1: Ethernet · Internet Protocol · Week2: Transmission Control Protocol · Domain Name System · Week3: Internet vs world wide web · HTTP · Week4: Web Content · Dynamic Web Content
PART2: Week5: How big is the Web? · Descriptive Web Models · Week6: Advanced Statistic Models · Modelling Similarity · Week7: Generative Modelling of the Web · Graph theoretic Web Modelling
PART3: Week8 : Investigating Meme Spreading · Herding Behaviour · Week9: Online Advertising · User Modelling
PART4: Week10 : Copyright · Net neutrality · Week11: Internet governance · Privacy

Survival of the fittest

Fit for whom?
- Search engine operator, search users, advertisers
- Unfit for spammers
Key performance indicators (multi-criteria optimization problem!)
- Value per click
  - User: usability, relevance of search results, coverage of the Web
  - Operator: advertising revenues, low cost and scalable technical infrastructure, low personell costs
  - Advertiser: click-through and conversion rate

part 1

what is a search engine?

why is it important
what is key word search?

Search engine history

Archie, 1990
Gopher, 1991
WebCrawler, Lycos, Yahoo search 1994
AltaVista search 1996
Google search 1998
Sequels: Baidu, Yandex, Bing
Alternatives: ask.com, wolframalpha.com
Vertical search: for products - amazon.com, for people: peoplefinder.com, for egosearch (identity theft prevention): garlik.com,...

Search system architecture

what is a web crawler
what is a search index (inverted index)
(for now) blackbox ranking
binary search relevance
interface (auto completion, search results,...)

ranking in search I: application of tf idf

show how tf idf can be used for ranking.

ranking in search II: random surfer model

explaining random surfer model

double[][] transitionMatrix = { { 0., 1. / 3., 1., 1. / 3., 0. },
		{ 1. / 2., 0., 0., 0., 0. }, { 0., 1. / 3., 0., 1. / 3., 1. },
		{ 1. / 2., 0., 0., 0., 0. }, { 0., 1. / 3., 0., 1. / 3., 0. } };
int numberOfNodes = 5;
int steps = 100;

int[] frequency = new int[numberOfNodes];
int page = 0;
for (int i = 0; i < steps; i++) {
	// Make one random move.
	double r = Math.random();
	double sum = 0.0;
	// go through a column of the matrix
	for (int j = 0; j < numberOfNodes; j++) {
		sum += transitionMatrix[j][page];
		// if propability is high enough see this as a jump
		if (r < sum) {
			System.out.println("Go from: " + page + " to:" + j);
			page = j;
			break;
		}
	}
	frequency[page]++;
}

comparison tfidf vs random surfer

Random surfer + tfidf
showing how to combine two models.
even more methods can be included

relevance is a choice: Trust issues with search engines

understand that algorithms are programmed by humans and it is up to us to trust a search engine / choose one
it will be hard to sense manipulations (magic keyword barack obama)
large search engines are about the most powerful institutions on the web (money wise but also with regards to impact)

SPAM and SEO

understand that search results can be manipulated
metadata (schema.org)

The following video of the flipped classroom associated with this topic are available:

You can find more information on wiki commons and also directly download this file

part 2

multi stakeholder system

search engine
end user
web site owner
advertiser
(web master (SEO))

economics of a search engine

understand the concept of keyword based advertising
understand the auction system of keywords
understand the model of shared econnomy and man in the middle business models
taken from b:Strategy_for_Information_Markets/Search_engine_business_models and w:Vickrey_auction
- http://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.1961.tb02789.x/pdf
w:Generalized_second-price_auction
- [1]
- [2]

personalization of search results

key methods of personalization (using a coockie)
graph view of user interests
collaborative filtering

filter bubble effects

Technologies for your own search engine

hadoop
solr
nutch
Elastic search

Key to the most successful search engines was their successful competition for search customers and advertisement customers. Both competitions will be explained in the next two weeks

Advertising

Stakeholders

advertiser
customer
content owner/portal
advertising network

Intermediaries:

markets (ebay)
advertising networks (doubleclick,...)

push out advertisement service from the portal into ad network

customer: more exact profile, better ad targeting
content owner/portal: better targeted ads lead to higher revenue
advertiser: higher click-through rate/conversion rate
ad network: valuable business model

Technology
Business model
Pricing, auctions
real-time bidding