This learning resource is created as a Wiki2Reveal course about an Open Search infrastructure in which

  • the generation of the underlying web index,
  • the ranking algorithms of the search results,
  • the Application Programmer Interface
  • ...

is based on

  • Open Source,
  • Open Data of the web index,
  • Open Content (e.g. Creative Commons, ...)

Design of the Learning Resource

edit

The learning resource is build upon the Open Community Approach, so that student can explore and learn about the principle of search engines and play around with an indexer and with application generation upon their own web index.

This is designed as a learning resource about generic principles of search engines, the client side logic skills to get appropriate search results for your needs and about computer science requirements and constraints of the backend to run a search engine.

Please keep in mind, that the experiments and the digital learning environment is designed in way, that activities in the learning environment can be performed on an average computer or even mobile device so that the learning is not dependent on a high performance cluster and good bandwidth. The basic principle of search engines can addressed with small databases a few documents to be scanned and processed.


Learning Tasks

edit

The following learning tasks are designed for a basic introduction into search engines with minor programming activities.

  • (Requirements of an Index) Assume you want to look for a specific keyword in your LibreOffice document. You go to "Find/Replace" of the editor, enter the keywords and after pressing the Enter button the search engine in LibreOffice will provide the search results (i.e. the locations in the document where the search algorithm find matches of the keyword with text loaded in your editor. Explain why this approach it is not possible for searching the web for specific keywords?
    • Theoretically you could load different given start pages,
    • load those pages,
    • search for the keywords in those pages,
    • add the results to a array of results, and
    • then follow the links in those start page and load them too,
    • search for the keywords in those linked document, ...
    • this approach can be performed until a specific stop condition is fulfilled.
The approach can be implemented but what are the limitations and drawbacks?
  • (Server Load) Assume the workflow above was perform for all search queries submitted to a search engine. Analyze and estimate the "work" load for a server per minute, to answer all remote fetches of the page content.
  • (Local Search Engine) Copy a few text documents of your choice in a folder (e.g. 10) and create a first database that maps keywords to document file names.
Search Index
Keyword Filename
example doc1.txt
example doc4.txt
extra doc4.txt
device doc3.txt
device doc1.txt
mobile doc3.txt
moblie doc2.txt
mobile doc1.txt
mobiledevice doc4.txt
... ....
Perform the following tasks for creating the web index.
  • remove all special characters with a regular expression e.g. s/[^A-Za-z0-9\-]g/" "/
  • Split the documents into words with a regular expression,
  • add the words to the database table together with the filename in which the algorithm (indexer) found the keyword.
with the examples above perform a search for the following keywords with the simple index. Search for
  • "mobile"
  • "device"
  • "mobile device"
What are the challenges for creating such an index?
Would you index the words "the", "of", "an", "a", ... in an english document?
What is the benefit of an index in comparison to the previous theoretical approach of a web search?
  • (Abstract/Description for Results) Use the search engine in Wikipedia for a topic you are interessed in. How are the search results presented to you?
  • (Typos in Documents and Query String) Documents are not necessary free of typos and also the keywords typed by the user may have swapped characters in the string (see above "moblie" instead of "mobile"). Apply Fuzzy Logic or a mathematical metric on the domain of words that provides a measure of proximity (e.g. d("mobile","mobile")=0.0, d("mobile","moblie")=0.15 and   has a large distance between the words). How could you define the metric   that provides a measure of distance between words? The metric is   if and only if  , i.e. the words   and   are equal. How would you add pages to the results that contain very similar keywords (e.g.   - if   only exact matched with the keywords are added to search results. How would you deal with the word "mobiledevice" which has a missing blank as typo? Consider matches of keywords with substrings?
  • (Programming Language) In which language would you like to implement the indexer (NodeJS, Python, Perl, ...) compare the programming languages for implementing a this task?
  • (Number of Web Pages) Now you have performed a basic introduction to an index, that support the search? Now we look at the amount of web pages that is available on the web? Try to find a quantification of web sites that are freely accessible and can be indexed by a web crawler? Make a rough estimation how long would take to index that amount of web pages with your own computer? Please take into account how much it takes to download a specific page from a web server and store it your machine and the time to index those page?

See also

edit