Web Science/Part2: Emerging Web Properties/Modelling Similarity of Text

Modelling Similarity of Text

Learning goals

Associated units

  • jump to video
  • download the video
  • jump to script
  • jump to quiz
  1. Know the properties of a similarity measure
  2. Be able to relate similarity and distance measures
  3. Know of two applications for modelling similarity
  • jump to video
  • download the video
  • jump to script
  • jump to quiz
  • Understand how text documents can be modeled as sets
  • Know the Jaccard coefficient as a similarity measure on sets
  • Know a trick how to remember the formula
  • Be aware of the possible outcomes of the Jaccard index
  • As always be able to criticize your model
    • jump to video
    • download the video
    • jump to script
    • jump to quiz
  • Be familiar with the vector space model for text documents
  • Be aware of term frequency and (inverse) document frequency
  • Have reviewed the definitions of base and dimension
  • Realize that the angle between two vectors can be seen as a similarity measure
    • jump to video
    • download the video
    • jump to script
    • jump to quiz
  • Be aware of a unigram Language Model
  • Know Laplacian (aka +1) smoothing
  • Know the query likelihood model
  • The Kullback Leibler Divergence
  • See how a similarity measure can be derived from Kullback Leibler Divergence
    • jump to video
    • download the video
    • jump to script
    • jump to quiz
  • Understand that different modeling choices can produce very different results.
  • Have a feeling how you could statistically compare the differences of the models.
  • Know how you could extract keywords from documents with the tf-idf approach.
  • Try to argue which model you like best in a certain scenario.
  • Further reading

    no further reading defined
    You can define further reading here.
    In general you can use the edit button in the upper right corner of a section to edit its content.

    Discussion