How do I compute document similarity using Python?

This presentation gathers together video+python. It was written by Jonathan Mugan. Dr. Mugan specializes in artificial intelligence and machine learning.

How do I find documents similar to a particular document?

We will use a library in Python called gensim.

Let’s create some documents. 

We will use NLTK to tokenize.

A document will now be a list of tokens.

We will create a dictionary from a list of documents.

A dictionary maps every word to a number.

What you will find in the full presentation:

  • Create corpus
  • Create tf-idf model
  • Similarity measure object
  • Convert query document
  • Similar documents
  • Exercises

