Term document matrix and cosine similarity in Python
I have following situation that I want to address using Python (preferably
using numpy and scipy):
Collection of documents that I want to convert to a sparse term document
matrix.
Extract sparse vector representation of each document (i.e. a row in the
matrix) and find out top 10 similary documents using cosine similarity
within certain subset of documents (i.e. documents are labelled with
categories and I want to find similar documents within the same category).
How do I achieve this in Python? I know I can use scipy.sparse.coo_matrix
to represent documents as sparse vectors and take dot product to find
cosine similarity, but how do I convert the entire corpus to a large but
sparse term document matrix (so that I can also extract it's rows as
scipy.sparse.coo_matrix row vectors)?
Thanks.
No comments:
Post a Comment