How to perform clustering on Word2Vec

How to perform clustering on Word2Vec



I have a semi-structured dataset, each row pertains to a single user:


id, skills
0,"java, python, sql"
1,"java, python, spark, html"
2, "business management, communication"



Why semi-structured is because the followings skills can only be selected from a list of 580 unique values.



My goal is to cluster users, or find similar users based on similar skillsets. I have tried using a Word2Vec model, which gives me very good results to identify similar skillsets - For eg.


model.most_similar(["Data Science"])



gives me -


[('Data Mining', 0.9249375462532043),
('Data Visualization', 0.9111810922622681),
('Big Data', 0.8253220319747925),...



This gives me a very good model for identifying individual skills and not group of skills. how do I make use of the vector provided from the Word2Vec model to successfully cluster groups of similar users?





kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne
– grshankar
Aug 28 at 7:42




1 Answer
1



You need to vectorize you strings using your Word2Vec model.
You can make it possible like this:


model = KeyedVectors.load("path/to/your/model")
w2v_vectors = model.wv.vectors # here you load vectors for each word in your model
w2v_indices = word: model.wv.vocab[word].index for word in model.wv.vocab # here you load indices - with whom you can find an index of the particular word in your model



Then you can use is in this way:


def vectorize(line):
words =
for word in line: # line - iterable, for example list of tokens
try:
w2v_idx = w2v_indices[word]
except KeyError: # if you does not have a vector for this word in your w2v model, continue
continue
words.append(w2v_vectors[w2v_idx])
if words:
words = np.asarray(words)
min_vec = words.min(axis=0)
max_vec = words.max(axis=0)
return np.concatenate((min_vec, max_vec))
if not words:
return None



Then you receive a vector, which represents your line (document, etc).



After you received all your vectors for each of the lines, you need to cluster, you can use DBSCAN from sklearn for clustering.


from sklearn.cluster import DBSCAN
dbscan = DBSCAN(metric='cosine', eps=0.07, min_samples=3) # you can change these parameters, given just for example
cluster_labels = dbscan.fit_predict(X) # where X - is your matrix, where each row corresponds to one document (line) from the docs, you need to cluster



Good luck!






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)