Towards Generating Word Embedding — Machine Learning
Definition: Word embedding is a dense representation of words in numeric values or vectors.
There are two main approaches for learning the word embedding, both relying on the contextual knowledge.
- Count-based: This is an unsupervised approach, based on matrix factorization of a global word co-occurrence matrix. Raw co-occurrence counts do not work well as they assume that words in the same contexts share similar or related semantic meanings.
- Context-based: A supervised approach, given a local context, we want to design a model that predict the target words and in the meantime given its neighbors, this model is efficient in learning the efficient word vector representations.
Count-Based Vector Space Model
Count-based vector space models heavily rely on the word frequency and co-occurrence matrix with a assumption that words in the same contexts share similar or related semantic meaning.
This may lead to bias as if dataset is rigged. Learn more below…
For Example: Let’s suppose that you took data from a set of population that doesn’t believe girls are good for professional or corporate lifestyle then the resultant will see the bias, as the co-occurrence matrix will account for all the negative words that occurs w.r.t women.
Context-Based Vector Space Model
Context-based methods build predictive models that directly target at predicting a word given its neighbors. The dense word vectors are part of the model parameters. The best vector representation of each word is learned during the model training process.
Skip-Gram Model
Suppose that you have a sliding window of a fixed size moving along a sentence: the word in the middle is the “target” and those on its left and right within the sliding window are the context words.
The skip-gram model is trained to predict the probabilities of a word being a context word for the given target. Each context-target pair is treated as a new observation in the data.
Continuous Bag-of-Words (CBOW)
It predicts the target word from source context words.
Because there are multiple contextual words, we average their corresponding word vectors, constructed by the multiplication of the input vector and the matrix W. Because the averaging stage smoothes over a lot of the distributional information, some people believe the CBOW model is better for small dataset.
Other Tips for Learning Word Embedding
Examples: word2vec on “Game of Thrones”
import sys
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
STOP_WORDS = set(stopwords.words('english'))
def get_words(txt):
return filter(
lambda x: x not in STOP_WORDS,
re.findall(r'\b(\w+)\b', txt)
)
def parse_sentence_words(input_file_names):
"""Returns a list of a list of words. Each sublist is a sentence."""
sentence_words = []
for file_name in input_file_names:
for line in open(file_name):
line = line.strip().lower()
line = line.decode('unicode_escape').encode('ascii','ignore')
sent_words = map(get_words, sent_tokenize(line))
sent_words = filter(lambda sw: len(sw) > 1, sent_words)
if len(sent_words) > 1:
sentence_words += sent_words
return sentence_words
# You would see five .txt files after unzip 'a_song_of_ice_and_fire.zip'
input_file_names = ["001ssb.txt", "002ssb.txt", "003ssb.txt",
"004ssb.txt", "005ssb.txt"]
GOT_SENTENCE_WORDS= parse_sentence_words(input_file_names)from gensim.models import Word2Vec
# size: the dimensionality of the embedding vectors.
# window: the maximum distance between the current and predicted word within a sentence.
model = Word2Vec(GOT_SENTENCE_WORDS, size=128, window=3, min_count=5, workers=4)
model.wv.save_word2vec_format("got_word2vec.txt", binary=False)
Vivek Gupta: https://www.linkedin.com/in/vivekg-/
Follow me on Quora: https://www.quora.com/profile/Vivek-Gupta-1493
Check out my legal space here: https://easylaw.quora.com