# 14.1. Word Embedding (word2vec)¶ Open the notebook in SageMaker Studio Lab

Natural language is a complex system used to express meanings. In this
system, words are the basic unit of the meaning. As the name implies,
*word vectors* are vectors used to represent words, and can also be
considered as feature vectors or representations of words. The technique
of mapping words to real vectors is called *word embedding*. In recent
years, word embedding has gradually become the basic knowledge of
natural language processing.

## 14.1.1. One-Hot Vectors Are a Bad Choice¶

We used one-hot vectors to represent words (characters are words) in Section 8.5. Suppose that the number of different words in the dictionary (the dictionary size) is \(N\), and each word corresponds to a different integer (index) from \(0\) to \(N−1\). To obtain the one-hot vector representation for any word with index \(i\), we create a length-\(N\) vector with all 0s and set the element at position \(i\) to 1. In this way, each word is represented as a vector of length \(N\), and it can be used directly by neural networks.

Although one-hot word vectors are easy to construct, they are usually
not a good choice. A main reason is that one-hot word vectors cannot
accurately express the similarity between different words, such as the
*cosine similarity* that we often use. For vectors
\(\mathbf{x}, \mathbf{y} \in \mathbb{R}^d\), their cosine similarity
is the cosine of the angle between them:

Since the cosine similarity between one-hot vectors of any two different words is 0, one-hot vectors cannot encode similarities among words.

## 14.1.2. Self-Supervised word2vec¶

The word2vec tool was
proposed to address the above issue. It maps each word to a fixed-length
vector, and these vectors can better express the similarity and analogy
relationship among different words. The word2vec tool contains two
models, namely *skip-gram* [Mikolov et al., 2013b] and
*continuous bag of words* (CBOW) [Mikolov et al., 2013a].
For semantically meaningful representations, their training relies on
conditional probabilities that can be viewed as predicting some words
using some of their surrounding words in corpora. Since supervision
comes from the data without labels, both skip-gram and continuous bag of
words are self-supervised models.

In the following, we will introduce these two models and their training methods.

## 14.1.3. The Skip-Gram Model¶

The *skip-gram* model assumes that a word can be used to generate its
surrounding words in a text sequence. Take the text sequence “the”,
“man”, “loves”, “his”, “son” as an example. Let us choose “loves” as the
*center word* and set the context window size to 2. As shown in
Fig. 14.1.1, given the center word “loves”, the skip-gram
model considers the conditional probability for generating the *context
words*: “the”, “man”, “his”, and “son”, which are no more than 2 words
away from the center word:

Assume that the context words are independently generated given the center word (i.e., conditional independence). In this case, the above conditional probability can be rewritten as

In the skip-gram model, each word has two \(d\)-dimensional-vector
representations for calculating conditional probabilities. More
concretely, for any word with index \(i\) in the dictionary, denote
by \(\mathbf{v}_i\in\mathbb{R}^d\) and
\(\mathbf{u}_i\in\mathbb{R}^d\) its two vectors when used as a
*center* word and a *context* word, respectively. The conditional
probability of generating any context word \(w_o\) (with index
\(o\) in the dictionary) given the center word \(w_c\) (with
index \(c\) in the dictionary) can be modeled by a softmax operation
on vector dot products:

where the vocabulary index set \(\mathcal{V} = \{0, 1, \ldots, |\mathcal{V}|-1\}\). Given a text sequence of length \(T\), where the word at time step \(t\) is denoted as \(w^{(t)}\). Assume that context words are independently generated given any center word. For context window size \(m\), the likelihood function of the skip-gram model is the probability of generating all context words given any center word:

where any time step that is less than \(1\) or greater than \(T\) can be omitted.

### 14.1.3.1. Training¶

The skip-gram model parameters are the center word vector and context word vector for each word in the vocabulary. In training, we learn the model parameters by maximizing the likelihood function (i.e., maximum likelihood estimation). This is equivalent to minimizing the following loss function:

When using stochastic gradient descent to minimize the loss, in each iteration we can randomly sample a shorter subsequence to calculate the (stochastic) gradient for this subsequence to update the model parameters. To calculate this (stochastic) gradient, we need to obtain the gradients of the log conditional probability with respect to the center word vector and the context word vector. In general, according to (14.1.4) the log conditional probability involving any pair of the center word \(w_c\) and the context word \(w_o\) is

Through differentiation, we can obtain its gradient with respect to the center word vector \(\mathbf{v}_c\) as

Note that the calculation in (14.1.8) requires the conditional probabilities of all words in the dictionary with \(w_c\) as the center word. The gradients for the other word vectors can be obtained in the same way.

After training, for any word with index \(i\) in the dictionary, we obtain both word vectors \(\mathbf{v}_i\) (as the center word) and \(\mathbf{u}_i\) (as the context word). In natural language processing applications, the center word vectors of the skip-gram model are typically used as the word representations.

## 14.1.4. The Continuous Bag of Words (CBOW) Model¶

The *continuous bag of words* (CBOW) model is similar to the skip-gram
model. The major difference from the skip-gram model is that the
continuous bag of words model assumes that a center word is generated
based on its surrounding context words in the text sequence. For
example, in the same text sequence “the”, “man”, “loves”, “his”, and
“son”, with “loves” as the center word and the context window size being
2, the continuous bag of words model considers the conditional
probability of generating the center word “loves” based on the context
words “the”, “man”, “his” and “son” (as shown in `fig_cbow`

),
which is

Since there are multiple context words in the continuous bag of words
model, these context word vectors are averaged in the calculation of the
conditional probability. Specifically, for any word with index \(i\)
in the dictionary, denote by \(\mathbf{v}_i\in\mathbb{R}^d\) and
\(\mathbf{u}_i\in\mathbb{R}^d\) its two vectors when used as a
*context* word and a *center* word (meanings are switched in the
skip-gram model), respectively. The conditional probability of
generating any center word \(w_c\) (with index \(c\) in the
dictionary) given its surrounding context words
\(w_{o_1}, \ldots, w_{o_{2m}}\) (with index
\(o_1, \ldots, o_{2m}\) in the dictionary) can be modeled by

For brevity, let \(\mathcal{W}_o= \{w_{o_1}, \ldots, w_{o_{2m}}\}\) and \(\bar{\mathbf{v}}_o = \left(\mathbf{v}_{o_1} + \ldots, + \mathbf{v}_{o_{2m}} \right)/(2m)\). Then (14.1.10) can be simplified as

Given a text sequence of length \(T\), where the word at time step \(t\) is denoted as \(w^{(t)}\). For context window size \(m\), the likelihood function of the continuous bag of words model is the probability of generating all center words given their context words:

### 14.1.4.1. Training¶

Training continuous bag of words models is almost the same as training skip-gram models. The maximum likelihood estimation of the continuous bag of words model is equivalent to minimizing the following loss function:

Notice that

Through differentiation, we can obtain its gradient with respect to any context word vector \(\mathbf{v}_{o_i}\)(\(i = 1, \ldots, 2m\)) as

The gradients for the other word vectors can be obtained in the same way. Unlike the skip-gram model, the continuous bag of words model typically uses context word vectors as the word representations.

## 14.1.5. Summary¶

Word vectors are vectors used to represent words, and can also be considered as feature vectors or representations of words. The technique of mapping words to real vectors is called word embedding.

The word2vec tool contains both the skip-gram and continuous bag of words models.

The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence; while the continuous bag of words model assumes that a center word is generated based on its surrounding context words.

## 14.1.6. Exercises¶

What is the computational complexity for calculating each gradient? What could be the issue if the dictionary size is huge?

Some fixed phrases in English consist of multiple words, such as “new york”. How to train their word vectors? Hint: see Section 4 in the word2vec paper [Mikolov et al., 2013b].

Let us reflect on the word2vec design by taking the skip-gram model as an example. What is the relationship between the dot product of two word vectors in the skip-gram model and the cosine similarity? For a pair of words with similar semantics, why may the cosine similarity of their word vectors (trained by the skip-gram model) be high?