Semantic Context & Analogy Prediction in Reviews via word2vec
A popular natural language processing technique originating from researchers at Google, word2vec was a state of the art algorithm for the last decade until the attention-based transformer took its place. The word2vec models use large corpuses of text data to learn either which word fits a given context, or to learn which contexts tend to surround a given word. In this article, we want to study in-detail all functionalities of the latter model, known as skip-gram word2vec, and implement the architecture on a corpus of restaurant reviews, with the goal of creating an embedding space for words and testing the reliability of the generated space and capability of our model to complete analogies (‘a’ is to ‘b’ as ‘c’ is to ‘d’). In order to accomplish this, the model must learn to find embeddings (i.e. representations) within a vector space, such that words which are syntactically and semantically similar exist near one another within this space. If the model is successful in embedding the words within such a space, then a successful learning of contexts given any word can be accomplished.
This article will be organized into five sections. Section one will cover the required preprocessing functions needed when working with the corpus of text data. Section two will cover the model which consists of functions used to create word embeddings, compute losses and conduct forward/backward passes through the model (with background explanations in theory offered). Section three will introduce the training algorithm. Section four will explain the functionalities required for using the trained model to complete user-created analogies. The last section will then execute this pipeline on a corpus of restaurant reviews, where we will then analyze the performance of this model. All the code can be found in the following repository, each section corresponds to a specific python file implementing the corresponding functionalities.
Section 0: Preprocessing
We want our pipeline to be an automated procedure, so we assume we simply begin with some corpus of text data which is organized into five categories (each per one of five possible star reviews). Our automated process must be able to read through the given text, build a vocabulary list, assign each vocabulary word a unique identifier, count how often each word appears, and build a sliding window which can pass word-by-word through the text to extract all contexts (i.e. neighbouring words). We will do this by introducing the following four pseudocode functions.
def load_data(task_data.npy) -> reviews_1_star, reviews_5_star def build_vocabulary(corpus) -> corpus, vocabulary, counts def compute_token_to_index(vocabulary, counts) -> token_to_idx, idx_to_token, idx_to_counts def get_token_pairs_from_window(sentence, window_size, token_to_idx) -> list(token_index, token_context_idx)
For this section, we will not explicitly write out the code describing each function (since there are other functions in later sections where we want to dive into deeper detail) however, we will explain the mechanisms behind each functions. All code is available in the link above, so please refer to the repository for more details.
Def. 1: With the body of data given as previously described, the first function simply takes the data file and returns a list of lists of strings (per-star review), where later in section five, these lists of strings are added to create our working corpus.
Def. 2: The second function then takes this corpus, counts how often each word appears, and inserts each considered word into a tuple. Before returning the three shown variables, we filter each of these to only include the most commonly seen 200 words. The reason for this being that the model performs better when it has access, in training, to words in multiple contexts. At this point, our functions return to us a filtered copy of our corpus (where each infrequent word is replaced with ‘Unknown’), a tuple of strings where each word appears chronologically and exactly as often as it shows up in our corpus, and a tuple of integers counting how often this word appears.
Def. 3: The third function now takes the ‘counts’ and ‘vocabulary’ tuples, and constructs three mappings in the form of dictionaries. For the first mapping, we assign each unique token (i.e. word) to a unique identifier (i.e. integer index). The second mapping inverts the first mapping (i.e. dictionary) where the keys are now the integer indices, and the values are now the tokens. The last mapping holds the indices as the key values, and an integer counting the number of times the word identified by this index shows up in our corpus.
Def. 4: This last function is the sliding window which is meant to move word-by-word through the corpus, and return the indices of all words which lie within a certain window size around whichever considered word. Consider the following example where our window size is 2, and the sentence of consideration is “Where are my keys to my car?”, and the sliding window rests on the word “keys.” The function must return a list of tuple-pairs, which in this case would be [(keys, are), (keys, my), (keys, to), (keys, my)]. Note, however, that the function returns the unique indices which correspond to the words, not the words themselves.
With these four preprocessing functions, we can take any text data and extract the surrounding words needed for our model to train on. With this information, our model will learn which words tend to surround other words. The underlying theory behind this prediction will be introduced in the next section, before we introduce the concrete prediction model.
Section I.A: Theory
Our prediction model functions by taking some arbitrary word, and predicting which words are most likely to show up within a certain window size around it. In our preprocessing section, we mentioned how each word is represented by a unique index. We will now use these indices to represent our words by a one-hot vector (meaning zeros in every entry except for the one which matches up with the index). When feeding this one-hot vector into our model, the system should yield us a vector of the same size, and with higher probabilities around the indices which it believes are likely to surround the word of consideration. A visualization of such a system can be seen in the following diagram, where the internal mechanism is explicitly shown as operations with the matrices U and V (where N is the size of the set of unique words, and D is the embedding dimension, and the window size is 2).
As we can see, the one-hot vector is multiplied with a (to be learned) matrix U in order to obtain the word embedding corresponding to the input. This embedding is then multiplied by another (to be learned) matrix V, and this output is sent through a (not-shown) softmax function in order to normalize all the entries and return to us a valid probability vector. Such a system which yields estimated probabilities motivates the introduction of a risk function which looks to maximize the correct inference probabilities, as shown by the following formalization.
In training, we never have access to the true risk function, as the probability distribution is not analytically available. We then must think of an empirical risk function where our loss is a function of the true probabilities (available in training) and the estimated probabilities (found by the forward pass of our model). In problems where we want to estimate the probability of different outcomes, there is a loss function which is the optimal choice. In such scenarios (given some number of conditionally independent samples), maximizing the likelihood with respect to our parameters is equivalent to minimizing the cross-entropy. Therefore, we move towards an empirical risk function which measures the Monte Carlo estimated cross-entropy between our true and predicted values.
Having now shown our model’s theoretical forward functionality and loss measure, the last topic to touch on is training. We can see from the diagram that we are working with a shallow neural network. Such a network allows for training using common principles of backpropagation. This requires taking the derivative of the loss function with respect to our parameters, which in our case are our two matrices. We will leave the more explicit demonstration of this algorithm for the following two sections, where we move to implement our discussed tools.
Section I.B: Model
Now that we have discussed the theory behind the model’s functions, we move to implement it via the following pseudocode class.
class Embedding() def __init__(vocab_size, emb_dim) self.vocabulary_size = vocabulary_size self.embedding_dim = embedding_dim self.ctx = None self.U = None self.V = None self.reset_parameters() def reset_parameters() -> None def one_hot(some_sentence) -> one_hot_matrix def softmax(x, axis) -> prob_vector def loss(y_true, y_predicted) -> cross_entropy_loss_float def forward(x_tokens, y_true_tokens) -> y_predicted def backward() -> dL_dV, dL_dU
We can see that when an instance of this class is created, we automatically initialize the necessary components of the model seen in section I.A. In addition to the already-mentioned components and dimensions, we also introduce a variable named ctx designed to store the values needed during backpropagation, as well as a function (def. 1) which resets ctx to ‘None’ and initializes our U and V matrices with normally distributed random numbers. Now we will move to explain the rest of the functions in the class.
Def. 2: This function takes any sentence represented by the unique indices derived in the preprocessing section, and returns a matrix where each row becomes the one-hot representation of that unique index. This matrix will then be fed row-by-row into our prediction model.
Def. 3: This function simply computes the vector-wise softmax function applied onto some matrix (where the axis of consideration is set by the ‘axis’ argument). Furthermore, in order to implement a numerically stable version of this function, we shift all vector inputs by some maximum value, which can be done without problems since the softmax function is shift-invariant.
Def. 4: This function computes the cross-entropy loss between our true and predicted values, and returns the value as a floating point number.
Def. 5: This function computes the forward pass of our prediction model, and functions exactly as described in section I.A. In addition, we save the values needed during backpropagation in our ctx variable, in order to not have to recompute these values.
For the last function, we would like to more explicitly demonstrate how it works. It is meant to be executed after having called the forward pass, and conducts a single backpropagation step to return the derivative of the loss with respect to both matrices.
def backward(self) embedding, logits, prob, x, y = self.ctx m, n = y.shape dL_dV = np.matmul(self.U@x, (prob - y).T) / m dL_dV = dL_dV.T dL_dU = np.matmul(np.matmul(self.V.T, prob - y), x.T) / m return { 'V': d_V, 'U': d_U }
The first step consists of unraveling the values obtained from the saved variables in ctx during the forward pass. From there, we plug in these values into the closed-form solutions obtained by taking the derivative of the cross-entropy function with respect to both matrices, and return them in dictionary form as seen above. Once these weights are returned, they are then fed into the optimizer, which uses momentum-based gradient descent to update our parameters. This is now covered in the coming section.
Section II: Training
In training we are interested in using the gradients found by the ‘backward()’ function to update the current U and V matrices. We do this by initializing another class called ‘Optimizer’ which contains two functions. The first function simply initializes previous gradients to zero, and the second function, named ‘step()’, is responsible for performing the momentum-aided gradient update step. Since we are using momentum, we not only need both the current gradient value and the parameter-matrix values, but also the previous parameter values. The update step is shown in the following, where weight ‘w’ is equivalent to either U or V.
dw_new = dw * (1 - self.momentum) + self.momentum * dw_prev w_new = w - self.learning_rate * dw_new
Now that we have functions to execute the forward pass, backward pass, and gradient updates, we can combine these functions to create training algorithm. The algorithm’s runtime will be controlled by an integer limiting the number of training steps, and we will be utilizing minibatches to achieve faster convergence.
for i in range(MAX_ITERATIONS): x_tokens, y_true_tokens = get_batch(data, BATCH_SIZE, probabilities) loss = model.forward(x_tokens, y_true_tokens) grad = model.backward() optim.step(grad)
Section III: Analogy Estimation
The success of the word embedding model in word2vec is thought to come from the fact that "a word is characterized by the company it keeps," - a quote from a popular linguist in the 1950s. This is the idea underlying a popular hypothesis in the field of semantic modeling, in which researchers model language using high-dimensional vectors and spaces. Just as in linear algebra, vectors which are "similar" with respect to some measure, tend to have relatively small distances separating them within the space. In a distributional semantic space, such as the one created by our embedding matrix U, we look to maintain these mathematical properties and measure our version of “similarity” through the use of the cosine similarity (namely, the cosine of the angle between two vectors).
Now that we have trained our model, we have the space available to test these capabilities. To test them, we use the following functions.
def cosine_distance(x, y) -> cos_dist def get_analogies(embedding_matrix, triplet, token_to_idx, idx_to_token, n) -> list(most_likely_embeddings)
The first function simply computes the cosine distance between two vectors. The second function find the ‘n’ most likely word embeddings which complete an analogy, given some user-defined triplet of word tokens. More technically, it finds an ordered list of word embeddings which minimize the cosine distance given the triplet.
Consider word embeddings a, b, c, and d. To say, a is to b, is to set up an equivalence between two things, i.e. a is approximately b in the same way as c is approximately d. We can view this mathematically as that which distinguishes a and b should be the same as that which distinguishes c and d, i.e. a - b = c - d. This idea holds when viewing this result in the embedding space.
If we want our model to predict one of these vectors, let’s say d. Then we simply need to compute b - a + c.
Section IV: Evaluation
To test the capabilities of our generated embedding space, we create the following triplets.
triplets = [['is', 'was', 'were'], ['lunch', 'day', 'night'], ['i', 'my', 'your']]
For each word, we use one of our preprocessing functions to return its unique index, then set up its one-hot vector to be multiplied with our embedding matrix U. As a result of this matrix-vector multiplication, we get the corresponding embedding, and then calculate the analogy exactly as defined in the last section. This is all done within our get_analogies() function, which returns to us the following results when limiting the number of candidates to one, so as to find the best candidate which fits the analogy.
`is` is to `was` as [are] is to `were` `lunch` is to `day` as [dinner] is to `night` `i` is to `my` as [you] is to `your`
References
The colored node embedding diagram and matrix block system diagram were taken from: S. Günnemann, Machine Learning for Graphs and Sequential Data. Technische Universität München, 2023.
Link to GitHub
To download the Python files and test the model for yourself, click here.