Grammatical Learning & Sentence Generation via Hidden Markov Model Filtering, Smoothing, & Inference

In Natural Language Processing (NLP), we are interested in constructing a model which is able to support and manipulate speech, where from our point-of-view, it seems as though the computer is learning the language. From this learning, the computer should then be able to extract information from text data for the purpose of further organizing, classifying, and even generating sentences of its own, indistinguishable from that of a person. Such a task is going to require the computer to learn many of the nuances specific to human speech which are present within a language, which itself is a task of developing an abstract understanding of speech. In other words, for a computer to learn how and when words are organized within a sentence, it must learn to model the temporal organization of abstract (i.e. latent) representations of words (e.g. nouns, verbs, and adjectives) and the inter-dependent probabilities by which they are realized. A model which lends itself nicely to modeling dependencies between observed and unobserved, latent variables are Hidden Markov Models (HMMs).

In this article, we will use HMMs to classify reviews for restaurants in Las Vegas as either 1-star or 5-star, and then use these trained models to generate reviews of their own. We will investigate the theory behind these models in section I, so that in section II we may proceed to implement these models along with their necessary functions. We conclude this article with section III, where we then evaluate the performance of these models by classifying and generating new reviews, and furthermore analyze the advantages as well as shortcomings of this model structure.

Section I: Theory

In the introduction to this article, I mentioned how HMMs are a well suited model for our NLP task, since we are looking to learn temporal, abstract, and inter-dependent representations for words. To elucidate this concept, take a look at the following example, which shows an ideally-trained HMM generating (observable) words, along with their abstract (unobservable) representations.

        JJ:  Adjective
        NN:  Noun, Singular or Mass
        VBZ: Verb, 3rd Person Singular or Present
        DT:  Determiner
        IN:  Preposition or Subordinating Conjunction

From the user’s perspective, the outputs in gray are the observable and interpretable outputs by our model, while the nodes in white serve as intermediary embeddings from which the gray outputs are realized. We can see that this ideal-HMM first begins by deciding upon an abstract word representation, which takes the form of some Part-of-Speech (PoS), and is dependent on the previously realized PoS in the chain.

Organizing the sequential, probabilistic structure of a sentence by these grammatical categories makes sense for two reasons. Firstly, there are rules of grammar which logically determine which PoSs may follow other PoSs, in order for a sentence to be considered grammatically valid. These rules allow for a learnable structure, from which our HMM can more easily train from. Secondly, the set of all possible word realizations is much larger than the set of all PoSs, which makes modeling the probabilistic dependencies between all actual words much more computationally complex and burdensome than simply modeling the sequential structure between PoSs, which already follow the same rules of grammar as words do anyway.

In order to formalize the above example into an implementable model, we express the Markov dependencies explicitly by the following expressions.

We see that we have three transition probabilities, whose Markov structure allow for the above probability decomposition. Each PoS may transition into another PoS, with a probability determined by the matrix A. Each realized PoS may transition into a specific word, with a probability determined by the matrix B.

Recall that the gray nodes correspond to observed variables, while the white nodes correspond to unobserved variables. If we are interested in training our HMM to learn from a corpus of text data, unless we go through the entire data set and label the PoS corresponding to each word, the only information our model will have access to are the observed variables. Given only these observed variables, we would like for our model to learn the transition probabilities between all states. After our model learns these transition probabilities, namely the vector π and matrices A and B, then we can proceed to use our trained HMM to classify or generate new sequences.

Above we see our learning objective specified mathematically and adapted to our HMM after introducing the latent, hidden variables. Normally, after specifying the objective function, the most common way to find a solution is to to take the derivative with respect to the model parameters θ, and use it to analytically derive a solution, or to numerically find one using gradient descent. The problem here is that all three objective functions shown above are intractable to compute. With regards to the first and second formulation, we do not have the analytical expression P(X) available for us to plug some observed sequence into, and find the probability of said sequence. We do, however, have the ability to derive analytic expressions for the third formulation, which take the form of functions of elements from our transition matrices. However, the marginalization shown by the summation means that if we want to take the gradient of this function, we would have to write it out explicitly first, which requires summing over an exponential number of terms, making the resulting maximization unreasonably expensive. As a result, all three of the above objective functions are intractable, so where do we go from here?

Variational Inference (VI) is a methodology for approximating intractable probability distributions via reformulations which involve more favorable expressions and the introduction of a free-to-choose optimization variable. These methods are designed to address our problem exactly, since we have an analytically intractable objective P(X), which we would like to express via a favorable term which we do have analytically available, namely the joint distribution P(X,Z) (shown at the beginning of this section).

Notice that this VI-reformulation results in two terms; the first term is known as the Evidence Lower Bound (ELBO) and serves as a tractable, proxy function to our intractable objective. Since it is a lower bound to our main objective, if this bound lies very close to the original function, then maximizing this proxy function can yield approximate solutions to our main objective.

Now lets focus our attention on the second term of the VI-reformulation. We can see that in addition to P(X,Z) there seems to be another term which we might need to have analytically available, namely the posterior P(Z|X). Thankfully, to optimize the VI-reformulation, variational Bayesian methods do not require an expression for the posterior thanks to the simple fact that P(X) is constant in q. This means that for a fixed θ, maximizing the ELBO with respect to q automatically minimizes the Kullback-Leibler divergence between q and the posterior.

Under two optimization variables, an alternating optimization is generally performed. Thanks to the above insight, we see that when we maximize with respect to q, we can choose to optimize only the favorable ELBO and neglect the KL-divergence term which has this commonly unavailable expression P(Z|X). In the next alternation, all terms are functions of θ, so we do not have the luxury of optimizing only the ELBO this time and expect the same behavior as when optimizing with respect to q. What is normally done, is that the KL-divergence term is thrown out for this alternation, and we instead hope that our lower bound is tight enough to yield a good enough solution.

The alternating optimization procedure of the ELBO which I just described, is the common procedure for optimizing P(X) in general latent variable models. However, we have a specific latent variable model, the hidden Markov model, and we are interested in using the EM-algorithm to optimize P(X). It turns out that the EM-algorithm is a specific type of variational Bayesian method, which under certain circumstances, is even better at optimizing P(X) than the procedure I just described. The difference here is that we can actually tractably compute P(Z|X) thanks to our HMM structure. Since q is some free-to-choose optimization variable, when we optimize with respect to q, we can simply set it equal to our posterior. Doing this will, in turn, yield a KL-divergence of zero, which means that when we optimize our ELBO w.r.t θ, we are no longer optimizing a strict lower bound, but rather optimizing a simpler expression for the exact same function as our main objective!

Let us now delineate how our optimization problem w.r.t q and θ can be solved via alternating between our E-step (optimizing q) and our M-step (optimizing θ).

This is called the “Expectation”-step since the goal is to set up the expected value (ELBO) under the distribution of the posterior. Notice the sum over Z1:T implies a summation over an exponential number of terms in the order O(KT). When applying the definition of the joint distribution for our HMM, the number of terms falls drastically to O(TK2). We can see that we no longer need to sum over all sequences of length T where each entry can take K different variables. Rather, our transition matrix A shows that we need only consider pairs of K different latent variables, and since this sequence is T entries long, we find an order of O(TK2).

This is called the “Maximization”-step, since now we maximize the expected value (ELBO) evaluated by the previous step. By again applying the definition of the joint distribution for our HMM, the task simplifies once more. We are able to forgo gradient-based descent and can instead take the derivative of the expression above with respect to each of the three parameters, setting this derivative to zero, and solving for the variable.

We are finally very close to completing our optimization procedure for finding the optimal parameters for our HMM; however, there is one last thing missing. As noted in the above picture, we still have a few terms (in blue) which we have not yet derived expressions for. Thanks to hidden Markov models being such a common latent variable model, there already exists an efficient algorithm for finding these exact terms.

The Forwards-Backwards algorithm is designed to compute posterior marginals for all latent state variables in probabilistic graphs given a sequence of observations. Since our implementation in section II will make heavy use of this algorithm along with the variables α and β, we will briefly remark on how this algorithm can find the unknown posteriors.

The Forwards-algorithm is shown above, where it incrementally computes the posterior distribution of each latent variable in the sequence given all past and present observations (hence online). Looking again at the posteriors we must calculate in the M-step, we see that we need the posteriors under all observations in the sequence. Thus, we must also consider future observations, and as a result, the algorithm starts at the end of the sequence and makes a backward pass.

The final expression missing is the second term in blue, which can easily be computed in closed form using the α and β terms computed by the Forwards-Backwards algorithm.

At this point we have all of the necessary components to implement our hidden Markov model and EM-algorithm. By alternating between the E- and M-steps, we can approximately solve our objective by iterating until we lie within some ε-fixed point. Thanks to this algorithm being so commonly-used when working with statistical models with latent variables, we can also rely on the proofs which have shown that the derivative of the likelihood will be arbitrarily close to zero at any fixed point we reach.

The EM-algorithm adapted to HMMs has a specific name, called the Baum-Welch algorithm. This algorithm works exactly as described in this section, with the added efficiency of forgoing having to calculate the entire posterior in the E-step, and instead, calculating only the posteriors ξ and γ. Recall equation (X) where we simplified our ELBO after plugging in our entire posterior calculated by the E-step. We find that only the terms in blue show up in our simplification, which means that after taking the derivative and finding a closed form expression for the optimal parameters, only the terms in blue show up. Therefore, our E-step will no longer consist of evaluating the entire posterior, but rather only evaluating the necessary posteriors ξ and γ.

In the next section, we will implement the tools we have derived in this section to train our HMM on a corpus of restaurant reviews from Las Vegas, and use this trained model to classify and generate new reviews.

Section II: Implementation

In the previous section, we saw how an HMM is parameterized and operates; how it uses the EM-algorithm to infer the optimal model parameters (i.e. train); and how all variables necessary for training can be found efficiently. To implement our model in an object-oriented program, we will create two separate classes. The first class will simply initialize and keep track of the HMM parameters.

        class HMM_Parameters
              def __init__(K, N)
                  A = random_tensor(K, K)
                  B = random_tensor(K, N)
                  π = random_tensor(K)
              def random_tensor(length, width)
                  -> None

We initialize our parameters via sampling from a standard gaussian. This class will be called by our main class which will contain the model along with all the necessary functions discussed in the previous section. Our main class will consist of some text preprocessing functions, our HMM, the forwards-backwards algorithm, our E-step where we evaluate only the necessary posteriors, and our M-step which uses them to update our parameters.

        class HMM
              def __init__(corpus, K)
                  model_params = HMM_Parameters(K, N)
                  corpus
                  word_dic
                  word_list
                  N = len(word_list)
                  K
              def forwards-backwards(sentence_in)
                  -> α, β, log_likelihood
              def loglik_corpus()
                  -> loglikelihood_corpus
              def E_step(sentence_in)
                  -> sum_ξ, sum_γ, sum_γ_x
              def update_params()
                  -> None
              def learn_params(num_iterations)
                  -> history_loglikelihood

Upon initialization of this class, we automatically call the previous class to initialize random values for the parameters A, B, and π, along with their dimensions N and K, where as in the previous section, N denotes the number of possible observable realizations (i.e. size of set of all words in corpus), and K denotes the number of possible hidden state realizations which is a user-defined quantity. Along with these variables, we also require some corpus of text data to learn from; a list of all words in the corpus; and a dictionary which maps each word to a unique integer identifier (since NLP models require numerical/vectorial representations for words).

The definitions within the class are separated in this article for readability, where the definitions above are used only for training, while those below can also be used outside of training. We will describe the functionalities of the definitions above

  • Def. 1: This function takes a sentence and executes the forwards algorithm to find and return the vector α for every hidden state k exactly as in section I - online. Then, the backwards algorithm is executed to find and return the vector β for every hidden state k exactly as in section I - offline. Additionally, since summing the components of the vector α corresponds to marginalizing the joint distribution P(X,Z) over all latent states, we can immediately retrieve the probability of the given sentence P(X) and return this float value as well.

  • Def. 2: This function is used within def. 5 (which is the main training algorithm) in order to calculate the log-likelihood of the corpus (i.e. training set) based on the current parameters at some iteration within ‘num_iterations’. This value acts as a sort of “loss” value, though it is a quantity we look to maximize, not minimize. By evaluating this float value over the training iterations, we can track the accuracy with which our (initially random) parameters A, B, and π are able to fit to our training corpus.

  • Def. 3: This function calculates the tensors ξ and γ exactly as in the end of section I. Notice that when calculating ξ, we end up with a [K, K, T] tensor, and when calculating γ, we get a [K, T] tensor. Since in the M-step, these tensors are all summed over the time axis, we sum their last components to return ‘sum_ξ’ and ‘sum_γ’ which are [K, K] and [K, 1] respectively. In order to further facilitate calculations, we notice that the numerator for calculating the B matrix in the M-step, requires summing over the entries of γ only in the positions in the time steps where the word ‘n’ was observed. As a result, we expand this matrix over an axis of size N (i.e. vocabulary size), with entries for γ only in the positions which led to a a non-zero indicator function, returning then a [K, N] matrix via ‘sum_γ_x’.

  • Def. 4: This function calls the previous E-step function and implements the M-step exactly as in the end of section I. After a single E- and M-step, it updates the parameters of the HMM class.

  • Def. 5: This function calls the previous function to compute a round of a single E- and M-step, for as many rounds as is denoted by ‘num_iterations’ and also tracks the log-likelihood of the parameter values at every iteration, and returns this value so that we may track the magnitude of the “loss” function.

The rest of the functions in this class shown below can also be used outside of training.

              def sentence_to_X(sentence_in)
                  -> X
              def X_to_sentence(X_in)
                  -> sentence
              def is_in_vocab(sentence_in)
                  -> boolean
              def loglik_sentence(sentence_in)
                  -> loglik_of_sentence
              def classify_review(hmm_1, hmm_5, prior, sentence_in)
                  -> int
              def generate_sentence(length)
                  -> sentence
  • Def. 1: This preprocessing function used during and after training converts sentences into sequences of integers which function as the unique identifiers for each word.

  • Def. 2: This preprocessing function operates as the inverse mapping of the previous function, mapping each unique integer to its corresponding word.

  • Def. 3: When using our HMM after training to work with new sentences which may have words never before seen during training, we must filter each new sentence to remove these new words. Otherwise, when classifying reviews, the probability of the sentence will be zero, yielding a negative log-likelihood of -inf.

  • Def. 4: This post-training subfunction is used to classify reviews by evaluating the log-likelihood via our trained HMM.

  • Def. 5: This post-training function calls the previous function to evaluate the log-likelihood of some sentence, with respect to both the 1-star trained HMM and 5-star trained HMM, as well as computes the prior log-probability (i.e. log likelihood of the frequency of 1-star reviews). Using Bayes theorem, we then sum the corresponding log-likelihoods together, and if the probability yielded by the 1-star model is higher than the 5-star model, then we return the integer 1, else, 5.

  • Def. 6: This post-training function uses the 1-star and 5-star trained HMM to generate a new sequence which should resemble a 1-star and 5-star review, respectively.

Section III: Evaluation

We will be working with a subset of reviews from restaurants in Las Vegas, which are either 1-star of 5-star reviews. After a training-test split of 80:20, where the reviews are stored as a list of sentences, we will look at two examples to inspect what these sentences look like.

        >>  ['This', 'place', 'tops', 'the', 'least', 'favorite', 'list', 'by', 'a', 'long', 'shot']
        >>  ['Filet', 'mignon', 'and', 'lobster', 'tail', 'was', 'very', 'good']

Now let us initialize two HMMs, one to train on 1-star reviews and the other to train on 5-star reviews. We will initialize them to take on eight possible different hidden state realizations, since there are exactly eight different parts-of-speech in the English language. We hope that by setting this hyperparameter to eight, our HMM will learn embedded approximations to these grammatical categories, along with the sequential interdependencies with which they are realized. After initializing these HMMs, we then train each for fifty iterations (i.e. fifty EM-rounds), and track the log-likelihood of the 1-star HMM on the left, and the 5-star HMM on the right.

        hmm_1 = HMM(reviews_1star_train, 8)
        hmm_5 = HMM(reviews_5star_train, 8)
        history_loglik_1 = hmm_1.learn_params(n_iter=50)
        history_loglik_5 = hmm_5.learn_params(n_iter=50)

Let us now proceed to test the generative capabilities of both models. We can do this by calling the ‘sample()’ function in our HMM class, to generate a short review of seven words.

        sample_1star = hmm_1.generate_sentence(7)
        sample_5star = hmm_5.generate_sentence(7)

        >>  ['Now', 'tasty', 'but', 'not', 'ordered', 'in', 'appetizing']
        >>  ['We', 'amazing', 'we', 'enjoy', 'home', 'the', 'filet']

We can see that the model succeeds in modeling the sequential rules of grammar (i.e. modeling valid sequences of hidden state realizations), but does not succeed in realizing legitimate sentences out of these hidden states (i.e. realizing observations x from z). This makes sense given that each hidden Markov model’s observable random variable is independent of any other observable realization. Since each realization x is fully dependent on the current realized hidden state z, after modeling a valid sequence of different parts-of-speech, what this model is essentially doing is simply realizing a typical word corresponding to whatever part-of-speech is present in the sequence. Thus, although this model is able to approximately learn the rules of grammar, it will further require some kind of information transfer between the observable states, in order to generate much more realistic sentences.

Although grammatical learning was much more successful than speech-realistic learning in our HMM, we suspect that the classification capabilities of our HMM will yield positive results. After filtering the reviews within the test set for any never before seen words, and using the training set to compute the frequency of 1-star reviews, we then call the following function to evaluate the classification accuracy of our HMM.

        >> Classification accuracy for 206 test instances: 0.7669902912621359

We find that over three in four reviews are classified correctly via our trained models, with an accuracy rate of 76% - which I interpret as being both not good and also not useless. We saw that these HMMs were much more successful in modeling the different grammatical parts-of-speech of the English language, however, such a model would probably be better suited to training for and classifying whether a sentence is a valid english sentence or not, rather than classifying the sentiment of a sentence. I would assume that 1-star and 5-star reviews hold almost no correlation with the grammatical structure of the review, which would mean that our HMM is essentially attempting to operate as some counting mechanism, which counts how often a word appears in a 1-star or 5-star review and decides based off of these counts which star corresponds to some new review.

HMMs are however used within the field of speech recognition, but the modern mechanisms which can successfully do so add many more sophistications to the base model presented in this article. The state-of-the-art model for textual generation currently lies with the attention-based transformer model, which models and exploits many more statistical dependencies than the HMM.

References

The hidden Markov model diagrams were taken from: S. Günnemann, Machine Learning for Graphs and Sequential Data. Technische Universität München, 2023.

Link to GitHub

To download the Python files and test the model for yourself, click here.

Previous
Previous

Graph Spectral Clustering for User Preference Categorization in Social Networks

Next
Next

Lipschitz-Based Precoding & Equalization in Reconfigurable Intelligent Surface Assisted MIMO Broadcast Channels