Language model from scratch with Pytorch: GloVe

GloVe captures both the global statistical properties of a corpus and the local context modeling of Word2Vec. By combining the strengths of both approaches, GloVe quickly became a standard technique for generating word vectors.

mz bai
6 min readSep 21, 2024

For those interested in Word2Vec

Illustration of GloVe(generated with ChatGPT prompt & DALL-E 3)

Get deeper into text representation in NLP

In the early stages of text representation in NLP, one-hot encoding was the dominant method for encoding tokens into categories — for example, classifying ‘apple’ separately from ‘banana.’ Alternatively, count vectors, and their more advanced version, Bag of Words (BoW), were used to represent the importance of tokens within a document.

However, these methods suffered from the curse of dimensionality, especially when applied to large corpora.

Later, statistical approaches like Tf-Idf and Latent Semantic Analysis (LSA) introduced denser, more meaningful text representations. Their ability to capture statistical patterns with compact dimensions quickly made them standard encoding techniques.

The introduction of neural network language models (NNLM) and recurrent neural network language models (RNNLM) revolutionized text representation further. These models leveraged loss functions designed around various linguistic structures — semantic, syntactic, and lexical — enabling them to learn meaningful representations in lower-dimensional spaces.

Word2Vec, in particular, demonstrated its ability to capture composability and semantic similarity, excelling in tasks such as word analogy and basic QA reasoning.

Local context modeling and global statistics modeling

GloVe and Word2Vec both belong to the family of neural network language models (NNLM), but they stem from different lineages.

GloVe is part of the global statistics modeling approach to building text representations, which also includes methods like Hyperspace Analogue to Language (HAL), COALS, Tf-Idf, and Latent Semantic Analysis (LSA).

These methods create word representations using global information, such as term frequency across the entire corpus, document-level statistics, or co-occurrence counts.

They often employ matrix factorization to uncover latent concept combinations and relationships between words.

Similar techniques like PPMI and Hellinger PCA also belong to this lineage, sharing many operational similarities.

In contrast, Word2Vec belongs to the local context modeling group. This approach captures relationships between words by analyzing their local context — words that appear in similar contexts tend to have similar meanings.

Examples of models from this group include SkipGram, CBOW, and vLBL/ivLBL.

Despite their differences, these two lineages share some similarities. Both rely on sampling words and their context through sliding windows and aim to capture word relationships based on their neighbors, reflecting an intuitive understanding of language structure.

Word-context cooccurrence matrix and likelihood ratio

The following table illustrates a fundamental concept in GloVe modeling. Given a masked word k, k should be more likely to co-occur with “ice” than with “steam” if k is “solid” rather than “gas”. That is, “ice” would co-occur more strongly with words like “snow” or “solid” rather than with “gas” or “liquid”.

This relationship becomes apparent when using ratios of co-occurrence counts rather than relying solely on conditional probabilities.

Thus, by examining these ratios, it becomes clear that co-occurrence counts are insightful for estimating the similarity between words and their contexts.

This insight is foundational to the design of GloVe’s loss function.

Odds table when word i showed up with word j compared to word i showed up with word k

Iterative Weighted Linear Square of log-bilinear model

Similar to SkipGram and ivLBL, GloVe is a log-bilinear model that uses the inner product of word vectors and context vectors to estimate either the probability of the target word or the co-occurrence count.

Additionally, GloVe employs a co-occurrence count-weighted mean squared error (MSE) loss function. This weighted MSE function reduces the importance of rare co-occurrence events and reverts to standard least squares when the co-occurrence event frequency exceeds a specified cutoff.

GloVe objective function
Weighting function f with α =3/4

Relationship to Word2Vec

GloVe can be viewed as a global extension of SkipGram. Specifically, it can be seen as applying a co-occurrence count weighted version of the SkipGram loss function.

The derivation of this relationship is detailed on the 5th page of the GloVe paper.

Evaluation

The model was trained on WikiText2, utilizing a vocabulary of approximately 20,000 words. Each word vector has 300 embedded dimension.

For evaluation, we used the same word analogy task dataset from Microsoft as used in Word2Vec. The model achieved an accuracy of 5%.

Code

Reference

GloVe: Global Vectors for Word Representation

Linguistic Regularities in Sparse and Explicit Word Representations

Word Embeddings through Hellinger PCA

Learning word embeddings efficiently with noise-contrastive estimation

Appendix A: PPMI

Positive Pointwise Mutual Information (PPMI) is a sparse text representation, similar to one-hot encoding or count vectors, but it operates in an even higher-dimensional space.

PPMI encodes each word based on the probability that the word and its context occur independently. This formula was also used in Mikolov’s Word2Vec paper to iteratively construct phrases by connecting adjacent words that have strong associations.

In this case, the “context” refers to both surrounding words and their positions. For example, with a context size of 5, a word will be represented by a vector up to four times the size of the vocabulary.

PPMI is log of probability independence between two event

This method is limited to encoding words in a central position, meaning that learning PPMI representations requires large corpora to provide sufficient training data.

Illustration of PPMI co-occurrence counting.

Appendix B: Hellinger PCA

Hellinger PCA uses an n-gram context co-occurrence matrix as the base, then applies Principal Component Analysis (PCA) using the Hellinger distance as the metric and minimizing the reconstruction error.

I implemented this approach using kernel PCA with the Hellinger distance as the kernel. For word representation, the n-gram was set to unigram only.

As a result, Hellinger PCA achieved the best accuracy of 25.2% on the word analogy dataset among the five models tested.

However, it is quite space consuming, since input matrix, eigenvalues and eigenvectors need to be stored for predicting.

Flow of hellinger PCA

Appendix C: vLBL / ivLBL

The Vector Log-Bilinear Language Model (vLBL) and its inverse (ivLBL) are part of the local context lineage of neural language models. Similar to CBOW and SkipGram, these models use word-context pairs as learning tasks, but vLBL/ivLBL extend this approach by incorporating position-based weights and biases.

Additionally, they use Noise Contrastive Estimation (NCE) as an efficient method for calculating the loss function, which allows for more scalable training on large datasets.

NCE approximates the true distribution by using the ratio of the likelihoods of the true distribution and a noise distribution to compute the posterior probability, avoiding the need for a normalization constant.

ivLBL achieves consistent better performance than vLBL, with context embedding yielding the best performance among the tested parameters, reaching 23.0% accuracy on the word analogy dataset.

It could achieve 27% accuracy if generating noise samples for context instead of gram.

--

--

mz bai
mz bai

Written by mz bai

Math is math, math is universal Code is code, code is horrifying unique.

No responses yet