The Rise of NLP: Word Embeddings (Part 1)

7 min readJun 16, 2022

From spurring out incoherent word predictions to hosting a human–computer conversation that appears as “sentient” to the tester, Natural Language Processing (NLP) has come a long way. This article series aim to summarise the building concepts of NLP as a helpful guide for beginners wishing to dabble in this field.

Understanding Human Language

The building blocks of the human language are individual words. Newly born infants tap into the power of communication by picking up words in disarray; therefore, when the idea of teaching computers about the human language was conceived, it was only logical to kickstart the process with feeding words into the machine brain.

A problem hence arises: since computers can only decipher numbers, how are we to represent the meaning of words in a way that computers can understand?

Denotational Semantics

The most common way of perceiving the relationship between words and their meanings is by likening words to a symbol that signifies an idea or object.

In denotational semantics, words are viewed as discrete symbols. To speak the computer language, these symbols are represented as one-hot vectors containing integers of 0 and 1 only.

Let the size of these vectors to be denoted by n (number of words in the corpus); the vectors consist of n-1 values of 0 and one value of 1 at unique positions to represent each unique word.

However, an inherent flaw exists with this view of semantics: these vectors are unable to represent the underlying similarity between synonyms. There is no way to see any relationship between a vector for student and pupil from the different positions of 1, handicapping the ability of machines to use vocabulary in a flexible manner.

Up to this point, the common solution in NLP is to create synonym sets and hypernyms (such as WordNet); A few problems nevertheless still remain unsolved.

Firstly, as the entire English corpus has more than 600 thousand words, the size of these vectors and synonym sets can grow to an astronomical amount, taking up too much storage space.

Secondly, manual human labour is required to create and adapt these sets, introducing subjectivity and bias into the lexical relationships. Manual effort also means these synonym sets will never catch up with the latest terms and colloquial. In addition, there is no way to quantify similarity between words.

Distributional Semantics

To solve the aforementioned problems that stem from the denotational view, a new way of perceiving words and their meanings emerged. In distributional semantics, word meanings are instilled by the context, which is the neighbouring words that frequently appear along with the target words. This view aims to train machine models to encode word similarity by themselves without human labour.

Context for a word w is the set of words that appear nearby within a fixed-size window. Many contexts of w build a representation of w.

Context Words Representing the Word “bank”

These word representations are also known as word vectors or word embeddings. These word vectors will be statistically similar to the vectors of other words that appear in similar contexts with the target words, as opposed to one-hot vectors. Basically, it means that the word vectors for student and pupil will be similar.

There are two methods of word vectorisation: count-based and predictive-based approaches.

Count-based

The occurrence of certain combination of words and phrases has a frequency. For example, “so far so good” occurs more frequently than “so near so bad”. Words and phrases with larger frequencies are usually more common in our everyday usage of language.

In count-based methods, these commonly occurring words are sought out from the corpus and documented into a co-occurrence matrix.

Co-occurrence Matrix for a Corpus of Three Sentences (Image Source)

It can be inferred that the matrix size will increase with larger vocabulary, increasing the number of dimensions and thereby requiring larger storage; issues with sparsity (low information density) also exist due to the many 0s in the matrix, leading to less robust models.

The solution to the aforementioned sparsity and storage issues is to maintain low-dimensional vectors by storing most of the important information in a fixed, dense vector. High-dimensional vectors go through dimensionality reduction using Singular Value Decomposition (SVD).

In SVD, the co-occurrence matrix is broken down into three matrices that can be multiplied together to recreate the original matrix.

Visualisation of SVD Formula (Image Source)

The U and V matrix break down dimensions of the input whereas the Σ matrix indicates the importance of these dimensions. Unimportant dimensions (the value 1.3 in the example figure above) will be set to a value of 0, which effectively removes that dimension from the matrix.

Predictive-based

In this method, computer models are tasked to assign a probability to a sequence of words. (These models are known as language models.) A logical sentence such as “The cat jumped over the puddle.” should be assigned a higher probability than “Stock boil fish is toy.” because the former is syntactically and semantically valid, whereas the latter makes no sense.

One type of language models is the n-gram model, where n is the number of words in a gram (written unit). There are unigrams, bi-grams, trigrams, 4-grams, so on and so forth.

The unigram model is where the model assumes that all words have completely independent probability of occurrences. For the sentence “The cat jumped over the puddle.”, the unigrams are simply [“The”, “cat”, “jumped”, “over”, “the”, “puddle”]. Probabilities are assigned to the frequency of occurrence of each unigram. This model is unreasonable because the probability of the next word is usually dependent on the previous sequence of words. The unigram model would therefore not seem like a good choice.

On the other hand, the bigram model assumes that the occurrence probability of words depends on one word before them. For the same sentence, the bigrams are [“The cat”, “cat jumped”, “jumped over”, “over the”, “the puddle”], which are subsequently assigned probabilities of occurrence. This model is still insufficient because the context are often longer than two words, but it works sufficiently good enough.

Trigrams and 4-grams work in the same way except that three and four words are split into a written unit respectively.

In the Word2Vec framework, the algorithm calculates the probability of occurrence for each context word (other words in the sentence) with respect to a centre word (target word) using the n-gram model, then adjusts the vector probabilities accordingly.

In a multidimensional vector space, similar words are placed in a cluster. Similarity between vectors can now be computed using cosine similarity.

Blue: Context Word, Orange: Centre Word (Image Source)

Word Clusters in Multidimensional Vector Space (Image Source)

The Word2Vec model has two variants: Continuous Bags of Words (CBOW) and Skip-Grams using a shallow feed-forward neural network. In CBOW, the model has to predict the centre word with the given context words. (Predict “awesome” when given “thank”, “you”, “for ”, “such”, “top”.) The context words can be in any order. In contrast, the Skip-Gram model predicts the context words instead with the centre word. (Predict “thank”, “you”, “for ”, “such”, “top” when given “awesome”.)

The word vectors generated by Word2Vec are based on probabilities calculated from a local context (neighbouring words) instead of utilising the entire corpus.

Global Representations for Word Vectors (GloVe)

Tabularised Pros and Cons for Count-based and Predictive-based Word Vectorisation

As count-based and predictive-based methods have an exclusive set of pros and cons, an algorithm named GloVe was developed to combine the best of both worlds.

GloVe is based on the idea of encoding the meaning components into the differences between vectors using the ratio of co-occurrence (from count-based methods) and probabilities (from predictive-based methods).

Tabulated Ratio of Co-occurrence Probabilities (Image Source)

From the table above, it can be read that the word solid is highly relevant only to ice, gas is highly relevant only to steam, water is highly relevant to both ice and steam, and fashion is irrelevant to both ice and steam. When the probe word k is relevant or irrelevant to both words, the total probability of the two words will be approximately equal to 1.

The word vectors produced by GloVe are based on the ratio of co-occurrence probabilities from a global context utilising the entire corpus. This algorithm is fast to train, scalable to huge corpora and maintains good performance with small corpus and small vectors.

Given that much work has been done to ensure words are properly represented to computers, these word embeddings can be then used for language modelling, which will be explored in the next part of this series.