What might seem like an outdated idea, is an important step into understanding machine learning and the math that governs it.

Let y denote two classes (denoted by red and blue blobs) and our data have two attributes X¹ & X².

Now, we may have a case where our data is linearly separable and we can clearly draw a line separating both classes.

But, we might also encounter situations where that is just not possible.

This article will go through a use-case of using CNNs for NLP with practical details of implementation on PyTorch & Python. It is based on the paper “Character-Aware Neural Language Models”.

**Language models:** The language model assigns a probability to a sentence or a phrase. Given some words, we can predict the next word which would assign the highest probability to the phrase. We often see these being used for text completion for messaging services

The children play in the ________

A good language would be able to predict the next word as ‘park’ or something else which makes sense.

…

This article aims to provide intuition for word embeddings using information from some of the most famous papers in the field.

The other choice that we have is to represent each word in a document as a one-hot encoded vector. Let’s try to think about some problems with this.

**Vocabulary Size**: Each word would be represented by a shape as long as your vocabulary. This could be in the order of the millions.

*This article is to give a flavor for the reader to explore the topic more in-depth, on their own using the resources mentioned at the end.*

Whether you’re doing least-squares approximation or reading a new paper in deep learning a fundamental understanding of linear algebra is critical to internalize key concepts.

Firstly, a few definitions that we should all know

**Column space: **You could imagine each column of the matrix as a vector in space. Now, the space spanned by taking linear combinations of all the independent vectors in the matrix makes up the column space.

ML/DS enthusiast