In previous parts of this class we have talked about the word to vec concept in the context of natural language processing. So the key idea here is that each word in the vocabulary will be mapped to a vector, and these factors are also sometimes called embeddings. So if wi represents the ith word in a vocabulary, what we're going to do is map it to a d dimensional vector. And we're going to do that for every word in our vocabulary. We've talked about this in the context of the long short-term memory for natural language processing. We're now going to extend this concept to what's called the transformer network or a attention-based network. And in so doing the hope will be that we'll gain some further insight into what these word embeddings or word vectors represent. So the key idea of this word to vec mapping, so word to vector, word2vec mapping, is we're going to take our vocabulary, which is composed of here v words, and we're going to map it to a collection of corresponding vectors or codes. So we use the notation c for code, and so if we have v words in our vocabulary, we now have v codes, each one of those codes is a d-dimensional vector. So after doing this, we represent our vocabulary of words by a code book. And again, a code book is a collection of the d-dimensional vectors for each of the words in our vocabulary. So there are many ways to learn these factors, these code vectors. And in fact, we have talked about those different ways of learning in other poor parts of this class, including long short-term memory. And we're going to, again, as I said earlier, we're going to extend this to a new type of model, the transformer. The key idea in all of these methods is that for each word in a given document we should be able to predict the next or surrounding words. So the idea is that the words have meaning, that meaning implies that a given word should indicate that a particular other word might be present with high probability. These word vectors, which we are going to learn, are meant to preserve or to achieve that concept. So what we want to do now is to try to achieve a further understanding of what these vectors mean. This will help motivate the model that we're going to talk about subsequently. So just to simplify the discussion, let's assume that each of these word vectors is ten dimensional just for simplicity. And so in this figure each of those boxes represents ten numbers corresponding to the ten components of this vector. And so the idea that we want to kind of think about is that there are underlying topics associated with each of these ten numbers and our ten dimensional word vector. Each of these topics represent characteristics of words, or what one might be trying to represent with words. And so if a given word is aligned with say the kth topic, then we would expect or what we expect that the corresponding component in our vector will be positive, will be a positive number because it aligns with that word. If the kth topic is not connected to the word, or does not align with the word, then we would expect the value of the kth element to be negative, and we'll see this further as we proceed. And so just notionally, let's look at a particular word here, Paris just as an example. And what I'm doing here is assigning themes or topics with each of the components of that ten dimensional vector. These themes or topics will be learned in the context of our machine learning. The assignment of particular names to these here, sport, politics, history, this is notional. We don't explicitly do that. But it's meant to give some understanding of the underlying concepts associated with this word vector. So if we look at the word Paris, historically Paris has played an important role. It's the capital of France. It has played an important role in politics. So we see that, notionally, here the second component of that vector is positive, which means that Paris is positively aligned with politics. Also Paris, as a consequence of its being the capital city of an important country, France, it has a very important role in history. So historically, Paris is important. So we see that the third component, which corresponds to history, is positive. The next or the fourth component, notionally, corresponds to action or a verb. Well, Paris is a proper noun. It doesn't have much explicitly to do with action. So we see that the action component is negative. Paris is a name. It's a name of a city. So the the next component, which corresponds to names, is positive. And so the idea that we're trying to reflect here is that the underlying components of the word vector have some topical or thematic meaning. We don't explicitly have to know what that meaning is, but it is uncovered through the machine learning. And so the idea that we want to gather from this is that what that word vector is doing is representing, component by component. The component either being positive if it's aligned with the topic, or negative if it's not aligned, is providing the underlying thematic meaning of the word. That's what this word2vec concept is meant to reflect. Now, what we're going to do next is take this idea of some meaning or association of what this word vector is actually doing, and use it to constitute a new form of machine learning for natural language. And this is a natural language processing model.