CS224 Lecture1 – Word2Vec
1 Introduction of Natural Language Processing
Natural Language is a discrete/symbolic/categorical system.
1.1 Examples of tasks
There are many different levels in NLP. For instance:
Easy:
 Spell Checking
 Keywords Search

Finding Synonyms

Synonym: 同义词 noun
Medium
 Parsing information from websites, documents, etc.

Parse: 对句子进行语法分析 verb
Hard
 Machine Translation
 Semantic Analysis (What is the meaning of query statement?)
 Coreference (e.g. What does “he” or “it” refer to given a document?)
Coreference: 共指关系，指代的词

Question Answering
1.2 How to represent words?
1.2.1 WordNet

Use e.g. WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).
hypernym: 上位词
Problems with resources like WordNet
 Great as a resource but missing nuance (e.g. “proficient” is listed as a synonym for “good”, This is only correct in some contexts.)
 Missing new meanings of words
 Subjective
 Requires human labor to create and adapt
 Can’t compute accurate word similarity
1.2.2 Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols: hotel, conference, motel – a localist representation.
Words can be represented by onehot vectors, onehot means one 1 and the rest are 0. Vector dimension = number of words in vocabulary.
 Problem with words as discrete symbols:
For example, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.
But:
hotel = [0\quad 0\quad 0\quad 0\quad 0\quad 1\quad \cdots \quad 0]\\motel = [0\quad 0\quad 0\quad 0\quad 1\quad 0\quad \cdots \quad 0]
These two vectors are orthogonal. But for onehot vectors didn’t shown there’s natural notion of similarity.

Solution:
 Could try to rely on WordNet’s list of synonyms to get similarity?
 But it is wellknown to fail badly: incompleteness, etc.
 Instead: Learn to encode similarity in the vectors themselves
 Could try to rely on WordNet’s list of synonyms to get similarity?
1.2.3 Representing words by their context
 Distributional semantics: A word’s meaning is given by the words that frequently appear closeby.
 “You shall know a word by the company it keeps”(J.R. Firth 1957: 11)
 One of the most successful ideas of modern statistical NLP
 When a word “w” appears in a text, its context is the set of words that appear nearby(within a fixedsize windows).
 Use the many contexts of “w” to build up a representation of “w“.
2 Word2Vec

Word Vectors: In word2vec, a distributed representation of a word is used. Each word is represented by a distribution of weights across those elements. So instead of a onetoone mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.
Note: Word vectors are sometimes called word embeddings or word representations. They are a distributed representation.
2.1 Word2Vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.
Idea:
 We have a large corpus of text
 Every word in a fixed vocabulary is represented by a vector
 Go through each position t in the text, which has a center word c and context (“outside”) words o
 Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
 Keep adjusting the word vectors to maximize this probability
2.2 Word2Vec: Objective Function
Example windows and process for computing:
P(w_{t+j}w_t)
For each position, t=1, \dots , T , predict context words within a window of fixed size m, given center word w_j.
Likelihood=L(\theta)=\prod_{t=1}^T\prod_{m\leq j \leq m}P(w_{t+j}w_t;\theta)
where \theta is all variables to be optimized.
 The objective function J(\theta) is the average negative log likelihood:
Sometimes, objective function called cost or loss function.
J(\theta)=\frac{1}{T}\log L(\theta)=\frac{1}{T}\sum_{t=1}^{T}\sum_{m\leq j \leq m}\log P(w_{t+j}w_t;\theta)
So, our target that maximizing predictive accuracy is equivalent to minimizing objective function.
Minimizing objective function \Leftrightarrow Maximizing predictive accuracy.
How to calculate P(w_{t+j}w_t;\theta) ?
To calculate P(w_{t+j}w_t;\theta) use two vectors per word w:
 v_w when w is a center word
 u_w when w is a context word
Then for a center word c and a context word o:
P(oc)=\frac{\exp(u_o^Tv_c)}{\sum_{w\in V}\exp(u_w^Tv_c)}
In the formula above Exponentiation makes anything positive, u_o^Tv_c is dot product that compares similarity of o and c. Large dot product = Large probability. Denominator \sum_{w\in V}\exp(u_w^Tv_c)normalize over entire vocabulary to give probability distribution.
The follow is an example of softmax function \mathbb{R}\rightarrow\mathbb{R}
softmax(x_i)=\frac{\exp(x_i)}{\sum_{j=1}^{n}\exp(x_j)}=p_i
Softmax Function apply the standard exponential function to each element x_i of the input vector x and normalize these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector p is 1.
The softmax function above maps arbitrary values x_i to a probability distribution p_i
 max because it amplifies probability of largest x_i
 soft because it still assigns some probability to smaller x_i
 Frequently used in Deep Learning.
\theta represents all model parameters. For example we have V words and each vector of word has d dimensional. And every word such as word w have two vectors, one is the center word vector v_w and another is the context vector u_w. So, \theta \in \mathbb{R}^{2dV} , we can optimize these parameters by walking down the gradient.
Useful basics about derivative:
\frac{\partial x^Ta}{\partial x}=\frac{\partial a^Tx}{\partial x}=a
Write out with indices can proof it.
We need to minimize
J(\theta)=\frac{1}{T}\log L(\theta)=\frac{1}{T}\sum_{t=1}^{T}\sum_{m\leq j \leq m, j\neq0}\log P(w_{t+j}w_t;\theta)
minimize J(\theta) is equivalent to minimize \log P(oc)=\log \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}.
Take the partial of \log P(oc)=\log \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}, we can get the follow results:
\log P(oc)=\log \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}=\log \exp(u_o^Tv_c)\log \sum_{w=1}^V\exp(u_w^Tv_c)
\frac{\partial P(oc)}{\partial v_c}=u_o\frac{\sum_{x=1}^V u_x\exp(u_w^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}=u_o\sum_{x=1}^Vu_xP(xc)
The statement above illustrate the skipgram language model and how to update the parameters of J(\theta).
本博客所有内容采用 知识共享署名非商业性使用相同方式共享 4.0 国际许可协议 进行许可
转载文章请注明：CS224 Lecture1 Word2Vec  https://blog.ill.cc/archives/548