CS224 Lecture1 Word2Vec
文章目录
CS224 Lecture1  Word2Vec
1 Introduction of Natural Language Processing
Natural Language is a discrete/symbolic/categorical system.
1.1 Examples of tasks
There are many different levels in NLP. For instance:
Easy:

Spell Checking

Keywords Search

Finding Synonyms

Synonym: 同义词 noun
Medium

Parsing information from websites, documents, etc.

Parse: 对句子进行语法分析 verb
Hard

Machine Translation

Semantic Analysis (What is the meaning of query statement?)

Coreference (e.g. What does “he” or “it” refer to given a document?)
Coreference: 共指关系，指代的词

Question Answering
1.2 How to represent words?
1.2.1 WordNet

Use e.g. WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).
hypernym: 上位词
Problems with resources like WordNet
 Great as a resource but missing nuance (e.g. “proficient” is listed as a synonym for “good”, This is only correct in some contexts.)
 Missing new meanings of words
 Subjective
 Requires human labor to create and adapt
 Can’t compute accurate word similarity
1.2.2 Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols: hotel, conference, motel  a localist representation.
Words can be represented by onehot vectors, onehot means one 1 and the rest are 0. Vector dimension = number of words in vocabulary.

Problem with words as discrete symbols:
For example, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.
But:
$$hotel = [0\quad 0\quad 0\quad 0\quad 0\quad 1\quad \cdots \quad 0]\motel = [0\quad 0\quad 0\quad 0\quad 1\quad 0\quad \cdots \quad 0]$$
These two vectors are orthogonal. But for onehot vectors didn’t shown there’s natural notion of similarity.

Solution:

Could try to rely on WordNet’s list of synonyms to get similarity?
 But it is wellknown to fail badly: incompleteness, etc.

Instead: Learn to encode similarity in the vectors themselves

1.2.3 Representing words by their context

Distributional semantics: A word’s meaning is given by the words that frequently appear closeby.
 “You shall know a word by the company it keeps”(J.R. Firth 1957: 11)
 One of the most successful ideas of modern statistical NLP

When a word “$w$” appears in a text, its context is the set of words that appear nearby(within a fixedsize windows).

Use the many contexts of “$w$” to build up a representation of “$w$”.
2 Word2Vec

Word Vectors: In word2vec, a distributed representation of a word is used. Each word is represented by a distribution of weights across those elements. So instead of a onetoone mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.
Note: Word vectors are sometimes called word embeddings or word representations. They are a distributed representation.
2.1 Word2Vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.
Idea:
 We have a large corpus of text
 Every word in a fixed vocabulary is represented by a vector
 Go through each position $t$ in the text, which has a center word $c$ and context (“outside”) words $o$
 Use the similarity of the word vectors for $c$ and $o$ to calculate the probability of $o$ given $c$ (or vice versa)
 Keep adjusting the word vectors to maximize this probability
2.2 Word2Vec: Objective Function
Example windows and process for computing:
$$P(w_{t+j}w_t)$$
For each position, $t=1, \dots , T$ , predict context words within a window of fixed size $m$, given center word $w_j$.
$$Likelihood=L(\theta)=\prod_{t=1}^T\prod_{m\leq j \leq m}P(w_{t+j}w_t;\theta) $$
where $\theta$ is all variables to be optimized.

The objective function $J(\theta)$ is the average negative log likelihood:
Sometimes, objective function called $cost$ or $loss$ function.
$$J(\theta)=\frac{1}{T}\log L(\theta)=\frac{1}{T}\sum_{t=1}^{T}\sum_{m\leq j \leq m}\log P(w_{t+j}w_t;\theta)$$
So, our target that maximizing predictive accuracy is equivalent to minimizing objective function.
Minimizing objective function $\Leftrightarrow$ Maximizing predictive accuracy.
How to calculate $P(w_{t+j}w_t;\theta)$ ?
To calculate $P(w_{t+j}w_t;\theta)$ use two vectors per word $w$:
 $v_w$ when $w$ is a center word
 $u_w$ when $w$ is a context word
Then for a center word $c$ and a context word $o$:
$$P(oc)=\frac{\exp(u_o^Tv_c)}{\sum_{w\in V}\exp(u_w^Tv_c)}$$
In the formula above $Exponentiation$ makes anything positive, $u_o^Tv_c$ is dot product that compares similarity of $o$ and $c$. Large dot product = Large probability. Denominator $\sum_{w\in V}\exp(u_w^Tv_c)$normalize over entire vocabulary to give probability distribution.
The follow is an example of $softmax function$ $\mathbb{R}\rightarrow\mathbb{R}$
$$softmax(x_i)=\frac{\exp(x_i)}{\sum_{j=1}^{n}\exp(x_j)}=p_i$$
Softmax Function apply the standard exponential function to each element $x_i$ of the input vector $x$ and normalize these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector $p$ is 1.
The softmax function above maps arbitrary values $x_i$ to a probability distribution $p_i$
 $max$ because it amplifies probability of largest $x_i$
 $soft$ because it still assigns some probability to smaller $x_i$
 Frequently used in Deep Learning.
$\theta$ represents all model parameters. For example we have $V$ words and each vector of word has $d$ dimensional. And every word such as word $w$ have two vectors, one is the center word vector $v_w$ and another is the context vector $u_w$. So, $\theta \in \mathbb{R}^{2dV}$ , we can optimize these parameters by walking down the gradient.
Useful basics about derivative:
$$\frac{\partial x^Ta}{\partial x}=\frac{\partial a^Tx}{\partial x}=a$$
Write out with indices can proof it.
We need to minimize
$$J(\theta)=\frac{1}{T}\log L(\theta)=\frac{1}{T}\sum_{t=1}^{T}\sum_{m\leq j \leq m, j\neq0}\log P(w_{t+j}w_t;\theta)$$
minimize $J(\theta)$ is equivalent to minimize $\log P(oc)=\log \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}$.
Take the partial of $\log P(oc)=\log \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}$, we can get the follow results:
$$\log P(oc)=\log \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}=\log \exp(u_o^Tv_c)\log \sum_{w=1}^V\exp(u_w^Tv_c)$$
$$\frac{\partial P(oc)}{\partial v_c}=u_o\frac{\sum_{x=1}^V u_x\exp(u_w^Tv_c)}{\sum_{w=1}^V\exp(u_w^Tv_c)}=u_o\sum_{x=1}^Vu_xP(xc)$$
The statement above illustrate the skipgram language model and how to update the parameters of $J(\theta)$.