CS224 Lecture1 - Word2Vec

1 Introduction of Natural Language Processing

Natural Language is a discrete/symbolic/categorical system.

1.1 Examples of tasks

There are many different levels in NLP. For instance:

Easy:

  • Spell Checking

  • Keywords Search

  • Finding Synonyms

  • Synonym: 同义词 noun

Medium

  • Parsing information from websites, documents, etc.

  • Parse: 对句子进行语法分析 verb

Hard

  • Machine Translation

  • Semantic Analysis (What is the meaning of query statement?)

  • Coreference (e.g. What does “he” or “it” refer to given a document?)

    Coreference: 共指关系,指代的词

  • Question Answering

1.2 How to represent words?

1.2.1 WordNet

  • Use e.g. WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).

    hypernym: 上位词

https://image.i-ll.cc/uPic/20200217/YDeWBF.png?imageMogr2/auto-orient/blur/1x0/quality/75|watermark/2/text/WmhhbyBDaGnigJhzIEJsb2c=/font/dGltZXMgbmV3IHJvbWFu/fontsize/240/fill/IzAwMDAwMA==/dissolve/75/gravity/SouthEast/dx/10/dy/10|imageslim

https://image.i-ll.cc/uPic/20200217/bZlTpI.png?imageMogr2/auto-orient/blur/1x0/quality/75|watermark/2/text/WmhhbyBDaGnigJhzIEJsb2c=/font/dGltZXMgbmV3IHJvbWFu/fontsize/240/fill/IzAwMDAwMA==/dissolve/75/gravity/SouthEast/dx/10/dy/10|imageslim

Problems with resources like WordNet

  • Great as a resource but missing nuance (e.g. “proficient” is listed as a synonym for “good”, This is only correct in some contexts.)
  • Missing new meanings of words
  • Subjective
  • Requires human labor to create and adapt
  • Can’t compute accurate word similarity

1.2.2 Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols: hotel, conference, motel - a localist representation.

Words can be represented by one-hot vectors, one-hot means one 1 and the rest are 0. Vector dimension = number of words in vocabulary.

  • Problem with words as discrete symbols:

    For example, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.

    But:

    hotel=[0000010]\motel=[0000100]

    These two vectors are orthogonal. But for one-hot vectors didn’t shown there’s natural notion of similarity.

  • Solution:

    • Could try to rely on WordNet’s list of synonyms to get similarity?

      • But it is well-known to fail badly: incompleteness, etc.
    • Instead: Learn to encode similarity in the vectors themselves

1.2.3 Representing words by their context

  • Distributional semantics: A word’s meaning is given by the words that frequently appear close-by.

    • “You shall know a word by the company it keeps”(J.R. Firth 1957: 11)
    • One of the most successful ideas of modern statistical NLP
  • When a word “w” appears in a text, its context is the set of words that appear nearby(within a fixed-size windows).

  • Use the many contexts of “w” to build up a representation of “w”.

    https://image.i-ll.cc/uPic/20200217/SeMoji.png?imageMogr2/auto-orient/blur/1x0/quality/75|watermark/2/text/WmhhbyBDaGnigJhzIEJsb2c=/font/dGltZXMgbmV3IHJvbWFu/fontsize/240/fill/IzAwMDAwMA==/dissolve/75/gravity/SouthEast/dx/10/dy/10|imageslim

2 Word2Vec

  • Word Vectors: In word2vec, a distributed representation of a word is used. Each word is represented by a distribution of weights across those elements. So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.

    Note: Word vectors are sometimes called word embeddings or word representations. They are a distributed representation.

2.1 Word2Vec: Overview

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.

Idea:

  • We have a large corpus of text
  • Every word in a fixed vocabulary is represented by a vector
  • Go through each position t in the text, which has a center word c and context (“outside”) words o
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
  • Keep adjusting the word vectors to maximize this probability

2.2 Word2Vec: Objective Function

Example windows and process for computing:

P(wt+j|wt)

https://image.i-ll.cc/uPic/20200217/TvsULE.png?imageMogr2/auto-orient/blur/1x0/quality/75|watermark/2/text/WmhhbyBDaGnigJhzIEJsb2c=/font/dGltZXMgbmV3IHJvbWFu/fontsize/240/fill/IzAwMDAwMA==/dissolve/75/gravity/SouthEast/dx/10/dy/10|imageslim

For each position, t=1,,T , predict context words within a window of fixed size m, given center word wj.

Likelihood=L(θ)=t=1TmjmP(wt+j|wt;θ)

where θ is all variables to be optimized.

  • The objective function J(θ) is the average negative log likelihood:

    Sometimes, objective function called cost or loss function.

J(θ)=1TlogL(θ)=1Tt=1TmjmlogP(wt+j|wt;θ)

So, our target that maximizing predictive accuracy is equivalent to minimizing objective function.

Minimizing objective function Maximizing predictive accuracy.

How to calculate P(wt+j|wt;θ) ?

To calculate P(wt+j|wt;θ) use two vectors per word w:

  • vw when w is a center word
  • uw when w is a context word

Then for a center word c and a context word o:

P(o|c)=exp(uoTvc)wVexp(uwTvc)

In the formula above Exponentiation makes anything positive, uoTvc is dot product that compares similarity of o and c. Large dot product = Large probability. Denominator wVexp(uwTvc)normalize over entire vocabulary to give probability distribution.

The follow is an example of softmaxfunction RR

softmax(xi)=exp(xi)j=1nexp(xj)=pi

Softmax Function apply the standard exponential function to each element xi of the input vector x and normalize these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector p is 1.

The softmax function above maps arbitrary values xi to a probability distribution pi

  • max because it amplifies probability of largest xi
  • soft because it still assigns some probability to smaller xi
  • Frequently used in Deep Learning.

θ represents all model parameters. For example we have V words and each vector of word has d dimensional. And every word such as word w have two vectors, one is the center word vector vw and another is the context vector uw. So, θR2dV , we can optimize these parameters by walking down the gradient.

Useful basics about derivative:

xTax=aTxx=a

Write out with indices can proof it.

We need to minimize

J(θ)=1TlogL(θ)=1Tt=1Tmjm,j0logP(wt+j|wt;θ)

minimize J(θ) is equivalent to minimize logP(o|c)=logexp(uoTvc)w=1Vexp(uwTvc).

Take the partial of logP(o|c)=logexp(uoTvc)w=1Vexp(uwTvc), we can get the follow results:

logP(o|c)=logexp(uoTvc)w=1Vexp(uwTvc)=logexp(uoTvc)logw=1Vexp(uwTvc)

P(o|c)vc=uox=1Vuxexp(uwTvc)w=1Vexp(uwTvc)=uox=1VuxP(x|c)

The statement above illustrate the skip-gram language model and how to update the parameters of J(θ).

https://image.i-ll.cc/uPic/20200220/QTV0B5.png?imageMogr2/auto-orient/blur/1x0/quality/75|watermark/2/text/WmhhbyBDaGnigJhzIEJsb2c=/font/dGltZXMgbmV3IHJvbWFu/fontsize/240/fill/IzAwMDAwMA==/dissolve/75/gravity/SouthEast/dx/10/dy/10|imageslim