Week 2

As announced in the first lecture, there will be two lectures this week, one Monday and one Thursday to establish that all of you have a sufficient background.

Lecture 2: Words, tokenization, tagged text

This lecture will consider

  • some basic linguistics concepts related to words
  • the processes of tokenization and normalization
  • tagged text

Presentation

Recordings

Mandatory reading

Jurafsky and Martin, Speech and Language Processing, 3. ed. (edition of 30 Dec. 2020!)

  • Ch. 2 Regular expressions, etc.
    • Sec. 2.0
    • Sec. 2.2 Words
    • Sec. 2.3 Corpora
    • Sec. 2.4 Normalization, except 2.4.3 and the technical details of 2.4.1
  • Ch. 8 Sequence Labelling ...
    • Sec 8.1 and 8.2

NLTK Book

  • Ch. 3, sec. 6 Normalizing Text
  • Ch. 3, sec. 8 Segmentation
  • Ch. 5, sec. 1 Using a tagger
  • Ch. 5, sec. 2 Tagged corpora

Wikipedia

Recommended reading

Wikipedia

Lecture 3, Thursday 2 Sept. 12.15 at Sed

Presentation

Recordings

Unfortunately, the recording equipment in SED did not work, but the following recordings from 2020 cover roughly the same material:

  • First part
  • Second part
    (This opens in PowerPoint because of problems with the recording equipment also last year.)

Readings

OpenIntro (3. ed.) (In the 4th ed. add one to the chapter numbers)

  • Ch. 2, "Probability", sec. 2.1-2.4
  • Ch. 3, "Distributions of random variables":
    • Sec. 3.3.1 Bernoulli distribution
    • Sec. 3.4.1 Binomial distribution

 

Published Aug. 29, 2021 3:46 PM - Last modified Oct. 14, 2021 8:37 AM