Lecture 2, Thursday Sept. 1:
Words, tokenization, tagged text
This lecture will consider
- some basic linguistics concepts related to words
- the processes of tokenization and normalization
- tagged text
Presentation
Recordings
Mandatory readings
Jurafsky and Martin, Speech and Language Processing, 3. ed. (edition of 12 Jan. 2022!)
- Ch. 2 Regular expressions, etc.
- Sec. 2.0
- Sec. 2.2 Words
- Sec. 2.3 Corpora
- Sec. 2.4 Normalization, except 2.4.3 and the technical details of 2.4.1
- Ch. 8 Sequence Labelling ...
- Sec 8.1 and 8.2
- Ch. 3, sec. 6 Normalizing Text
- Ch. 3, sec. 8 Segmentation
- Ch. 5, sec. 1 Using a tagger
- Ch. 5, sec. 2 Tagged corpora
Wikipedia
Recommended reading
Wikipedia
Probabilities - background and tutorial
The slides of last year and the readings below indicate what we expect with respect to your knowledge of probabilities. Many of you have a background in probabilities, but some of you may lack it. If anybody are interested, we will arrange a tutorial on probabilities sometime between Fri Sept. 2 and Wed Sept. 7. We can decide on time in the lecture Sept. 1. (Sept. 1 at 14 turned out not to be an option.) If you are interested, you may send me (jtl) a mail indicating possible times.
Presentation
Readings
OpenIntro (3. ed.) (In the 4th ed. add one to the chapter numbers)
- Ch. 2, "Probability", sec. 2.1-2.4
- Ch. 3, "Distributions of random variables":
- Sec. 3.3.1 Bernoulli distribution
- Sec. 3.4.1 Binomial distribution