Lecture 2: Words, tokenization, tagged text
This week we will look at
- some basic linguistics concepts related to words
- the processes of tokenization and normalization
- tagged text
Slides
Recording
Mandatory reading
Jurafsky and Martin, Speech and Language Processing, 3. ed. (edition of 16 Oct,. 2019!)
- Ch. 2 Regular expressions etc
- Sec. 2.0
- Sec. 2.2 Words
- Sec. 2.3 Corpora
- Sec. 2.4 Normalization, except 2.4.3 and the technical details of 2.4.1
- Ch. 8 Part-of-speech-tagging,
- Sec 8.1 and 8.2
- Ch. 3, sec. 6 Normalizing Text
- Ch. 3, sec. 8 Segmentation
- Ch. 5, sec. 1 Using a tagger
- Ch. 5, sec. 2 Tagged corpora
Wikipedia
Recommended reading
Wikipedia
Lab-session 1, Tuesday 25 August at Sed
Observe that the group sessions have moved to Sed with 28 seats!
Bring either your laptop or keyboard+mouse!