What?
The content - and the learning goals - for the course are described in broad terms at the main course page. More specifically, we will consider various steps involved in NLP systems from tokenization, sentence splitting, and tagging to named-entity recognition, information extraction and more. Central to the course will be the use of experiments in NLP, as many of you will carry out such experiments as part of your master's thesis project. We will in particular consider how experiments should be set up and evaluated, as well as various machine learning algorithms, and what makes linguistic material special when it comes to machine learning experiments.
The course will follow a similar path as in 2019, but with some revisions. The first 7 weeks (Aug.+Sept.) will consider basic steps in an NLP process and various NLP techniques, based on chapters 1-9 in Jurafsky and Martin, Speech and Language Processing (3. ed, 16 Oct 2019). Compared to 2019, some of the more elementary stuff will be skipped, and some more advanced approaches will be included. In October, we will consider dialogue systems, and in mandatory assignment 3, you will make your own chatbot.
Our bachelor programs were renewed with the incoming students in 2017. In particular, the bachelor program in Informatics: Language Technology was established (replacing the former program called Informatics: Language and Communication.) This is the first year where the master's program in Informatics: Language Technology will recruit students from the bachelor program of the same name, and some of the revisions to this course are made to accommodate these updates. In particular, it is a goal not too repeat too much of the curriculum from IN1140 and IN2110.
At the same time, we are also recruiting more master's students to the Informatics: Language Technology program from other bachelor programs and universities than before. Moreover, there are also students from other programs than the I:LT master's taking this course. Last year, most of the students were from other programs. For many, IN4080 is their first encounter with NLP. It is a challenge to accommodate to students with such varying backgrounds - on the one hand making the curriculum comprehensible to they without background in NLP and at the same time avoiding too much repetition of IN1140 or IN2110. We will try and do our best. But it is inevitable than some of you will have heard some of this before, while for some of you, there will much new and the learning curve might feel steep.
Recommended background
Natural Language Processing is an interdisciplinary subject building on insights from various fields including
- Language and Linguistics
- Computer Science in general and programming in particular
- Probability theory and statistics (and mathematics)
- Machine Learning and "Data Science"
Here is some more on assumed background and recommendations on what to read.
Language and linguistics
You should be familiar with some core concepts of linguistics, like "parts of speech" and "sentence structure". If you have not taken any courses in linguistics or NLP/Computational Linguistics you should consult some of the following.
- Chapter 3, "Linguistic Essentials", p. 81-115, in Manning and Schütze: Foundations of Statistical Natural Language Processing. This is the best overview for what will be assumed in the course. Unfortunately, the book is not online, but you find it in the library.
- Jurafsky and Martin, Speech and Language Processing, sections 3.1 + 8.1 in the 3.ed cover some of the background on words
- While chapter 12 Constituency Grammars, Sections 12.1-12.3, introduce some of the key concepts of sentence syntax.
- Related to sentence syntax, you are also recommended to read sections 8.1-8.3 in the NLTK book: Natural Language Processing with Python, by Bird, Klein and Loper.
Programming in Python
The course will not be a heavy programming course, but you have to be able to write programs to solve simple tasks. Moreover, many of the tools we will use are Python modules. We assume that you know how to program and that you are able to learn yourself Python if you are not already familiar with it. Sources for learning Python includes
- The official Python documentation, in particular the Tutorial and Library.
- Think Python: How to Think Like a Computer Scientist is an easy introduction to Python
- The Scipy Lecture Notes gives a condensed introduction to Python together with introductions to some of the packages we will apply.
- Also the first four chapters of the NLTK book gives a Python introduction.
The Natural Language Toolkit (NLKT)
We wil use several Python packages and toolkits for NLP and machine learning during the course.This toolkit is used in several bachelor courses. We will also use some parts of it in this course for simpler tasks like tokenization, tagging and counting frequencies. You are advised to familiarize yourself with the 3 first chapters of the book as soon as possible, in particular chapter 1 and chapter 2, sec. 2.1-2.2.
Probability theory and Statistics
Probabilistic concepts are heavily used in modern NLP and you need a basic understanding of them. It has varied from year to year how many of the students who have lacked a background on probabilities, and we have some years offered a turorial on the basic concepts for they with no background. We will considered whether there is a wish for it this year.
Do you need a book on statistics? We will cover all the concepts on the slides, so a book is not strictly required. But if you like some more to read, most books on statistics will do. To make it cheap we will use some parts of a book that is freely available on the web, the OpenIntro Statistics (3.ed)
Other usuful sources
- If you already own a book on statistics, that will probably cover what we will consider, e.g. the STK1000 book, Moore and McCabe, Introduction to the Practice of Statistics.
- In earlier semesters, some students recommended Khan academy
What first?
Question: If I lack some of this background, in which order should I attack it?
- If you lack experience with Python and NLTK, that is most urgent. We are going to use it from the first week.
- Then, if you don't have knowledge of linguistics, that's next on your agenda.
- If you already know Python, NLTK and some linguistics, it is time for probabilities and statistics. As said, we may give some tutorials, but is wise to start ahead and use the first weeks of the semester.