# Exercise Set 1
<h2> IN4080 - 2025 </h2>

*You will probably not manage to work through all of this exercise set during the group session. Continue
to work on it by yourself after the group session and return to the teacher in later group sessions if you
have any questions.*


This exercises requires the following packages:
* Spacy 
* Numpy 
* Pandas
* Matplotlib

## Part 0: Set up a working environment on your PC

Follow the installation instructions on the course web page to set up a working environment on your
own computer.
The following exercises should be solved interactively in Python. We recommend using a Jupyter
Notebook, but you can also work in a standard interactive Python prompt.

### Jupyter notebooks? 
  The Jupyter Notebook is an interactive computing environment. 

  Read about: 
  - [Notebook Basics](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics.html) 
  - [Running Code](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Running%20Code.html) 


## Part 1: Precpocessing Data with SpaCy

We will work with the book *Peter Pan*, which is available as plain text on Project Gutenberg. 


a) Download the Plain Text UTF-8 file from https://www.gutenberg.org/ebooks/16 and place it in your working
directory. Rename the file if necessary. Open the file in a text editor. Are there any particularities in the document 
that you’ll need to watch out when processing?


*Answer with text here:*


b) This is a Jupyter notebook which is organized into cells. There are essentially two types of cells: text (Markdown)
cells and code cells. You can execute a code cell by clicking on the Play button or by pressing Ctrl+Enter.
Create a code cell below, load in the Peter Pan text, and print the length of the text. What does this number tell you? 


<details>
  <summary>Code Solution</summary>

```python
with open(‘peterpan.txt’, ‘r’, encoding=’utf-8-sig’) as f:
  text = f.read()
  print(len(text))
```
</details> 

c) We would now like preprocess the raw text. The NLP pipeline from `SpaCy` does a lot of the heavy lifting for us.
Run the text through the SpaCy pipeline and print out the number of sentences in the resulting `Doc` object.
Does the number of sentences correspond to your expectations? Inspect the data.

[SpaCy Usage Documentation](https://spacy.io/usage)

d) It turns out that the initial text contains a lot of line breaks that make the processing harder than
necessary. Replace all line breaks in the raw text with a space.  

*Hint* use regular expressions and the `re` library.

If you are not familiar with regular expressions, let your teachers know. We won’t need them now, but they 
can come quite handy for a lot of text processing tasks, so it’s worth investing some time in learning the basics…

e) Run the new text through the sentence tokenizer again. How do you judge the result? Have new errors been introduced?

## Part 2: Frequency distributions and Pandas

In this exercise set, we want to study the frequency distributions of words in the text.
Python provides a Counter class that counts repeated elements in a list and stores the counts as a
dictionary. We can set up the counter, get the most common elements, and the count for a specific element as follows:

```python
import collections

c = collections.Counter(tokens)
print(c.most_common(10))
print(c['with'])
```

a) Print out the 10 most frequent tokens in the text.


b) You will see that punctuation marks are among the most frequent items in the result. Remove them from the counter.

<details>
<summary> Hints </summary>
	<ul>
		<li> <code>string.punctuation</code> contains a list of punctuation symbols (you’ll need to import string first)</li>
		<li>You can delete the 'and' item from the counter with <code>del c[‘and’]</code>.</li>
	</ul>
</details>


c) It would be nice if we could display the frequency distribution as a nicely formatted table. Let us use a
DataFrame from the Pandas package for this. In general, it is easy to populate a DataFrame with the
contents of a Python dictionary, as in the following code snippet:

```python
import pandas as pd

d = {‘apple’: 5, ‘orange’: 8, ‘banana’: 51, ‘strawberry’: 20}
df = pd.DataFrame(d.items(), columns=['fruit', 'number'])

df.head()    # Display the first rows
```

You can do the same with your word counter.  

How many types and how many tokens does the text contain? What is its type-token-ratio?


d) With Pandas, you can easily select rows of a dataframe according to a particular criterion. For example,
this command displays all words that occur ten times or more:

```python
df[df[‘count’] >= 10]
```

How many hapaxes are in the dataset? What percentage of all word types are hapaxes?
How many word types start with upper case A?

## Part 3: Plotting with Matplotlib

Matplotlib is a package for making plots and figures in Python. When using Jupyter notebooks, the plots
are directly displayed in the notebook. This code snippet generates a simple plot:

```python 
import numpy as np
import matplotlib.pyplot as plt

numbers = np.arange(10)
print(numbers)
plt.plot(numbers)
```

a) Let us create a plot from our frequency distribution. You can directly use the `plot()` method of the
dataframe as follows:

```python
df = df.sort_values('count', ascending=False)
df.plot(x='word', y='count')
```

Does the result of this plot correspond to your expectations? 

b) Let's make some more visualizations. 

* Modify the command to only display the 20 most frequent words. 
* Try to display all words on the x-axis.
* For frequency plots, it is more natural to use bar charts. Switch the type of the plot with the `kind='bar'` parameter.

c) Zipf’s law states that the product of the frequency of a word and of its rank is approximately constant.
Let us verify this law on a subset of our frequency distribution. Select the 2000 first words of the
sorted frequency table. Reset the indices in the resulting dataframe so that we can use each index for the rank. Now, add a new column with the product of rank and frequency.

<details>
<summary>Hints</summary>

```python
	df_zipf = df[:2000].reset_index(drop=True) 
	df_zipf['z'] = df_zipf.index.values * df_zipf['Count']
```
</details>

What are the highest, lowest and average values of z that you observe? Plot the z values as a line chart.