# coding: utf-8

# # IN4080 2018, Mandatory assignment 2, part B (= Exercise set 6)
# **See mandatory assignment 2, part A (=Exercise set 5) for general requirements**

# In this exercise we will experiment with sequence classification and tagging. We will combine some of the tools for tagging from NLTK with scikit-learn to build various taggers.
# 
# We jump into the NLTK book, chapter 6, the sections 6.1.5 Exploiting context and 6.1.6 Sequence classification. You are advised to read them before you start.
# 
# We start by importing NLTK and the tagged sentences from the news-section from Brown, similarly to the NLTK book.
# 
# Then we split the set of sentences into a train set and a test set.

# In[1]:


import re
import pprint
import nltk
from nltk.corpus import brown
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]


# Like NLTK, our tagger will have two parts, a feature extractor, here called **pos_features**, and a general class for building taggers, **ConsecutivePosTagger**.
# 
# We have made a few adjustments to the NLTK setup. We are using the *pos_features* from section 6.1.5 together with the *ConsecutivePosTagger* from section 6.1.6. The *pos_features* in section 1.5 does not consider history, but to get a format that works together with *ConsecutivePosTagger*, we have included an argument for history in *pos_features* which is not used initially. (It get used by the *pos_features* in section 6.1.6 of the NLTK book).
# 
# Secondly, we have made the *feature_extractor* a parameter of *ConsecutivePosTagger*, so that it can easily be replaced by other feature extractors while keeping *ConsecutivePosTagger*.

# In[2]:


def pos_features(sentence, i, history): 
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features


# In[3]:


class ConsecutivePosTagger(nltk.TaggerI): 

    def __init__(self, train_sents, features=pos_features):
        self.features = features
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)


# Following the NLTK bok, we train and test a classifier.

# In[4]:


tagger = ConsecutivePosTagger(train_sents)
print(round(tagger.evaluate(test_sents), 4))


# This should give results comparable to the NLTK book.

# The Brown tags that comes with NLTK are extended compared to the original tag set. To get better results from our tagger, we strip off the part after the hyphen and stick to the original Brown tags, a tagset of 87 tags. We can then repeat the training and testing, and we see a slightly improved result.

# In[5]:


def originize(tagged_sents):
    """Change tags to original Brown tags in tagged_sents"""
    return [ [(word, tag.split('-')[0]) for (word,tag) in sent]
             for sent in tagged_sents]

orig_train_sents = originize(train_sents)
orig_test_sents = originize(test_sents)

orig_tagger_1 = ConsecutivePosTagger(orig_train_sents)
print(round(orig_tagger_1.evaluate(orig_test_sents), 4))


# We will use these original Brown tags for the rest of this exercise.

# ## Ex1: Corpus and baseline (5 points)
# ### Part a
# To get a good tagger, we need a reasonably sized training corpus. Ideally, we would have used the complete Brown corpus in this exercise, but it turns out that some of the experiments we will run will be time consuming if we do that. Hence, we will follow the NLTK book and use only the News section. Since this is a rather homogeneous domain, and we also pick our test data from the same domain, we can still get decent results.
# 
# Towards the end of the exercise set, we will see what happens if we take our best model from the News section to a bigger domain.
# 
# Beware that even with this reduced corpus some of the experiments will take several minutes. And when we build the full tagger in exercise 5, an experiment may take more than an hour. So make sure you start the work early enough. (You might do other things while the experiments run.)
# 
# We will be a little more cautious than the NLTK-book, though. We will split the News-section into three sets
# 
# - 10% for final testing which we tuck aside for now, call it *news_test*
# - 10% for development testing, call it *news_dev_test*
# - 80% for training, call it *news_train*
# 
# And we will use the original Brown tags as explained.
# 
# - Make the data sets, and repeat the training and evaluation with *news_train* and *news_dev_test*.
# - Please use 4 counting decimal places and stick to that throughout.

# ### Part b
# One of the first things we should do in an experiment like this is to establish a reasonable baseline. A reasonable baseline here would be to for each word, return its most likely tag. Replace *pos_features* with a feature extractor *word_feature* which uses the word itself as the only feature. Repeat the experiment and report the accuracy of the tagger with this feature. (Trained on the 80% *news_train* and tested on *news_dev_test*).

# #### Deliveries: 
# Code. Results. Answer: Does the NLTK *pos_features* tagger beat your baseline?

# ## Ex2: scikit-learn and tuning (10 points)
# We will exchange the naive Bayes-classifier with logistic regression. To do that, we move over to scikit-learn. We have reimplemented the ConsecutivePosTagger to use scikit-learn classifiers. We have made the classifier a parameter so that it can easily be exchanged. We start with the BernoulliNB-classifier which should correspond to the way it is done in NLTK, at least in section 6.1.5.

# In[ ]:


import numpy as np
import sklearn

from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer


class ScikitConsecutivePosTagger(nltk.TaggerI): 

    def __init__(self, train_sents, 
                 features=pos_features, clf = BernoulliNB(alpha=0.5)):
        # Using pos_features as default.
        # Using BernoulliNB() (with alpha/lidstone 0.5)
        self.features = features
        train_features = []
        train_labels = []
        for tagged_sent in train_sents:
            history = []
            untagged_sent = nltk.tag.untag(tagged_sent)
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history)
                train_features.append(featureset)
                train_labels.append(tag)
                history.append(tag)
        v = DictVectorizer()
        X_train = v.fit_transform(train_features)
        y_train = np.array(train_labels)
        clf.fit(X_train, y_train)
        self.classifier = clf
        self.dict = v

    def tag(self, sentence):
        test_features = []
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            test_features.append(featureset)
        X_test = self.dict.transform(test_features)
        tags = self.classifier.predict(X_test)
        return zip(sentence, tags)


# ### Part a
# Train the ScikitConsecutivePosTagger on the *news_train* set and test on the *news_dev_test* set first with the *pos_features* and then with the *word_feature* extractors. Do you get the same result as with the original NLTK?

# ### Part b
# I get inferior results compared to using the NLTK set-up with the same feature extractors. The only explanation I could find is that the smoothing is too heavy. Therefore, try again with alpha in [1, 0.5, 0.1, 0.01, 0.001, 0.0001] for both feature extractors. What do you find to be the best value for alpha for each of the two?
# 
# With the best choices of alpha, do you get the same results as with NLTK, worse results or better results? What is the new baseline? Does pos_features beat the baseline?

# #### Deliveries:
# Code, results of runs, answer to questions.

# ## Ex 3: Logistic regression (10 points)
# ### Part a
# It is time to move on to logistic regression. Import *LogisticRegression* and use it with standard settings instead of *BernoulliNB*. Which results do you now get with the two different feature extractors?

# ### Part b
# The results are a little disappointing. We expect that also here the reason may be smoothing. Smoothing for LogisticRegression is done by regularization. In scikit-learn, regularization is expressed by the parameter C. A smaller C means a heavier smoothing. (C is the inverse of the parameter $\alpha$ in the lectures.) Try with C in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0] and see which value which yield best results for the two feature extractors.

# ### Part c
# If you get similar results like me, *LogisticRegression* does not perform better than *BernoulliNB* with *word_feature*, while *LogisticRegression* does  perform better than *BernoulliNB* with *pos_features*. Answer in 3-10 sentences why you think that is the case.

# #### Deliveries:
# Code. Results of the runs. Answers to the questions.

# ## Ex 4: Features (10 points)
# ### Part a
# The baseline uses only one feature - the word itself - and the *pos_feature* does not consider the word itself. The goal is now to construct a better feature extractor. The first immediate candidate is to use the features from both these feature extractors. Construct such a feature extractor. Train and test a tagger with this extractor using *LogisticRegression* and optimal C from last point.

# ### Part b
# Add also a feature for the next word in the sentence, and train and test similarly.

# ### Part c
# Try to add more features to get an even better tagger. Only the fantasy sets limits to what you may consider. Some candidates: is the word a number? does it contain capitals? etc. What is the best feature set you can come up with? Train and test it as before. 
# 
# If you use sources for finding tips about good features (like articles, web pages, NLTK code, etc.) make references to the sources and explain what you got from them.
# 
# Observe that the way *ScikitConsecutivePosTagger.tag()* is written, it extracts the features from a whole sentence before it tags it. Hence it does not support  preceding tags as features. It is possible to rewrite *ScikitConsecutivePosTagger.tag()* to tag one and one word, and to use the *history* which keeps the preceding tags in the sentence. If you like, you may try it. However, we got surprisingly little gain from including preceding tags as features, and you are not requested to trying it.

# #### Deliveries:
# Code. Results of the runs. Answers to the questions.

# ## Ex5: Larger corpus and evaluation (10 points)
# ### Part a
# We can now test our best tagger so far on the *news_test* set. 
# Do that. How is the result compared to testing on *news_dev_test*?

# ### Part b
# But we are looking for bigger fish. How good is our settings when trained on a bigger corpus?
# 
# We will use nearly the whole Brown corpus. But we will take away two categories for later evaluation: *adventure* and *hobbies*. We will also initially stay clear of *news* to be sure not to mix training and test data.
# 
# Call the Brown corpus with all categories except these three for *rest*. Shuffle the tagged sentences from *rest* and strip the tags to the original Brown tags. Then split it into 80%-10%-10%: *rest_train*, *rest_dev_test*, *rest_test*.
# 
# We can then merge these three sets with the corresponding sets from *news* to out final training and test sets:
# 
# `train = rest_train+news_train
# test = rest_test + news_test`
# 
# The first we should do is to establish a new baseline. Go back to the *word_feature* feature extractor, train a classifier on *train* and test it on *test*. What do you see? (Beware, this may take some time.)

# ### Part c
# We can then build our tagger for this larger domain. Use the best settings from the earlier exercises, train on *train* and test on *test*. What is he accuracy of your tagger? (Beware, this may take even longer time.)

# ### Part d
# Test the big tagger first on *adventures* then on *hobbies*. Discuss in a few sentences why you see different results from when testing on *test*. Why do you think you got different results on *adventures* from *hobbies*?

# #### Deliveries:
# Code. Results of the runs. Answers to the questions.

# ## Ex6: Comparing to other taggers (5 points)

# ### Part a
# In the lectures, we spent quite some time on the HMM-tagger. NLTK comes with an HMM-tagger which we may train and test on our own corpus. It can be trained by 
# 
# `news_hmm_tagger = nltk.HiddenMarkovModelTagger.train(news_train)`
# 
# and tested similarly as we have tested our other taggers. Train and test it, first on the *news* set then on the big *train*/*test* set. How does it perform compared to your best tagger? What about speed?

# ### Part b
# NLTK also comes with an averaged perceptron tagger which we may train and test. It is currently consider the best tagger included with NLTK. It can be trained as follows:

# In[ ]:


news_per_tagger = nltk.PerceptronTagger(load=False)
news_per_tagger.train(news_train)


# It is tested similarly to our other taggers. 
# 
# Train and test it, first on the news set then on the big train/test set. How does it perform compared to your best tagger? Did you beat it? What about speed?

# #### Deliveries: 
# Code. Results of the runs. Answers to the questions.