Project Exam for STK4900 and STK9900
Spring semester 2007
The
exam in STK4900/STK9900 consists of this project exam and an oral examination.
The
written solution to the project exam (in Norwegian or English) must be handed
in
no later than May 30th at 2 pm either by regular
mail or by e-mail to
You are
not allowed to collaborate with other students on the project exam.
The
oral examinations take place June 12th to June 14th. Details will be posted on
the course web-page.
The
project exam consists of four exercises. In each exercise you will analyse a
data set and interpret the analyses you have done. You may use the software
package of your choice, but whether you use R or not, you must be able to
answer all questions. We recommend that you use R.
The
written solution to the exercises should be divided into two parts. In the main
part you answer the questions and present the numerical results and plots that
are necessary for your arguments. In an appendix you should document the
computer code you have used to obtain the results in the main part. (You should
only include the final code, not all trial and errors.)
If you
have questions regarding the exercise texts or technicalities regarding R
please send an email to borgan@math.uio.no
Remember to write your name, date of birth
(cf. exercise 4) and e-mail address on the solution.
EXERCISE 1
Foresters need to be able to asses the amount of timber in a part of a
forest. Therefore they need to have a simple and quick method to estimate the
volume a tree.
It is difficult to estimate the volume of a living tree. It is, however,
fairly easy to measure its height, and even easier to measure its diameter at
ground level. Foresters therefore need to have available a formula that relates
the volume of a tree to its diameter and/or height.
At the course web-page you find the data set trees.txt which contains measurements of the
diameter, height and volume of a sample of 31 trees from a forest in the
For each tree the measurements are:
?
DIAMETER??????????????? Diameter in inches
?
HEIGTH???????????????????? Height in
inches
?
VOLUME???????????????????? Volume in square
feet
a) Study the relation between volume
and diameter and between volume and height using simple linear regression.
Which of the two variables diameter and height is by itself the best predictor
for volume?
b) Study the relation between volume,
diameter and height using multiple linear regression.
c) Assess the fit of the model in
question b using suitable plots of the residuals. Make sure that you comment on
what each of the plots tells you.
d) Is there a better way to describe
the relation between volume, diameter and height than the one in question b? If
so, perform such an analysis and interpret the results.
EXERCISE 2
At the course web-page you find the data set insects.txt which contains results from an
experiment where one wanted to assess the toxicity of the substance rotenone.
Groups of about 50 insects were exposed to various doses of rotenone, and the
number of insects that died at each dose level was recorded.
The variables in the data set are coded as follows:
?
LOGDOSE?????? Logarithm of the concentration
of rotenone (base 10 logarithm)
?
NUMBER???????? Number of insects in total
?
DEAD????????????? Number of insects
that died
a) Fit a suitable regression model for
the relation between the proportions of insects that died and the doses of
rotenone. Give your reasons for your choice of regression model.
b) Assess the fit of the model in
question a by a suitable plot and a formal test, and give an interpretation of
the model.
c) Use the fitted model to estimate
LD50, i.e. the dose required to kill half the members of a tested
population.
EXERCISE 3
In this
exercise you will analyse data on accidents in a portfolio of private cars in a
medium sized English insurance company during a three months period in the
1970s.
At the course web-page you find the data set claims.txt which contains the number of
insurance claims according to the age of the driver, the motor volume of the
car and area where the driver lived. The data set also contains information of
the number of insured persons in each group (defined by age, motor volume and
area).
The variables
in the data set are as follows:
?
AGE? ??????????? Age
of the driver:
1=less than 25 year
2=25-29 years
3=30-35 years
4=more than 35 years
?
VOLUME???????? Motor volume:
1=less than
2=1 -
3=1.5 -
4=more than
?
AREA ??????????? Area
where the driver lived:
4=
1-3=other districts
?
NUMBER??? Number of insured persons in the
group
?
ACCIDENTS Number of accidents in the group
a) Explain why it is reasonable to
assume that the number of accidents in a given group is Poisson distributed,
and describe a type of regression model that is suitable for analysing the
data.
b) Perform an analysis of the data
using the regression formulation described in question a. As part of the
analysis, you should investigate whether all the factors (age, volume and area)
are needed to describe the accident rate, and whether there are interactions
between the factors.
c) Give an interpretation of the model you arrived at in question b.
EXERCISE 4
In this exercise you will study the importance of some covariates on the birth weight of a child. To this end you shall use a sample of 500 birth weights from pregnancies where the pregnancy lasted at least 38 weeks. Each student should use their own sample of 500 birth weights, and it is described on the course web-page how you should proceed to obtain your sample. Make sure that you follow these instructions carefully.
The
variables in the data set are coded as follows:
?
AGE? ??????????? Age
of the mother at the start of the pregnancy
?
WEEKS???? Length of the pregnancy in weeks
?
SEX?????? Sex of the child (0=boy, 1= girl)
?
PARITY??? Parity of the child (1 for the first
child, 2 for the second child, etc.)
?
WEIGHT??? Birth weight in grams
Analyse the
data with the aim of finding a model that can be used to predict the weight of
a newborn child (based on the four covariates age of the mother, length of the
pregnancy, and sex and parity of the child). Describe how you arrive at your
model, and give an interpretation of the model.