OPPGAVE 1 �C STK4900 - Spring 2008 �C Universitetet i Oslo

STK4900 - Spring 2008

Project Exam for STK4900 and STK9900

Spring semester 2007

The exam in STK4900/STK9900 consists of this project exam and an oral examination.

The written solution to the project exam (in Norwegian or English) must be handed in

no later than May 30th at 2 pm either by regular mail or by e-mail to

?rnulf Borgan, Matematisk institutt, Universitetet i Oslo, P.B. 1053 Blindern, 0316 Oslo e-mail: borgan@math.uio.no

You are not allowed to collaborate with other students on the project exam.

The oral examinations take place June 12th to June 14th. Details will be posted on the course web-page.

The project exam consists of four exercises. In each exercise you will analyse a data set and interpret the analyses you have done. You may use the software package of your choice, but whether you use R or not, you must be able to answer all questions. We recommend that you use R.

The written solution to the exercises should be divided into two parts. In the main part you answer the questions and present the numerical results and plots that are necessary for your arguments. In an appendix you should document the computer code you have used to obtain the results in the main part. (You should only include the final code, not all trial and errors.)

If you have questions regarding the exercise texts or technicalities regarding R please send an email to borgan@math.uio.no

Remember to write your name, date of birth (cf. exercise 4) and e-mail address on the solution.

EXERCISE 1

Foresters need to be able to asses the amount of timber in a part of a forest. Therefore they need to have a simple and quick method to estimate the volume a tree.

It is difficult to estimate the volume of a living tree. It is, however, fairly easy to measure its height, and even easier to measure its diameter at ground level. Foresters therefore need to have available a formula that relates the volume of a tree to its diameter and/or height.

At the course web-page you find the data set trees.txt which contains measurements of the diameter, height and volume of a sample of 31 trees from a forest in the US. (These measurements were taken after the trees were cut down.)

For each tree the measurements are:

? DIAMETER??????????????? Diameter in inches 4.5 feet above the ground

? HEIGTH???????????????????? Height in inches

? VOLUME???????????????????? Volume in square feet

a) Study the relation between volume and diameter and between volume and height using simple linear regression. Which of the two variables diameter and height is by itself the best predictor for volume?

b) Study the relation between volume, diameter and height using multiple linear regression.

c) Assess the fit of the model in question b using suitable plots of the residuals. Make sure that you comment on what each of the plots tells you.

d) Is there a better way to describe the relation between volume, diameter and height than the one in question b? If so, perform such an analysis and interpret the results.

EXERCISE 2

At the course web-page you find the data set insects.txt which contains results from an experiment where one wanted to assess the toxicity of the substance rotenone. Groups of about 50 insects were exposed to various doses of rotenone, and the number of insects that died at each dose level was recorded.

The variables in the data set are coded as follows:

? LOGDOSE?????? Logarithm of the concentration of rotenone (base 10 logarithm)

? NUMBER???????? Number of insects in total

? DEAD????????????? Number of insects that died

a) Fit a suitable regression model for the relation between the proportions of insects that died and the doses of rotenone. Give your reasons for your choice of regression model.

b) Assess the fit of the model in question a by a suitable plot and a formal test, and give an interpretation of the model.

c) Use the fitted model to estimate LD50, i.e. the dose required to kill half the members of a tested population.

EXERCISE 3

In this exercise you will analyse data on accidents in a portfolio of private cars in a medium sized English insurance company during a three months period in the 1970s.

At the course web-page you find the data set claims.txt which contains the number of insurance claims according to the age of the driver, the motor volume of the car and area where the driver lived. The data set also contains information of the number of insured persons in each group (defined by age, motor volume and area).

The variables in the data set are as follows:

? AGE? ??????????? Age of the driver:

1=less than 25 year

2=25-29 years

3=30-35 years

4=more than 35 years

? VOLUME???????? Motor volume:

1=less than 1 litre

2=1 - 1.5 litres

3=1.5 - 2 litres

4=more than 2 litres

? AREA ??????????? Area where the driver lived:

4=London and other big cities

1-3=other districts

? NUMBER??? Number of insured persons in the group

? ACCIDENTS Number of accidents in the group

a) Explain why it is reasonable to assume that the number of accidents in a given group is Poisson distributed, and describe a type of regression model that is suitable for analysing the data.

b) Perform an analysis of the data using the regression formulation described in question a. As part of the analysis, you should investigate whether all the factors (age, volume and area) are needed to describe the accident rate, and whether there are interactions between the factors.

c) Give an interpretation of the model you arrived at in question b.

EXERCISE 4

In this exercise you will study the importance of some covariates on the birth weight of a child. To this end you shall use a sample of 500 birth weights from pregnancies where the pregnancy lasted at least 38 weeks. Each student should use their own sample of 500 birth weights, and it is described on the course web-page how you should proceed to obtain your sample. Make sure that you follow these instructions carefully.

The variables in the data set are coded as follows:

? AGE? ??????????? Age of the mother at the start of the pregnancy

? WEEKS???? Length of the pregnancy in weeks

? SEX?????? Sex of the child (0=boy, 1= girl)

? PARITY??? Parity of the child (1 for the first child, 2 for the second child, etc.)

? WEIGHT??? Birth weight in grams

Analyse the data with the aim of finding a model that can be used to predict the weight of a newborn child (based on the four covariates age of the mother, length of the pregnancy, and sex and parity of the child). Describe how you arrive at your model, and give an interpretation of the model.