Dec 29, 2023

Assignment Task


1. The activity of enzyme X is measured in enzyme unit (U) = 1 μmol min−1 (micromole per minute) in five different batches of cells at several different temperatures and the information recorded in enzyme_activity.csv.

Is there a significant difference between the average enzyme activity at each temperature at 5% significance?

Give your educated guess, hypotheses and decision.

Recreate a plot similar to this, changing the title to reflect your decision and highlighting any pairs of temperatures between which there is a significant difference

2. Protein concentration assays are best read as curves, we are going to simplify by using linear regression, this can be appropriate over certain concentration ranges.

The file prot_conc.csv contains 20 measurements of standards of known protein concentration (ug/ml) and the absorbances (unitless) recorded.

3. The file water_o2.csv contains 50 measurements from freshwater ponds and streams in an area, data includes the temperature of the water (degrees Celcius), pH of the water and oxygen level of the water (miligrams per litre).

Visualise each possible pair of measurements and give your educated guess as to whether any are correlated and if the correlation will be positive or negative.

Test if each pair is correlated, give the hypotheses and your decision.

4. The file driving_simulator.csv contains 100 measurements of the number of accidents people had while driving a simulator and the number of hours of sleep they had in the previous 24 hours

Imagine you are a driving safety campaigner trying to use this data to complete the following slogans

“Sleep saves lives. Just one more hour’s sleep could prevent XXXXX traffic accidents”

“Don’t drive tired, people who have not slept at all are predicted to have XXXX accidents”

What are the appropriate numbers to use in each slogan?

5. A microbiologist is studying the effect of various pollutants on the doubling time of a specific bacterial species. The researcher exposed multiple growth colonies of the bacteria to different pollutants or just growth media as a control. The doubling time of the bacteria was measured for each growth colony.

The researcher wants to determine if there is a significant difference in the average doubling time among the different pollutant treatments. Perform an ANOVA analysis on the data and interpret the results.

Doubling times (minutes) and pollutant used are recorded in pollutants.csv. Give an educated guess, hypotheses and a decision.

6. The file class_survey.csv contains multiple pieces of information about 100 students in a class, their hair colour, eye colour, county of birth, which class they belong to and which society they have joined.

Construct a table counting how many students are found in each potential category combination of county of birth and class membership

Do the categories appear to be independent of each other? Visualise so you can give an educated guess, hypotheses and decision.

How many degrees of freedom does this test have and why?

7. The file test_independence.csv contains counts of numbers of individuals which fall into certain categories.

NOTE: The first column of this file contains the rownames. When reading in the file use row.names = 1 to tell R to use those as rownames or edit the input to set the rownames to be these names instead of numbers and then delete this column.

Do the categories appear to be independent of each other? Visualise so you can give an educated guess, hypotheses and decision.

8. There are two possible tests of independence, which is more appropriate here?

How many individuals would you expect to see which are both Mountain and Green_eye? Is this bigger or smaller than the observed number in this category?

IF the two categories are not independent describe what you think is the biggest difference from independence as “there are more/less [category combination] than I would expect from the overall proportion in [category]”

NOTE particular combinations of test and data may produce an error “FEXACT error 7(location). LDSTP=18630 is too small for this problem,” - the answer to this is to add simulate.p.value=TRUE inside the brackets when running the test.

9. The file bicep_increase.csv contains 100 measurements of people following a weight lifting program intended to increase bicep size. The columns represent increase in bicep size from start to end of training (inches), number of weeks spent training and sex.

Recreate the following plot with appropriate axis labels and use it to make an educated guess about the following questions.

  • Is it possible to predict the increase in bicep size from the number of weeks of training?

  • What is the average increase that you would predict someone will see after 21 weeks of training?

  • What range of average values would you predict someone will see after 21 weeks based on this sample?

  • What range of increases would you predict an individual who has trained for 21 weeks might see?

  • Are there individual values which lie outside the confidence interval? Are there individual values which lie outside the prediction interval? How can this be?

