(Briefly Discussing a) Large Data Set

Tomorrow is my final A Level Maths exam. It's on stats and mechanics, which isn't my favourite part of maths (despite me generally liking data analysis), and I can't say I'm not slightly worried for it. Yes I have been revising, yes I know my hypothesis tests. And yes, I think I know why the data point for wind direction on the 29th October 1987 in Leeming is invalid. 

To be exact, it lists the wind direction (which can only take values from 0 to 360, as in angles) as 999 - obviously it's an anomaly, and I somewhat think it could be a copyright trap, or a geniuine mistake from a weather station in Leeming.

There are five main locations on the Large Data Set - Camborne, a town located in the south-west; Hurn, a former RAF base (now the site of Bournemouth Airport) along the south coast; Heathrow, referring to the London airport that's the largest in the world; Leeming, another former RAF base located in Yorkshire; and Leuchars, a Scottish town near St Andrews and Dundee. Thanks to the Met Office collating months work of weather data in these locations, Edexcel have enabled teenagers to access the data for free, provided they sit an exam at some point. 

Cloudy day, maybe about 6-7 oktas
The data included is on temperature, wind direction/speed/gust, pressure, hours of sunshine, and rainfall. Some of the data is repeated in various forms - wind direction is both in degrees (as discrete data) and as cardinal directions (categoric data). Windspeed is either in knots, or using the Beaufort scale, which assigns a description to the wind - 6kn is "light" on the scale, whereas 11 is "moderate". My personal favourite is cloud cover, which is measured in oktas from 0 to 8, with 0 being clear skies and 8 having no inch of blue visible. 

Sometimes the units can get very strange too - visibility is measured in decameters, something which caught most of my class out. Pressure is measured in hectopascals, or one hundred pascals, unlike the more typical bars or atm used (but millibars are fine, I believe). This should come as no surprise - the Met Office are meteorological, and the units reflect this.

Sometimes, the exam will want to ask you how to compare these variables, perhaps calculate a mean value, and carry out hypothesis tests on correlations. One thing I've learned is to always assume pressure will be on the x axis, as it's almost always the explanatory variable. As pressure increases, for instance, rainfall decreases or sunshine increases. Thankfully we don't need to know why this is geographically the case.

The main locations I want to make sure I understand, however, aren't in the UK. They are:

  • Jacksonville, located in south-eastern US;
  • Beijing, in eastern China;
  • Perth, in south-western Australia.

They only have five variables that can be measured, or:

  • Daily Mean Air Temperature (the British locations only have the daily mean temperature)
  • Rainfall over a 24 hour period
  • Daily Mean Pressure 
  • Daily Mean Windspeed in both knots and the Beaufort scale

There's only one categoric variable therefore, and that's the Beaufort scale. 

The other main discrepancy is the existence of trace (tr) values in the data set. They essentially refer to any values of rainfall recorded as being between 0-0.05 mm. When calculating the mean, they count as zero. 

Also, I shouldn't forget that the data set only includes data between May and October inclusive - the other months are forgotten about, so any mean values aren't fully reflective of the UK as a whole. 

Extra content

About a year ago, I wrote a blogpost called "July in terms of the weather", where I averaged out the data points for highest and lowest temperatures in every July from 1948 onwards, as recorded in Heathrow. Now, I can finally compare and contrast my data from 1987 and 2015 using the large data set:

July 1987 with LDS: 17.5॰C; July 1987 with min/max values: 17.5॰C

July 2015 with LDS: 18.8॰C; July 2015 with min/max values: 18.8॰C

I didn't expect them to line up so neatly.

Also, AQA and OCR students have been going over a different large data set; AQA are looking at car mileage, whereas OCR are analysing what seems to be almost every local authority. Apologies to anyone caught out by this blogpost, and I hope the exam goes well.

Back to revising...

Comments