ParkRunning and Crosstalking ! Interactive graphs + Linear Regression in R
Park Run Results : Brockwell park 10th June
I am currently in training for a 10km running race. As part of my training I have started doing the park run every saturday morning to get some speed practice. This saturday I did the Brockwell park run in south london and got my second fastest parkrun time of 26 minutes 47 seconds. Park run publishes the complete results each week showing the times of all finishers including details about their gender, age band and their personal best for the course. I decided to use this data to make some charts of the results this week.
At work I have recently come across interactive mapping packages “plotly” and “crosstalk”. These are relatively new packages that create interative charts (similar to those in Rshiny) but can be hosted in a html webpage without the need for a shiny server. This makes it easier to share interactive graphs as all you need is a single html document.
Finishing times: Men vs Women | Plotly
Plotly is a graphing library that makes interactive, publication-quality graphs online. Find out more here. It is easy to convert a gggplot to a plotly graph using the ggplotly function.
p=ggplot(ParkRun, aes(x = Time_mins, fill = Gender)) + geom_histogram( binwidth = 1, position="identity", alpha=0.6) ggplotly(p)
Finishing times vs Number of Runs | CrossTalk
Crosstalk adds interactivity to HTML widgets. This means it works with the plotly graphs as shown below. The dataset you want to use has to be converted to a shared data enviroment and then it can be linked with a selecter (here I have checkbox). Note there is no correlation between Number of park Runs and time! Find out more here.
shared_PR <- SharedData$new(ParkRun) bscols(widths = c(3,NA,NA), list( filter_checkbox("Gender", "Gender", shared_PR, ~Gender, inline = TRUE) ), plot_ly(data = shared_PR, x = ~Time_mins, y = ~`Number of Runs`, color = ~Gender) )
Predicting Finish Times!
Finally for a bit of fun I had a go at predicting the run times based on the information given by Park Run. I did a multiple linear regression model and found that the following were significant in predicting run times:
- age / gender category,
- whether a runner was part of a club or not
- whether this run was their first time
- Their previous personal best all to be signficant when predicting run times. (Because not all runners had a previous PB recorded I filled in the missing values with the median of the other runners based on gender.)
The graph below shows the fitted values from the model plotted against the actual values with my run time highlighted!The model predicts my time to be almost 2 and half minutes slower. Although just from looking at the graph, the fit isn’t too bad!
Note: This regression was a bit of fun and I did not check all the model assumptions held & evaluate the fit.
model1 <- lm(data=ParkRun, Time_mins~factor(Category)+factor(Club2)+factor(Type)+PB5) t=fitted.values(model1) t2 = filter(ParkRun, valid==2) Combined = cbind(t, t2) pal <- c("red", "skyblue") plot_ly(data=Combined, x =~Time_mins, y=~t, color=~factor(me), alpha = 0.7, colors= pal)