It’s easy to get swept up in the excitement. Sweating on every tick of the market. Listening to the live market news and updates, trying to time it just right.

I had a holiday booked with my partner, so I took a couple of weeks off and didn’t look at the markets. When I returned, it was like I never left. Everyone was still on the same hamster wheel. The market had moved, but it was kind of the same.

Having formal mathematics training lent me some perspective. It reminded me of the famous fractal pattern popularised by Benoit Mandelbrot where certain geometric shapes look similar to themselves as you continually zoom in (a process known as self-similarity^{1}).

It got me thinking, was this all just random noise. Was this all just a giant illusion based on psychology, sales targets and the need to fill column inches?

I recently completed Michael Kemp’s excellent book *The Ulysses Contract: How to never worry about the share market again.*^{2}

Kemp breaks down common myths around ‘authority’ in financial markets and the tendancy for humans to fall prey to cognitive biases and essentially forgetting history when it comes to the markets. I agreed with nearly every word in the book. Although admitedly, I did blush at how my attitudes differed when I was a freshly minted finance graduate.

Author and interesting person Nassim Nicholas Taleb also writes about this in his 2001 classic *Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets.*^{3}. Despite his writing style being extrememly abrasive, the underlying points are compelling. Are we all blinded by randomness? Commentating, analysing and taking credit for what is essentially random noise happening *to* us?

**It got me thinking, could the average punter tell the difference between the stock market and random noise?**

To test this, I simulated a type of data called a ‘random walk’. A random walk is a type of mathematical, time-series model. It describes a process of sequential observations (say, daily share prices). Each day’s share price is simulated by taking yesterday’s price and just adding randomly generated ‘noise’. It doesn’t try to emulate the shape or style of share market. It’s really dumb. It just takes the price one day and either adds or subtracts a small random amount (of normally distributed noise).

The funny thing about this model is it often contains^{4}:

- long periods of apparent trends up or down
- sudden and unpredictable changes in direction.

Sound like the share market anyone??

This can be written as:

Really, this is just a special case of a class of time-series models called ARIMA models. In particular it is an Autoregressive model, that is, a model that is formed from linear combination of previous values.

Where we set just one term , and specify

I have simulated 1000 values of a random walk (AR(1)) model and placed this next to 1000 recent observations of a randomly selected Australian stock’s daily closing price.

Can you tell which is which?

I don’t really know. In the short term, it does *look* random. In the long term, the Australian share market has continued to exhibit long term growth. This doesn’t mean its dynamics aren’t driven my a random walk, after all these models can incorporate drift.

There is even a formal theory in finance called the Random Walk hypothesis^{5} in which some financial heavyweights have argued that stock prices evolve based on a random walk process and are thus - **unpredictable.**

Randomness is very tricky - the implication here isn’t that the markets are meaningless or the underlying dynamics are random or that you cant make money.

Rather, we can think of stock prices being *modelled* well by a random walk process. So what you can do about a process that is modelled well by a random walk?

You can start by calling bullshit on attempts to explain or predict short term price fluctuations. Even if they sound confident and wear a suit.

https://en.wikipedia.org/wiki/Self-similarity↩︎

https://www.goodreads.com/en/book/show/88826560↩︎

https://www.goodreads.com/book/show/38315.Fooled_by_Randomness↩︎

https://otexts.com/fpp3/stationarity.html#random-walk-model↩︎

https://en.wikipedia.org/wiki/Random_walk_hypothesis↩︎

Is this real? Where is it? Could I geolocate it using just OSINT^{1} techniques?… Yeah of course.

I have a loose theory that no matter who or where you are, there is probably sufficient data for a sufficiently motivated person to find you.

I saw this pop up on Twitter^{2} and other tabloid sites a while ago and thought it would be fun to try and geolocate it from the Twitter post alone.

A family in Australia has remained defiant in selling their nearly 5-acre property in the last few years as developers have been forced to build around them. Most recently, they declined a whopping $50 million offer for their home. Slap bang in the middle of a new-build… pic.twitter.com/pULUqpe1em

— Historic Vids (@historyinmemes) June 22, 2023

The main clue we get is:

About 40 minutes from Sydney’s central core, the property offers panoramic views of the Blue Mountains.

*(Actually the tweet and some articles do mention the suburb, but let’s ignore that so we don’t spoil the fun)*

- If the property is 40 minutes from Sydney and has views of the Blue Mountains, its likely to be somewhere West of the city.

- We can use the {osrm} package in R to construct drive time isochrones. This leverages the Open Source Routing Machine, based on Open Street Map data to calculate polygons that represent a given drive time from a set of coordinates. If we set this as 35-45 min drive time from the center of Sydney, we should get a ‘ring’ around Sydney containing the property location.

- By looking at the final frame we can see some distinguishing features in the image:

- There is a roundabout nearby

- There is a ‘Secondary Road’ (not a primary or trunk road, but more significant than roads found in villages etc)
- There are several ‘dead-end’ roads. Defined as cul-de-sacs that are not turning-circle roundabouts, but have no other exit point.

We can now construct a query to the OpenStreetMap Overpass API which stores features about the metadata of the street network. Let’s look for roundabouts on secondary roads within 500m of a no-exit road.

```
[out:json];
(
way[junction=roundabout][highway=secondary]
({{bbox}});
) -> .roundabout;
(
node[noexit=yes]
({{bbox}});
) -> .culdesac;
(
way.roundabout(around.culdesac:500);
);
out body;
>;
out skel qt;
```

- The above query was run over all of Sydney, and exported as a
`geojson`

file, which I then intersected with our drive-time ring above.

- This creates a shortlist of 101 candidate roundabouts. We now need to manually inspect each one to match it to the tweet. As these are unsorted, the average search time to find our roundabout of interest will be which means I have have to manually check 50 images on average. To streamline this, I wrote a function to loop through all the candidate roundabouts, and automatically export a satellite image at roughly the level of zoom that would help me identify the right frame.

And found it. Don’t think I need to share the exact address, not that it’s a secret or anything.

Open Source Intelligence↩︎

https://x.com/historyinmemes/status/1671673688683413504?s=20↩︎

While many (including me) have leveled a fair amount of criticism towards such solutions, I thought it would be worth seeing what the fuss was about.

Could I go head-to-head on the same predictive modelling challenge and compete with the might of Microsoft’s AutoML solution? Even worse, would I enjoy it? Even more worse, could I win??

**Objective**: Create the most accurate time series forecasting model

**Data Source**: Half-hourly electricity demand for Victoria, Australia^{1}

**Training data**: 51,120 records from 2012-01-01 to 2014-11-30

**Test data**: 1488 records from 2014-12-01 to 2014-12-31

**Method 1**: Use Microsoft Azure’s Automatic ML product.

**Method 2**: Hand code a statistical time series model in R

names | type | description |
---|---|---|

Time | datetime | Time stamp |

Demand | double | Target Variable: Electricity Demand |

Temperature | double | Temperature for the day |

Date | date | Date |

Holiday | logical | Was it a holiday date? |

The process to set up a new AutoML job was very easy and assumes you are working under somewhat sanitized conditions (which I was in this case).

Once you kick it off, it chugged away for an hour and 33 minutes. To my horror, I realized it takes the ‘kitchen sink’ approach and fits a suite of 41 (!) different machine learning models at the training data. Hyperparameter tuning is done by constructing a validation set using K-Fold cross validation.

The best performing model is then selected and then predictions are run on the test set. It’s a little concerning that Test set evaluation is only in ‘Preview’ mode. It was also very confusing to dig out the results on the test set. Most of the metrics prominently displayed are overly confident in-sample accuracy results.

The winning model in my case was a ‘Voting Ensemble’ of three models

- MaxAbsScaler, ExtremeRandomTrees
- StandardScalerWrapper, XGBoostRegressor

- StandardScalerWrapper, LightGBM

Overall the process was very easy and user friendly. It look a long time to train, but I didn’t have to think about anything - at all (which is usually time consuming) so overall it was a quick solution. I trained the model on a Standard_DS11_v2 (2 cores, 14 GB RAM, 28 GB disk) compute instance which costs $0.2 per hour. So it cost money, but not much.

Performance evaluation to follow below…

The process for doing this myself involved much more thought and brain-effort. Here are some notes.

The data set is quite complicated as its sub-daily and has (probably) three seasonal periods (daily, weekly, yearly). There was also maybe some trend and outliers to deal with. The data set also contained covariates such as Temperature and Holiday indicators.

Due to the seasonal complexity many traditional statistical methods were not appropriate like straight ARIMA (autoregressive integrated moving average) and ETS (exponential smoothing). While STL (Seasonal and Trend decomposition using Loess) can handle multiple seasonal periods I wanted a method to handle the covariates (like Temperature and Holidays). My next step was to think of Time Series Linear Regression models. However, accounting for yearly seasonality with 30min data meant fitting 17,520 (2 * 24 * 365) parameters just for this seasonal period. Which seemed excessive.

For longer, multiple-seasonal periods, using Fourier terms can be a good idea. Here a smaller number of terms in a fourier series can be estimated to approximate a more complex function. This type of *Dynamic Harmonic Regression*^{2} can also handle exogenous covariates and we can even fit the model with ARIMA errors to account for the short term dynamics of time series data.

In fact, this very approach was outlined in the excellent *Forecasting: Principles and Practice*^{3} using this very same example data set. I decided to borrow (steal) the ideas of creating a piece-wise linear trend for temperature. I also went a bit crazy with encoding specific holiday dummy variables and some other tweaks.

Overall I found this method slow to fit, and not overly performant. I decided next to try fitting a Prophet^{4} model. Prophet is an open-source automated algorithm for time series forecasting developed by Facebook. It uses a Bayesian framework to fit complex, non-linear, piece-wise regression models. For complex time series data, it provides a decent, fast framework including exogenous variables, holiday and seasonal effects. I didn’t do any principled hyperparameter tuning, but I did fiddle around with the model a bit.

So who won?

The AutoML platform did :( , but only just. Below is the comparison of RMSE and MAPE. The AutoML is red, my predictions are in blue. I stuffed up over Christmas a bit, which admittedly is a tricky hold-out month for testing.

Method | Metric | Value |
---|---|---|

Azure AutoML | RMSE | 213 |

Azure AutoML | MAPE | 3.56 |

Me | RMSE | 274 |

Me | MAPE | 4.96 |

So overall it was pretty close, but in terms of pure predictive performance, the AutoML platform did pip me at the post. Admittedly, the solution I arrived at was probably more of an ML solution than a ‘classical’ time series method given it is still an automated algorithm. If I had more time and patience I probably could have pursued a more complex regression model. In fact in *Forecasting: Principles and Practice*, the authors also cite the performance of a straight Dynamic Harmonic Regression is limited, however they go on to propose other innovative approaches^{5}^{6}, including splitting the problem into separate models for each 30min period and using regression splines to better capture exogenous effects. So it can be done, but not without a huge amount of effort.

This all led me to think: If the data are quite complex for a time series problem, then of course a more Machine Learn-y solution would outperform. I wonder what would happen if we repeated the same exercise but with many fewer data points and some quirky time series characteristics.

My hypothesis is, the machine learning models will not have sufficient data to fit well. On the other hand, my experience and gestalt will enable me to select and encode a statistical model that is appropriate and gain an edge on a black-box type of solution.

**Objective**: Create the most accurate time series forecasting model

**Data Source**: Monthly Medicare Australia prescription data^{7}, Anatomical Therapeutic Chemical index classification A10

**Training data**: 163 records from Jul 1991 to Jan 2005 (black line)

**Test data**: 41 records from Feb 2005 to Jun 2008 (grey line)

names | type | description |
---|---|---|

Cost | double | Cost of the scripts in $AUD |

Month | double | Month time stamp |

Here we have less than 200 data points, but we can visually inspect the time series and see that there is a clear trend, the process is multiplicative and there is a single, yearly seasonal pattern.

The AutoML platform again used a Voting Ensemble, churned out in 43 minutes, but this time using:

- ProphetModel (it must have copied me from last round ;))

- Exponential Smoothing

Given the multiplicative process here, I modeled the log transformed data. (I did try a more generalized Box-Cox transformation, but got better performance with a straight natural log transform). I tried an ARIMA model, using model selection via the Hyndman-Khandakar algorithm^{8}, which resulted in a `ARIMA(2,0,1)(1,1,2)[12] w/ drift>`

.

Yay! I won this round. Quite easily.

Method | Metric | Value |
---|---|---|

Azure AutoML | RMSE | 2.43 |

Azure AutoML | MAPE | 9.22 |

Me | RMSE | 1.63 |

Me | MAPE | 7.23 |

Well, I call it a draw.

Here are some of my closing thoughts from this experiment.

An ML solution might be a good choice if:

- You have lots of data

- You care a lot about prediction

- You don’t have to be too transparent

- Interpretation is not very important

- You have a very complex time series data set

I would caveat this with not just blindly modelling your problems away. You still need to understand the process to ensure your predictions are well calibrated and you don’t fall prey to over fitting.

A more classical statistical modelling approach might be a good choice if:

- You want a more flexible framework

- You need to / want to encode domain knowledge

- You want a more interpretable model
- You have fewer data

The good news is, if you are sufficiently smart and motivated (which I am sure you are) you can certainly compete in terms of model performance with an ML solution, even on complex problems. The bad news is, it’s harder and you need to think a bit. You can’t just delegate all your thinking to the machines. Not yet anyway.

O’Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022). *tsibbledata: Diverse Datasets for ‘tsibble’*. https://tsibbledata.tidyverts.org/, https://github.com/tidyverts/tsibbledata/.

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2023-06-05.

Thanks to the Tidyverts team https://tidyverts.org/. The new an improved time series stack in R makes all this so easy.

**Note**: None of this was super-rigorous, and I certainly tilted the board in my favour here and there. It was just fun and a chance to play around with a tool that I have previously avoided for no real reason.

Source: Australian Energy Market Operator;

*tsibbledata*R package↩︎Young, P. C., Pedregal, D. J., & Tych, W. (1999). Dynamic harmonic regression. Journal of Forecasting, 18, 369–394. https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-131X(199911)18:6%3C369::AID-FOR748%3E3.0.CO;2-K↩︎

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2023-06-05.↩︎

Taylor SJ, Letham B. 2017. Forecasting at scale. PeerJ Preprints 5:e3190v2 https://doi.org/10.7287/peerj.preprints.3190v2↩︎

Fan, S., & Hyndman, R. J. (2012). Short-term load forecasting based on a semi-parametric additive model. IEEE Transactions on Power Systems, 27(1), 134–141. https://ieeexplore.ieee.org/document/5985500↩︎

Hyndman, R. J., & Fan, S. (2010). Density forecasting for long-term peak electricity demand. IEEE Transactions on Power Systems, 25(2), 1142–1153. https://ieeexplore.ieee.org/document/5345698↩︎

Source: Medicare Australia;

*tsibbledata*R package↩︎Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(1), 1–22. https://doi.org/10.18637/jss.v027.i03↩︎

Here I’ll cover some options to deploy this environment to the cloud so you can access it anywhere.

A common pattern is to create a Virtual Machine (VM) with a cloud service provider (such as AWS, Azure, GCP) and run your code there. I’ll cover an example using Microsoft Azure.

- Deploy a VM with an Ubuntu operating system. Go ahead and choose the compute power you need.

- Configure a custom network rule to allow traffic on port 8787 for RStudio

3. Log into your new VM terminal using SSH

Install Docker Engine by following these steps

Clone and Deploy the docker container from Step 2 in my guide.

The above is fine, but arguably if you are setting up a VM from scratch for development purposes I’m not sure what benefit there is from using a docker container. You may as well just directly install what you want and consider the VM a ‘container’.

However, if you plan to make this available to other users in your organisation, or to adapt this guide for Shiny App development you may be interested in other features such as TLS/SSL security, scale up, advanced networking, continuous integration, continuous deployment, staging/production deployment slots etc. This represents a shift from development sandpit to ‘web app’. For this case, Azure App Service may be a lower hassle option. This is Microsoft’s enterprise grade, web app deployment managed service.

In the **Virtual Machine** model you are setting up compute infrastructure, deploying and running containers directly - then fiddling with the infrastructure layer for everything else. In **App Service** you deploy your custom docker container (here containing RStudio Server) to Azure Container Registry (kind of like DockerHub). Azure App Services then builds and serves your app from there - without you having to stand up and manage an Infra layer directly.

Create Azure Container Registry (ACR) (or some other Docker repository) using this help guide

Run and test your container locally

Deploy your local container to ACR using this help guide

Create a new web app in Azure App Services using this help guide

Configuration:

- I didn’t have to fiddle with ports, presumably it reads the exposed ports in the docker file and does this magically.

- For custom environment variables like the RStudio Server password, I had to manually add this in the config section.

and it worked just fine:

However it is often the case that users are operating in a restricted computing environment, such as in a corporate or government setting. Alternatively you may wish to create a custom development environment to test or replicate some other specific setup. This is a good case to move away from locally managed software to containerization, such as Docker.

I have set up a Github repository that sets up a local data science development environment in the browser.

It builds a docker container including:

- Ubuntu 20.04 LTS
- R version 4.2
- RStudio Server 2022.02.3+492
- All tidyverse packages and devtools
- tex & publishing-related package

The image builds on the rocker/verse image from Rocker Project.

Some other enhanced configuration options are included in the Dockerfile, such as preloading you RStudio preferences to get the same look and feel you have locally, the option to install other CRAN packages & mounting local volumes to persist your work locally.

Go here for Step by step instructions:

]]>The typical choice when calculating binomial proportion confidence intervals is the asymptotic, or normally approximated ‘Wald’ interval where success probability is measured by:

In many settings, such as marketing analytics or manufacturing processes the sample proportion is close to 0 or 1. Evaluating asymptotic confidence intervals near these boundary conditions will lead to underestimation of the error, and in some cases producing an interval outside .

Fortunately other methods exist, such as Wilson’s score interval, exact methods and Bayesian approaches. The recommendation here is to examine the probability coverage and explore alternative methods for sample size and CI calculation, especially when the parameter is near the boundary conditions, or in cases of very small n.

```
library(binom)
library(tidyverse)
n <- 50
p <- c(0.01, 0.5, 0.99)
```

`x <- purrr::map_df(p, .f = ~binom.confint(x = n * .x, n = n, methods = 'all'))`

```
ggplot(x, aes(colour = factor(x))) +
geom_point(aes(mean, method), show.legend = F) +
geom_errorbarh(aes(xmin = lower, xmax = upper, y = method), show.legend = F) +
geom_vline(xintercept = c(0, 1), lty = 2, col = "grey") +
facet_wrap(~(x*2/100)) +
theme_bw() +
labs(title = "A variety of binomial confidence interval methods for p = 0.01, 0.5 & 0.99",
subtitle = "Note unusual behaviour near 0.01 and 0.99")
```

`cov <- purrr::map_df(p, ~binom.coverage(.x, n, conf.level = 0.95, method = "all"))`

```
ggplot(cov, aes(colour = factor(p))) +
geom_point(aes(coverage, method), show.legend = F) +
geom_vline(xintercept = 0.95, lty = 2) +
facet_wrap(~(p)) +
theme_bw() +
labs(title = "Probability coverage for a variety of binomial confidence interval methods",
subtitle = "Reference line at 0.95 coverage")
```

A good discussion is contained in:

Wallis, Sean A. (2013). “Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods” (PDF). **Journal of Quantitative Linguistics.** 20 (3): 178–208. doi:10.1080/09296174.2013.799918. S2CID 16741749.

https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

]]>There are widgets produced and we need to audit of them. Some sort of rejection threshold is needed on that sample to decide if the whole batch of widgets has met a specified quality level.

Typically, a binomial distribution would be appropriate for measuring the probability of successes (in this case defects found) in independent trials with probability .

The word *independent* is doing a lot of work here as it implies that we are sampling *with* replacement in order to maintain a fixed probability parameter .

In cases where you are taking draws from a population *without* replacement (such as when you do destructive inspections on a widget) the underlying population changes with each draw and so does the probability .

In this case, modelling the process using a hypergeometric distribution may be a more appropriate choice.

It similarly describes the probability of successes in draws without replacement. However, instead of specifying a parameter , we provide the population size , which contains success states in the population.

Let’s say we have 2000 widgets manufactured and we want to sample 50 (ignore why 50, that is a whole separate question). We have an assumed quality level of 10% defective units (which we define as ‘success’ for complicated reasons).

Q: Based on a sample of 50 widgets how many defective units would be considered unlikely (95% CI) to occur randomly given our assumed quality level, and therefore result in us rejecting the entire batch?

We can compare the binomial probability mass function with the hypergeometric and observe they are essentially the same.

`library(tidyverse)`

```
tibble(
x = seq.int(0, 50, by = 1),
binomial = dbinom(x, size = 50, prob = 0.1),
hypergeom_2000 = dhyper(x, m = 200, n = 1800, k = 50),
) |>
pivot_longer(cols = -1, names_to = 'distribution', values_to = 'density') |>
ggplot(aes(x, density, col = distribution)) +
geom_line() +
geom_point() +
xlim(c(0, 20)) +
theme_bw() +
labs(x = "Observed defective units in sample")
```

However, if we had a smaller population of say 100 or 70 widgets, how would this compare?

```
tibble(
x = seq.int(0, 50, by = 1),
binomial = dbinom(x, size = 50, prob = 0.1),
hypergeom_2000 = dhyper(x, m = 200, n = 1800, k = 50),
hypergeom_100 = dhyper(x, m = 10, n = 90, k = 50),
hypergeom_070 = dhyper(x, m = 7, n = 63, k = 50)
) |>
pivot_longer(cols = -1, names_to = 'distribution', values_to = 'density') |>
ggplot(aes(x, density, col = distribution)) +
geom_line() +
geom_point() +
xlim(c(0, 20)) +
theme_bw() +
labs(x = "Observed defective units in sample")
```

We can see these curves are markedly different. And indeed the 95% confidence intervals obtained are narrower for the hypergeometric case.

`qbinom(p = c(0.025, 0.975), size = 50, prob = 0.1)`

`[1] 1 9`

`qhyper(p = c(0.025, 0.975), m = 10, n = 90, k = 50)`

`[1] 2 8`

We can see from a random draw of 1 million samples from each PMF that they both have the same expected values, but the variance is smaller in the hypergeometric case.

```
X <- rbinom(n = 1e6, size = 50, prob = 0.1)
Y <- rhyper(nn = 1e6, m = 10, n = 90, k = 50)
mean(X)
```

`[1] 5.003297`

`var(X)`

`[1] 4.503315`

`mean(Y)`

`[1] 5.000162`

`var(Y)`

`[1] 2.27195`

As a consequence of removing samples in each draw we influence the probability of a subsequent success. If our and is very large relative to our sample this wont make much of an impact, but it can be impactful for smaller populations, or relatively larger samples.

From our example above, failing to use a hypergeometric distribution to model this process for smaller populations will result in wider, more conservative acceptance regions which can increase consumer risk in a manufacturing process.

Typical guidance on when to use each distribution is given in manufacturing standards such as *AS 1199.1-2003: Sampling Procedures for Inspection by Attributes* and typically involves how you structure your sampling scheme.

In the help page for `lme4::predict.merMod()`

is the following note:

- There is no option for computing standard errors of predictions because it is difficult to define an efficient method that incorporates uncertainty in the variance parameters; we recommend bootMer for this task.

There are some useful resources out there but it took a while to track down, so this post may serve as a good reference in the future.

Let’s go through an example using the famous `sleepstudy`

data showing the average reaction time per day (in milliseconds) for subjects in a sleep deprivation study.

```
library(lme4)
library(tidyverse)
data("sleepstudy")
```

We would like to model the relationship between `Reaction`

and `Days`

```
ggplot(sleepstudy, aes(Days, Reaction)) +
geom_point(show.legend = FALSE) +
theme_bw()
```

Fitting a basic linear model:

```
fit_lm <- lm(Reaction ~ Days, data = sleepstudy)
summary(fit_lm)
```

```
Call:
lm(formula = Reaction ~ Days, data = sleepstudy)
Residuals:
Min 1Q Median 3Q Max
-110.848 -27.483 1.546 26.142 139.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 251.405 6.610 38.033 < 2e-16 ***
Days 10.467 1.238 8.454 9.89e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.71 on 178 degrees of freedom
Multiple R-squared: 0.2865, Adjusted R-squared: 0.2825
F-statistic: 71.46 on 1 and 178 DF, p-value: 9.894e-15
```

```
ggplot(sleepstudy, aes(Days, Reaction)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
theme_bw()
```

But this ignores the fact these data are not independent. We have multiple observation per subject. Some look like a good fit, others not.

```
ggplot(sleepstudy, aes(Days, Reaction, col = Subject)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
facet_wrap(~Subject) +
theme_bw()
```

Let’s add a random intercept term for `Subject`

. For simplicity we will leave out any other random effects.

```
fit <- lme4::lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
summary(fit)
```

```
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 | Subject)
Data: sleepstudy
REML criterion at convergence: 1786.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.2257 -0.5529 0.0109 0.5188 4.2506
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 1378.2 37.12
Residual 960.5 30.99
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.4051 9.7467 25.79
Days 10.4673 0.8042 13.02
Correlation of Fixed Effects:
(Intr)
Days -0.371
```

New fitted lines can be drawn, showing the adjusted intercept for each subject (original regression line kept for reference).

```
sleepstudy |>
mutate(pred = predict(fit, re.form = NULL)) |>
ggplot(aes(Days, Reaction, col = Subject)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1], col = "grey") +
geom_line(aes(Days, pred), show.legend = FALSE) +
facet_wrap(~Subject) +
theme_bw()
```

Let’s try and generate prediction intervals using `lme4::bootMer()`

as suggested.

First on the in-sample data.

```
# predict function for bootstrapping
predfn <- function(.) {
predict(., newdata=new, re.form=NULL)
}
# summarise output of bootstrapping
sumBoot <- function(merBoot) {
return(
data.frame(fit = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.5, na.rm=TRUE))),
lwr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.025, na.rm=TRUE))),
upr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.975, na.rm=TRUE)))
)
)
}
# 'new' data
new <- sleepstudy
```

Notes:

In the

`predict()`

function we specify`re.form=NULL`

which identifies which random effects to condition on. Here`NULL`

includes all random effects. Obviously here you can compute individual predictions assuming you feed it with the correct grouping level in your data.In the

`lme4::bootMer()`

function we set`use.u=TRUE`

. This conditions on the random effects and only provides uncertainly estimates for the i.i.d. errors resulting from the fixed effects of the model.

If use.u is TRUE and type==“parametric”, only the i.i.d. errors are resampled, with the values of u staying fixed at their estimated values.

`boot <- lme4::bootMer(fit, predfn, nsim=250, use.u=TRUE, type="parametric")`

```
new |>
bind_cols(sumBoot(boot)) |>
ggplot(aes(Days, Reaction, col = Subject, fill = Subject)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
geom_line(aes(Days, fit), show.legend = FALSE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, show.legend = FALSE) +
facet_wrap(~Subject) +
theme_bw()
```

However, this gets complicated if we want to make predictions for *new* subjects.

We can no longer condition on the random effects, as the new subject level will not have a fitted random intercept value. Instead we need to effectively make a population level prediction (i.e. set the random effect to zero.). This makes sense as we don’t know what the random effect ought to be for a given, unobserved subject.

But we don’t want the prediction interval to just cover the uncertainty in the population level estimate. If we are interested in individual predictions, how can we incorporate the uncertainly of the random effects in the prediction intervals?

Lets generate a new, unobserved subject.

```
new_subject <- tibble(
Days = 0:9,
Subject = factor("999")
)
```

We provide a new predict function that doesn’t condition on the random effects by using `re.form = ~0`

. This lets us input and obtain predictions for new subjects.

```
predfn <- function(.) {
predict(., newdata=new_subject, re.form=~0, allow.new.levels=TRUE)
}
```

```
new_subject |>
bind_cols(predicted = predfn(fit)) |>
ggplot(aes(Days, predicted, col = Subject)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
theme_bw() +
ylim(c(150, 450))
```

However using `predict`

just results in a completely deterministic prediction as shown above.

An alternative approach is to use `lme4::simulate()`

which will simulate responses for subjects non-deterministically using the fitted model object.

Below we can see a comparison on both approaches.

```
predfn <- function(.) {
predict(., newdata=new_subject, re.form=~0, allow.new.levels=TRUE)
}
sfun <- function(.) {
simulate(., newdata=new_subject, re.form=NULL, allow.new.levels=TRUE)[[1]]
}
```

```
new_subject |>
bind_cols(simulated = sfun(fit)) |>
bind_cols(predicted = predfn(fit)) |>
pivot_longer(cols = c(3, 4), names_to = "type", values_to = "val") |>
ggplot(aes(Days, val, col = type)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
theme_bw() +
ylim(c(150, 450))
```

We can use this `simulate()`

function in our bootstrapping to resample responses from the fitted model (rather than resampling deterministic population predictions).

This time we set `use.u=FALSE`

to provide uncertainly estimates from both the model errors and the random effects.

If use.u is FALSE and type is “parametric”, each simulation generates new values of both the “spherical” random effects uu and the i.i.d. errors , using rnorm() with parameters corresponding to the fitted model x.

`boot <- lme4::bootMer(fit, sfun, nsim=250, use.u=FALSE, type="parametric", seed = 100)`

```
new_subject |>
bind_cols(sumBoot(boot)) |>
bind_cols(predicted = predfn(fit)) |>
ggplot(aes(Days, predicted, col = Subject, fill = Subject)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
geom_line(aes(Days, fit), show.legend = FALSE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, show.legend = FALSE) +
theme_bw() +
ylim(c(150, 450))
```

So while we don’t have a conditional mode of the random effect (because its a new subject) we can derive a bootstrapped estimate of the prediction interval by resampling the random effects and model errors on simulated data values.

For comparison, here is what the same prediction interval would look like if we just used an unconditional population prediction. While the overall gist is the same, despite also resampling both the random effects and the i.i.d. errors, the interval is narrower as it is resampling just the deterministic population predictions of the model.

`boot <- lme4::bootMer(fit, predfn, nsim=250, use.u=FALSE, type="parametric", seed = 100)`

```
new_subject |>
bind_cols(sumBoot(boot)) |>
bind_cols(predicted = predfn(fit)) |>
ggplot(aes(Days, predicted, col = Subject, fill = Subject)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
geom_line(aes(Days, fit), show.legend = FALSE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, show.legend = FALSE) +
theme_bw() +
ylim(c(150, 450))
```

Most of the material and code is taken from a variety of sources below. In particular the lme4 github issue. Also, the `merTools`

package has a nice vignette comparing these methods with their own solution.

https://tmalsburg.github.io/predict-vs-simulate.html https://github.com/lme4/lme4/issues/388 https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#predictions-andor-confidence-or-prediction-intervals-on-predictions

Choosing a husband: In the court of King Arthur, a damsel, Elfreda needs to choose a husband. Each day for the next 100 days Knights will come to Calemlot to be interviewed by Arthur to be Elfreda’s husband. However once the Knights left the castle they did not return. She needed to decide as soon as she saw a knight whether to marry or pass. In the challenge, Elfreda is given the advice from Merlin to not pick any of the first 37, just look and decide who is best, then pick the next knight who is better than that.

The questions are:

Why did Merlin instruct Elfreda to ignore the first 37 suitors?

What did Merlin think was the probability that Elfreda would choose the very best knight with this strategy?

This is a well known problem in Optimal Stopping theory. When you look at the first few knights you have no information about how they rank relative to all 100. But by the time you have assessed most of the 100 knights you have information but you have no agency. The knights have all moved on and you either miss out or have to settle for whoever is left.

This problem comes up every time I look for a parking spot on a busy day. Do I take the first spot I find a long way from the beach? I could try to get a closer spot but may not find one and could miss out entirely. Most likely the earlier spots are now taken.

**When is the optimal time to stop looking and take the leap?**

The strategy here is:

- Just
*Look*at a given number of knights without commiting to any

- Note the best ranking knight of out of the ones we looked at

- Commit to the very next knight that is
*better*than the best ranked so far.

We assume we look at least one knight before stopping our search and assume that if we fail to find a superior knight after our *looking* period - we miss out! We obviously wont settle for less. (I guess I could force this to accept the last one in line - not a big deal).

We can simulate this problem by sampling 100 random knights and assigning them a random ranking of 1-100 (assume 100 is best).

```
set.seed(111)
sample(100)
```

```
[1] 78 84 83 47 25 59 69 35 72 26 49 45 74 8 100 96 24 48
[19] 95 7 21 15 1 9 63 40 85 93 71 52 28 38 88 61 92 30
[37] 5 53 37 6 36 41 70 42 18 27 29 23 32 89 86 57 16 90
[55] 39 4 68 55 99 98 79 43 54 97 65 50 94 44 10 91 56 80
[73] 19 73 33 11 12 67 81 62 76 60 77 34 2 20 13 51 14 64
[91] 75 87 66 17 22 3 31 46 82 58
```

Next we can look at the cumulative maximum ranking. This will track the score of the best knight seen so far. We also run, say, 10,000 simulations of this problem to analyse.

```
library(tidyverse)
set.seed(111)
n <- 100
sims <- 1e4
knight_sims <- expand_grid(sim = 1:sims) |>
mutate(rankings = map(sim, ~sample(1:n))) |>
unnest(rankings) |>
group_by(sim) |>
mutate(stop_after_days = 1:n,
max_rank = cummax(rankings))
knight_sims
```

```
# A tibble: 1,000,000 × 4
# Groups: sim [10,000]
sim rankings stop_after_days max_rank
<int> <int> <int> <int>
1 1 78 1 78
2 1 84 2 84
3 1 83 3 84
4 1 47 4 84
5 1 25 5 84
6 1 59 6 84
7 1 69 7 84
8 1 35 8 84
9 1 72 9 84
10 1 26 10 84
# … with 999,990 more rows
```

We can write a function that can map over each day and determine what rank knight we would end up with if we stopped searching after x days and committed to the next best knight.

```
optimal_stop <- function(samp, i) {
if (i == 0) {
samp[1]
}
if (i > length(samp) | i < 0) {
stop("Invalid choice")
}
if (any(samp > cummax(samp)[i]) == FALSE) {
# if you want to default to the last option
#samp[length(samp)]
# if you would rather miss out if no superior choice
NA
} else{
idx <- which.max(samp > cummax(samp)[i])
samp[idx]
}
}
```

We want to know how often we end up with the optimal (top ranked) knight depending on which day we stop looking.

```
results <- knight_sims |>
mutate(result = map_int(stop_after_days, ~optimal_stop(rankings, .x))) |>
mutate(is_optimal = result == max(rankings, na.rm = TRUE)) |>
group_by(stop_after_days) |>
summarise(optimal_stops = sum(is_optimal, na.rm = TRUE),
n = n(),
optimal_stop_pct = optimal_stops / n,
var = sqrt(optimal_stop_pct * (1 - optimal_stop_pct) / n))
results
```

```
# A tibble: 100 × 5
stop_after_days optimal_stops n optimal_stop_pct var
<int> <int> <int> <dbl> <dbl>
1 1 533 10000 0.0533 0.00225
2 2 846 10000 0.0846 0.00278
3 3 1116 10000 0.112 0.00315
4 4 1352 10000 0.135 0.00342
5 5 1572 10000 0.157 0.00364
6 6 1771 10000 0.177 0.00382
7 7 1941 10000 0.194 0.00396
8 8 2098 10000 0.210 0.00407
9 9 2225 10000 0.222 0.00416
10 10 2360 10000 0.236 0.00425
# … with 90 more rows
```

We can plot the success proportion of bagging the most optimal knight for each stopping day.

```
ggplot(results, aes(stop_after_days, optimal_stop_pct)) +
geom_line() +
geom_ribbon(aes(stop_after_days, ymax = optimal_stop_pct + 1.96 * var,
ymin = optimal_stop_pct - 1.96 * var),
alpha = 0.22) +
theme_bw() +
geom_vline(xintercept = 37, lty = 2, col = 'red') +
geom_hline(yintercept = 0.368, lty = 2, col = 'blue') +
scale_y_continuous(labels = scales::percent) +
labs(title = "What is the probability of picking the best knight?",
subtitle = "Based on stopping searching at x days and choosing the next best knight",
x = "Look for this many days",
y = "Probability of optimal choice")
```

Here is the number of knights we should observe without picking any

`which.max(results$optimal_stop_pct)`

`[1] 37`

And here is the probability of choosing the best knight with this strategy

`max(results$optimal_stop_pct)`

`[1] 0.3755`

**So we should let the first 37 knights pass, just like Merlin said. If we do this, we will give ourselves roughly 37% probability of picking the most optimal knight, which is the best we can do.**

So why 37 days and 37% probability?

If we assume we reject the first knights and then choose the next best knight we get to probability

We want to know which maximises the probability for large . As approaches infinity we can let be the limit of and use for and for we get the integral

If we take the derivative with respect to x and set the equation to zero we can find the value of which maximises this probability:

and gives the optimal stopping number (for any large n) of

`1/exp(1)`

`[1] 0.3678794`

`100/exp(1)`

`[1] 36.78794`

Ferguson, T. S. (1989). Who solved the secretary problem?. Statistical science, 4(3), 282-289.

Brian Christian and Tom Griffiths. 2016. Algorithms to Live By: The Computer Science of Human Decisions. Henry Holt and Co., Inc., USA.

https://en.wikipedia.org/wiki/Secretary_problem

So which three clubs do you pick?

This was easy for me as I could only reliably hit two clubs: the five iron and the seven iron. I usually included a driver so as not be mocked by the septuagenarians at the par-5 tees.

I also have an embarrassingly crap collection of golf clubs as many have been sent to an early retirement in water features.

However, many friends who now play golf have a bewildering selection of exotic clubs from hybrid driving irons to 60-degree lob-wedges.

It seems like in there is always some new technology or method to master in data science broadly. This is interesting, but also exhausting and fills me with existential dread that I am becoming irrelevant.

It got me thinking, if you had to play statistics (or data science) for the rest of your career with just three algorithms/models, what would they be?

**Random Forest**: Uncontroversial ensemble method that can quickly get a performance ceiling with minimal fuss. It lacks some interpretability, but I’ll keep it in my bag for when I want to squeeze as much AUC as I can in a prediction setting.**Generalised Additive Model (GAM)**: You could argue this is quite broad, but that’s okay. I’ve had good luck with GAMs as they provide a solid statistical approach that I am comfortable with, along with the flexibility of using smooths for non-linear terms that seems to be a key feature in the problems I have to solve. If I had to only pick one club this would probably be it.**k-medoids clustering:**Clustering methods are a great third club to have in your bag. They solve a different problem to the modelling tasks above and provide a good hedge for those unsupervised tasks that come along fairly regularly. While everyone knows k-means, I enjoy the increased interpretation of k-medoids as the cluster centers are always one of the input data points. It also allows for more generalised distance methods (such as Gower’s distance) which I have found useful for mixed data types.

We can include `lm()`

as the default putter you get for free.

Which three clubs would you pick in your bag?

Problem solved.

This website lets you type in the vague descriptions of a small child and it will return a mathematically ranked list of closest matching Bluey episodes.

https://deanmarchiori.shinyapps.io/blueysearch/

The website is a Shiny app (deployed to shinyapps.io), which contains all 130 episode titles, descriptions and thumbnails.

All of the episode descriptions were tokenized into ‘terms’ using the {tidytext} R package and formed into a Document-Term-Matrix. Rather than use the typical term counts, a binary indicator was used if a term appears in an episode. This was preferred as the user’s text input is unlikely to really mimic the detail of an episode description, which threw the similarity measure out a bit.

Once a user inputs text, the Shiny app dynamically forms a new term vector and compares it to the all-episode’s matrix using cosine distance. The episodes are then ranked based on smallest cosine distance and displayed to the users.

For the source code visit: https://github.com/deanmarchiori/bluey-search

Our directory contains a folder called `myapp`

which contains our shiny app file and other supporting files.

At the top level we have our dockerfile and other config files. These should be modified accordingly.

This directory structure can be cloned from my github repo

```
dockerised-shiny/
├── Dockerfile
├── myapp
│ └── app.R
├── README.md
├── shiny-server.conf
└── shiny-server.sh
```

This should be adapted as required.

```
# Using rocker/rver::version, update version as appropriate
FROM rocker/r-ver:3.5.0
# install dependencies
RUN apt-get update && apt-get install -y \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev \
libxt-dev \
libxml2-dev \
libssl-dev \
wget
# Download and install shiny server
RUN wget --no-verbose https://download3.rstudio.org/ubuntu-14.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://download3.rstudio.org/ubuntu-14.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb && \
. /etc/environment && \
R -e "install.packages(c('shiny', 'rmarkdown'), repos='$MRAN')" && \
cp -R /usr/local/lib/R/site-library/shiny/examples/* /srv/shiny-server/
# Copy configuration files into the Docker image
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
COPY shiny-server.sh /usr/bin/shiny-server.sh
# Copy shiny app to Docker image
COPY /myapp /srv/shiny-server/myapp
# Expose desired port
EXPOSE 80
CMD ["/usr/bin/shiny-server.sh"]
```

To build the Docker image (called `myapp`

)

`docker build -t myapp .`

To run a container based on our Docker image:

This will run the docker image ‘myapp’ in a container (in detached mode) and expose post 80. It will name it ‘myapp’ and remove it when exited.

`docker run --rm -p 80:80 --name myapp -d myapp`

http://127.0.0.1/myapp/

`docker images `

`docker ps -a`

For individual containers add the container ID

`$ docker rm`

To remove all exited containers :

`$ docker rm $(docker ps -a -q -f status=exited)`

Remove all unused containers, networks, images (both dangling and unreferenced), and optionally, volumes.

`docker system prune -a`

`docker save -o ~/myapp.tar myapp`

```
docker load -i myapp.tar
docker run myapp
```

https://github.com/deanmarchiori/dockerised-shiny https://hub.docker.com/r/rocker/shiny

https://www.docker.com/get-started

https://www.bjoern-hartmann.de/post/learn-how-to-dockerize-a-shinyapp-in-7-steps/

This feed contains a list of current incidents from the NSW RFS, and includes location data and Major Fire Update summary information where available. Click through from the feed to the NSW RFS website for full details of the update.

GeoJSON is a lightweight data standard that has emerged to support the sharing of information with location or geospatial data. It is widely supported by modern applications and mobile devices.

See here: https://www.rfs.nsw.gov.au/news-and-media/stay-up-to-date/feeds for attribution and guidelines. Please read these important guidelines before using this data.

Load packages

```
library(sf)
library(mapview)
library(tidyverse)
```

Pull incidents

```
url <- "http://www.rfs.nsw.gov.au/feeds/majorIncidents.json"
fires <- st_read(url)
```

```
Reading layer `majorIncidents' from data source
`http://www.rfs.nsw.gov.au/feeds/majorIncidents.json' using driver `GeoJSON'
Simple feature collection with 22 features and 7 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: 141.9267 ymin: -37.06285 xmax: 153.0005 ymax: -29.11298
Geodetic CRS: WGS 84
```

Optional step to get points only

```
# points only
fire_pt <- fires %>%
st_cast("POINT")
```

Optional Step to get Polygons only. Note the hack to aply a zero distance buffer.

```
#' Polygons only
fire_poly <- fires %>%
st_buffer(dist = 0) %>%
st_union(by_feature = TRUE)
```

Mapping data interactively

```
mapview(fire_poly, layer.name = "RFS Current Incident Polygons", zcol = "category") +
mapview(fire_pt, layer.name = "RFS Current Incident Locations", zcol = "category")
```