Forecasting the number of parties to the TPNW/Simulate number of parties to the TPNW

The statistical computations to produce the figure in the Wikiversity article on "Forecasting the number of parties to the TPNW" are documented and explained and can be replicated by "knitting" the R Markdown code below in Rstudio.
Process

RStudio RMarkdown knit icon
Open an instance of RStudio. If you don't already have it installed, you can go to "RStudio Cloud" and click "get started for free" (at least as of 2020-10-24).^[1] Or you can download and install free and open-source versions of it and R on your local computer. R is available from The Comprehensive R Archive Network. For RStudio, go to its website > Products (top center) > RStudio > "RStudio Desktop" > "Open Source Edition" > "Download RStudio Desktop".
Start RStudio. Then File > "New File" > "R Markdown..." > Title: "Simulate number of parties to the TPNW" > PDF > OK.
Replace the default code in this new RMarkdown vignette on "Simulate number of parties to the TPNW" with the text from the section below entitled, 'RMarkdown vignette to "Simulate number of parties to the TPNW"'.
Save.
Click the "Knit" icon; see the companion image. Or read the text and run the code chunks one at a time manually. The latter makes it relatively easy to look at intermediate computations carefully and experiment with changing things in different ways.
Development of this vignette

This vignette was produced as a modification of an earlier vignette used to predict the date that the fiftieth country officially became a party to the TPNW. After that happened, that vignette was modified to Monte Carlo the plausible range of the number of new parties by a given date in the future, which is what this vignette does.
RMarkdown vignette on "Simulate number of parties to the TPNW"

---
title: "Simulate numbers of parties to the TPNW"
author: "Spencer Graves"
date: "2022-06-30"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Intro 

This vignette forecasts the date at which the last member state of the United Nations officially becomes a party to the [Treaty on the Prohibition of Nuclear Weapons (TPNW)](https://treaties.un.org/Pages/ViewDetails.aspx?src=TREATY&mtdsg_no=XXVI-9&chapter=26&clang=_en) while quantifying the statistical uncertainty in that estimate.  

At this date, we can confidently predict the following: 
# The predictions with their associated range of uncertaintly will work reasonably well for the next few years. 
# However, as time passes the predictions will become worse, as the fundamental underlying processes will increasingly diverge from the assumptions of this model.  In 1978 [George Box](https://en.wikipedia.org/wiki/George_E._P._Box) famously wrote, ["All models are wrong but some are useful."](https://en.wikipedia.org/wiki/All_models_are_wrong)
# The nonparametric methods used herein can be adjusted in different ways using, e.g., [Bayesian model averaging](https://stat.uw.edu/sites/default/files/files/reports/1998/tr335.pdf) to adapt to the evolution of reality.  For simplicity, we will only consider one model here, partly because there is no empiricaly evidence yet of a substantive departure from linearity, i.e., the assumption of a [Poisson point process](https://en.wikipedia.org/wiki/Poisson_point_process) with a constant arrival rate.  

You, dear reader, can simultaneously reduce [the risks of nuclear Armageddon](https://en.wikiversity.org/wiki/Time_to_nuclear_Armageddon) and increase human appreciation of statistical methods by sharing this vignette and its results with others.  

## Naive estimate of saturation 

A naive estimate estimate of saturation, the date at which the last member state of the United Nations officially becomes a party to the TPNW, can be computed fairy simply.  We will compute that here and use that as a naive end date for simulations we do later.  

As of `r (updateDate <- as.Date('2022-06-30'))` there are [193 member states of the United Nations](https://en.wikipedia.org/wiki/United_Nations) plus 2 non-member observer states:  the Holy See and Palestine.  Of these, 66 are parties to the TPNW.  The first three states parties to the TPNW deposited their documents on 2017-09-20.  

We can compute average days between arrivals of new state parties to the TPNW and use that to compute a naive forecast for saturation as follows:  

```{r naiveEnd}
str(Today <- as.Date(tis::today()))
str(party1Date <- as.Date('2017-09-20'))
str(daysSinceParty1 <- difftime(Today, 
          party1Date, units='days'))

nUN <- 195
nParties0 <- 66
dPartiesSince1 <- (nParties0-3)
str(daysPerParty <- daysSinceParty1 /
      dPartiesSince1)
(nParties2sat <- nUN-nParties0)
str(saturation <- Today + 
      nParties2sat * daysPerParty)
```

So far this process has averaged one new state party every `r round(daysPerParty, 1)` days.  

How close is this to, e.g., Poisson arrivals in the actual data?  

## The data

The methodology then and now begins by scraping a table of Parties from, e.g., the web page on that treaty from the web site of the [United Nations Treaty Collection](https://treaties.unoda.org/t/tpnw), which is the official documentation of the states parties.  

## URL

```{r link}
TPNWurl <- "https://treaties.un.org/Pages/ViewDetails.aspx?src=TREATY&mtdsg_no=XXVI-9&chapter=26&clang=_en"
```

## Scraping 

When I started working on this, I had not written code to scrape a web site in some time, so I started with a lit search, including `sos::findFn('scrape web page')`:

```{r eval=FALSE}
library(sos)
(scrape <- findFn('scrape web page'))
(scrapeTable <- findFn('extract a table from a web page'))
```

After studying this, doing companion web searches, and a little experimentation, I was able to progress as follows:  

```{r RCurl}
library(RCurl)
TPNWchars <- RCurl::getURL(TPNWurl)
str(TPNWchars)
length(TPNWchars)
nchar(TPNWchars)
```

This is a character vector of length 1 with, on `r updateDate`, 122353 characters.   

Let's feed this to `XML::readHTMLTable`:  

```{r}
library(XML)
TPNWtable <- XML::readHTMLTable(TPNWchars)
class(TPNWtable)
length(TPNWtable)
```

This is a "list" of 19.  

```{r class}
library(purrr)
TPNWclasses <- map_chr(TPNWtable, class)
names(TPNWclasses) <- NULL
table(TPNWclasses)
```

That's interesting:  7 `data.frame`s and 12 `NULL`s.  (A previous run of this code produced 8 `data.frame`s and 10 `NULL`s.  This suggests we may not be ready to completely automate this step.)

Let's isolate those 7 `data.frames`:  

```{r TPNWtabs}
TPNWtabs <- TPNWtable[TPNWclasses=='data.frame']
```

Let's get the dimensions of those 7:  

```{r TPNWdims}
dimnames(TPNWdims <- sapply(TPNWtabs, dim))
```

The `dimnames` here are counterproductive.  

```{r TPNWdims2}
colnames(TPNWdims) <- NULL
TPNWdims
```

The web site from which these data were scrapped says that there are currently 86 signatories and 65 parties.  The Wikipedia article giving ["List of parties to the Treaty on the Prohibition of Nuclear Weapons"](https://en.wikipedia.org/wiki/List_of_parties_to_the_Treaty_on_the_Prohibition_of_Nuclear_Weapons) notes that 3 of the 65 parties "acceeded" to the treaty without "signing" it.  That suggests that 89 is the correct number.  And in previous runs, we used the sixth `data.frame` in this list.  Let's try that again.  

```{r TPNW6}
str(TPNW6 <- TPNWtabs[[6]])
names(TPNW6)[3] <- 'Party'
head(TPNW6)
tail(TPNW6)
```

NOTE:  This seems to have worked repeatedly since I first wrote this code on or slightly before 2020-08-10.  If at any point the answers seem questionable, it would be wise to confirm that `TPNW6` matches the information available from an authoritative source.  

Can we parse the dates?  

```{r dates}
Sign <- as.Date(TPNW6$Signature, 
      format='%e %b %Y')
ARa <- as.Date(TPNW6[[3]], 
      format='%e %b %Y')
str(TPNW0 <- data.frame(Participant=TPNW6[[1]], 
    dateSigned=Sign, dateParty=ARa))
sapply(TPNW0, function(x)sum(!is.na(x)))
```

There are 89 Participants, but only 76 "Signed".  That's clearly wrong.  Let's look at those errors.  

```{r signNA}
sum(signNA <- is.na(TPNW0$dateSigned))
TPNW6[signNA, ]
```

I don't see anything strange.  Let's `table` the dates and see if we can see a difference between the dates converted and those that weren't.  

```{r evalNA}
isNA <- lapply(TPNW0[2:3], is.na)
str(goodDates <- table(c(TPNW6[[2]][!isNA[[1]]], 
                     TPNW6[[3]][!isNA[[2]]])))
str(badDates <- table(c(TPNW6[[2]][isNA[[1]]], 
                    TPNW6[[3]][isNA[[2]]])))
intersect(names(goodDates), 
          names(badDates))
```

No overlap.  That should make it easier to diagnose the problem.    

```{r compareGoodBad}
head(goodDates)
head(badDates)
```

Conclusion:  `badDates` all have a single digit day.  

After some research, I found that the problem dates had `&nbsp;` as the first character rather than a regular blank space.  I fixed that as follows:  

```{r dates2}
asDate <- function(x){
  x1 <- sub('\t', ' ', x)
  x2 <- textutils::HTMLencode(x1)
  x3 <- sub('^&nbsp;', ' ', x2)
  as.Date(x3, format='%e %b %Y')
}
Sign2 <- asDate(TPNW6$Signature)
ARa2 <- asDate(TPNW6[[3]])
str(TPNW <- data.frame(Participant=TPNW6[[1]], 
    dateSigned=Sign2, dateParty=ARa2))
(nTPNW <- sapply(TPNW, function(x)sum(!is.na(x))))
```

Looks good:  89 participating states parties.  86 signed the TPNW.  `r nParties0-3` of those had ratified it.  The other 3 had acceeded to it without having "signed" it.  

Let's confirm that's the same number we used above:  

```{r confirmN}
nParties <- nTPNW[3]
if(nParties0 != nParties){
  cat('OOPS: nParties read is different from', 
      ' nParties0 assumed.')
  stop('Please fix.')
}
```

## Time between `dateParty`

```{r DaysBetwParties}
daysSince1 <- difftime(Today, 
    min(TPNW$dateParty, na.rm=TRUE), units='days') 
(DaysBetwParties <- (daysSince1 / 
      c(dPartiesSince1 - 1, nParties0-3)))
```

Poisson probabilities for selected numbers of additional parties:

```{r nMore}
(lam <- as.numeric(DaysBetwParties[1]))
x <- round(lam+((-3):3))
cbind(x=x, px=round(dpois(x, lam), 4))
```

NOTE: This is not used in the computations below, but it seemed interesting to look at here.  

Sort: 

```{r order}
TPNWp <- TPNW[!is.na(TPNW$dateParty), ]
str(o <- do.call(order, 
    c(TPNWp[c(3:1)], na.last=FALSE)))
head(TPNWo <- TPNWp[o,])
tail(TPNWo)
```

## Number of new parties by the dates each became a party

```{r nNew}
str(nNew <- table(TPNWo$dateParty))
kNew <- length(nNew)
```

As of `r updateDate` new parties have joined on `r kNew` distinct dates. This gives us observations on `r kNew-1` interarrival times with the number of new parties with each date.    

```{r interarrivals}
str(arrivalDates <- as.Date(names(nNew)))
daysBetw <- diff(arrivalDates, units='days')
str(interArr <- data.frame(Date=arrivalDates[-1], 
        daysBetw=daysBetw, 
        nNew=as.integer(nNew)[-1]))
plot(daysBetw~Date, interArr, type='n', 
     ylab='days between new Parties', las=1)
with(interArr, text(Date, daysBetw, nNew))

nDates <- nrow(interArr)
```

There is some suggestion in the plot that the rate of new arrivals may have slowed slightly in the past year  We could try to model it, e.g., using Poisson regression with `log(lambda) ~ arrivalDates` in Bayesian model averaging.  

For the present, we will ignore that refinement.  

## Simulations 

For simplicity, we will assume that `interArr[2:3]` is a random sample of the distribution of interarrival times that will continue until all members and observers to the [United Nations](https://en.wikipedia.org/wiki/United_Nations) are parties to the TPNW.  That's probably reasonable for the next couple of years but will likely break down long before we reach `saturation`.   

* If progress stalls, the TPNW will likely become nice sounding words on paper but may not have any more impact on the future of humanity  than the [Treaty of Tlatelolco (officially known as the "Treaty for the Prohibition of Nuclear Weapons in Latin America and the Caribbean")](https://en.wikipedia.org/wiki/Treaty_of_Tlatelolco)
* Or activists in the US and elsewhere could place sufficient pressure on the US that it joins the TPNW long before the currently forecasted saturation (`r saturation`).  If that happens, progress could accelerate dramatically.   

For the present work, we will assume that the history summarized in `interArr` will replicate itself in a random order until `r nParties2sat` additions states have become parties, bringing the total number of parties to the number eligible, `r nUN`.  We expect that this simulated distribution will likely describe the uncertainty in the forecasts for the next few years reasonably well, but the reality will probably escape the range of simulated plausibility later.  However, to paraphrase George Box once again, we believe these forecasts can still be useful.  

To keep things simple, we will replicate `interArr` enough so the total additional parties is at least `nParties2sat` = `r nParties2sat` and then use random permutations of that for the simulations.  

```{r nReps}
(nReps <- ceiling(nParties2sat/nDates))
str(interAr <-with(interArr[rep(1:nDates, nReps), ], 
        data.frame(daysBetw=as.numeric(daysBetw), 
                   nNew=nNew) ))
niAr <- nrow(interAr)

sim1 <- function(o=sample(niAr)){
  iAr <- interAr[o, ]
  iAr. <- cumsum(iAr$daysBetw)
  iAr2 <- rep(iAr., iAr$nNew)
  head(iAr2, nParties2sat)
}

plot(tst <- sim1()); length(tst); min(diff(tst))
plot(tst <- sim1()); length(tst); min(diff(tst))
plot(tst <- sim1()); length(tst); min(diff(tst))
```

This looks good:  Each simulation has length = `nParties2sat` = `r nParties2sat`.  The trajectories look monotonically nondecreasing, with some numbers duplicated, as required.  

```{r sims}
set.seed(1)
(nAr <- nrow(interAr))
nSims <- 40000
# matrix nDays  x nSims 
sims <- matrix(NA, nParties2sat, nSims)
colnames(sims) <-  paste0("sim_", 1:nSims)

Start <- proc.time()
lastRpt <- Start
for(i in 1:nSims){
  sims[, i] <- sim1()
  now <- proc.time()
  d_t <- (now-lastRpt)
  if(max(d_t)>10 & (i%%10==0)){
    cat(i, now, '\n')
    lastRpt <- now
  }
}
# check
sims[1:9, 1:9]
dim(sims)
```

Looks sensible.  

Compute the mean and quantiles of these simulations:  

```{r simSum}
(xMean <- apply(sims, 1, mean))
head(xCI <- t(apply(sims, 1, quantile, 
          probs=c(.1, .2, .5, .8, .9))))
tail(xCI)
#(table(tail(sims, 1))/sum(tail(sims, 1)))
```

## plot 

Stair step plot?

NOTE: The simulated times of future ratifications are added to `Today`, NOT to the date of the most recent ratification, `tail(TPNWo[, 3])`.  This avoids the nonsense of a simulated ratification occurring before `Today`.  This effectively assumes that the history has no memory, i.e., the time between dates of ratifications follows an exponential distribution.  (An alternative could be to eliminate from `interArr` all interarrivals shorter than `Today -  tail(TPNWo[, 3])`.  However, `interArr` is rather small to start with, it does not seem sensible to reduce it.) 

```{r plot}
nRats <- nrow(TPNWo)
xlim. <- c(TPNWo[3, 3], saturation)
#ylim. <- c(0, nRats+max(xCI[days2eofct, ]))
ylim. <- c(0, nUN)

TPNWo$number <- 1:nRats
meanDateParty <- mean(TPNWo$dateParty)
trendLine <- lm(number~I(dateParty-meanDateParty), TPNWo)
summary(trendLine) # ... but observations are not independent
(nCol <- ncol(xCI))

plotParties <- function(confInts=xCI, lwd.=2, 
    cex.axis.=1.5, main.='Parties to TPNW',
    cex.main.=1, cex.=1, 
    adj50=c(.3, -0.1), adjeif=c(.5, 1.2)){
  plot(TPNWo[[3]], 1:nRats, type='s', 
     lwd=2, xlab='', ylab='', 
     main=main., ylim=ylim., 
     xlim=xlim., las=1, cex.axis=cex.axis.,
     cex.main=cex.main.)
  abline(coef=coef(trendLine), col='red', 
       lty=2, lwd=lwd.)
  abline(h=c(50, nUN), lty='dotted', col='green', 
       lwd=lwd.)
  TPNWv <- (TPNWo[50,3]+c(0, 90))
  abline(v=TPNWv, lwd=lwd., 
         col=c('green', 'blue'), lty='dotted')
  fifty <- paste('50: ', TPNWv[1])
  text(TPNWv[1], mean(ylim.), fifty, 
     srt=90, adj=adj50, cex=cex., col='green')
  eif <- paste('entry into force', TPNWv[2])
  text(TPNWv[2], mean(ylim.), eif, 
     srt=90, adj=adjeif, cex=cex., col='blue')
  newParties <- (nParties + 1:nParties2sat)
  coli <- c('red', 'orange', 'blue')[
        c(1:3, 2:1)]
  for(i in 1:nCol){
    lines(Today+confInts[,i], newParties, 
          lty='dotted', lwd=lwd., col=coli[i])
  }
}
plotParties()
```

The forecasts for the next year or maybe tow look plausible.  However, beyond that, they are too narrow to be credible.  

The `lm` fit above was silly, because the dates at successive dates when new states officially deposit their documents with the UN Treaty Collection are obviously not statistically independent.  However, the times between them might more plausibly be independent.  Let's redo that with `interArr`: 

```{r qqnorm_daysBetw}
qqnorm(interArr$daysBetw, datax=TRUE)
qqnorm(interArr$daysBetw, datax=TRUE, log='x')
```

Lognormal fits better than normal.  

```{r }
interArFit <- lm(log(as.numeric(daysBetw)) ~ Date, interArr)
summary(interArFit)
```

Clearly no discernible trend, at least with the data available 2022-06-26.  

Let's fit a constant model.  

```{r constFit}
constFit <- lm(log(as.numeric(daysBetw)) ~ 1, interArr)
summary(constFit)
```

Let's redo the simulations, adjusting `interAr$daysBetw` to have a slightly different mean consistent with `constFit`.  

```{r sim2}
(constCoef <- coef(summary(constFit))[, 1:3])
(dfconst <- df.residual(constFit))

str(simAdjMediansBetw <- exp(constCoef[2]*rt(nSims, dfconst)))

sim1a <- function(i=1, o=sample(niAr)){
  iAr <- interAr[o, ]
  iAr. <- cumsum(iAr$daysBetw)/simAdjMediansBetw[i]
  iAr2 <- rep(iAr., iAr$nNew)
  head(iAr2, nParties2sat)
}

plot(tst2 <- sim1a()); length(tst2); min(diff(tst2))
plot(tst2 <- sim1a()); length(tst2); min(diff(tst2))
plot(tst2 <- sim1a()); length(tst2); min(diff(tst2))

sims2 <- matrix(NA, nParties2sat, nSims)
colnames(sims2) <-  paste0("sim_", 1:nSims)

for(i in 1:nSims){
  sims2[, i] <- sim1a(i)
}
# check
sims2[1:9, 1:9]
dim(sims2)

(xMean2 <- apply(sims2, 1, mean))
head(xCI2 <- t(apply(sims2, 1, quantile, 
          probs=c(.01, .1, .5, .9, .99))))

CI2 <- (Today + tail(xCI2, 1))
tail(CI2)

plotParties(xCI2)

plotParties(xCI2, main.='')

svg("TPNW forecasts.svg")
op <- par(mar=c(2, 3, .5, .5)+.1)
plotParties(xCI2, main.='')
dev.off()

svg("TPNW forecasts2.svg")
op <- par(mar=c(2.1, 3.2, .5, .5))
plotParties(xCI2, main.='', cex.=2)
dev.off()
```
↑ RStudio Cloud, Wikidata Q100799903.
[1] RStudio Cloud, Wikidata Q100799903.
[1]