US Gross Domestic Product (GDP) per capita/Measuring Worth

Most of the statistical computations summarized in the Wikiversity article on "US Gross Domestic Product (GDP) per capita" can be replicated by "knitting" the R Markdown code in this vignette in Rstudio.


RStudio RMarkdown knit icon
  1. Open an instance of RStudio. If you don't already have it installed, you can go to "RStudio Cloud" and click "get started for free" (at least as of 2020-10-24).[1] Or you can download and install free and open-source versions of it and R on your local computer. R is available from The Comprehensive R Archive Network. For RStudio, go to its website > Products (top center) > RStudio > "RStudio Desktop" > "Open Source Edition" > "Download RStudio Desktop".
  2. Start RStudio. Then File > "New File" > "R Markdown..." > Title: "Measuring Worth" > HTML > OK.
  3. Replace the default code in this new RMarkdown vignette on "Measuring Worth" with the text from the section below entitled, 'RMarkdown vignette on "Measuring Worth"'.
  4. Save.
  5. File > "New File" > "Text File".
  6. Copy the text from the section below entitled "nuc-references.bib" into this new empty text file.
  7. Save as "nuc-references.bib".
  8. Click the "Knit" icon; see the companion image. Or read the text and run the code chunks one at a time manually. The latter makes it relatively easy to look at intermediate computations carefully and experiment with changing things in different ways.

RMarkdown vignette on "Measuring Worth"Edit

title: "Measuring Worth"
author: "Spencer Graves"
date: "2021-12-01"
output: html_document

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

## Purpose of this document 

This document describes the process for updating a simple analysis of real US GDP per capita from 1790 to the present using data from [Measuring Worth](  After discussing how to download the data, we read and analyze those data in various ways.  This includes computing an [autocorrelation function (ACF)](, which suggests possibly oversmoothing by Measuring Worth.  It also includes a data-driven structural break analysis using the [`breakpoints`]( function in the [`strucchange`]( package, suggested by [Achim Zeileis]( and documented in multiple publications cited in that package.  This is followed by a more naive manual analysis comparable to that of `struccchange::breakpoints`.  

## Read the data

As of 2021-11-21 []( included "DATA SETS", left middle, and "GDP - US" just below that.  Clicking there invites us to ask, ["What Was the U.S. GDP Then?"](  We entered 1790 as "Initial Year".  For the "Ending Year", we use the following:  

```{r lastYear}
(Today <- Sys.Date())
(lastYear <- (lubridate::year(Today)-1))

We entered `r lastYear` as "Ending Year", then clicked "Retrieve".  From there, we clicked "Download the Results in a Spreadsheet Format" above the table.  This put a file with a name like `r paste0('USGDP_1790-', lastYear, ".csv")` in our "Downloads" folder.  (This was on a Mac.  It will likely be the same on a Windows machine.)  We then moved that to our current working directory (`r getwd()`).  

The following will check if it's there and we can access it:

```{r datFile}
fileNm <- paste0('USGDP_1790-', lastYear, ".csv")

This should contain exactly one row.  If it includes `NA`s, either we don't have the correct file name, or the file is not in the working directory. 

If the file is there, let's read it:  

```{r read.csv}
str(USGDP0 <- read.csv(fileNm, skip=1))
Note the use of the option `skip` = 1 in `read.csv`:  On 2021-11-21 I needed that, because cell A1 of that file contained "Samuel H. Williamson, 'What Was the U.S. GDP Then?' MeasuringWorth, 2021", and the column names began in the next row.  

Note also that the column names are long and mangled.  Let's shorten them manually:

```{r newNames}
colNms <- colnames(USGDP0)
newNms <- c('Year', 'nominalGDP_M', 
  'realGDP2012_M', 'GDPdeflator', 'popK', 
  'nominalGDPperCap', 'realGDPperCap')
if(length(colNms) != length(newNms))stop('bad names')
cbind(colNms, newNms) #confirm names match

It's wise to visually compare the actual column names in the file just read (`colNms`) with the proposed new names (`newNms`):  If Measuring Worth changed the order or the number of columns, it could introduce errors when using this vignette with new data.  

Let's check the last row and remove it if it's not actually present ... after converting to numerics:  

```{r chkLast}

For the present work, we're really only interested in `realGDPperCap`.  

Secondly, most of the numbers appear as character strings that were not converted by `read.csv`.  We can convert those numbers using `Ecfun::asNumericDF`:  

```{r asNumericDF}
str(USGDP <- Ecfun::asNumericDF(USGDP0))
colnames(USGDP) <- newNms
if(tail(USGDP$realGDPperCap, 1)==0){
  USGDP <- head(USGDP, -1)
  lastYear <- tail(USGDP$Year, 1)

Good:  All columns are now numeric.  

## Plot 

```{r plot}
ylab. <- 'Real GDP / capita\n(2012 $ thousands)'
plot(realGDPperCap/1000~Year, USGDP, type='l', 
     log='y', las=1, ylab='', bty='n')
mtext(ylab., 2, 2)

## Growth increments 

Let's start by converting `realGDPperCap` into a [`ts`](

```{r ts}
realGDPperCap <- with(USGDP, ts(realGDPperCap, Year[1]))

Next compute the growth rate on the log scale:  

```{r growth}
growth <- diff(log(realGDPperCap))
USGDP$growth <- c(NA, growth)

Let's plot them:  

```{r plotGr}
plot(growth, las=1, xlab='', ylab='')
abline(v=1946:1947, col=c('grey', 'green'), lty='dotted')

sel45_47 <- with(USGDP, (1944<Year)&(Year<1948))

This is interesting:  It suggests that the standard deviation seemed to have been gradually increasing from 1791 to 1946, after which the standard deviation has been much smaller, and the growth rate has been slowing down slightly.  

We've added grey and green vertical lines to help us see the difference between 1946 and 1947:  The big drop was in 1946, as one would expect with the war ending in mid-1945.  

Let's add a scale on the right with percent change.  

```{r GrScale}
grlim <- range(growth)
pctlim <- expm1(grlim)
pctcks <- pretty(pctlim)

op <- par(mar=c(3, 4, 3, 4)+.1)
plot(growth, las=1, xlab='', ylab='')
axis(4, log1p(pctcks), pctcks, las=1)
abline(v=1946:1947, col=c('grey', 'green'), lty='dotted')

The axis on the left gives changes in `log(realGDPperCap)`.  The axis on the right gives proportion (percentage) changes in `realGDPperCap`.  The two are very similar, because by Taylor's theorem, `log(1+x) = x + O(x^2)`;  the difference shows the extent to which this `log` transformation is nonlinear over this range.  

Are these numbers normally distributed?  

## Are the growth increments normally distributed?  

```{r qqnorm2}
with(USGDP, qqnorm2(growth, Year<1947))
abline(h=c(-1.5, 1.5))
abline(h=c(-1.6, 1.4), lty='dotted', col='grey')
pnorm(-c(1.4, 1.6))

This suggests a mixture of normals with all the outliers prior to 1947, though it's not clear that the split is statistically significant nor worth considering in modeling.    

What about a plot with different lines before vs. after 1946? 

```{r qqnorm2t}
USGDP$split47 <- ordered(c('to 1946', 'since 1946')[
                      1 + (USGDP$Year>1946)], 
                      c('to 1946', 'since 1946')) 
qqnorm2t('growth', 'split47', data.=USGDP)
abline(h=c(-1.5, 1.5))
abline(h=1, lty='dotted', col='grey')
abline(h=0, lty='dashed', col='blue')
pnorm(-c(1, 1.5))

This plot supports the earlier conjecture that the observations since 1946 have had a lower standard deviation, though it's still not clear if that difference is statistically significant.  It also suggests that the period up to 1946 may be a mixture of two normal distributions with roughly 22 (the lowest 7 percent and the highest 16 percent) may have a lower mean and a higher standard deviation. And the period since 1946 has a slightly higher media.  However, as before it's not clear if these differences are statistically signicant.  We will ignore them for the moment.  

## Autocorrelation 

Most standard statistical procedures assume the observations are uncorrelated.  To check this, we compute the [autocorrelation function (`acf`)](  

```{r acf}
(acfGr <- acf(growth))[1]

The lag 1 autocorrelation is `r round(acfGr[[1]][1],3)`, and is statistically significant.  This suggests oversmoothing in producing the published numbers. Most standard statistical methods assume that the observations are uncorrelated.  We will still use some of these methods, while noting that the reported levels of statistical significance are probably overstated, because of this likely violation of the assumptions. 

## A data-driven structural break analysis 

[Achim Zeileis]( suggested the following data-driven structural break analysis, searching for changes in mean growth with at least 3 years per epoch and at most 10 breaks: 

```{r strucchange}
bp <- breakpoints(growth~1, h=3, breaks=10)
(bp_class <- class(bp))

This analysis is supported by a substantial literature documented in the help file for [`breakpoints`]( and the documentation for the [`strucchange`]( package more generally.  

Conveniently, the `breakpoints` function returns an object of class `r bp_class`, for which there are [`methods`](  

```{r bp_methods}
bp_mtds <- lapply(bp_class, function(cls)methods(class=cls))
names(bp_mtds) <- bp_class

After some experimentation, it seemed that `plot(bp)` gave a usefull summary of the analysis, combined with `print(bp)`:

```{r plot.breakpointsfull}

The minimum of the [Bayesian information criterion
(BIC)]( occurs with 4 breakpoints, 5 epocs.  For more detail on that, we consider `summary(bp)`:  

```{r sum_bp}

Zeileis also suggested the following:  

```{r AchimSaid}
(coef_bp <- coef(bp))

op <- par(mar=c(3, 4, 3, 4)+.1)
plot(growth, las=1, xlab='', ylab='')
axis(4, log1p(pctcks), pctcks, las=1)
abline(v=1946:1947, col=c('grey', 'green'), lty='dotted')
lines(fitted(bp), col=4, lwd=3)

Let's translate the change in log(GDP per capita) to actual percents:  

```{r pct}
round(100*cbind(round(coef_bp, 4), 
      percent=expm1(as.numeric(coef_bp))), 2)

In sum, the US economy averaged 1.5 percent per year growth in real GDP per capita from 1790 to 1929, then fell by over 8 percent per year following the [Wall Street Crash of 1929]( under US President [Herbert Hoover](, then grew by over 9 percent per year for eleven of [Franklin Roosevelt's]( 12 years as president, then fell by 6 percent per year during a post-World War II recession (which is not even discussed in typical histories of that period), and has grown relatively consistently at 1.9 percent per year since.  

Next we see if the declining growth rate since 1946 is statistically signficant 

## Growth slowing since 1946?

```{r growthSlowing}
grSlow <- lm(growth~Year, 
    USGDP[which(USGDP$split47=='since 1946'),])

Next we consider a manual analysis comparable to `breakpoint`.   

## Manual analysis of epocs

The following summarizes my thoughts in looking at the plot above of real GDP per capita:  
* The period 1790 - 1929 looks like one long relatively consistent growth period.  It might be split before, during, or after the US Civil War, 1861-1865.  However the case for that seems fairly weak. We won't try that here.  
* The first major abnormality in this plot is the spectacular collapse of the US economy 1929-1933 during the administration of Herbert Hoover.  
* That was followed by a spectacular growth for most years between 1933 and 1945 during the administration of Franklin Roosevelt. 
* That was followed by a noticeable drop immediately following the end of World War II until 1947.  However, this only lasted 2 years and may not be worth identifying separately.  
* There are several possibilities for how to treat the post World War II period.  The simplist is just to treat it as one period.  That gives us a 4-epoc model.  
* However, the period from 1945 when the war ended to 1947 looks like the first half of the Hoover era.  That would give us a 5-epoc model. 
* The period since 2008 suggests slower growth than the period between World War II and 2007.  That would give us a 6-epoc model.  However, the 2008-2019 period does not seem sufficiently different visually from 1947-2008 to compel us to consider it differently.  In addition, the breaks in 1929, 1933, and 1945 are clearly marked by major events in geopolitics.  The ["Great Recession"]( was clearly notable.  However, the changes, whether in this plot or in other descriptions, do not seen as large as those of 1929, 1933, and 1945.  
* An alternative 6-epoc model is suggested by reading Saez and Zucman (2020), who said the following:  

"From 1946 to 1980, average per adult national income rose 2 percent a year, one of the highest growth rates recorded over a generation in a country at the world’s technological frontier.  Moreover, this growth was widely shared, with only the income of the top 1 percent growing a bit less than average. One easily understands why many economists chose during this period to treat the US distribution of income as a constant. From 1980 to 2018, average annual growth in per adult national income falls to 1.4 percent a year."  

Emmanuel Saez and Gabriel Zucman (2020) "The Rise of Income and Wealth Inequality in America: Evidence from Distributional Macroeconomic Accounts", Journal of Economic Perspectives—Volume 34, Number 4—Fall 2020—Pages 3–26 (

Their "growth in per adult national income" is different from Measuring Worth's real GDP per capita.  The break they describe in 1980 is not visible in Figure 1.   

Alternatively, both the [incarceration rate]( and income inequality in the US started to increase during the last quarter of the twentieth century, as the mainstream commercial broadcasters in the US were firing nearly all their investigative journalists and replacing them with the police blotter.  We can look to see if the economy became more volatile after 1975 or 1980.  

We start by summarizing these epocal specifications as follows:   

```{r epocs}
epocBreak4 <- c(1790, 1929, 1933, 1945, lastYear)
epocs4 <- tibble::tibble(start=head(epocBreak4, -1), 
    end=tail(epocBreak4, -1), 
    growthPct0=c(1.5, -8, 8, 2), 
    description = c('to1929', 'Hoover', 
        'postWorldWarII') )

epocBreak5 <- c(1790, 1929, 1933, 1945, 1947, 
epocs5 <- tibble::tibble(start=head(epocBreak5, -1), 
    end=tail(epocBreak5, -1), 
    growthPct0=c(1.5, -8, 8, -8, 2), 
    description = c('to1929', 'Hoover', 
        'postWarRecession', 'postWarGrowth') )

epocBreak6 <- c(1790, 1929, 1933, 1945, 1947, 2008, 
epocs6 <- tibble::tibble(start=head(epocBreak6, -1), 
    end=tail(epocBreak6, -1), 
    growthPct0=c(1.5, -8, 8, -8, 2, 1.4), 
    description = c('to1929', 'Hoover', 
        'FranklinRoosevelt', 'postWarRecession', 
        'GreatRecession_since') )

epocBreak6s <- c(1790, 1929, 1933, 1945, 1947, 1980, 
epocs6s <- tibble::tibble(start=head(epocBreak6s, -1), 
    end=tail(epocBreak6s, -1), 
    growthPct0=c(1.5, -8, 8, -8, 2, 1.4), 
    description = c('to1929', 'Hoover', 
        'FranklinRoosevelt', 'postWarRecession', 
        'broadlySharedGrowth', 'Reagan') )

epocBreak6t <- c(1790, 1929, 1933, 1945, 1947, 1975, 
epocs6t <- tibble::tibble(start=head(epocBreak6t, -1), 
    end=tail(epocBreak6t, -1), 
    growthPct0=c(1.5, -8, 8, -8, 2, 1.4), 
    description = c('to1929', 'Hoover', 
        'FranklinRoosevelt', 'postWarRecession', 
        'broadlySharedGrowth', 'corporateMedia') )

The `growthPct0` in these tables is only an estimate based on other work.  

Next, we compare the different epocs and then annotate the basic plot consistent with this analysis.  

## Growth by epoc 

Let's write a function to compute mean and standard deviation within epoc, and save the log(likelihoods) to facilitate statistical testing.   

```{r epocStats}

(mean_dy <- mean(growth))
(sd_dy <- sd(growth))
N <- length(growth)
(sd_dy. <- sd_dy*sqrt((N-1)/N))
sum(dnorm(growth, mean_dy, sd_dy.))

epocStats <- function(epoc=epocs4,
      Year=USGDP$Year, y=USGDP$realGDPperCap){
  lgy <- log(y)
  dy <- c(NA, diff(lgy))
  nep <- nrow(epoc)
  meansd <- matrix(NA, nep, 5)
  colnames(meansd) <- c('meanlogChg', 'growthPct', 
              'sdlogChg', 'loglik1', 'loglik')
  for(i in 1:nep){
    seli <- with(epoc, (start[i]<Year) & 
               (Year <= end[i]))
    dyi <- dy[seli]
    mdyi <- mean(dyi)
    meansd[i, 'meanlogChg'] <- mdyi
    sdyi <- sd(dyi)
    ni <- sum(seli)
    sdyi. <- sdyi*sqrt((ni-1)/ni)
    meansd[i, 'sdlogChg'] <- sdyi 
    meansd[i, 'loglik1'] <- sum(dnorm(dyi, 
                    mdyi, sd_dy., log=TRUE))
    meansd[i, 'loglik'] <- sum(dnorm(dyi, 
                    mdyi, sdyi., log=TRUE))
  meansd[, 'growthPct'] <- 100*expm1(meansd[, 
  lglk1 <- sum(meansd[, 'loglik1'])
  lglk <- sum(meansd[, 'loglik'])
  ndf <- (nrow(epoc)-1)
  lgLR <- (lglk - lglk1)
  ep2 <- cbind(epoc, meansd)
  attr(ep2, 'loglik1') <- lglk1
  attr(ep2, 'loglik') <- lglk
  attr(ep2, 'logLR') <- lgLR
  p_het <- pchisq(2*lgLR, ndf, lower.tail=FALSE)
  attr(ep2, 'p_heterosc') <- p_het
  cat('p(no heteroscedascticity) = ', 
      signif(p_het, 2), '\n')

(meansd4 <- epocStats(epocs4))

# Just a check:  
i29.33 <- which(USGDP$Year %in% 1929:1933)
rGDPpcp <- USGDP$realGDPperCap[i29.33]
Hoover <- 100*((rGDPpcp[5]/rGDPpcp[1])^.25
               -1 )
drGDPpcp <- diff(log(rGDPpcp))
Hoo <- 100*expm1(mean(drGDPpcp))
if(all.equal(meansd4[2, 'growthPct'], 
             Hoo)!= TRUE){
  stop('Wrong answer for the Great Humanitarian.')
print(list(Hoover, Hoo))

(meansd5 <- epocStats(epocs5))
(meansd6 <- epocStats(epocs6))
(meansd6s <- epocStats(epocs6s))
(meansd6t <- epocStats(epocs6t))
attr(meansd4, 'p_heterosc')
attr(meansd5, 'p_heterosc')
attr(meansd6, 'p_heterosc')
attr(meansd6s, 'p_heterosc')
attr(meansd6t, 'p_heterosc')


The small values for `attr(*meansd*, 'p_heterosc')` suggest that we should assume heterscedasticity and prefer `loglik` over `loglik1`.  

```{r LRtstg}
(lglk4 <- attr(meansd4, 'loglik'))
(lglk4. <- attr(meansd4, 'loglik1'))

(lglk5 <- attr(meansd5, 'loglik'))
(lglk5. <- attr(meansd5, 'loglik1'))

(lglk6 <- attr(meansd6, 'loglik'))
(lglk6. <- attr(meansd6, 'loglik1'))

(lglk6s <- attr(meansd6s, 'loglik'))
(lglk6s. <- attr(meansd6s, 'loglik1'))

(lglk6t <- attr(meansd6t, 'loglik'))
(lglk6t. <- attr(meansd6t, 'loglik1'))
(lgLR4_5 <- (lglk5 - lglk4))
(lgLR4_5. <- (lglk5. - lglk4.))

(lgLR5_6 <- (lglk6 - lglk5))
(lgLR5_6. <- (lglk6. - lglk5.))

(lgLR5_6s <- (lglk6s - lglk5))
(lgLR5_6s. <- (lglk6s. - lglk5.))

(lgLR5_6t <- (lglk6t - lglk5))
(lgLR5_6t. <- (lglk6t. - lglk5.))

pchisq(2*lgLR4_5, 2, lower.tail = FALSE)
pchisq(2*lgLR4_5., 1, lower.tail = FALSE)
pchisq(2*lgLR5_6, 2, lower.tail = FALSE)
pchisq(2*lgLR5_6., 1, lower.tail = FALSE)
pchisq(2*lgLR5_6s, 2, lower.tail = FALSE)
pchisq(2*lgLR5_6s., 1, lower.tail = FALSE)
pchisq(2*lgLR5_6t, 2, lower.tail = FALSE)
pchisq(2*lgLR5_6t., 1, lower.tail = FALSE)

This says that the 5-epoc model is better than the 4-epoc model, but neither of the 6-epoc models are significantly better than the 5-epoc model.  

## Annotated plot 

```{r annotate}
Adj6 <- matrix(0, 6, 2)
Adj6[, 2] <- c(1, .6, 1.8, .6, 1.5, .7)
Adj6[, 1] <- c(-10, 0, -3, 7, 0, 3)

annotateGDP <- function(breaks=meansd6, 
          Adj=Adj6, mark=1.2, colb='red', 
          ltyb='dotted', Ylab=ylab., ...){
  plot(realGDPperCap/1000~Year, USGDP, type='l', 
     log='y', las=1, xlab='', ylab='', bty='n', ...)
  mtext(Ylab, 2, 2)
  nep <- nrow(breaks)
  i0 <- which(USGDP$Year %in% breaks$start)
  i1 <- which(USGDP$Year %in% breaks$end)
  yr0 <- USGDP$Year[i0]
  y0 <- (USGDP$realGDPperCap[i0]/1000)
  yr1 <- USGDP$Year[i1]
  y1 <- (USGDP$realGDPperCap[i1]/1000) 
  segments(yr0, y0, yr1, y1, 
           col=colb, lty=ltyb, ...) 
  yr.5 <- (yr0+yr1)/2
  y.5 <- (y0+y1)/2
  p.5 <- paste0(signif(breaks$growthPct, 2), '%')
  text(yr.5+Adj[1:nep, 1], y.5*Adj[1:nep, 2], 
       p.5, col=colb, xpd=NA, ... )
  segments(yr0[-1], y0[-1]/mark, 
           yr0[-1], y0[-1]*mark, 
           col=colb, lty=ltyb, ...)

Adj5 <- Adj6[-5, ]
Adj4 <- Adj5[-5, ]
annotateGDP(meansd4, Adj4)
annotateGDP(meansd5, Adj5)

## plot to svg 

[Wikimedia Commons]( prefers `svg`, `png`, and `jpg`.  [Scalable Vector Graphics (`svg`)]( is generally preferred for many standard plots, because they are routinely rescaled without losses due to pixelization.  [Portable Network Graphics (`png`)]( is preferred for drawings and diagrams not available in `SVG`.  `JPG` is preferred for photographs.  

It's often preferred to avoid English in labeling from graphics uploaded to Wikipedia Commons, because that makes it easier for the art to be used with documents in other languages.  

```{r svg}
par(mar=c(2, 2, .5, .5)+0.1, cex=2)
plot(realGDPperCap/1000~Year, USGDP, type='l', 
     log='y', las=1, xlab='', ylab='', bty='n')

op <- par(mar=c(3, 4, 3, 4)+.1)
plot(growth, las=1, xlab='', ylab='')
axis(4, log1p(pctcks), pctcks, las=1)
abline(v=1946:1947, col=c('grey', 'green'), lty='dotted')

op <- par(mar=c(3, 3, 3, 1)+.1)
qqnorm2t('growth', 'split47', data.=USGDP, 
         xlab='', ylab='')
abline(h=c(-1.5, 1.5))
abline(h=1, lty='dotted', col='grey')
abline(h=0, lty='dashed', col='blue')

op <- par(mar=c(3, 3, 3, 1)+.1)
plot(acfGr, xlab='', ylab='', main='')

op <- par(mar=c(3, 3, 3, 3)+.1)
plot(bp, xlab='', ylab='', main='')

op <- par(mar=c(3, 4, 3, 4)+.1)
plot(growth, las=1, xlab='', ylab='')
axis(4, log1p(pctcks), pctcks, las=1)
lines(fitted(bp), col=4, lwd=3)

par(mar=c(2, 2, .5, .5)+0.1, cex=2)
annotateGDP(meansd4, Adj4, Ylab='')

par(mar=c(2, 2, .5, .5)+0.1, cex=2)
annotateGDP(meansd5, Adj5, Ylab='')

par(mar=c(2, 2, .5, .5)+0.1, cex=2)
annotateGDP(meansd6, Adj6, Ylab='')

par(mar=c(2, 2, .5, .5)+0.1, cex=2)
annotateGDP(meansd6s, Adj6, Ylab='')
  1. RStudio Cloud, Wikidata Q100799903.