New Page — Keen Data Science

#Introduction

As many know, the Congressional Budget Office (CBO) analyzes the budget passed by congress to predict the effects of said budget. These projections come one to three times a year and look 10-years into the future. They are all publicly available on the CBO's website as excel workbooks. I downloaded the 28 most recent reports, ranging from 2007-2019, and focused on the data in their baseline projections sheets. There are 18 different measures the CBO is predicting for every year from the report year to 10 years in the future. All of the proceeding analysis and exploration was done in rstudio using the following 4 packages.

library(ggplot2)
library(dplyr)
library(plotly)
library(readxl)

Given the large amount of formatting in the excel workbooks, I scraped out the numbers the “low-tech” way with simple cut and paste and saved the data to a new excel document that I then read into rstudio.

#Exploration ##Cleaning the GDP Subset

After converting the yearPredicted and reportMonth variables to factors The data needed unique ID's for all of the reports. This is done below.

CBObudgetPredictions <- CBObudgetPredictions %>% 
    mutate(reportID = group_indices_(CBObudgetPredictions,
                                     .dots=c("reportYear", 
                                             "reportMonth")))

Next I isolated one measure, GDP, to get a visual sense for what combination of plotting techniques would be most effective for visualizing the entire dataset. To do this, I separated the data into two new sets, actualGDP and predictedGDP then condensed the actualGDP data into actualCondensed. The consensed data will get used to help in plotting later.

predictedGDP <- filter(CBObudgetPredictions, 
                       `actual/predicted` == "predicted",
                       measure == "GDP")
actualGDP <- filter(CBObudgetPredictions, 
                      `actual/predicted` == "actual", 
                      measure == "GDP")

actualCondensed <- select(actualGDP, yearPredicted, 
                          `dollars (*10^9)`, 
                          reportID)

After that I further condensed the actualGDP data to use as a lookup table to generate the error statistics I am interested in called actualLookup.

actualLookup <- select(actualCondensed, yearPredicted, `dollars (*10^9)`)

Then I changed the name of the dollars (*10^9) column to actualValue to prepare it for the join and also to better indicate what that variable represents. Following that I joined the predictedGDP and actualLookup frames by the yearPredicted variable and saved everything as a new predictedGDP frame that Now has a predicted measure and an actual measure for each observation.

names(actualLookup)[2] <- "actualValue"
predictedGDP <- inner_join(predictedGDP, actualLookup, by = "yearPredicted")

Now I can finally mutate two new variables onto the predictedGDP data, a predicted error (predictError) and a percent error (percentError).

predictedGDP <- predictedGDP %>%
  mutate(predictError = actualValue - `dollars (*10^9)`, 
         percentError = (predictError / actualValue) * 100)

##Visualizing GDP

My initial temptation was to plot a line representing the actual GDP values with geom_point layer to show where the CBO's predictions fall with respect to the actuals. This proved more involved than I initially expected. I started by setting predictedGDP as my global dataset, yearPredicted as my global X, dollars (*10^9) as my global Y, and separated my data into colors by reportID. Then I plotted these aesthetics as a geom_point layer, faceted everything by reportMonth, and finally applied a geom_line layer with a local dataset (the actualCondensed data from earlier). This produced the following.

ggplotly(
  ggplot(predictedGDP,
         aes(yearPredicted, 
             `dollars (*10^9)`, 
             color = factor(reportID))) +
    geom_point(alpha = 0.6) +
    facet_wrap(~ reportMonth) +
    geom_point(data = actualCondensed,
               alpha = 0.3) +
    geom_line(data = actualCondensed, 
              color =  "#1001FE",
              group = 0) +
    ggtitle("Congressional Budget Office GDP predictions") +
    xlab("Year of Prediction") +
    ylab("Billions of Dollars") +
    theme(axis.text.x = element_text(angle = 90))
  )

plot of chunk unnamed-chunk-7

Following that I decided to utilize the error statistics that I generated earlier while cleaning the data to look at a density plot of percentError within each report (using the reportID variable). Since there are 28 reports in the data, the color aesthetic isnt ideal for handling them all, but given plotly's ability to remove groups by clicking on them in the legend you can still isolate indivigual density curves within the plot.

ggplotly(ggplot(predictedGDP, aes(percentError, color = factor(reportID))) +
  geom_density(alpha = 0.5))

plot of chunk unnamed-chunk-8

After examining the plot above, the fact that in some years the CBO issued up to 3 reports came to mind. Most years, not all, the CBO issued a report in January; however there were quite a few reports issued as late in the year as september. It seemed possible that some of the spikes in accuracy seen in the last plot could be due to them coming from those september reports.

I decided to look at the percentError again, but this time to facet by reportYear and group by reportMonth. I chose to use the fill aesthetic this time instead of the color since there would be less curves per plot.

ggplotly(ggplot(predictedGDP, aes(percentError, fill = reportMonth)) +
  geom_density(alpha = 0.4) +
  facet_wrap(~ reportYear))

plot of chunk unnamed-chunk-9

Somewhat surprisingly, the only year where this accuracy of late year reports compared to the early reports seemed to pop up was in 2009. If you remove the january reports from the above plot you can see that the flat curve was the January report, while the two relative spikes were from March and August reports. At this point I decided to check if changing my Y aesthetic from density to count would reveal anything else about the relationship between accuracy of GDP prediction and the year and month of CBO reports. the only change in this code is the local y-aesthetic within the geom_density call.

ggplotly(ggplot(predictedGDP, aes(percentError, fill = reportMonth)) +
  geom_density(alpha = 0.4, aes(y = ..count..)) +
  facet_wrap(~ reportYear))

plot of chunk unnamed-chunk-10

#Generalizing ##Cleaning Across All Measures

At this point, I had enough information to scale my exploration up to all measures, not just GDP. To do so, I first had to clean the whole dataset (CBObudgetPredictions) in a similar manner to how I cleaned the subset of GDP data. To do this I started by creating a dataset of all the actual values (budgetActual).

budgetActual <- filter(CBObudgetPredictions, `actual/predicted` == "actual")

Next I created another condensed dataset of actuals (budgetActualCondenced) to use for in the join that will be implemented later and renamed its dollars (*10^9) column.

budgetActualCondenced <- select(budgetActual, measure, yearPredicted, `dollars (*10^9)`)
names(budgetActualCondenced)[3] <- "actualValue"

Finally, I created a budgetPredicted dataset from all of the predicted values, joined this new set with budgetActualConsenced by the measure and yearPredicted variables, then mutated the same two columns that were added to the predictedGDP data (perdictionError and percentError).

budgetPredicted <- filter(CBObudgetPredictions, `actual/predicted` == "predicted")
budgetPredicted <- inner_join(budgetPredicted, budgetActualCondenced, by = c("measure", "yearPredicted"))
budgetPredicted <- budgetPredicted %>%
  mutate(predictionError = actualValue - `dollars (*10^9)`) %>%
  mutate(percentError = (predictionError / actualValue) * 100)

##Visualizing

Similarly to the subset visualization, I started here by looking at the density plot, with count on the y-aesthetic, of the percentError variable grouped by report and then faceted by the measure.

The first iteration of this plot:

ggplot(budgetPredicted, aes(percentError, fill = factor(reportID))) +
    geom_density(alpha = 0.4, aes(y = ..count..)) +
    facet_wrap(~ measure)

plot of chunk unnamed-chunk-14

included the measure OUTLAYStotal, which you can see has been removed for the plot below, and also used fixed axes for all facets. This revealed that OUTLAYStotal was an outlier with respect to its accuracy. Even once OUTLAYStotal was removed though the data was still unhelpful with the axes fixed. The plot below uses independent axes for each facet, because of this comparing between measures should be done with caution. While this plot is a bit cramped in the markdown window there are still some intriguing things to notice; for example the middle plot on the top row which represents DEFECIT/SURPLUSoff-budget. A quick glace at the x axis reveals that this number, while generally pretty accurate, has been underpredicted by up to 4000%. You will also notice the y-axis for the OUTLAYSon-budget plot ranges three times wider than the next closest contender. It also appears that REVENUEStotal is prety consistently underestimated.

budgetPredicted %>%
  filter(measure != "OUTLAYStotal") %>%
  ggplot(aes(percentError, fill = factor(reportID))) +
    geom_density(alpha = 0.4, aes(y = ..count..)) +
    facet_wrap(~ measure, scales = "free") +
    theme(axis.text.x = element_text(angle = 90))

plot of chunk unnamed-chunk-15

After looking at this plot I remembered that I had removed the OUTLAYStotal measure. Given that all of my axes were free now, I decided to add that measure back in and plot the same comparison. The only other difference is that I now set both color and fill aesthetics to reportID. There were advantages and disadvantages to this compared to just having the fill set. Since the plots are restricted by the markdown window, I couldnt say which is the better option. When viewing in a larger window though setting both fill and color produces a more informative visual.

budgetPredicted %>%
  ggplot(aes(percentError, fill = factor(reportID), color = factor(reportID))) +
    geom_density(alpha = 0.2, aes(y = ..count..)) +
    facet_wrap(~ measure, scales = "free", strip.position = "right") +
    xlab("Percent Error") +
    ggtitle("CBO Baseline Budget Prediction Errors") +
    theme(panel.background = element_blank(),
          panel.grid = element_line(color = "grey"),
          axis.text.x = element_text(angle = 90))

plot of chunk unnamed-chunk-16

In this visual, I took a step away from percentError and looked at the predictionError which is a measure of billions of USD. I faceted these plots on the same variable, measure, but grouped the density curves by the reportMonth this time to try to spot any accuracy that was happening because of late year reports.

budgetPredicted %>%
  ggplot(aes(predictionError, fill = factor(reportMonth), color = factor(reportMonth))) +
    geom_density(alpha = 0.3, aes(y = ..count..)) +
    facet_wrap(~ measure, scales = "free", strip.position = "right") +
    xlab("Prediction Error in Billions") +
    ggtitle("CBO Baseline Budget Prediction Errors") +
    theme(panel.background = element_blank(),
          panel.grid = element_line(color = "grey"),
          axis.text.x = element_text(angle = 90)) +
    guides(fill = guide_legend(title = "Report Month"), 
           color = guide_legend(title = "Report Month"))

plot of chunk unnamed-chunk-17