#Introduction
As many know, the Congressional Budget Office (CBO) analyzes the budget passed by congress to predict the effects of said budget. These projections come one to three times a year and look 10-years into the future. They are all publicly available on the CBO's website as excel workbooks. I downloaded the 28 most recent reports, ranging from 2007-2019, and focused on the data in their baseline projections sheets. There are 18 different measures the CBO is predicting for every year from the report year to 10 years in the future. All of the proceeding analysis and exploration was done in rstudio using the following 4 packages.
library(ggplot2)
library(dplyr)
library(plotly)
library(readxl)
Given the large amount of formatting in the excel workbooks, I scraped out the numbers the “low-tech” way with simple cut and paste and saved the data to a new excel document that I then read into rstudio.
#Exploration ##Cleaning the GDP Subset
After converting the yearPredicted
and reportMonth
variables to factors The data needed unique ID's for all of the reports. This is done below.
CBObudgetPredictions <- CBObudgetPredictions %>%
mutate(reportID = group_indices_(CBObudgetPredictions,
.dots=c("reportYear",
"reportMonth")))
Next I isolated one measure, GDP, to get a visual sense for what combination of plotting techniques would be most effective for visualizing the entire dataset. To do this, I separated the data into two new sets, actualGDP
and predictedGDP
then condensed the actualGDP
data into actualCondensed
. The consensed data will get used to help in plotting later.
predictedGDP <- filter(CBObudgetPredictions,
`actual/predicted` == "predicted",
measure == "GDP")
actualGDP <- filter(CBObudgetPredictions,
`actual/predicted` == "actual",
measure == "GDP")
actualCondensed <- select(actualGDP, yearPredicted,
`dollars (*10^9)`,
reportID)
After that I further condensed the actualGDP
data to use as a lookup table to generate the error statistics I am interested in called actualLookup
.
actualLookup <- select(actualCondensed, yearPredicted, `dollars (*10^9)`)
Then I changed the name of the dollars (*10^9)
column to actualValue
to prepare it for the join and also to better indicate what that variable represents. Following that I joined the predictedGDP
and actualLookup
frames by the yearPredicted
variable and saved everything as a new predictedGDP
frame that Now has a predicted measure and an actual measure for each observation.
names(actualLookup)[2] <- "actualValue"
predictedGDP <- inner_join(predictedGDP, actualLookup, by = "yearPredicted")
Now I can finally mutate two new variables onto the predictedGDP
data, a predicted error (predictError
) and a percent error (percentError
).
predictedGDP <- predictedGDP %>%
mutate(predictError = actualValue - `dollars (*10^9)`,
percentError = (predictError / actualValue) * 100)
##Visualizing GDP
My initial temptation was to plot a line representing the actual GDP values with geom_point layer to show where the CBO's predictions fall with respect to the actuals. This proved more involved than I initially expected. I started by setting predictedGDP
as my global dataset, yearPredicted
as my global X, dollars (*10^9)
as my global Y, and separated my data into colors by reportID
. Then I plotted these aesthetics as a geom_point layer, faceted everything by reportMonth
, and finally applied a geom_line layer with a local dataset (the actualCondensed
data from earlier). This produced the following.
ggplotly(
ggplot(predictedGDP,
aes(yearPredicted,
`dollars (*10^9)`,
color = factor(reportID))) +
geom_point(alpha = 0.6) +
facet_wrap(~ reportMonth) +
geom_point(data = actualCondensed,
alpha = 0.3) +
geom_line(data = actualCondensed,
color = "#1001FE",
group = 0) +
ggtitle("Congressional Budget Office GDP predictions") +
xlab("Year of Prediction") +
ylab("Billions of Dollars") +
theme(axis.text.x = element_text(angle = 90))
)
Following that I decided to utilize the error statistics that I generated earlier while cleaning the data to look at a density plot of percentError
within each report (using the reportID
variable). Since there are 28 reports in the data, the color aesthetic isnt ideal for handling them all, but given plotly's ability to remove groups by clicking on them in the legend you can still isolate indivigual density curves within the plot.
ggplotly(ggplot(predictedGDP, aes(percentError, color = factor(reportID))) +
geom_density(alpha = 0.5))
After examining the plot above, the fact that in some years the CBO issued up to 3 reports came to mind. Most years, not all, the CBO issued a report in January; however there were quite a few reports issued as late in the year as september. It seemed possible that some of the spikes in accuracy seen in the last plot could be due to them coming from those september reports.
I decided to look at the percentError
again, but this time to facet by reportYear
and group by reportMonth
. I chose to use the fill aesthetic this time instead of the color since there would be less curves per plot.
ggplotly(ggplot(predictedGDP, aes(percentError, fill = reportMonth)) +
geom_density(alpha = 0.4) +
facet_wrap(~ reportYear))
Somewhat surprisingly, the only year where this accuracy of late year reports compared to the early reports seemed to pop up was in 2009. If you remove the january reports from the above plot you can see that the flat curve was the January report, while the two relative spikes were from March and August reports. At this point I decided to check if changing my Y aesthetic from density to count would reveal anything else about the relationship between accuracy of GDP prediction and the year and month of CBO reports. the only change in this code is the local y-aesthetic within the geom_density
call.
ggplotly(ggplot(predictedGDP, aes(percentError, fill = reportMonth)) +
geom_density(alpha = 0.4, aes(y = ..count..)) +
facet_wrap(~ reportYear))
#Generalizing ##Cleaning Across All Measures
At this point, I had enough information to scale my exploration up to all measures, not just GDP. To do so, I first had to clean the whole dataset (CBObudgetPredictions
) in a similar manner to how I cleaned the subset of GDP data. To do this I started by creating a dataset of all the actual values (budgetActual
).
budgetActual <- filter(CBObudgetPredictions, `actual/predicted` == "actual")
Next I created another condensed dataset of actuals (budgetActualCondenced
) to use for in the join that will be implemented later and renamed its dollars (*10^9)
column.
budgetActualCondenced <- select(budgetActual, measure, yearPredicted, `dollars (*10^9)`)
names(budgetActualCondenced)[3] <- "actualValue"
Finally, I created a budgetPredicted
dataset from all of the predicted values, joined this new set with budgetActualConsenced
by the measure
and yearPredicted
variables, then mutated the same two columns that were added to the predictedGDP
data (perdictionError
and percentError
).
budgetPredicted <- filter(CBObudgetPredictions, `actual/predicted` == "predicted")
budgetPredicted <- inner_join(budgetPredicted, budgetActualCondenced, by = c("measure", "yearPredicted"))
budgetPredicted <- budgetPredicted %>%
mutate(predictionError = actualValue - `dollars (*10^9)`) %>%
mutate(percentError = (predictionError / actualValue) * 100)
##Visualizing
Similarly to the subset visualization, I started here by looking at the density plot, with count on the y-aesthetic, of the percentError
variable grouped by report and then faceted by the measure.
The first iteration of this plot:
ggplot(budgetPredicted, aes(percentError, fill = factor(reportID))) +
geom_density(alpha = 0.4, aes(y = ..count..)) +
facet_wrap(~ measure)
included the measure OUTLAYStotal
, which you can see has been removed for the plot below, and also used fixed axes for all facets. This revealed that OUTLAYStotal
was an outlier with respect to its accuracy. Even once OUTLAYStotal
was removed though the data was still unhelpful with the axes fixed. The plot below uses independent axes for each facet, because of this comparing between measures should be done with caution. While this plot is a bit cramped in the markdown window there are still some intriguing things to notice; for example the middle plot on the top row which represents DEFECIT/SURPLUSoff-budget
. A quick glace at the x axis reveals that this number, while generally pretty accurate, has been underpredicted by up to 4000%. You will also notice the y-axis for the OUTLAYSon-budget
plot ranges three times wider than the next closest contender. It also appears that REVENUEStotal
is prety consistently underestimated.
budgetPredicted %>%
filter(measure != "OUTLAYStotal") %>%
ggplot(aes(percentError, fill = factor(reportID))) +
geom_density(alpha = 0.4, aes(y = ..count..)) +
facet_wrap(~ measure, scales = "free") +
theme(axis.text.x = element_text(angle = 90))
After looking at this plot I remembered that I had removed the OUTLAYStotal
measure. Given that all of my axes were free now, I decided to add that measure back in and plot the same comparison. The only other difference is that I now set both color
and fill
aesthetics to reportID
. There were advantages and disadvantages to this compared to just having the fill set. Since the plots are restricted by the markdown window, I couldnt say which is the better option. When viewing in a larger window though setting both fill
and color
produces a more informative visual.
budgetPredicted %>%
ggplot(aes(percentError, fill = factor(reportID), color = factor(reportID))) +
geom_density(alpha = 0.2, aes(y = ..count..)) +
facet_wrap(~ measure, scales = "free", strip.position = "right") +
xlab("Percent Error") +
ggtitle("CBO Baseline Budget Prediction Errors") +
theme(panel.background = element_blank(),
panel.grid = element_line(color = "grey"),
axis.text.x = element_text(angle = 90))
In this visual, I took a step away from percentError
and looked at the predictionError
which is a measure of billions of USD. I faceted these plots on the same variable, measure
, but grouped the density curves by the reportMonth
this time to try to spot any accuracy that was happening because of late year reports.
budgetPredicted %>%
ggplot(aes(predictionError, fill = factor(reportMonth), color = factor(reportMonth))) +
geom_density(alpha = 0.3, aes(y = ..count..)) +
facet_wrap(~ measure, scales = "free", strip.position = "right") +
xlab("Prediction Error in Billions") +
ggtitle("CBO Baseline Budget Prediction Errors") +
theme(panel.background = element_blank(),
panel.grid = element_line(color = "grey"),
axis.text.x = element_text(angle = 90)) +
guides(fill = guide_legend(title = "Report Month"),
color = guide_legend(title = "Report Month"))