Predicting USA covid cases as if STATES were their own nations

April 12, 2020

Like you, I’ve been affected by COVID-19.

I know folks who've contracted it. I’ve experienced its economic furor. I’ve been treated differently. I was even almost trapped on a remote island in the Pacific because of it - thrice.

Expanding on that last one, as many Asian borders rapidly closed in mid March 2020, I found myself checking the news and various countries' CDC sites an excessive amount - far more than is healthy. I was basically glued to the screen at the time. Except, unlike the normal 24/7 news cycle, it was... almost worth it to be checking the news every few minutes.

(In one particular 36 hour period, the Luzon island of the Philippines declared a lockdown effective the following day which influenced 50mm people, followed by Singapore implementing national quarantines for all foreigners, followed by Taiwan banning all foreign nationals, followed by Singapore banning all foreigners...)

Reflecting on it all, I found that the whole experience was not only stressful, but also pretty… ineffective.

Problem

Why ineffective? To start:

The problem and stress at the time mostly boiled down to a major lack of clarity in how the world was shaping and there being no central playbook shedding light on the cohesive global direction. The underlying desire here was to have more control of my future, as meager as that was.

The solution

So, as always, I went to the source data and started mucking around.

Rather than sift through the conflicting messages or manually compare figures daily, I leveraged the data many agencies are basing their conclusions off of. This produced two main outcomes I’ll detail below with a few examples in each.

Consolidated, Aggregated Trends that Identify Milestones and Put Them in Perspective

“USA! USA! USA!” - When it was clear we were on the path to leading the world in confirmed COVID-19 cases

The United “Nations” of America - Forecast of US National COVID-19 Cases Based on a K-Nearest Neighbor (KNN) Algorithm Fit to Global Country Data

The real value of history only comes to fruition when we:

In this case, I’ve attempted to put these lessons into action by predicting USA COVID-19 case counts a small window into the future. The benefit of this exercise is multifold from grounding personal optimism to warning friends and families in advance to correlating it with market investments.

There are a few issues to address first though. Namely, the data.

There has not and is still not enough organized public data including the virus’ behavior, the conditions of each country that likely contribute to the virus’ spread (e.g., healthcare system), and the flow of humanity. Even if most of this was available, the data is fairly sparse given most countries have only tracked data for a little over the month. Many US states only began tracking their case counts in mid March. Combined, these factors lead to challenges in implementing a typical time series model and lead to insufficient data for a “deep" learning model.  

To resolve these gaps, my approach was to:

This solution tests the common refrain that the US states are in essence individual countries at times (to be fair, many states are often the size of individual countries).

If so, why not predict the states as if they were countries? By doing so, we also avoid the current lack of historical data available across most states and can more closely examine the subtle trends across states that are lost in an aggregate US prediction.

The implementation can be found HERE with takeaways below:

Given I'll be the first to admit this chart could be a little confusing, note that the X-axis reflects the amount of days used in the CAGR (Compound Annual Growth Rate) calculation. For instance, -3 days correlates to the 3-day CAGR of COVID cases as of the date of each country in the legend. Another example, 7 days correlates to the CAGR of COVID cases 7 days after the date in the legend (hence California and Alabama don't have predictions given I finished this on 4/12/2020)

Reflection

The increasing availability of data publicly is what makes ventures like the above and beyond possible. It’s one of the key factors that lead to the rapid progression of the machine learning field, and it’s also key to how the world will contain this pandemic (understanding the virus behavior based on commonalities in cases, political policy based on country aggregate trends, etc).

Caveat here is that some/much of the data has been known to be under-reported. This is important because inaccuracy can snowball into another country/province making inaccurate decisions based off omitted data. Serious considered tradeoffs lead to these decisions such as political perception (Japan for the olympics, China for trade) or prohibitively expensive testing vs just shutting down of countries (Philippines / Russia), etc. (Check out FiveThirtyEight for a more detailed writeup on under/over/dramatic shifts)

I’m a firm believer that next level analyses will harmonize these gaps as more data continues to come online such as test data, imported vs local transmission, deaths, healthcare capabilities, national policies in place, etc. Unfortunately, a lot of this is still influx today which means really very little of the data is available to fuel nuclear engines like the ML community.

But I hope and know that we will radically make progress in these directions. Truth supply / demand curves will continue to build off each other with compounding reinforcement. After all, this is the most united the world’s ever been.

Even more than WWII. Back then, the world was split in two, neutral countries existed, and some countries were so remote that the war really had no impact on them.

Not this time. We’re all in this together.  

Methodology and Credits

All source code, training data, and project notes can be found on my Github.

Raw data courtesy of John Hopkins University who's done much of the legwork in aggregating the data.