How Data Transparency Can Help Fight COVID-19
ID 177446735 © Solarseven |


How Data Transparency Can Help Fight COVID-19

To help make better-informed decisions about coronavirus, governments should be publishing comprehensive, machine-readable data sets.

In the weeks since COVID-19 became a pandemic, a plethora of data dashboards have been launched by governments and non-governmental entities. Unfortunately, the data they provide is often incomplete and inconsistent, resulting in policymakers and the public receiving an insufficient amount of actionable information. To answer the questions of what public health measures are necessary and what personal protective measures are prudent, we should have comprehensive, standardized data.

The most-used sources of data are from the Centers for Disease Control and Prevention (CDC), along with visualization produced by Johns Hopkins University, Worldometer and many local and state governments. These sources generally lack information on hospitalizations as well as demographic data on deaths and serious cases.

A common practice for these sites is to publish just raw counts of positive tests and deaths attributed to the virus. Test results are not comparable across jurisdictions because the criteria for being tested vary. Since we know that many people who are infected experience only minor symptoms (if any), it is likely that the number of positive cases is greatly undercounted even in places that are testing aggressively. For the positive case number to be really meaningful, most of the population in a given area would have to be tested on a regular basis. The government’s massive testing failures, since the beginning of this pandemic, continue to hurt the ability to get a full and complete picture from testing totals.

Deaths provide a more reliable indicator of the coronavirus’ impact on communities, but this statistic also has some important limitations.  First, it is a lagging indicator:  a final outcome of an illness that is less helpful in predicting the near-term trajectory of the pandemic. Second, there may be differences across jurisdictions and even across medical examiners about how to attribute any given death. While the presence of COVID-19 in a deceased individual can be reliably determined, whether the virus caused the death is a judgment call. It has been argued, for example, that Italy has a higher reported death rate than neighboring countries because it does not make a distinction between deaths with COVID and deaths from COVID.

Others have suggested the death count is being underreported. The New York Times reported “hospital officials, doctors, public health experts and medical examiners say that official counts have failed to capture the true number of Americans dying in this pandemic. The undercount is a result of inconsistent protocols, limited resources and a patchwork of decision making from one state or county to the next.”

A better measure of COVID-19’s impact at this point is hospitalizations. This figure excludes asymptomatic and mild cases that have may less of a social impact and provides more of a real-time indicator than deaths. Admittedly, this measure is also vulnerable to the classification issue that also applies to deaths.

Speaking of classification, only some jurisdictions break down reported COVID-19 totals by age group, gender, and the presence of comorbidities. These factors are known to affect how any given individual experiences the virus and so these decompositions are useful information for both the public and policymakers.

New York City provides more data than most jurisdictions, which is fortunate given the severity of its situation. It reports on hospitalizations and also provides age and gender breakdowns. Recently, the city’s dashboard showed a hospitalization rate of 0.17 percent for males and 0.11 percent for females—a significant difference, and one that has been observed elsewhere. Hospitalization rates ranged from 0.01percent for those under 18 to 0.5 percent for individuals 75 and over.

While New York does not provide comorbidity data for hospitalizations, it does so for deaths. A recent report showed that over 97 percent of deaths that had been assessed for the existence of underlying conditions were found to have them present, but 29 percent of all deaths had yet to be assessed. Underlying conditions included “Diabetes, Lung Disease, Cancer, Immunodeficiency, Heart Disease, Hypertension, Asthma, Kidney Disease, and GI/Liver Disease.”

Scaling up New York City’s reporting to a national level would greatly improve the level of actionable information we have.  To effectively scale, all jurisdictions should report their data in a standardized, machine-readable format and submit these data files to a single, public repository. Once again, New York City has taken an important step in this direction by producing daily comma separated value (CSV) files and posting them on Github, a popular software repository that is also used for publishing public data. As of this writing, the underlying condition data was only available on the city web site in PDF format. If those limitations were overcome, the New York City public data reports could be used as a template by states, counties and cities across the nation.

COVID-19 poses a serious threat to public health, the economy and ultimately large parts of our civilization. To attack the problem, we need to have the best possible data. It’s understandable that overwhelmed hospitals, cities and states might not view this as the ideal moment to change the way they collect, process and share data. However, establishing and implementing a data standard right now could significantly help health officials identify coronavirus trends and best practices faster and more effectively. Ultimately, the cost of a data standard is tiny compared to the many other expenditures being undertaken by governments at all levels, but its value could be enormous.