[Article] Improving CDC Data Practices: Recommendations for Improving the United States Centers for Disease Control (CDC) Data Practices for Pneumonia, Influenza, and COVID-19

This is a preprint of a new academic paper written by Tam Hunt, Josh Mitteldorf, Ph.D. and myself on the US Centers for Disease Control (CDC)’s data practices during the COVID-19 pandemic and for pneumonia and influenza prior to the pandemic. I am the corresponding author.

Abstract

During the pandemic, millions of Americans have become acquainted with the CDC because its reports and the data it collects affect their day-today lives. But the methodology used and even some of the data collected by CDC remain opaque to the public and to the community of academic epidemiology. In this paper, we highlight areas in which CDC methodology might be improved and where greater transparency might lead to broad collaboration. (1) “Excess” deaths are routinely reported, but not “years of life lost”, an easily-computed datum that is important for public policy. (2) What counts as an “excess death”? The method for computing the number of excess deaths does not include error bars and we show a substantial range of estimates is possible. (3) Pneumonia and influenza death data on different CDC pages is grossly contradictory. (4) The methodology for computing influenza deaths is not described in sufficient detail that an outside analyst might pursue the source of the discrepancy. (5) Guidelines for filling out death certificates have changed during the COVID-19 pandemic, preventing the comparison of 2020-21 death profiles with any previous year. We conclude with a series of explicit recommendations for greater consistency and transparency, and ultimately to make CDC data more useful to outside epidemiologists.

John F. McGowan, Ph.D., Tam Hunt, Josh Mitteldorf. Improving CDC Data Practices Recommendations for Improving the United States Centers for Disease Control (CDC) Data Practices for Pneumonia, Influenza, and COVID-19. Authorea. July 19, 2021.
DOI: 10.22541/au.162671168.86830026/v1

Here are the key recommendations from the paper:

Recommendations

In light of the previous discussion, we make a number of recommendations to improve CDC’s data practices, including improved observance of common scientific and engineering practice – such as use of significant figures and reporting of statistical and systematic errors. Common scientific and engineering practice is designed to prevent serious errors and should be followed rigorously in a crisis such as the COVID-19 pandemic.

Note that some of these recommendations may require changes in federal or state laws, federal or state regulations, or renegotiation of contracts between the federal government and states. This is probably the case for making the Deaths Master File (DMF), with names and dates of death of persons reported as deceased to the states and federal government, freely available to the public and other government agencies.

  • All CDC numbers, where possible, should be clearly identified as estimates, adjusted counts, or raw counts, with statistical errors and systematic errors given, using consistent clear standard language in all documents. The errors should be provided as both ninety-five percent (95%) confidence level intervals and the standard deviation – at least for the statistical errors.
  • In the case of adjusted counts, the raw count should be explicitly listed immediately following the adjusted count as well as a brief description of the adjustment and a reference for the adjustment methodology. For example, if the adjusted number of deaths in the United States in 2020 is 3.4 million but the raw count of deaths was 3.3 million with 100,000 deaths added to adjust for unreported deaths of undocumented immigrants, the web pages and reports would say:

Total deaths (2020): 3.4 million (adjusted, raw count 3.3 million, unreported deaths of undocumented immigrants, adjustment methodology citation: Smith et al, MMWR Volume X, Number Y)

  • The distinction between the leading causes of death report “pneumonia and influenza” deaths, ~55,000 per year pre-pandemic, and the FluView website “pneumonia and influenza” deaths, ~188,000 per year pre-pandemic, should be clarified in the labels and legends for the graphics and prominently in the table of leading causes of death or immediately adjacent text. Statistical and systematic errors on these numbers should be provided in graphs and tables.
  • In general, where grossly different raw counts, adjusted counts, or estimates are presented in CDC documents and websites with the same name, semantically equivalent or nearly equivalent names such as “pneumonia and influenza” and “influenza and pneumonia,” clearly distinct names should be used instead, or the reasons for the gross difference in the values should be prominently listed in the graphs and tables or immediately adjacent text. It should be easy for the public, busy health professionals, policy makers and others to recognize and understand the differences.
  • CDC should provide results for different models for the same data with similar R2 values – coefficient of determination – to give the audience a quick sense of the systematic modeling errors – since there is no generally accepted methodology for estimating the 95% confidence level for the systematic modeling errors. See Figure 7 above for an example.
  • All mathematical models should be free and open source with associated data provided using commonly used free open-source scientific programming languages such as Python or R, made available on the CDC website, GitHub, and other popular sources. The models and data should be provided in a package form such that anyone with access to a standard MS Windows, Mac OS X, or Linux/Unix computer can easily download and run the analysis – similar to the package structure used by the GNU project, for example.
  • Specifically, the influenza virus deaths model should be provided to the public as code and data. The justification for the increase in the number of deaths attributed to influenza (~6,000 to ~55,000) should be presented in clear language with supporting numbers, such as the false positive and negative rates for the laboratory influenza deaths and general diagnosis of influenza in the absence of a positive lab test as well as in the code and data.
  • With respect to excess deaths tracking, include all major select causes of death, rather than just the thirteen (13) in the cause-specific excess deaths that CDC tracks, which currently account for about 2/3 of all deaths.
  • Include a Years of Lives Lost (YLL) display for COVID-19 deathsi and non-COVID-19 deaths, as well as excess deaths analysis, due to the higher granularity of YLL analysis when compared to excess deaths analysis. Explain the pros and cons of both analytical tools. Do the same for any future pandemics or health crises.
  • Adopt or develop a different algorithm or algorithms for tracking excess deaths which are mostly attributed to non-infectious causes such as heart attacks, cancer, and strokes. The Farrington/Noufaily algorithms were specifically developed as an early warning for often non-lethal infectious disease outbreaks such as salmonella. A medically-based model or models that incorporates population demographics such as the aging “baby boom” and evolving death rates broken down by age, sex, and possibly other factors where known is probably a better practice rather than simple empirical trend models such as the Noufaily algorithm.
  • Eliminate the zeroing procedure in calculating excess deaths, in which negative excess deaths in some categories are set to zero, rather than being added to the full excess deaths sum over all categories.
  • The anonymized data with causes of death as close to the actual data as possible, e.g. the actual death certificates, should be available on the CDC website in a simple accessible widely used format such as CSV (comma separated values) files. The code used to aggregate the data into summary data such as the FluView website data files should also be public.
  • The full Deaths Master File (DMF) including the actual names of the deceased persons and dates of death should be made available to the general public, independent researchers, and others. This is critical to independent verification of many numbers from the CDC, SSA, and US Census.
  • COVID-19-related deaths figures should be tracked based on year-specific age of death, rather than 10-year age ranges, as is currently the case.
  • CDC frequently changes the structure and layout of the CSV files/spreadsheets on their websites. The CDC should either (1) not do this or (2) provide easy conversion between different file formats with each new format so it is trivial for third parties to quickly adapt to the changes without writing additional code. CDC should provide a program or program in a free and open source language like R to convert between the formats.
  • The CDC and other agencies should be required to announce and solicit public comment for changes to case definitions, data collection rules, etc. for key public policy data such as the COVID-19 case definitions, death certification guidelines, and coding rules. Other government agencies have significantly more public participation than CDC, which is appropriate in a modern democracy.
  • Any practices and policies imposed in a public emergency, such as case definitions, definitions of measured quantities, data reporting practices, etc. imposed without public comment and review, should have an expiration date (e.g. sixty days) beyond which they must be subject to public review. Public comment, reviews, and cost/benefit analyses should happen during this emergency period.

Enacting these reforms should reduce the risk of serious errors, increase the quality and accuracy of CDC data and analyses, as well as any policies or CDC guidelines based on the data and analysis, and strengthen public confidence in the CDC and public health policies.

(C) 2021 by John F. McGowan, Ph.D.

About Me

John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing gesture recognition for touch devices, video compression and speech recognition technologies. He has extensive experience developing software in C, C++, MATLAB, Python, Visual Basic and many other programming languages. He has been a Visiting Scholar at HP Labs developing computer vision algorithms and software for mobile devices. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech).