Complex data analysis attempts to solve problems with one or more inputs and one or more outputs related by complex mathematical rules, usually a sequence of two or more non-linear functions applied iteratively to the inputs and intermediate computed values. A prominent example is determining the causes and possible treatments for poorly understood diseases such as heart disease, cancer, and autism spectrum disorders where multiple genetic and environmental factors may contribute to the disease and the disease has multiple symptoms and metrics, e.g. blood pressure, heart rate, and heart rate variability.
Another example are macroeconomic models predicting employment levels, inflation, economic growth, foreign exchange rates and other key economic variables for investment decisions, both public and private, from inputs such as government spending, budget deficits, national debt, population growth, immigration, and many other factors.
A third example is speech recognition where a complex non-linear function somehow maps from a simple sequence of audio measurements — the microphone sound pressure levels — to a simple sequence of recognized words: “I’m sorry Dave. I can’t do that.”
State-of-the-art complex data analysis is labor intensive, time consuming, and error prone — requiring highly skilled analysts, often Ph.D.’s or other highly educated professionals, using tools with large libraries of built-in statistical and data analytical methods and tests: SAS, MATLAB, the R statistical programming language and similar tools. Results often take months or even years to produce, are often difficult to reproduce, difficult to present convincingly to non-specialists, difficult to audit for regulatory compliance and investor due diligence, and sometimes simply wrong, especially where the data involves human subjects or human society.
A widely cited report from the McKinsey management consulting firm suggests that the United States may face a shortage of 140,000 to 190,000 such human analysts by 2018: http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation.
This talk discusses the current state-of-the-art in attempts to automate complex data analysis. It discusses widely used tools such as SAS and MATLAB and their current limitations. It discusses what the automation of complex data analysis may look like in the future, possible methods of automating complex data analysis, and problems and pitfalls of automating complex data analysis. The talk will include a demonstration of a prototype system for automating complex data analysis including automated generation of SAS analysis code.
(C) 2017 John F. McGowan, Ph.D.
About the author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing gesture recognition for touch devices, video compression and speech recognition technologies. He has extensive experience developing software in C, C++, MATLAB, Python, Visual Basic and many other programming languages. He has been a Visiting Scholar at HP Labs developing computer vision algorithms and software for mobile devices. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech).