Five reasons why the Covid-19 pandemic is a failure of Big Data Analytics

.myIframe {
position: relative;
padding-bottom: 65.25%;
padding-top: 30px;
height: 0;
overflow: auto;
-webkit-overflow-scrolling:touch; //<

The ongoing Covid-19 pandemic was an opportunity to test the current capacity of the much-hyped Big Data Analytics. In April, a Harvard Business Review article entitled “Fighting Coronavirus with Big Data” perceived that technological advantages are the main differences in fighting the pandemic from the century-old Spanish flu. “In many ways, this is our most meaningful Big Data and analytics challenge so far. With will and innovation, we could rapidly forecast the spread of the virus not only at a population level but also, and necessarily, at a hyper-local, neighborhood level,” it opined.

The use of big data – the ‘Moneyball’ culture – is dominating almost every bit of our present-day lifestyle. The aspiration was that efficient use of loads of data on the disease, its spread, and mobility of people would help to identify the suspected cases, thus preventing the spread of the pandemic and leading efficient way to optimise the allocation of resources and taking appropriate and timely decisions.

South Korea, for example, has leveraged Big Data to find the number of test kits that needs to be produced to meet demands. Also, contact tracing widely helped curb the spread of Covid-19, particularly in East Asia. In a report in the Journal of the American Medical Association in early March, Taiwan’s success in handing the Covid-19 crisis has been partly attributed to Big Data analytics. However, lessons and preparedness due to the SARS epidemic from the past, timely action, steep penalties for noncompliance with the temporary orders, and the culture in adhering the admistrative directives may also be very important factors in the success stories in parts of East Asia.

The pandemic was a litmus test for big data experts all over the world. However, with eight months into the pandemic, not much success stories are blowing in the wind. In an article in FiveThirtyEight in April, Neil Paine opined: “The battle against COVID-19 has laid bare the limitations of modern technology in the face of a pandemic. We can’t accurately track the disease’s toll in real time, nor can we accurately predict where it’s headed.” Here’s some possible reasons behind this failure.

First, often people are quite unsure about what to expect and what not from Big Data analytics – the target is vaugely defined. Even the big data experts sometimes tend to forget their limitations in handing so much data, and overestimate their capacity and try to answer too many questions.

Second, ideally, data for such a purpose needs to be accumulated from around the world. Big data analytics, in turn, was supposed to identify geographical hotspots and make predictions. However, it’s almost impossible to account for all the variables necessary for this purpose. Also, there must be lack of coordination to collect and combine necessary data from different countries. All the required data may not adhere to the strict privacy issues and related laws in many countries.

Third, there’s no denying the fact that too many useless data are collected. This is a general problem – maybe due to increasing ambition and overestimation of statistical and technological capacity. The objective of analysing data is to identify variables behind causation, and also the relationships among the variables. However, it’s well-known that the number of pairs showing significant ‘spurious’ or ‘nonsense’ correlation increases in the order of the ‘square of the number of variables’. With millions of variables, the number of pairs exhibiting such spurious correlations would be in billions, which are almost impossible to identify. Also, suppose ‘age’ might have significant correlation with ‘infection rate’, whereas ‘square’ or any other function of ‘age’ might exhibit even higher correlation with ‘infection rate’. Then, which function of ‘age’ should be included in the model?

Fourth, running routine software packages for analysing big data is never adequate, and is often incorrect! The added disadvantage in modeling such a pandemic is that nobody knew the exact dynamics of the disease. People mostly used existing epidemiological models from their past experiences. Consequently, it may be observed that most of the prediction models for Covid-19 have failed miserably.

Fifth, current computational equipments are certainly inadequate to handle millions of variables and billions of data points. The last three points correspond to general problems in handing big data, in any case.

Statistics is still in its infancy in this context, and is not equipped yet to handle big data efficiently. Let’s be honest to admit that. When the much-hyped ‘Google Flu Trends’ project, launched in 2008, turned to a disastrous failure, people came to understand that big data might not be the holy grail! The situation is mostly unchanged even after a decade. In 2017, Gartner analyst Nick Heudecker inferred that about 5 out of 6 of the big data projects fail. I’m sure that the actual percentage of failures might be way more, for, in most cases, nobody knows what could be the extent of ‘success’. In contrast, ‘success’ in handling Covid-19 was more or less known. And big data, in general, failed to imprint big impact in such a crisis of human civilization – not quite surprisingly though.

Disclaimer: The views expressed in the article above are those of the authors’ and do not necessarily represent or reflect the views of this publishing house. Unless otherwise noted, the author is writing in his/her personal capacity. They are not intended and should not be thought to represent official ideas, attitudes, or policies of any agency or institution.