Capturing the COVID-19 Crisis through Public Health and Social Measures Data Science | Scientific Data

PHSM trackers have not been able to come by their achievements easily. From developing data taxonomies to building organizational structures for collecting, cleaning and validating data, PHSM trackers initiated their efforts without the benefit of precedent. At the beginning of the pandemic, trackers also worked without knowledge of each other’s efforts. While a vast improvement over isolation, greater cooperation among trackers entails its own set of challenges. Our review below provides a holistic overview of both the data and organizational challenges facing PHSM trackers individually and as a group. This can be complemented with Shen et al.’s20 commentary, which provides a more in depth discussion of the various data challenges faced by individual trackers.

Individual challenges

Data taxonomy forms the basis for comprehensible and meaningful use of PHSM data. Each tracker had different strategies for building their taxonomies, and given the peculiarities of how governments implemented COVID-19 PHSM, they generally developed them inductively and inferentially. Trackers have found that the main challenge in doing so is developing a standard taxonomy that can both capture the nuances and peculiarities of a given country’s PHSM rollout while also allowing for cross-country comparisons. Additionally, ensuring that taxonomies remain relevant over time by including periodic updates (e.g., documenting vaccination policies following the global vaccine rollout) remains an ongoing challenge.

Likewise, data standardization remains a key challenge in PHSM data collection (as is true in data collection more broadly). Beyond the enormous variability in definitions of policies and interventions PHSM trackers encountered while collecting data from around the globe, lack of data standardization on the national, state/provincial, and local levels represents a major hindrance for data collection21. This issue affects not only COVID-19 data but also basic demographic data. Indeed, detailed demographic data is often not available to the public and definitions as well as categories for demographic characteristics vary across countries and states22. This disarray not only makes data collection highly challenging but also makes it difficult to compare or identify the multitude of e.g., socioeconomic and health consequences of the pandemic, especially with regards to the most vulnerable populations.

To collect, clean, and validate this enormous volume of PHSM data, most trackers rely on the tremendous contribution of volunteers. However, the corresponding recruitment, training, engagement, and organization of volunteers present enormous challenges. Most volunteers are students and their availability thus fluctuates according to the academic calendar. The reliance on unpaid work also raises questions of research ethics and sustainability. According to our survey, only around 10% of data collectors across are paid; the vast majority are volunteers serving a public good (Fig. 3a).

Fig. 3

PHSM Tracker Survey Responses. Table 1 provides information which trackers participated in the survey. Responses from the tracker survey of PHSM Network members to the following questions (a) What are the number of paid versus unpaid data collectors? (b) What are funding needs compared to received funds? (c) Is the tracker still actively coding new policies? (d) What governmental level of policies do trackers gather data for?

Many trackers rely on volunteers for data collection not by design, but due to lack of funding. Funding constraints are unfortunately quite severe: many policy trackers have had to stop working because of the lack of continued funding, resulting in wide evidentiary gaps. When trackers do receive funding, it is often short-term because of uncertainty about the pandemic’s duration. According to our tracker survey, only 16% of the overall funding needs by trackers are satisfied (Fig. 3b). This has led to a 65% decline in the number of trackers that are actively collecting data (Fig. 3c). Some trackers have attempted to address this problem by harmonising their data into the few databases with more sustainable funding schemes, which underscores the importance of longer-term funding for sustained PHSM data tracking.

Collective challenges

PHSM trackers face challenges not only as individual actors, but also as a collective ecosystem. At the beginning of the pandemic, 40+ PHSM tracking projects launched with little to no knowledge of each other due to their emergency nature. These parallel data collection efforts led to the duplication of data, multiple taxonomy strategies across trackers, gaps in data coverage and variation in data quality23.

While there is significant data overlap among trackers, many trackers also have unique data coverage in specific domains, such as public health, economic policy, and human rights. Though these differences provide a diversity of perspectives on PHSM data tracking, they can lead to difficulties in data utilization. Working toward a single harmonized source might seem like an obvious solution, and indeed the World Health Organization (WHO) has done important work toward this goal24. However, this work also underscores the difficulty in data harmonization when underlying data sources are still in the process of being cleaned and organized. More to the point, we believe that there is great value in continuing to maintain diversity in tracking projects. Doing so allows (i) different datasets to be validated against each other (ii) individual datasets to reflect a variety of research priorities and (iii) stakeholders find the dataset that best fits their needs.

The benefits of diversity must be continuously balanced against the costs of data collection, completeness, and quality. With regards to data completeness, PHSM trackers have done impressive work in documenting how governments around the world have responded to the pandemic at both national and subnational levels; however, data overlaps and gaps persist. In general, across PHSM trackers, data from the “Global North” are overrepresented whereas data from the “Global South” are poor or missing. In the PHSM network, only one tracker has its main team physically based in the Global South. Due to funder interests, most data collection is focused on OECD countries and on national policies, leading to large gaps in data collection for less-developed countries and sub-national levels. While over 50% of the bigger trackers collect sub-national data (Fig. 3d), systematic subnational data collection for non-OECD countries is limited to Brazil, China, India, Russia and Nigeria.

With regards to data quality, trackers have learned that local knowledge and/or language skills are essential to gathering complete and accurate information. Because of this, PHSM data quality for countries in the Global South is also more likely to suffer because many of the major trackers and their funders are based in the Global North.

Altogether, while all trackers are united in their aim to document government responses to the COVID-19, when considering the sheer number of policies it is possible to collect on the one side, with the diversity of understandings of how to define a policy as well as organisational resources to capture them on the other side, there is a great deal of variation in terms of the scope, quality, and structure of PHSM datasets. While providing a definitive guide as to which datasets may be best suited for a given analysis is still premature given the ongoing nature of the pandemic and the attendant data collection thereof, Table 1 provides some broad guidance for adjudicating among different datasets with regards to geographic and temporal dimensions at the time of writing of this commentary.

Ultimately, given the colossal volume and speed of government COVID-19 policy making, greater collaboration between researchers in different fields (e.g., epidemiologists, political scientists, data scientists) as well as communication with policy makers is further needed to understand how to best model and analyse PHSM data. Such work would need to start with better integrating PHSM them with other relevant COVID-19 data (e.g., COVID-19 cases, deaths and hospitalizations; economic indicators; environmental indicators). In all likelihood, further work would need to be done to develop novel analytical tools for using PHSM data to assess the drivers and impacts of the pandemic. While some trackers have made more headway than others on this front (e.g., see Our World in Data’s COVID-19 dashboard:; or the PERISCOPE COVID Atlas:, the field as a whole still lacks much needed coordination and resources to forward this work.

To address these challenges, in what follows, we outline key focus areas for PHSM data science and advocate for greater cooperation and communication among and between PHSM trackers.