Data extracted from the World Health Organization https://covid19.who.int/data
To begin, we'll import our dataset using Pandas. The file "WHO-COVID-19-global-data.csv" holds comprehensive information regarding the Covid-19 pandemic, featuring statistics for each country worldwide, categorized by date. These statistics encompass data on new cases, cumulative cases, new deaths, and cumulative deaths.
import pandas as pd
data = pd.read_csv('WHO-COVID-19-global-data.csv')
data.head()
Date_reported | Country_code | Country | WHO_region | New_cases | Cumulative_cases | New_deaths | Cumulative_deaths | |
---|---|---|---|---|---|---|---|---|
0 | 2020-01-03 | AF | Afghanistan | EMRO | 0 | 0 | 0 | 0 |
1 | 2020-01-04 | AF | Afghanistan | EMRO | 0 | 0 | 0 | 0 |
2 | 2020-01-05 | AF | Afghanistan | EMRO | 0 | 0 | 0 | 0 |
3 | 2020-01-06 | AF | Afghanistan | EMRO | 0 | 0 | 0 | 0 |
4 | 2020-01-07 | AF | Afghanistan | EMRO | 0 | 0 | 0 | 0 |
Next, we'll identify the dataset's earliest and latest data points by extracting the maximum and minimum values from the 'Date-reported' column. This helps us ascertain the timeframe in which all our collected data is situated.
print("Earliest day: " + str(data['Date_reported'].min()))
print("Latest day: " + str(data['Date_reported'].max()))
Earliest day: 2020-01-03 Latest day: 2023-08-30
We can verify the currency of our data, considering that this project was initiated on 2023-09-3, in comparison to the most recent data collected on 2023-08-30. According to the Centers for Disease Control and Prevention (CDC), the outbreak of Covid-19 commenced on December 12, 2019, with "a cluster of patients in China's Hubei Province, specifically in the city of Wuhan, exhibiting symptoms of an atypical pneumonia-like illness that did not respond well to standard treatments."
Our initial dataset was gathered in the early stages of the pandemic, starting on 2020-01-03. To accomplish this, I will create maps displaying data on new Covid-19 cases. This approach will allow us to pinpoint the pandemic's epicenter, track its global evolution since its inception, and identify the countries most profoundly impacted.
To enhance our visualization capabilities, we'll begin by importing GeoPandas, a Python library that streamlines geospatial data handling. Subsequently, we'll incorporate a world country map stored as a JSON file, which can be downloaded from this link https://github.com/deldersveld/topojson/blob/master/world-countries.json. To ensure compatibility with our Covid-19 data, I've made necessary adjustments to the ISO2 codes for Kosovo, North Cyprus, Western Sahara, and Somaliland, as well as aligned their column names.
Next, I've created a function called process_data_for_date(). This function takes set dates from the initial year of the pandemic as a parameter and generates visualizations depicting the count of new Covid-19 cases detected in each country during those days. Utilizing Matplotlib, I've arranged these maps adjacently to visualize the virus's progression effectively.
import geopandas as gpd
import matplotlib.pyplot as plt
world_countries = gpd.read_file('world-countries.json')
world_countries.rename(columns={'Alpha-2': 'iso2'}, inplace=True)
def process_data_for_date(data, date, ax):
later_data = data[data['Date_reported'] == date].copy()
later_data.rename(columns={'Country_code': 'iso2'}, inplace=True)
later_data.loc[later_data['Country'] == 'Namibia', 'iso2'] = "NAM" # Pandas recognized the iso 'NA' as NaN.
first_cases = later_data[['iso2', 'New_cases']]
merged_df = world_countries.merge(first_cases, on='iso2', how='inner')
merged_df.plot('New_cases', legend=True, ax=ax)
ax.set_title(date)
dates_to_process = ['2020-01-03', '2020-02-03', '2020-03-03', '2020-04-03', '2020-05-03',
'2020-06-03', '2020-07-03', '2020-08-03', '2020-09-03', '2020-10-03',
'2020-11-03', '2020-12-03']
num_rows = 4
num_cols = 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(16, 12))
axes = axes.flatten()
for i, date in enumerate(dates_to_process):
if i < num_rows * num_cols: # Check if there are more dates than subplots
process_data_for_date(data, date, axes[i])
else:
break
plt.tight_layout()
plt.show()
At the beginning of 2020, it's noticeable that the number of reported cases was close to zero in most parts of the world, even though there's a general belief that the virus had already started spreading from China in December 2019. This might be attributed to the lack of sufficient information about the virus, making it challenging to identify infections and compile data. However, come February, a major outbreak occurred in China, with case numbers skyrocketing to over 2500. As the months rolled on, a clear trend began to emerge. China responded vigorously, implementing stringent measures that effectively suppressed the virus within its borders. In contrast, other nations began to grapple with the emergence of this new pathogen.
In Europe, Italy took the lead in identifying infections, prompting a wave of alarm across the continent, leading to the implementation of rigorous measures. European countries, on the whole, managed the situation relatively effectively. Across the Atlantic, countries like the USA and Brazil faced a starkly different trajectory, witnessing a surge in cases, with numbers exceeding 50,000 during the first half of the year. The situation in the US continued to deteriorate through the latter half, partly due to resource constraints and insufficient awareness. Within Asia, Russia and India grappled with the most severe repercussions of the virus, encountering substantial difficulties in their containment efforts. In contrast, Africa and Oceania remained relatively unscathed for the better part of the initial year.
Now, we can compare and examine its progression over the next three years, opting for annual averages rather than set monthly dates.
def process_data_for_year(data, year, ax):
later_data = data[data['Date_reported'].str.startswith(year)].copy()
later_data.rename(columns={'Country_code': 'iso2'}, inplace=True)
later_data.loc[later_data['Country'] == 'Namibia', 'iso2'] = "NAM" # Replace 'NA' iso2 with 'NAM'
yearly_mean_cases = later_data.groupby('iso2')['New_cases'].mean()
merged_df = world_countries.merge(yearly_mean_cases, on='iso2', how='inner')
merged_df.plot('New_cases', legend=True, ax=ax)
ax.set_title(year)
years_to_process = ['2020', '2021', '2022', '2023']
num_rows = 2
num_cols = 2
fig, axes = plt.subplots(num_rows, num_cols, figsize=(16, 12))
axes = axes.flatten()
for i, year in enumerate(years_to_process):
if i < num_rows * num_cols: # Check if there are more dates than subplots
process_data_for_year(data, year, axes[i])
else:
break
plt.tight_layout()
plt.show()
Since 2021, many countries and regions around the world have initiated extensive COVID-19 vaccination campaigns. Multiple vaccines, including Pfizer-BioNTech, Moderna, AstraZeneca, Johnson & Johnson, and others, have been developed and distributed globally. Vaccination efforts aimed to achieve herd immunity and reduce the severity of the disease. We can see that by 2023 new cases have started to slowly degrease.
However, several new variants of the SARS-CoV-2 virus have also emerged since 2021, with some of them raising concerns due to increased transmissibility or potential vaccine resistance. Notable variants include Delta, Alpha, Beta, and Gamma. Many countries have experienced multiple waves or surges of COVID-19 infections as shown on the maps, often associated with the emergence of new variants or changes in public health measures. These waves have led to fluctuations in case numbers, hospitalizations, and deaths.
For example, COVID-19 got worse in China during 2022 and 2023. In December 2022, China's government abruptly ended its zero-COVID policy, which had been in place for nearly two years. This led to a sharp increase in cases, with some experts estimating that as many as 37 million people were infected in a single day in January 2023. The government responded by imposing lockdowns in several cities, including Shanghai, which had a population of over 25 million people. The lockdowns had a significant impact on the Chinese economy, causing factories to close and businesses to lose money. The government has since relaxed some of the restrictions, but it is unclear when China will be able to return to its previous zero-COVID policy.
Lets look at the most recent data to see the actual situation.
later_data = data[data['Date_reported'] == '2023-08-30'].copy()
later_data.rename(columns={'Country_code': 'iso2'}, inplace=True)
later_data.loc[later_data['Country'] == 'Namibia', 'iso2'] = "NAM" # Pandas recognized the iso 'NA' as NaN.
first_cases = later_data[['iso2', 'New_cases']]
merged_df = world_countries.merge(first_cases, on='iso2', how='inner')
merged_df.plot('New_cases', legend=True)
<Axes: >
Currently, it seems like new cases of COVID-19 have reduced significantly in many regions, thanks in large part to widespread vaccination efforts, the implementation of public health measures, and increased awareness and compliance with safety protocols. However, it's important to remain vigilant and continue monitoring the situation, as the virus can still pose a threat, especially with the emergence of new variants and the potential for seasonal fluctuations in cases. We can see the daily global cases over time in the following graph.
data['Date_reported'] = pd.to_datetime(data['Date_reported'])
global_cases = data.groupby('Date_reported')['New_cases'].sum().reset_index()
plt.figure(figsize=(12, 6))
plt.plot(global_cases['Date_reported'], global_cases['New_cases'], marker='o', linestyle='-')
plt.xlabel('Date Reported')
plt.ylabel('Global New Cases')
plt.title('Daily Global COVID-19 Cases Over Time')
plt.grid(True)
plt.tight_layout()
plt.show()
Now, let's examine the countries most impacted by analyzing the cumulative data provided by WHO.
#Two ways to find the current total cases.
total_cases = data['New_cases'].sum()
total_cases = data[data['Date_reported'] == '2023-08-30']['Cumulative_cases'].sum()
print(f'Up to 2023-08-30, there have been a total of {total_cases} cases of Covid-19 worldwide.')
Up to 2023-08-30, there have been a total of 770085713 cases of Covid-19 worldwide.
#Two ways to find the current total deaths by Covid-19.
total_deaths = data['New_deaths'].sum()
total_deaths = data[data['Date_reported'] == '2023-08-30']['Cumulative_deaths']
print(f'Up to 2023-08-30, there have been a total of {total_deaths} deaths worldwide due to Covid-19.')
Up to 2023-08-30, there have been a total of 6956173 deaths worldwide due to Covid-19.
# Top 10 Affected Countries
# Most COVID-19 Cases by Country (up to 2023-08-30)
cases_by_country = data[data['Date_reported'] == '2023-08-30'][['Country', 'Cumulative_cases']]
cases_by_country = cases_by_country.sort_values(by='Cumulative_cases', ascending=False)
top_countries1 = cases_by_country.head(10)
# Most COVID-19 Deaths by Country (up to 2023-08-30)
deaths_by_country = data[data['Date_reported'] == '2023-08-30'][['Country', 'Cumulative_deaths']]
deaths_by_country = deaths_by_country.sort_values(by='Cumulative_deaths', ascending=False)
top_countries2 = deaths_by_country.head(10)
# Highest COVID-19 Case-Fatality Ratios by Country (up to 2023-08-30)
data_selected = data[data['Date_reported'] == '2023-08-30'].copy()
data_selected.loc[:, 'Death_to_Case_Ratio'] = data_selected['Cumulative_deaths'] / data_selected['Cumulative_cases']
data_selected = data_selected.sort_values(by='Death_to_Case_Ratio', ascending=False)
top_countries3 = data_selected.head(10)
# Highest COVID-19 Deaths-Population Ratios by Country (up to 2023-08-30)
# Extract population data from 'world_population.csv' https://www.kaggle.com/datasets/iamsouravbanerjee/world-population-dataset?resource=download
world_population = pd.read_csv('world_population.csv')
world_population.rename(columns={'CCA3': 'iso3'}, inplace=True)
world_population = world_population[['iso3', '2022 Population']]
world_countries.rename(columns={'id': 'iso3'}, inplace=True)
merged_df = world_countries.merge(world_population, on='iso3', how='inner')
data_selected = data[data['Date_reported'] == '2023-08-30'].copy() # Create a copy to avoid the warning
data_selected.rename(columns={'Country_code': 'iso2'}, inplace=True)
merged_df2 = data_selected.merge(merged_df, on='iso2', how='inner')
merged_df2.loc[:, 'Death_to_Population_Ratio'] = merged_df2['Cumulative_deaths'] / merged_df2['2022 Population']
merged_df2 = merged_df2.sort_values(by='Death_to_Population_Ratio', ascending=False)
top_countries4 = merged_df2.head(10)
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
axes[0, 0].pie(top_countries1['Cumulative_cases'], labels=top_countries1['Country'], autopct='%1.1f%%', startangle=140, textprops={'fontsize': 9})
axes[0, 0].set_title('Most COVID-19 Cases', fontsize=12)
axes[0, 1].pie(top_countries2['Cumulative_deaths'], labels=top_countries2['Country'], autopct='%1.1f%%', startangle=140, textprops={'fontsize': 9})
axes[0, 1].set_title('Most COVID-19 Deaths', fontsize=12)
axes[1, 0].pie(top_countries3['Death_to_Case_Ratio'], labels=top_countries3['Country'], autopct='%1.1f%%', startangle=140, textprops={'fontsize': 9})
axes[1, 0].set_title('Highest COVID-19 Case-Fatality Ratios', fontsize=12)
axes[1, 1].pie(top_countries4['Death_to_Population_Ratio'], labels=top_countries4['Country'], autopct='%1.1f%%', startangle=140, textprops={'fontsize': 9})
axes[1, 1].set_title('Highest COVID-19 Deaths-Population Ratios', fontsize=12)
plt.subplots_adjust(wspace=0.9, hspace=-0.2)
plt.show()
In the preceding set of pie charts, we examine the ten countries most profoundly affected by the COVID-19 pandemic up to 2023-08-30. We gauge the extent of the damage by considering data on infections and fatalities.
In the first chart, we observe the top ten countries with the highest number of Covid-19 cases. Notably, the United States, China, and India experienced the greatest number of infections. It's worth acknowledging that these nations are both large in terms of landmass and have substantial populations. These factors could have contributed to their higher infection and transmission rates. Additionally, factors such as a lack of preventive measures and vaccination efforts might have provoked the spread of the virus in these regions.
The second chart, highlighting the top ten countries with the highest Covid-19 death tolls, presents a somewhat similar picture to the first chart. However, China stands out for effectively managing to reduce fatalities despite a significant infection rate. In contrast, Brazil appears to have suffered a considerable number of casualties.
The third graph showcases the top ten countries with the highest Covid-19 case-fatality ratios. This metric sheds light on where the virus proved most lethal. Yemen, Sudan, and Syria emerge as places with the highest mortality rates associated with Covid-19.
Lastly, the fourth graph outlines the top ten countries with the highest Covid-19 deaths-population ratios. This statistic offers insights into how many citizens per capita succumbed to the virus, enabling a deeper analysis of mortality rates. Notably, countries like Morocco and Peru experienced substantial losses in terms of lives.
It is crucial to recognize that these last two graphs include underdeveloped and third-world countries, which may not have had the necessary resources to effectively combat the pandemic. Factors contributing to these disparities encompass healthcare infrastructure, economic resources, implementation of public health measures, access to vaccines, international cooperation, and the prevalence of pre-existing health conditions. While underdevelopment is a factor, the overall impact is influenced by a complex interplay of these various factors and necessitates further in-depth analysis.
Lets predict the the Covid-19 cases in USA in the following year, between 2023-08-31 and 2024-08-31. For this, we can install the Prophed Python Library running 'pip install prophet' in our command. Prophet is an open-source forecasting tool developed by Facebook's Core Data Science team. It is designed for time series forecasting and is particularly well-suited for datasets with daily observations and strong seasonal patterns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet
from datetime import timedelta
# Load and preprocess the dataset
data = pd.read_csv('WHO-COVID-19-global-data.csv')
data['Date_reported'] = pd.to_datetime(data['Date_reported'])
US_data = data[data['Country'] == 'United States of America']
US_data.set_index('Date_reported', inplace=True)
# Forecasting the following year
forecast_period = 365
# Prepare data for Prophet
df = US_data.reset_index()[['Date_reported', 'New_cases']]
df = df.rename(columns={'Date_reported': 'ds', 'New_cases': 'y'})
# Create and fit the Prophet model
model = Prophet()
model.fit(df)
# Make future predictions
future = model.make_future_dataframe(periods=forecast_period)
forecast = model.predict(future)
# Visualize the forecast
fig = model.plot(forecast)
plt.xlabel('Year')
plt.ylabel('New Cases')
plt.title('COVID-19 Cases Forecast for the United States')
plt.show()
14:34:54 - cmdstanpy - INFO - Chain [1] start processing 14:34:54 - cmdstanpy - INFO - Chain [1] done processing
The results suggest that there will be another peak at the beginning of 2024, similar to what occurred in previous years. Additionally, it's anticipated that the virus's spread will remain relatively low. Now, let's employ the Prophet model once more to forecast COVID-19-related deaths in the United States for the upcoming year.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet
from datetime import timedelta
# Load and preprocess the dataset
data = pd.read_csv('WHO-COVID-19-global-data.csv')
data['Date_reported'] = pd.to_datetime(data['Date_reported'])
US_data = data[data['Country'] == 'United States of America']
US_data.set_index('Date_reported', inplace=True)
# Forecasting the following year
forecast_period = 365
# Prepare data for Prophet
df = US_data.reset_index()[['Date_reported', 'New_deaths']]
df = df.rename(columns={'Date_reported': 'ds', 'New_deaths': 'y'})
# Create and fit the Prophet model
model = Prophet()
model.fit(df)
# Make future predictions
future = model.make_future_dataframe(periods=forecast_period)
forecast = model.predict(future)
# Visualize the forecast
fig = model.plot(forecast)
plt.xlabel('Year')
plt.ylabel('New Cases')
plt.title('COVID-19 Cases Forecast for the United States')
plt.show()
14:34:59 - cmdstanpy - INFO - Chain [1] start processing 14:34:59 - cmdstanpy - INFO - Chain [1] done processing
The findings indicate that there will be another surge in deaths in the US following the projected peak in the spread at the outset of 2024, with another smaller surge expected in the subsequent months leading up to summer. To conclude the research, let's calculate the total global COVID-19-related deaths expected by the year 2030 using Prophet again.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet
# Prepare the data
data = pd.read_csv('WHO-COVID-19-global-data.csv')
data_cumulative_deaths = data.groupby('Date_reported')['Cumulative_deaths'].sum().reset_index()
data_cumulative_deaths.columns = ['ds', 'y']
# Create a Prophet model
model = Prophet()
# Fit the model to the historical data
model.fit(data_cumulative_deaths)
# Make a DataFrame for future predictions
future = model.make_future_dataframe(periods=2312) # Extend the time horizon
# Make predictions for the future
forecast = model.predict(future)
# Plot the historical data and future predictions
fig = model.plot(forecast, xlabel='Year', ylabel='Cumulative Deaths')
plt.title('Historical and Predicted Cumulative Deaths')
target_date = '2030-01-01'
closest_row = forecast.iloc[(forecast['ds'] - pd.to_datetime(target_date)).abs().argsort()[:1]]
# Get the projected cumulative deaths for the closest date
cumulative_deaths_2030 = closest_row['yhat'].values[0]
print("Projected cumulative deaths by the start of 2030:", int(cumulative_deaths_2030))
print(f'Unfortunatelly, {9261555 - 6956173} more people are expected to die from Covid-19 by 2030.')
14:40:17 - cmdstanpy - INFO - Chain [1] start processing 14:40:18 - cmdstanpy - INFO - Chain [1] done processing
Projected cumulative deaths by the start of 2030: 9261555 Unfortunatelly, 2305382 more people are expected to die from Covid-19 by 2030.