The Human Immunodeficiency Virus on the African Continent
by John Ferrone
Abstract
This report discusses the nature of HIV, primarily in Africa, which is the region with the largest amount of data and the highest levels of HIV seen throughout the world. The purpose of this report is to determine if there is any correlation between the proportion of the population living with HIV against factors such as the access to basic drinking water, the education expenditure as a percentage of the GDP, or the rural population as a percentage of the total population. It was found that the prevalence of HIV in Africa was much higher than anywhere else in the world. Despite this, many of the initial ideas regarding this topic were proven to be incorrect.
Introduction
Background
The human immunodeficiency virus, or HIV, first rose to prevalence in the 1970s after it made the jump from chimpanzees to humans. While this virus has likely existed for centuries before it was discovered in humans, it was only brought to international attention more recently and has been incorrectly associated with homosexual populations in the west. In Africa, this disease is recognized as a more widespread epidemic. While initial explorations of the data focused on the entire world, this shifted towards primarily towards Africa as this region had the most consistent and complete data. Africa as a whole is different than the rest of the world when it comes to HIV. The proportion of the world living with HIV is estimated to be about 0.7%, while the proportion of Africa is closer to 5%, a drastically higher number.
The collection of this data was overseen and performed by the World Bank Group and published publicly on their website. It includes data for the majority of the world, but in each analysis performed there are numerous countries that have been excluded from analysis due to missing data.
Aims
The aim of this research is to determine where HIV is most prevalent, and what some of the indicators of this disease are. There are many things which one might expect to point to a lower rate of disease, such as education and health expenditure, as well as prevalence of condom use. Analysis will be done by creating a variety of scatter plots to determine any potential relationships which should be highlighted. Additionally, a correlation matrix will be created to help provide insight into the more obvious relationships.
Certain columns will also be removed in order to maximize the size of the dataset. Too many variables with missing values can lead to a much smaller subset of the data being used. Following this, the data will be tidied further before regression is performed to determine how they act as indicators for the percentage of the population with HIV. Finally, the principal components will also be separated out to further explore the relationships between variables.
Datasets
The data is made up of a series of different measurements in order to cover a multitude of potential indicators for HIV. The observational units in this situation are the countries while the primary data points this report will focus on are hiv_percentage
, child_hiv
, and orphan_pop
. These are explained in more detail in the table below. This data is intended to be representative of the populations of each country and the world as a whole. The method through which The World Bank has collected this data is not made clear, but this can be considered census data and its scope is the surveyed populations of the world. The data was collected in the year 2018.
Table 1: Variable descriptions for each column that was analyzed
Name | Variable Description | Types | Units of Measurement |
---|---|---|---|
region | the continent on which the country is located | String | Continent |
subregion | the region in which the country is located | String | Region |
country | the country where the data is recorded | String | Country |
hiv_percentage | the percentage of adults living with HIV | Numeric | % |
child_hiv | the percentage of children with HIV | Numeric | % |
education_exp | the percentage of the GDP spent on education | Numeric | % |
basic_water | the percentage of the population that has access to basic water resources | Numeric | % |
open_def | the percentage of the popultioni that practices open defecation | Numeric | % |
unemployment | percentage of the population that is not employed | Numeric | % |
health_exp | the percentage of the GDP spent on health | Numeric | % |
orphan_pop | the percentage of the child population that is orphaned | Numeric | % |
rural_pop | the percentage of the population that lives in rural areas | Numeric | % |
urban_pop | the percentage of the population that lives in urban areas | Numeric | % |
Methods
Upon beginning to work with the data, it became clear it needed to be tidied quite a bit before it was fit for analysis. The data had to be pivoted and given names that were clearer, as well as many of the population counts needed to be converted to percentages. In order to represent the data by continent, a United Nations data set containing each country and its respective region and subregion was merged with the original data. Initial explorations began with scatter and bar plots, as well as a correlation matrix in order to see if any values were obviously correlated. The primary focus was to compare hiv_percentage
with variables such as child_hiv
, education_exp
, health_exp
, orphan_pop
, and rural_pop
.
Linear regression was then performed to determine the more influential data points. This was then repeated for just the African continent in order to determine if the variables factored in differently for what seemed to be the most affected continent. This yields interesting, but expected results. Finally, principal component analysis would yield results about what clusters of variables seemed to influence the data most significantly, as well as some outliers which seemed to be influenced strongly by both clusters. This in particular was a bit more interesting, as it suggested that other variables may have factored more heavily into the prevalence of HIV than the other methods.
Table 2: Example rows of The World Bank's health data
region | subregion | country | education_exp | hiv_percentage | basic_water | child_hiv | open_defec | health_exp | unemployment | rural_pop | urban_pop | orphan_pop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Africa | Northern Africa | Algeria | 0.0587 | 0.001 | 0.9404 | 0.0001 | 0.0042 | 0.0616 | 0.1042 | 0.2737 | 0.7263 | 0.0001 |
1 | Africa | Sub-Saharan Africa | Burundi | 0.0508 | 0.011 | 0.6146 | 0.0022 | 0.0261 | 0.0824 | 0.0159 | 0.8697 | 0.1303 | 0.0169 |
2 | Africa | Sub-Saharan Africa | Djibouti | 0.0363 | 0.009 | 0.7594 | 0.0026 | 0.1650 | 0.0226 | 0.2619 | 0.2223 | 0.7778 | 0.0275 |
3 | Africa | Sub-Saharan Africa | Ethiopia | 0.0507 | 0.01 | 0.4664 | 0.0012 | 0.2277 | 0.0331 | 0.0232 | 0.7924 | 0.2076 | 0.0081 |
4 | Africa | Sub-Saharan Africa | Kenya | 0.0531 | 0.045 | 0.6027 | 0.0048 | 0.0939 | 0.0431 | 0.0425 | 0.7297 | 0.2703 | 0.0391 |
Results
Initial Explorations
The initial explorations of the data were focused on finding a potential relationship or region to explore further. After plotting scatter plots for an assortment of variables, it quickly became apparent that the variables were for the most part on two separate scales. There was a much lower scale throughout much of the world, compared to the drastically higher scale seen in Africa.
Figure 1: Scatter plots of variables against
hiv_percentage
The scatter plots in Figure 1 show that there is not much of a trend represented in the data except with child_hiv
, orphan_pop
, and potentially unemployment
.
Below, Figure 2 shows the density plot for hiv_percentage
continues to support this, as it quickly becomes clear that Africa's density curve does not follow the hiv_percentage
density of the world. The plot on the right confirms this, as it is a density plot of hiv_percentage in Africa. There is also a clear distinction between the two subregions as seen by the density lines overlaid on top of the bar plot. The Northern Africa region follows the same trend as the rest of the world, while the Sub-Saharan Africa region seems to be what is causing such a difference between Africa and the rest of the world.
Figure 2: Density distributions of the world with the density curve of Africa on top (left). Africa with the density curve of its subregions on top (right).
Africa
Following this, the focus shifted towards Africa as it seemed that this was where the most unique data was located. Africa is particularly interesting when discussing HIV, due to the significant portion of the population who has been infected in when compared to the world. When taking the top 20 countries that are affected by HIV, they are all located within Africa as seen in Figure 2. Due to this, it is likely better to focus on this region as it is unique to the rest of the world. While most other nations rarely see more than 1-2% of their population infected, as seen by an exploration of the density plots, in Africa the average proportion of the population that has HIV is 5%, much higher than the next closer continental average, the Americas, at 0.7%. This disparity between the rest of the world and Africa when it comes to HIV is immense.
Figure 3: The 20 countries with the highest proportion of their population with HIV
Figure 3 shows that more than a quarter of Eswatini's adult population is infected with HIV. After some further investigation into Eswatini, they have the highest levels of HIV amongst children (2.5% of the population between the ages of 0-14) and 14% of the child population is orphaned. Their education and health expenditures are not large, but there are a number of nations which spend a smaller proportion of their GDP on expenditures without seeing a significant increase in the presence of HIV. This leads to the next step, which is to see which variables seem to have the most significant effect on hiv_percentage.
Table 3: linear regression with
hiv_percentage
as the response for the African continent
Estimate | Standard Error | |
---|---|---|
intercept | 0 | 0.026188 |
education_exp | 0.256025 | 0.176998 |
basic_water | 0.033237 | 0.027872 |
child_hiv | 5.76602 | 1.00586 |
open_defec | 0.000143 | 0.000137 |
health_exp | -0.110886 | 0.109393 |
unemployment | 0.035903 | 0.054558 |
rural_pop | 0.022193 | 0.022307 |
orphan_pop | 0.802298 | 0.138601 |
error variance | 8.3e-05 | nan |
Based on Table 3, the biggest indicators for a high level of HIV in the population of Africa are child_hiv
and orphan_pop
. Unfortunately, in this case rather than indicators, they are likely the casualty of the high populations with HIV. It is not guaranteed, but possible for a mother to pass HIV to their baby while it is in the womb. There are treatments available which can prevent this, but in the highly rural African regions it may be difficult to access this level of healthcare. Additionally, these values have a very linear relationship in Africa, with a definite trend between higher adult populations with HIV and orphaned children.
Principal Component Analysis
Following linear regression, principal component analysis was performed on the variables to see if there were any clusters of correlated variables. Figure 4 seems to actually prove this to some degree, as it becomes immediately clear that variables are very much clustered together.
Figure 4: Center panel is a scatter plot of principal component 1 (PC1) vs. principal component 2 (PC2), the red points represent the outliers (left to right), Malawi, Mozambique, Zambia, Namibia, Zimbabwe, Botswana, South Africa, Lesotho, and Eswatini. The right panel is a loading plot of the two principal components.
The first principal component seems to be composed of nations with high levels of hiv_percentage
, child_hiv
, and orphan_pop
with low urban populations and not much access to basic drinking water. The second principal component represents the variation in the data caused by basic_water
, open_defec
, and urban_pop
which seems to suggest this represents nations who have lower levels of HIV but a lack of critical infrastructure. While the majority of nations focus more on the second principal component with lower levels of HIV, those with high levels of HIV are what make the first principal component so prominent in this data. The levels of HIV in Eswatini, Lesotho, and South Africa are significantly harsher even amongst children that they and all of the outliers can be explained differently from the rest of the world.
Conclusion
The data seems to suggest that there are indicators of hiv_percentage
amongst the variables. Linear regression revealed that education_exp
also has a large amount of sway over hiv_percentage
. Finally, performing principal component analysis suggests that there is a high correlation between hiv_percentage
and the indicators child_hiv
and orphan_pop
.
It could have been interesting to find a more complete dataset with all of these values and see if there were any clear trends world-wide. There could also be a separate analysis done with Africa removed, to see if there are any trends between hiv_percentage and the variables tested above. It is clear from them that Africa seems to be an outlier in a way, such that it can obscure the other data.