1 ASKING about the subject of MPCA.

1.1 Introduction and objectives.

Multipurpose Cash Assistance (MPCA) has been a central method for delivering aid to populations affected by the ongoing Ukrainian conflict, particularly in the Donbas region, during the initial phases of humanitarian response. This assistance aims to meet diverse needs by providing flexible cash support to impacted households.

The United Nations Office for the Coordination of Humanitarian Affairs (OCHA) compiles and regularly updates a dataset from field reports by humanitarian actors. This dataset records details on MPCA interventions across Ukraine, capturing information on funding allocation, regional coverage, and demographic reach.

In this notebook, we’re diving into an exploratory look at this list for three main goals:

  1. Descriptive Overview: To put together a summary of what the dataset looks like - its structure, different types of information it holds, and details about MCA interventions.

  2. Data Completeness and Quality: Identifying any missing pieces of data and thinking about how that might affect our analysis.

  3. Geographical Analysis: Looking at differences in where MCA is being offered and where there might be gaps, while also considering how geographical factors play a role in deciding where help goes in Ukraine.

By focusing on these areas, the aim is to gather insights that can make MCA targeting better, making sure it gets to the places and people who need it most.

1.2 Methodology

For this case study, I will be following the steps proposed by the Google Data Analytics Profesional Certificate: ask, prepare, process, analyze, share and act.

The question I want to ask is: what insights can we get from data collected in the field? This data was collected as a way to track and coordinate the response during the first year of hostilities in the war of Ukraine. With the perspective gained one year after, i will try to present the data in a different way as it was originally conceived.

This analysis is the product of data wrangling, analysis and my own subject expertise. During the first two years of the emergency, I lead operations for a humanitarian organization in the East of Ukraine. I also have a record of working in cash based interventions in other emergencies.

1.3 Limitations:

The primary dataset under study is made up of secondary data voluntarily shared by humanitarian actors involved in this operation. Voluntary self-reporting can unintentionally result in inaccuracies.

Initially during emergencies, when population movements were highly fluctuating, the absence of precise demographic data makes it challenging to present figures relative to the total population. Consequently, we’ll only be able to display absolute numbers in certain instances.

This analysis cannot replace necessary on-site assessments and should not be solely relied upon for making decisions. Instead, it serves as a complementary source of insights.

2 PREPARING

2.1 Source

This dataset was retrieved on May 27th, 2023 and was originally posted by OCHA Ukraine (The United Nations Office for the Coordination of Humanitarian Affairs) on the platform Humanitarian Data Exchange (HDX) which is an open platform for sharing data across crises and organisations, and managed by OCHA’s Centre for Humanitarian Data.

2.2 License:

This dataset is licensed under Creative Commons Attribution International.

Under the CC BY license, you are free to share (copy and redistribute the material in any medium or format) and or adapt (remix, transform, and build upon the material) for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms. The license terms are that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. Additionally, you may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

3 PROCESSING.

3.1 Setting up our working environment

We are going to need a number of libraries for data wrangling, including those that will format the document appropriately.

library(tidyverse)
library(dplyr)
library(janitor)
library(readr)
library(readxl)
library(kableExtra)
library(tidyr)
library(forcats)
library(ggplot2)
library(tinytex)
library(scales)
library(tidytext)

3.2 Acquiring the data.

The dataset, which is a crucial part of this analysis, is an Excel file that has been uploaded to the Humanitarian Data Exchange. It has been maintained throughout the year 2022 by the Multipurpose Cash Assistance (MPCA) technical working group. It has 42 variables and 46,866 observations (described below).

The Excel sheet we’re working with has seen quite a bit of action over time - various people have filled it out, some external connections no longer work, and so on. As a result, we’ve run into two main problems: 1) The dates are causing numerous errors due to issues with POXit, and 2) There’s “phantom data,” or entries that appear to be filled but the actual information is missing.

To streamline our process and save time, I’ll convert this complex Excel file into a simpler CSV format. This will give us the raw data to work with, allowing us to start the cleanup process from there.

# Loading the raw data in CSV format.
library(readr)
mpca <- read_csv("data/mpc_ukr_hdx_2022_full - Copy.csv")

3.3 Describing the data.

Each record in this dataset represents a distinct action, which captures the delivery of aid or services to a specific group of beneficiaries in a particular location during a defined time period. Organizations are required to report all actions, provided they meet certain prerequisites, such as securing the necessary permissions and obtaining adequate funding. As a result, these records grow and accumulate over time, reflecting the evolving scope and scale of activities.

It is important to understand that this data was collected for coordination purposes, not for research. As such, the data contains some inconsistencies that were “tolerated” because they were not a priority for this group of information managers.

Print out of str(mpca)
Print out of str(mpca)

3.4 Standardizing and cleaning.

Several issues become apparent that require attention to ensure accurate analysis:

  1. Variable Names: The names given to our data variables are excessively long, making them cumbersome to work with and less intuitive.

  2. Date Formatting: The date information is not presented in a user-friendly format, causing difficulties in interpreting and manipulating temporal data.

  3. Empty Variables: Some of the variables within the dataset contain no actual data, which could skew analysis results if included.

  4. Missing Values: There are instances where data is missing from certain entries, necessitating strategies to address these gaps.

  5. Characterized Factors: Certain variables that should be categorical or factor types are instead recorded as character strings, potentially complicating analyses that require specific data formats.

  6. Unvalidated Factor Data: Some of the categorical data within our dataset lack the necessary validation, which could lead to inconsistencies and errors in analysis.

3.4.0.1 Fixing the variable’s names.

The dataset included in the first row, not only the name of the column, but also the translation to Ukranian language and instructions on how to report each variable. I am going to rename them in a more machine readable way.

# Let's address first the issue of the variables' names, they are too long and we will replace them for these: 

new_names <- c(
"TYPE", "ORGANIZATION", "IMPLEMENTING_PARTNER", "DONOR", "ACTIVITY",
"FIN_SERVICES", "OBLAST", "ADMIN1_ID", "RAION", "ADMIN2_ID", "HROMADA",
"ADMIN3_ID", "SETTLEMENT", "ADMIN4_ID", "RURAL_URBAN", "FINAL_RURAL_URBAN", "ADMIN_3_4", "MODALITY",
"CONDITIONALITY", "TRANSFER_VALUE_PER_HOUSEHOLD", "FREQUENCY_OF_TRANSFER",
"BENEFICIARY_TYPE", "DELIVERY_MECHANISMS", "STATUS", "DURATION_ONE_MONTH",
"START_DATE_ESTIMATED", "END_DATE_ESTIMATED", "NUMBER_OF_BENEFICIARIES_INDV",
"NUMBER_OF_BENEFICIARIES_HH", "GIRLS_0_17_YEARS", "BOYS_0_17_YEARS",
"WOMEN_18_59_YEARS", "MEN_18_59_YEARS", "ELDERLY_60", "PEOPLE_WITH_DISABILITIES",
"ACRONYM_REPORTING", "ORGANIZATION_TYEP_REPORTING", "ACRONYM_IMPLEMENTER",
"ORGANISATION_TYPE_IMPLEMENTER", "TOTAL_AMOUNT_OF_CASH_DISBURSED", "CONTACT1", "CONTACT2"
)

# Renaming columns
colnames(mpca) <- new_names

3.4.0.2 Fixing dates.

The date information in our dataset is currently presented in a complex POSIXct format, which can be challenging to interpret directly. I am going to change, in this case to make it more human readable.

# Convert START_DATE_ESTIMATED to Date format
mpca$START_DATE_ESTIMATED <- as.Date(mpca$START_DATE_ESTIMATED, format = "%d/%m/%Y")

# Convert END_DATE_ESTIMATED to Date format
mpca$END_DATE_ESTIMATED <- as.Date(mpca$END_DATE_ESTIMATED, format = "%d/%m/%Y")

3.4.0.3 Removing empty variables and duplicated registries.

I have identified a few variables that either do not contribute meaningful information or are no longer necessary:

  1. TYPE: This variable does not provide us with any valuable insights.

  2. FIN_SERVICES: It appears that this field is entirely empty and thus offers no data for analysis.

  3. RURAL_URBAN: Although initially present, this variable has been superseded by a more comprehensive final version, RURAL_URBAN_FINAL.

  4. Admin Codes: During the initial data entry process, administrative codes were undoubtedly useful for organizing information. However, given that our current analysis does not require these granular identifiers, we will remove them from the dataset to simplify our focus on more pertinent variables.

  5. CONTACT1 and CONTACT2: These fields, which likely pertain to contact information, are not relevant to the specific analytical objectives of this study.

    Additionally there is an important number of duplicated observations, we will count how many of them we are removing from our analysis.

By removing these variables and observations, we have a lighter and leaner dataset.

library(dplyr)

# Dropping the specified columns
mpca <- mpca %>%
  select(-TYPE, -FIN_SERVICES, -RURAL_URBAN, -CONTACT1, -CONTACT2)
original_rows <- nrow(mpca)

# Removing duplicate rows
mpca <- mpca[!duplicated(mpca), ]

new_rows <- nrow(mpca)

duplicates_removed <- original_rows - new_rows
cat("Number of duplicated rows removed:", duplicates_removed, "\n")
Number of duplicated rows removed: 11431 

3.4.0.4 Fixing the type of variables.

As noted earlier, we have categorical variables currently formatted as character data. This prevents us from performing classifications or conducting basic descriptive analyses. Therefore, we will proceed to transform these variables, similar to how we handled the date variables.

Additionally, we need to examine the unique values within these categorical variables. This step will help us identify and address any inconsistencies before proceeding with the dataset analysis.

# Convert all character columns to factors
mpca <- mpca %>%
  mutate(across(where(is.character), as.factor))
# Specify the columns you want to check
selected_columns <- c("ACTIVITY", "FINAL_RURAL_URBAN", "ADMIN_3_4", "MODALITY", "CONDITIONALITY",
                      "FREQUENCY_OF_TRANSFER", "BENEFICIARY_TYPE", "DELIVERY_MECHANISMS",
                      "STATUS", "DURATION_ONE_MONTH", "ACRONYM_REPORTING", 
                      "ORGANIZATION_TYEP_REPORTING", "ACRONYM_IMPLEMENTER",
                      "ORGANISATION_TYPE_IMPLEMENTER")

# Filter to only the specified columns and check if they are factors
factor_levels <- lapply(mpca[selected_columns], function(col) {
  if (is.factor(col)) levels(col) else NULL
})

# Remove any NULL values for non-factor variables
factor_levels <- factor_levels[!sapply(factor_levels, is.null)]

# Display the unique values for each selected factor variable
print(factor_levels)
$ACTIVITY
[1] "Provision of timely MPC transfers"

$FINAL_RURAL_URBAN
[1] "rural/ сільський" "urban / міський" 

$ADMIN_3_4
[1] "Admin 4" "Admin3" 

$MODALITY
[1] "Cash / Готівкою"    "None"               "Vouchers / Ваучери"

$CONDITIONALITY
 [1] "Cash For Food"               "for covering basic needs"   
 [3] "MPC"                         "MPCA"                       
 [5] "Multipurpose"                "no"                         
 [7] "No conditions"               "None"                       
 [9] "Nost families"               "Optional"                   
[11] "Unconditiional"              "Unconditiona."              
[13] "unconditional"               "Unconditional"              
[15] "Unconditional Cash Transfer"

$FREQUENCY_OF_TRANSFER
[1] "Additional categories defined per context, no open text options / Додаткові категорії визначаються відповідно до контексту, немає доступних варіантів вибору"
[2] "Biweekly"                                                                                                                                                    
[3] "Monthly / раз на місяць"                                                                                                                                     
[4] "One-Off / одноразовo"                                                                                                                                        

$BENEFICIARY_TYPE
[1] "First/Once-Off / первинний/одноразовий"
[2] "Second/Once-Off/вторинний/одноразовий" 
[3] "Second/Repeated / вторинний/повторний" 

$DELIVERY_MECHANISMS
[1] "Bank remittances / Банківські перекази"
[2] "Bank transfers / Банківські перекази"  
[3] "Other /Інше"                           
[4] "Post office / Поштове відділення"      
[5] "Western Union"                         

$STATUS
[1] "Completed / завершена" "Ongoing / в процесі"   "Planned / планується" 

$DURATION_ONE_MONTH
 [1] "1 month"               "1 Month"               "1.5 months"           
 [4] "2 months"              "2 Months"              "2nd round of 3 months"
 [7] "3"                     "3 and 6 months (mix)"  "3 momth"              
[10] "3 month"               "3 months"              "3 Months"             
[13] "4 Months"             

$ACRONYM_REPORTING
 [1] "AAI"       "ACF"       "ACTED"     "ADRAUA"    "ASB"       "CARE"     
 [7] "CaritasCz" "CaritasUA" "CFDE"      "CORE"      "DAI"       "DCA"      
[13] "DRC"       "DRK"       "ERC"       "FAO"       "GC"        "HAI"      
[19] "HEKS-EPER" "HELP e.V." "HI"        "IFRC"      "IMC"       "IOM"      
[25] "IRC"       "MC"        "Medair"    "NRC"       "PAH"       "PIN"      
[31] "SCI"       "SI"        "SOS"       "Triangle"  "UDA"       "UNHCR"    
[37] "UNICEF"    "URCS"      "WFP"       "WHH"       "WVI"       "WW"       
[43] "ZOA"      

$ORGANIZATION_TYEP_REPORTING
[1] "International NGO" "National NGO"      "RCRC Movement"    
[4] "UN Agency"        

$ACRONYM_IMPLEMENTER
 [1] "ACF"                  "ACTED"                "ADRAUA"              
 [4] "AirLight"             "ARW"                  "ASB"                 
 [7] "Blahochestia"         "Caritas Brody"        "Caritas Donetsk"     
[10] "Caritas Drohobych"    "Caritas Khmelnytsky"  "Caritas Kramatorsk"  
[13] "Caritas Kyiv"         "Caritas Odesa"        "Caritas Poltava"     
[16] "Caritas Ternopil"     "Caritas Volnovakha"   "Caritas Zaporizhzhia"
[19] "Caritas Zhovkva"      "CF LDF"               "DAI"                 
[22] "DRC"                  "DUA"                  "ERC"                 
[25] "ERC/UFF"              "GC"                   "GZ"                  
[28] "HAI"                  "HEKS-EPER"            "HI"                  
[31] "HIA"                  "IMC"                  "IRC"                 
[34] "Medair"               "NRC"                  "OPR"                 
[37] "PAH"                  "PIN"                  "Plich-o-Plich"       
[40] "Poruch"               "Posmishka"            "R2P"                 
[43] "RC"                   "SI"                   "SOSCV"               
[46] "UDA"                  "UFF"                  "UNHCR"               
[49] "URCS"                 "WCU"                  "WVI"                 
[52] "ZOA"                 

$ORGANISATION_TYPE_IMPLEMENTER
[1] "International NGO"              "International NGO/National NGO"
[3] "National NGO"                   "RCRC Movement"                 
[5] "UN Agency"                     

As anticipated, some entries in the dataset contain misspellings:

  • Conditionality is listed as [1] “Cash For Food” for covering basic needs”, [4] “Multipurpose” “No condition”, and [7] “Nost families” “Optional” [10] “Unconditiional” “Unconditional” “Unconditional Cash Transfer” “vulnerability, IDP”.

  • Duration of the transfers is noted as [1] “1 Month”, “1.5 months”, “2 Months”, “3 month”, “3 months”, “3 Months”, and “4 Months”.

There seems to be a potential issue with the modality of transfer, where “Western Union” is incorrectly placed within the “Post Office” category. It could suggest a significant number of international transfers originating from Western Union.

The category “International NGO/National NGO” in “ORGANISATION_TYPE_IMPLEMENTER” is unclear and inconsistent for comparison purposes with “ORGANIZATION_TYEP_REPORTING”. It seems to have been used to describe internationally operating NGOs. I will verify the names associated with this category and make a decision on how to address it.

Although there are libraries like stringdist designed to handle such cases, given their rarity in our dataset, we can tackle them individually. We plan to utilize the forcats library for this purpose.

# Clean and standardize the CONDITIONALITY column in cleaned_mpca
mpca <- mpca %>%
  mutate(CONDITIONALITY = tolower(CONDITIONALITY), # Convert to lowercase
         CONDITIONALITY = case_when(
           grepl("unconditional|no|none|optional", CONDITIONALITY) ~ "unconditional",
           grepl("cash for food|basic needs", CONDITIONALITY) ~ "conditional",
           grepl("mpc|mpca|multipurpose|other", CONDITIONALITY) ~ "ambiguous", # Combine other & multipurpose
           TRUE ~ "ambiguous" # Default to ambiguous for undefined cases
         ))

# Verify the changes
unique(mpca$CONDITIONALITY)
[1] "ambiguous"     "unconditional" "conditional"  
# Clean and standardize the DURATION_ONE_MONTH column in cleaned_mpca
mpca <- mpca %>%
  mutate(DURATION_ONE_MONTH = tolower(DURATION_ONE_MONTH), # Convert to lowercase
         DURATION_ONE_MONTH = case_when(
           grepl("1 month", DURATION_ONE_MONTH) ~ "1 month",
           grepl("1.5", DURATION_ONE_MONTH) ~ "1.5 months",
           grepl("^2 ", DURATION_ONE_MONTH) ~ "2 months",
           grepl("^3 ", DURATION_ONE_MONTH) ~ "3 months",
           grepl("^4 ", DURATION_ONE_MONTH) ~ "4 months",
           grepl("^6 ", DURATION_ONE_MONTH) ~ "6 months",
           grepl("2nd round of 3 months", DURATION_ONE_MONTH) ~ "3 months",
           grepl("3 and 6 months", DURATION_ONE_MONTH) ~ "multiple durations",
           TRUE ~ "other" # Catch-all for undefined cases
         ))

# Verify the changes
unique(mpca$DURATION_ONE_MONTH)
[1] "1 month"    "3 months"   "other"      "1.5 months" "4 months"  
[6] "2 months"  

Here we have another problematic variable, and the reason why a detailed financial analysis is not possible with this dataset (more information below). The variables detailing the amount of transfers were in some cases entered as text, we will change it for a numeric variable and see how much data we can recover. There is also a big number of outliers, which I will investigate below.

# Clean and convert TRANSFER_VALUE_PER_HOUSEHOLD to numeric
mpca <- mpca %>%
  mutate(
    TRANSFER_VALUE_PER_HOUSEHOLD = gsub("\\$", "", TRANSFER_VALUE_PER_HOUSEHOLD), # Remove dollar signs
    TRANSFER_VALUE_PER_HOUSEHOLD = as.numeric(TRANSFER_VALUE_PER_HOUSEHOLD)      # Convert to numeric
  )

# Check for warnings or issues
sum(is.na(mpca$TRANSFER_VALUE_PER_HOUSEHOLD)) # Count how many NAs remain
[1] 563
# Calculate basic statistics
transfer_stats <- mpca %>%
  summarise(
    Mean = mean(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Median = median(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Min = min(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Max = max(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    SD = sd(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Missing = sum(is.na(TRANSFER_VALUE_PER_HOUSEHOLD))
  )

# View the statistics
transfer_stats
# A tibble: 1 × 6
   Mean Median   Min   Max    SD Missing
  <dbl>  <dbl> <dbl> <dbl> <dbl>   <int>
1  94.8     68  30.6  2220  239.     563

3.5 Handling missing data.

It is noted that several columns have significant gaps—with 20% or more entries missing. This issue is quite apparent in the administrative details like ‘admin4_id___automatic’, ‘rural’, and ‘settlement’, as well as organization-related data such as ‘org_contact_name’ and ‘organisation_type’.

Let’s consider the implications for each variable:

  • ADMIN4_ID and SETTLEMENT: This is where the major gap is, we will analyze this in depth.

  • ACRONYM_IMPLEMENTER and ORGANISATION_TYPE_IMPLEMENTER: This is ok, when the action is implemented by the same reporting organization, this goes unfilled.

  • DELIVERY_MECHANISMS: This is relevant, I will try to extrapolate from the existing observations.

  • MEN_18_59_YEARS, WOMEN_18_59_YEARS, etc.: In all of the demographics, the missing values represents zero (there was no people of these segments in that household). But, if all demographics of an action are missing, it means that the data was not collected. This is a tough one.

  • DONOR: Probably funding coming from the organization itself (private funding). No donor wants to go unmentioned :)

In many cases, missing data shall be interpreted as zero value or not applicable. Best practice is not to leave empty observations.

# Calculate the number and percentage of missing values
missing_values <- data.frame(
  Column = names(mpca), # Explicitly assign column names
  Missing_Count = colSums(is.na(mpca)),
  Missing_Percentage = sapply(mpca, function(x) mean(is.na(x)) * 100),
  row.names = NULL # Remove row names to avoid duplication
)

# Display the table using kable
missing_values %>%
  arrange(desc(Missing_Percentage)) %>% # Sort by highest missing percentage
  kable(
    caption = "Missing Values by Column in the Dataset",
    col.names = c("Column", "Missing Count", "Missing Percentage (%)"),
    digits = 2,
    format.args = list(big.mark = ",") # Add thousand separators for counts
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header row
Missing Values by Column in the Dataset
Column Missing Count Missing Percentage (%)
ADMIN4_ID 34,608 97.67
SETTLEMENT 34,338 96.90
ACRONYM_IMPLEMENTER 18,961 53.51
ORGANISATION_TYPE_IMPLEMENTER 18,961 53.51
NUMBER_OF_BENEFICIARIES_HH 12,916 36.45
DELIVERY_MECHANISMS 12,817 36.17
IMPLEMENTING_PARTNER 11,242 31.73
ADMIN_3_4 9,155 25.84
HROMADA 9,153 25.83
ADMIN3_ID 9,153 25.83
FINAL_RURAL_URBAN 9,143 25.80
PEOPLE_WITH_DISABILITIES 7,972 22.50
BOYS_0_17_YEARS 6,601 18.63
GIRLS_0_17_YEARS 6,599 18.62
MEN_18_59_YEARS 6,570 18.54
WOMEN_18_59_YEARS 6,297 17.77
BENEFICIARY_TYPE 4,061 11.46
DONOR 2,922 8.25
ELDERLY_60 1,544 4.36
START_DATE_ESTIMATED 1,271 3.59
END_DATE_ESTIMATED 1,121 3.16
FREQUENCY_OF_TRANSFER 693 1.96
TRANSFER_VALUE_PER_HOUSEHOLD 563 1.59
MODALITY 481 1.36
RAION 194 0.55
ADMIN2_ID 194 0.55
NUMBER_OF_BENEFICIARIES_INDV 143 0.40
OBLAST 12 0.03
ADMIN1_ID 12 0.03
TOTAL_AMOUNT_OF_CASH_DISBURSED 6 0.02
ORGANIZATION 0 0.00
ACTIVITY 0 0.00
CONDITIONALITY 0 0.00
STATUS 0 0.00
DURATION_ONE_MONTH 0 0.00
ACRONYM_REPORTING 0 0.00
ORGANIZATION_TYEP_REPORTING 0 0.00

Here we are going to see another challenge. The demographic variables are inconsistently entered.

# Specify the demographic columns
demographic_columns <- c("PEOPLE_WITH_DISABILITIES", "BOYS_0_17_YEARS", 
                         "GIRLS_0_17_YEARS", "MEN_18_59_YEARS", 
                         "WOMEN_18_59_YEARS", "ELDERLY_60")

# Convert all demographic columns to numeric, handling non-numeric values
mpca <- mpca %>%
  mutate(across(all_of(demographic_columns), ~ parse_number(as.character(.))))
# Step 2: Handle missing data
mpca <- mpca %>%
  mutate(
    # Flag rows where all demographic columns are missing
    all_missing = rowSums(is.na(select(., all_of(demographic_columns)))) == length(demographic_columns),
    # Replace missing values conditionally
    across(
      all_of(demographic_columns),
      ~ ifelse(all_missing, NA, ifelse(is.na(.), 0, .))
    )
  ) %>%
  mutate(
    # Replace completely missing rows with "Not Collected" as a separate indicator
    across(
      all_of(demographic_columns),
      ~ ifelse(is.na(.), "Not Collected", as.character(.))
    )
  ) %>%
  select(-all_missing) # Remove temporary helper column
library(dplyr)

# List of columns where missing data should be replaced with "Not Applicable"
columns_to_replace_na <- c("ACRONYM_IMPLEMENTER", "ORGANISATION_TYPE_IMPLEMENTER")

# Add "Not Applicable" as a level and replace NAs
mpca <- mpca %>%
  mutate(across(all_of(columns_to_replace_na), 
                ~ factor(.,
                         levels = c(levels(.), "Not Applicable")))) %>%
  mutate(across(all_of(columns_to_replace_na), 
                ~ replace_na(., "Not Applicable")))
library(naniar)

# Create a missing data plot
gg_miss_var(mpca)

Investigating empty values in the geography

# Calculate percentage of missing values for the specified columns grouped by ORGANIZATION_TYEP_REPORTING
missing_percentage_by_type <- mpca %>%
  group_by(ORGANIZATION_TYEP_REPORTING) %>%
  summarise(
    OBLAST_missing = mean(is.na(OBLAST)) * 100,
    RAION_missing = mean(is.na(RAION)) * 100,
    HROMADA_missing = mean(is.na(HROMADA)) * 100,
    SETTLEMENT_missing = mean(is.na(SETTLEMENT)) * 100
  ) %>%
  arrange(desc(OBLAST_missing)) # Optional: Sort by highest missing percentage

# Create and style the table
missing_percentage_by_type %>%
  kable(
    caption = "Percentage of Missing Values by Organization Type Reporting",
    col.names = c("Organization Type", "Oblast (%)", "Raion (%)", "Hromada (%)", "Settlement (%)"),
    digits = 2
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"), # Adds visual effects
    full_width = FALSE, # Keeps table width compact
    position = "center" # Centers the table
  ) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlights the header row
Percentage of Missing Values by Organization Type Reporting
Organization Type Oblast (%) Raion (%) Hromada (%) Settlement (%)
RCRC Movement 0.53 7.11 84.21 100.00
UN Agency 0.15 0.67 99.03 99.03
International NGO 0.02 0.48 22.30 97.59
National NGO 0.00 0.00 20.71 64.46

This is one of the most significant findings. National NGOs excel in the granularity of reporting, effectively documenting activities at the most local level. While it is true that their efforts often focus on major cities (which are easier to report on), their deep local knowledge is clearly an asset that has been well utilized.

On the other hand, UN agencies tend to report primarily at the Raion level (an administrative division comparable to a county). This lack of detailed reporting makes it challenging to identify which Hromadas (municipalities) have already been covered. This has been a frequent source of frustration for NGOs. I vividly recall team leaders expressing their frustration after long journeys to remote areas, only to discover that populations there had already been reached by UN agencies.

While most NGOs relied on a face-to-face approach to register potential beneficiaries for cash transfers, UN agencies adopted a more practical, online registration method. However, this approach, combined with their limited reporting granularity, made it nearly impossible to identify coverage gaps effectively.

4 ANALIZING

4.1 Financial

A financial analysis would be the most straightforward method to examine this data set. However, since our current exploration centers on coverage, we will only touch upon several key points:

We must take into account the nature of self-reporting, where a field manager reports on their intended actions. There may be discrepancies between what is initially reported and the actual outcomes when the transfers are executed. To perform a more precise financial analysis, it would be more appropriate to consult more reliable sources, collected a posteriori such as Post Distribution Monitoring reports. These provide data that are derived directly from field observations and transfer records.

In this analysis I will be focusing on coverage aspects, aiming to shed light on the operational intentions of these agencies without delving deeply into the complexities of their actual financial transactions or the variability in outcomes.

To prove this, let’s explore the surface of the finantial data in this dataset:

# Summarize the total cash disbursed by ORGANIZATION_TYPE_REPORTING
cash_summary <- mpca %>%
  group_by(ORGANIZATION_TYEP_REPORTING) %>%
  summarise(
    Total_Cash_Disbursed = sum(TOTAL_AMOUNT_OF_CASH_DISBURSED, na.rm = TRUE) # Summing while ignoring NA values
  ) %>%
  arrange(desc(Total_Cash_Disbursed)) # Sort by total cash disbursed in descending order

# Create and style the table
cash_summary %>%
  kable(
    caption = "Total Cash Disbursed by Organization Type Reporting",
    col.names = c("Organization Type", "Total Cash Disbursed ($)"),
    digits = 2,
    format.args = list(big.mark = ",") # Add thousand separators
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header row
Total Cash Disbursed by Organization Type Reporting
Organization Type Total Cash Disbursed ($)
UN Agency 1,016,982,879
International NGO 160,018,364
National NGO 23,121,326
RCRC Movement 9,538,335

What is happeing here? Did we disboursed 1.2 Billion of dollars in cash transfers during 2022? Not really, although the amount of transfers for one emergency response was historical in terms of volume (some figures below), there is something else happening here.

The high number of outliers (1,610 records) suggest that (mot likely) some reporting has been made in UHA and other in USD/EUR. This is not unheard of these kinf of responses, the primary objective in this dataset is field coordination, so validation of right locations and covered beneficiaries has a higher priority than cash disboursment.

These are some of the publicly reported figures for the major UN agencies:

The World Food Programme (WFP) distributed roughly $375.6 million in cash-based transfers, benefiting approximately 2.3 million individuals. Source: WFP Executive Board.

The United Nations High Commissioner for Refugees (UNHCR) provided $202 million in both multi-purpose and protection-specific cash assistance to almost 500,000 refugees. Source: UNHCR Reporting.

While specific figures for the International Organization for Migration (IOM)’s cash transfers in 2022 are not detailed in available sources, IOM has been actively engaged in delivering cash-based assistance to conflict-affected individuals across Ukraine.

The United Nations Children’s Fund (UNICEF) initiated the ‘Spilno’ cash assistance program in March 2022, in partnership with Ukraine’s Ministry of Social Policy. By August 2022, this initiative had distributed approximately $125 million, supporting over 350,000 children, including 35,000 with disabilities, across 120,000 households. Source: UNICEF.

# Calculate basic statistics
transfer_stats <- mpca %>%
  summarise(
    Mean = mean(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Median = median(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Min = min(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Max = max(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    SD = sd(TRANSFER_VALUE_PER_HOUSEHOLD, na.rm = TRUE),
    Missing = sum(is.na(TRANSFER_VALUE_PER_HOUSEHOLD))
  )

# View the statistics
transfer_stats
# A tibble: 1 × 6
   Mean Median   Min   Max    SD Missing
  <dbl>  <dbl> <dbl> <dbl> <dbl>   <int>
1  94.8     68  30.6  2220  239.     563

We could probably identify these outliers and infer the most likely reporting intention. But again, it would be easier and more precise to use other sources for this kind of analysis.

4.2 Demographics

One of the principles to identify gender imbalances dynamics is the collection of Sex and Age Disegregated Data (SADD). And although methodologies can slightly differ, the main principle is that we differentiate population groups in a meaningful way to allow the targeting of specific vulnerabilities.

In this case, data was disaggregated in an unusual way, and for the category of people beyond than 60 years old, there was no distinction between men and women. For this reason, I am presenting senior citizens and PwD separately.

Additionally, since total population figures are hard to come by, due to the Internal Displacement and lack of recent census information, it has not been possible to relate these figures with the total population, and important limitation to take into account when reading these tables.

# Define columns for Senior Citizens and PwD
special_categories <- c("ELDERLY_60", "PEOPLE_WITH_DISABILITIES")

# Convert these columns to numeric, handling "Not Collected"
mpca[special_categories] <- lapply(mpca[special_categories], function(x) {
  as.numeric(gsub("Not Collected", NA, x))
})

# Summarize totals (sum) for these categories
special_totals <- colSums(mpca[special_categories], na.rm = TRUE)
special_data <- data.frame(
  Category = c("Senior Citizens (+60)", "PwD"),  # Custom labels
  Total = special_totals
)

# Assign specific colors
custom_colors <- c("Senior Citizens (+60)" = "#F8F2BB", "PwD" = "#FCD1CA")

# Create the enhanced bar chart
ggplot(special_data, aes(x = Category, y = Total, fill = Category)) +
  geom_bar(stat = "identity", color = "black", width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = scales::comma(Total)), vjust = -0.5, size = 5, fontface = "bold") +  # Add bold totals
  labs(
    title = "Sum of Senior Citizens (+60) and PwD",
    subtitle = "Representation of specific demographic groups",
    x = "Category",
    y = "Total Sum"
  ) +
  theme_light(base_size = 14) +  # Use a light theme for better aesthetics
  scale_fill_manual(values = custom_colors) +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.1))) +  # Add space above bars
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    plot.caption = element_text(size = 10, face = "italic", hjust = 1),
    axis.text.x = element_text(size = 12, face = "bold"),
    axis.text.y = element_text(size = 12)
  )

# Define demographic columns
demographic_columns <- c("BOYS_0_17_YEARS", "GIRLS_0_17_YEARS", 
                         "WOMEN_18_59_YEARS", "MEN_18_59_YEARS")

# Convert demographic columns to numeric, handling "Not Collected"
mpca[demographic_columns] <- lapply(mpca[demographic_columns], function(x) {
  as.numeric(gsub("Not Collected", NA, x))
})

# Summarize totals for each category
demographic_totals <- colSums(mpca[demographic_columns], na.rm = TRUE)
demographic_data <- data.frame(
  Category = c("Boys (0-17)", "Girls (0-17)", "Women (18-59)", "Men (18-59)"),
  Total = demographic_totals
)

# Create positive and negative values for pyramid
demographic_data <- demographic_data %>%
  mutate(Direction = case_when(
    grepl("Girls|Women", Category) ~ "Female",
    TRUE ~ "Male"
  ),
  Value = ifelse(Direction == "Female", Total, -Total))

# Custom colors
custom_colors <- c("Female" = "#8600fc", "Male" = "#00C4AC")

# Create the pyramid plot
ggplot(demographic_data, aes(x = Category, y = Value, fill = Direction)) +
  geom_bar(stat = "identity", color = "black", width = 0.5) +
  geom_text(aes(label = scales::comma(abs(Value))), vjust = 0.5, hjust = ifelse(demographic_data$Value > 0, -0.1, 1.1), 
            size = 5, fontface = "bold") +  # Add labels within the bars
  coord_flip() +
  scale_y_continuous(labels = abs, expand = expansion(mult = c(0.1, 0.1))) +
  scale_fill_manual(values = custom_colors) +
  labs(
    title = "Demographic Pyramid: Men, Women, Boys, and Girls",
    subtitle = "Total representation of age and gender groups",
    x = "Demographic Categories",
    y = "Population (Sum)"
  ) +
  theme_light(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    plot.caption = element_text(size = 10, face = "italic", hjust = 1),
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12, face = "bold"),
    legend.position = "top",
    legend.title = element_blank()
  )

# Assign colors for gender
row_colors <- c("#8600fc", "#00C4AC")  # Purple for Female, Green for Male

# Create and style the table
demographic_data %>%
  kable("html", align = "c") %>%
  kable_styling("striped", full_width = F, font_size = 16) %>%
  row_spec(1, background = row_colors[1], color = "white") %>%  # Girls
  row_spec(2, background = row_colors[2]) %>%  # Boys
  row_spec(3, background = row_colors[1], color = "white") %>%  # Women
  row_spec(4, background = row_colors[2])      # Men
Category Total Direction Value
BOYS_0_17_YEARS Boys (0-17) 233888.7 Male -233888.7
GIRLS_0_17_YEARS Girls (0-17) 233901.8 Female 233901.8
WOMEN_18_59_YEARS Women (18-59) 464087.0 Female 464087.0
MEN_18_59_YEARS Men (18-59) 254813.4 Male -254813.4

Although we are still observing absolute figures, there is a clear reduction in the number of adult men. This is particularly striking, considering that most of the refugee population consists of women and children. This trend might suggest voluntary underreporting by families who fear recruitment. Since the transfer amount is calculated based on the number of family members, this could mean that larger families are receiving less aid than initially intended. This would be an interesting line for further research.

4.3 Geographical analysis

# Summarize total population by oblast
oblast_summary <- mpca %>%
  group_by(OBLAST) %>%  # Group by oblast
  summarise(Total_Population = sum(across(all_of(demographic_columns)), na.rm = TRUE)) %>%
  mutate(Percentage = (Total_Population / sum(Total_Population)) * 100) %>%  # Calculate percentage
  arrange(desc(Total_Population))  # Arrange by total population in descending order

# Create and style the table
oblast_summary %>%
  mutate(
    Total_Population = scales::comma(Total_Population),  # Format totals with commas
    Percentage = scales::percent(Percentage / 100, accuracy = 0.1)  # Format percentages
  ) %>%
  kable("html", col.names = c("Oblast", "Total Population", "Percentage"), align = "c") %>%
  kable_styling("striped", full_width = F, font_size = 16)
Oblast Total Population Percentage
Dnipropetrovska 137,043 11.5%
Lvivska 119,173 10.0%
Vinnytska 86,944 7.3%
Zakarpatska 83,831 7.1%
Zaporizka 75,750 6.4%
Kharkivska 71,099 6.0%
Khmelnytska 61,492 5.2%
Chernivetska 60,782 5.1%
Poltavska 51,586 4.3%
Donetska 48,742 4.1%
Odeska 46,637 3.9%
Ivano-Frankivska 40,694 3.4%
Mykolaivska 39,621 3.3%
Kirovohradska 38,489 3.2%
Sumska 33,613 2.8%
Ternopilska 30,485 2.6%
Kyivska 30,203 2.5%
Cherkaska 27,714 2.3%
Zhytomyrska 21,159 1.8%
Chernihivska 20,773 1.8%
Kyiv 18,385 1.5%
Khersonska 16,377 1.4%
Volynska 9,503 0.8%
Luhanska 8,674 0.7%
Rivnenska 7,374 0.6%
NA 511 0.0%
Sevastopilska 24 0.0%
Avtonomna Respublika Krym 13 0.0%
# Define the desired oblast order (right-to-left)
oblast_order <- c(
  "Luhanska", "Donetska", "Kharkivska", "Zaporizka", "Dnipropetrovska", 
  "Khersonska", "Mykolaivska", "Odeska", "Poltavska", "Kirovohradska", 
  "Cherkaska", "Vinnytska", "Sumska", "Chernihivska", "Kyivska", 
  "Zhytomyrska", "Rivnenska", "Volynska", "Khmelnytska", "Ternopilska", 
  "Chernivetska", "Ivano-Frankivska", "Lvivska", "Zakarpatska"
)

# Summarize total population by oblast
oblast_summary <- mpca %>%
  group_by(OBLAST) %>%
  summarise(Total_Population = sum(across(all_of(demographic_columns)), na.rm = TRUE)) %>%
  filter(OBLAST %in% oblast_order) %>%  # Keep only oblasts in the order list
  mutate(OBLAST = factor(OBLAST, levels = oblast_order))  # Set factor levels to match the order

# Create the bar chart
ggplot(oblast_summary, aes(x = OBLAST, y = Total_Population, fill = OBLAST)) +
  geom_bar(stat = "identity", color = "black", show.legend = FALSE) +
  labs(
    title = "Population Distribution by Oblast",
    x = "Oblast (Right to Left)",
    y = "Total Population"
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_discrete(limits = rev(oblast_order)) +  # Reverse order for right-to-left display
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Now let’s analyze the distribution of individual beneficiaries by Oblast, which is the highest administrative division under the State. The oblasts are arranged based on their geographical position in the country, from West to East, left to right, as they approach the frontline. The number of beneficiaries in oblasts farther from the frontline is comparable to those closer to it. Could this be because people relocated to the western border? While this might have been true at the beginning of the war, this data encompasses all of 2022. Another plausible (and somewhat problematic) explanation is that many humanitarian organizations established themselves in the West (notably in the city of Lviv), reaching more people there than the relatively few organizations that deployed to the East.

# Summarize total population by oblast and organization type
oblast_summary <- mpca %>%
  group_by(OBLAST, ORGANIZATION_TYEP_REPORTING) %>%  # Include organization type in the grouping
  summarise(Total_Population = sum(across(all_of(demographic_columns)), na.rm = TRUE)) %>%
  filter(OBLAST %in% oblast_order) %>%  # Keep only oblasts in the order list
  mutate(OBLAST = factor(OBLAST, levels = oblast_order))  # Set factor levels to match the order

# Create the stacked bar chart
ggplot(oblast_summary, aes(x = OBLAST, y = Total_Population, fill = ORGANIZATION_TYEP_REPORTING)) +
  geom_bar(stat = "identity", color = "black") +
  labs(
    title = "Population Distribution by Oblast and Organization Type",
    x = "Oblast (East to West)",
    y = "Total Population",
    fill = "Organization Type"
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_discrete(limits = rev(oblast_order)) +  # Reverse order for right-to-left display
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

By type of organization, we can observe that a significant portion of those reached in the west were assisted by UN agencies. This could be attributed to the stricter security constraints faced by their staff compared to NGOs.

# Load additional libraries for arrow annotation
library(grid)

# Create the stacked bar chart
ggplot(oblast_summary, aes(x = OBLAST, y = Total_Population, fill = ORGANIZATION_TYEP_REPORTING)) +
  geom_bar(stat = "identity", color = "black") +
  labs(
    title = "Population Distribution by Oblast and Organization Type",
    subtitle = "Oblast shown in order of approximation to the frontline ",
    x = "Oblast (from West to East)",
    y = "Total Population",
    fill = "Organization Type"
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_discrete(limits = rev(oblast_order)) +  # Reverse order for right-to-left display
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.margin = margin(t = 40, r = 20, b = 20, l = 20)  # Adjust margin for extra space
  ) +
  # Add an arrow annotation
  annotation_custom(
    grob = grid::linesGrob(arrow = arrow(type = "closed", length = unit(0.25, "inches")), 
                           gp = gpar(col = "black", lwd = 2)),
    ymin = max(oblast_summary$Total_Population) * 1.05, ymax = max(oblast_summary$Total_Population) * 1.05,
    xmin = 1, xmax = 24  # Control arrow position
  ) +
  # Add text for "Frontline"
  annotate("text", x = 23, y = max(oblast_summary$Total_Population) * 1.1, label = "Frontline", 
           size = 5, hjust = 0, fontface = "bold")

# Define the Oblasts of interest
oblsts_of_interest <- c("Donetska", "Lvivska", "Zakarpatska", "Kharkivska") # Replace "Dnipropetrovska" with "Donetska"

# Filter and analyze Raion coverage by Oblast
raion_coverage_by_oblast <- mpca %>%
  filter(OBLAST %in% oblsts_of_interest, !is.na(RAION)) %>% # Filter for selected Oblasts and valid Raions
  group_by(OBLAST) %>%                                     # Group by Oblast
  summarise(raions_covered = n_distinct(RAION))            # Count unique Raions per Oblast

# Print the results
print(raion_coverage_by_oblast)
# A tibble: 4 × 2
  OBLAST      raions_covered
  <fct>                <int>
1 Donetska                 8
2 Kharkivska               7
3 Lvivska                  7
4 Zakarpatska              7
# Visualization with proper Oblast order and sorting within each chart
ggplot(data = mpca %>% 
         filter(OBLAST %in% c("Lvivska", "Kharkivska", "Zakarpatska", "Odeska"), !is.na(RAION)) %>%
         mutate(
           OBLAST = factor(OBLAST, levels = c("Lvivska", "Kharkivska", "Zakarpatska", "Odeska")), # Correct Oblast order
           RAION = reorder(RAION, -NUMBER_OF_BENEFICIARIES_INDV) # Sort Raions by number of beneficiaries
         ),
       aes(x = RAION, y = NUMBER_OF_BENEFICIARIES_INDV, fill = OBLAST)) +
  geom_bar(stat = "identity", show.legend = FALSE) +        # Bar chart without labels
  facet_wrap(~ OBLAST, scales = "free", ncol = 2) +         # Separate facets by Oblast
  coord_flip() +                                            # Flip axes for better readability
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.1))) +
  labs(
    title = "Number of Beneficiaries Reached per Raion",
    subtitle = "Comparing coverage in Lvivska, Kharkivska, Zakarpatska, and Odeska Oblasts",
    x = "Raion",
    y = "Number of Beneficiaries"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.text = element_text(face = "bold", size = 14),
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, face = "italic")
  ) +
  scale_fill_brewer(palette = "Set2")

On the left, we see two oblasts close to the frontline, while on the right, there are two oblasts farther away. Both show a similar distribution, with more beneficiaries reached in areas less affected by the hostilities.

# Rural/Urban Labels and Formatting
rural_urban_summary <- mpca %>%
  filter(!is.na(FINAL_RURAL_URBAN)) %>%
  mutate(
    FINAL_RURAL_URBAN = case_when(
      FINAL_RURAL_URBAN == "rural/ сільський" ~ "Rural",
      FINAL_RURAL_URBAN == "urban / міський" ~ "Urban",
      TRUE ~ FINAL_RURAL_URBAN
    )
  ) %>%
  group_by(FINAL_RURAL_URBAN) %>%
  summarise(
    `# of Individuals` = sum(NUMBER_OF_BENEFICIARIES_INDV, na.rm = TRUE),
    `# of Actions` = n()
  ) %>%
  mutate(
    `# of Individuals` = scales::comma(`# of Individuals`),
    `# of Actions` = scales::comma(`# of Actions`)
  )

# Display the table with kable
rural_urban_summary %>%
  kbl(caption = "Summary of Beneficiaries by Rural/Urban Classification") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE) %>%  # Highlight the Rural/Urban column
  add_header_above(c(" " = 1, "Beneficiaries Data" = 2))  # Add a header above the numeric columns
Summary of Beneficiaries by Rural/Urban Classification
Beneficiaries Data
FINAL_RURAL_URBAN # of Individuals # of Actions
Rural 298,551 18,752
Urban 222,316 7,540

A significant number of actions in rural areas reflects a strong commitment to reaching the most vulnerable populations. These efforts are logistically more challenging and typically serve fewer people per action, making them more resource- and energy-intensive.

# Calculate percentages for urban and rural beneficiaries and actions
organization_rural_urban_summary <- mpca %>%
  filter(!is.na(FINAL_RURAL_URBAN), !is.na(ORGANIZATION_TYEP_REPORTING)) %>%  # Exclude missing data
  mutate(
    FINAL_RURAL_URBAN = case_when(
      FINAL_RURAL_URBAN == "rural/ сільський" ~ "Rural",
      FINAL_RURAL_URBAN == "urban / міський" ~ "Urban",
      TRUE ~ FINAL_RURAL_URBAN
    )
  ) %>%
  group_by(ORGANIZATION_TYEP_REPORTING, FINAL_RURAL_URBAN) %>%
  summarise(
    Total_Beneficiaries = sum(NUMBER_OF_BENEFICIARIES_INDV, na.rm = TRUE),
    Total_Actions = n()
  ) %>%
  group_by(ORGANIZATION_TYEP_REPORTING) %>%
  mutate(
    `% of Beneficiaries` = round(100 * Total_Beneficiaries / sum(Total_Beneficiaries), 2),
    `% of Actions` = round(100 * Total_Actions / sum(Total_Actions), 2)
  ) %>%
  ungroup()

# Prepare the table for display (only percentages)
organization_rural_urban_summary %>%
  select(
    `Organization Type` = ORGANIZATION_TYEP_REPORTING,
    `Rural/Urban` = FINAL_RURAL_URBAN,
    `% of Beneficiaries`,
    `% of Actions`
  ) %>%
  kbl(caption = "Percentage of Urban and Rural Beneficiaries and Actions per Organization Type") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE) %>%
  add_header_above(c(" " = 2, "Percentages" = 2))
Percentage of Urban and Rural Beneficiaries and Actions per Organization Type
Percentages
Organization Type Rural/Urban % of Beneficiaries % of Actions
International NGO Rural 59.79 71.46
International NGO Urban 40.21 28.54
National NGO Rural 37.04 64.14
National NGO Urban 62.96 35.86
RCRC Movement Rural 91.71 95.00
RCRC Movement Urban 8.29 5.00
UN Agency Rural 55.09 38.46
UN Agency Urban 44.91 61.54
# Remove NA and summarize for pie chart
rural_urban_total <- mpca %>%
  filter(!is.na(FINAL_RURAL_URBAN)) %>%
  group_by(FINAL_RURAL_URBAN) %>%
  summarise(Total_Beneficiaries = sum(NUMBER_OF_BENEFICIARIES_INDV, na.rm = TRUE))

# Pie chart with numbers displayed
ggplot(rural_urban_total, aes(x = "", y = Total_Beneficiaries, fill = FINAL_RURAL_URBAN)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Overall Rural vs Urban Distribution",
    x = NULL,
    y = NULL,
    fill = "Rural/Urban"
  ) +
  geom_text(aes(label = scales::comma(Total_Beneficiaries)), 
            position = position_stack(vjust = 0.5), 
            size = 5) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    legend.position = "top"
  )

The depth of this analysis is intentionally made at the descriptive level. To use inferential statistics or predictive modeling (while an interesting rehearsal) would have been of very limited value, given the nature of this data collection, which was not conceived for research. Once again, let’s acknowledge the great effort put on the collection and processing of the data made by the MPCA technical working group.

5 SHARE

5.1 Main findings.

  • The lack of geographical granularity on the reporting is lead by the UN agencies, mostly reporting at the oblast level. Following by International NGOs and on the winning post are the national NGOs. With a big caveas, national NGOs reported more activity at the urban level, where is easier to pin point the names and codes of the settlements.

  • A significant clustering of activities in regions distant from conflict zones indicates that on several occasions, ease of access has taken precedence over need-based criteria.

  • The higher number of actions occurring in rural areas demonstrates a strong commitment to reaching hard-to-access regions, led by both international and national NGOs.

5.2 Dissemination strategy.

To ensure the widest reach and maximum impact of this report, we will distribute it across multiple platforms:

  1. Kaggle: The complete report will be posted here to engage with the data science community and provide access to the raw data and analysis.

  2. My Own Website (jbaena.net): This platform will host the full report for easy access and sharing among my professional network and interested readers.

  3. LinkedIn Article: A summarized version, focusing on key findings and reflections, will be shared on LinkedIn to make the insights more accessible to the broader humanitarian community and stimulate discussion.

  4. YouTube Video Report (conditional): Depending on time availability, a video report will be produced to visually present the findings in an engaging manner, suitable for sharing on YouTube.

6 ACTING

The main takeaway from this analysis is acknowledging that field-collected data has diverse applications beyond just facilitating coordination. It’s essential to consider these data’s limitations and the methods used for its collection and documentation when conducting deeper analyses.

While this work in itself holds value and has inspired additional research directions, such as employing Bayesian models to predict risk perception—a project I am currently undertaking—it also presents opportunities for enhancement, specifically concerning the dataset used here.

I praise the effort dedicated to updating this dataset; however, I will still suggest potential improvements that could facilitate more in-depth analysis in future tracking tools.

The findings from this study, which rely on open data, will be incorporated into my professional portfolio.