Case Study in python: World Bank Development Indicators

Last updated on January 5, 2026

1 High-dimensional data

World Bank Development Indicators

import pandas as pd

pd.options.display.float_format = lambda x: ("0" if abs(x) < 1e-12 else f"{x:.3g}")

X = pd.read_csv("data/wdi_eu_2022.csv.gz", index_col=0)

wdi_info = pd.read_csv("data/wdi_description.csv")
wdi_info = dict(zip(wdi_info["id"], wdi_info["value"]))
X.columns = [wdi_info[id] for id in X.columns]

wdi_countries = pd.read_csv("data/wdi_countries.csv")
wdi_countries = dict(zip(wdi_countries["id"], wdi_countries["value"]))
X.index = [wdi_countries[id] for id in X.index]

X
Fertilizer consumption (kilograms per hectare of arable land) Agricultural land (sq. km) Agricultural land (% of land area) Arable land (hectares) Arable land (hectares per person) Arable land (% of land area) Land under cereal production (hectares) Permanent cropland (% of land area) Forest area (sq. km) Forest area (% of land area) ... High-technology exports (current US$) High-technology exports (% of manufactured exports) Transport services (% of commercial service exports) Travel services (% of commercial service exports) Voice and Accountability: Estimate Voice and Accountability: Number of Sources Voice and Accountability: Percentile Rank Voice and Accountability: Percentile Rank, Lower Bound of 90% Confidence Interval Voice and Accountability: Percentile Rank, Upper Bound of 90% Confidence Interval Voice and Accountability: Standard Error
Austria 97.4 2.6e+04 31.5 1.32e+06 0.146 16 7.55e+05 0.81 3.9e+04 47.2 ... 2.35e+10 14.9 26.8 24 1.41 9 94.2 88.9 98.1 0.11
Belgium 198 1.36e+04 44.6 8.64e+05 0.0739 28.3 3.23e+05 0.787 6.89e+03 22.6 ... 6.86e+10 22.8 22.3 6.27 1.3 9 92.8 86.5 96.1 0.11
Bulgaria 128 5.02e+04 46.3 3.47e+06 0.536 31.9 1.9e+06 1.28 3.92e+04 36.1 ... 2.72e+09 10.4 20.8 25.4 0.288 11 57.5 51.7 60.9 0.103
Cyprus 128 1.23e+03 13.3 9.48e+04 0.0712 10.3 2.49e+04 2.87 1.72e+03 18.7 ... 9.4e+07 17.7 18.2 10.5 0.85 9 73.4 67.6 80.7 0.11
Czechia 152 3.53e+04 45.7 2.48e+06 0.232 32.1 1.39e+06 0.627 2.68e+04 34.7 ... 4.66e+10 21.8 25.5 16 1.04 11 81.2 74.4 88.4 0.103
Germany 117 1.66e+05 47.5 1.17e+07 0.14 33.4 6.1e+06 0.581 1.14e+05 32.7 ... 2.46e+11 17.5 26 7.25 1.41 10 94.7 88.9 98.1 0.109
Denmark 121 2.62e+04 65.6 2.36e+06 0.4 59 1.31e+06 0.775 6.3e+03 15.8 ... 1.37e+10 16 68.7 6.21 1.59 9 98.1 93.7 99.5 0.11
Spain 111 2.67e+05 53.4 1.17e+07 0.245 23.4 5.83e+06 10.2 1.86e+05 37.2 ... 3.33e+10 12.6 12.6 44 1.02 10 80.2 71.5 87.4 0.109
Estonia 85.8 9.86e+03 23.1 7.07e+05 0.524 16.5 3.62e+05 0.117 2.44e+04 57.1 ... 2.33e+09 18 25.5 10.9 1.2 10 87.9 80.7 93.2 0.103
Finland 69.5 2.27e+04 7.46 2.24e+06 0.403 7.37 9.52e+05 0.0165 2.24e+05 73.7 ... 5.31e+09 9.62 13.3 9.48 1.61 9 98.6 94.7 100 0.11
France 131 2.84e+05 52.7 1.7e+07 0.249 31.5 9.01e+06 1.91 1.74e+05 32.3 ... 1.06e+11 23 25.9 16.4 1.11 10 86.5 75.8 92.3 0.109
Greece 143 5.37e+04 41.7 1.86e+06 0.178 14.4 7.73e+05 8.03 3.9e+04 30.3 ... 2.86e+09 14.5 49.6 36.6 0.957 9 77.8 71 87.4 0.11
Croatia 189 1.45e+04 25.9 8.53e+05 0.221 15.2 5.19e+05 1.41 1.94e+04 34.7 ... 1.74e+09 12.1 7.53 66 0.616 11 66.7 60.4 71 0.103
Hungary 108 5.08e+04 55.6 4.16e+06 0.433 45.6 2.25e+06 1.61 2.05e+04 22.5 ... 2.29e+10 18.4 29.4 18.3 0.425 11 59.9 55.6 64.7 0.103
Ireland 1.06e+03 4.35e+04 63.1 4.46e+05 0.0856 6.47 2.86e+05 0.0145 7.9e+03 11.5 ... 9.22e+10 46.8 2.07 1.81 1.45 9 96.1 91.8 99 0.11
Italy 114 1.3e+05 44 7.08e+06 0.12 24 3.01e+06 8.07 9.67e+04 32.7 ... 4.52e+10 9.16 11.6 36.1 1.07 10 83.1 75.8 88.9 0.109
Lithuania 96.1 2.91e+04 46.5 2.29e+06 0.81 36.6 1.34e+06 0.573 2.2e+04 35.2 ... 3.62e+09 13 54.9 6.45 1.07 10 81.6 75.8 88.9 0.103
Luxembourg 141 1.33e+03 51.5 6.2e+04 0.0949 24.1 2.77e+04 0.592 887 34.5 ... 8.2e+08 5.99 12.8 4.12 1.54 9 97.1 93.2 99.5 0.11
Latvia 102 1.97e+04 31.7 1.36e+06 0.722 21.8 7.75e+05 0.161 3.42e+04 54.9 ... 2.06e+09 17.6 32.3 14.2 0.937 10 76.3 71 85.5 0.103
Malta 101 87.5 27.3 7.8e+03 0.0147 24.4 0 2.97 4.6 1.44 ... 1.02e+09 39 19.3 8.3 1.08 8 83.6 75.8 89.9 0.11
Netherlands 240 1.8e+04 53.6 1e+06 0.0567 29.8 1.85e+05 1.13 3.71e+03 11 ... 9.84e+10 22 19.1 6.19 1.55 9 97.6 93.2 99.5 0.11
Poland 161 1.42e+05 46.3 1.12e+07 0.303 36.5 7.2e+06 1.24 9.51e+04 31.1 ... 3.31e+10 11.6 29.6 15.2 0.604 11 65.7 59.9 71 0.103
Portugal 129 3.92e+04 42.8 9.33e+05 0.0894 10.2 1.98e+05 9.46 3.31e+04 36.2 ... 4.28e+09 6.99 20.7 47.1 1.26 10 89.4 83.1 95.2 0.109
Romania 90.3 1.27e+05 55.1 8.21e+06 0.431 35.7 5.19e+06 1.71 6.93e+04 30.1 ... 8.53e+09 11.9 27.6 12.1 0.566 11 63.3 58.9 70 0.103
Slovak Republic 119 1.85e+04 38.5 1.32e+06 0.244 27.5 7.11e+05 0.354 1.93e+04 40.1 ... 7.99e+09 8.49 39 10.3 0.89 11 75.8 69.1 81.2 0.103
Slovenia 216 6.11e+03 30.3 1.79e+05 0.0848 8.89 1.04e+05 2.68 1.23e+04 61.3 ... 3.92e+09 8.57 30.9 26.6 0.976 11 78.3 71.5 87.4 0.103
Sweden 103 3e+04 7.35 2.53e+06 0.241 6.21 9.53e+05 0.00982 2.8e+05 68.7 ... 2.35e+10 17.3 12.7 9.34 1.53 10 96.6 93.2 99.5 0.109

27 rows × 698 columns

2 Principal Component Analysis

First and second moments

pd.DataFrame([X.mean()], index=["Mean"], columns=X.columns)
Fertilizer consumption (kilograms per hectare of arable land) Agricultural land (sq. km) Agricultural land (% of land area) Arable land (hectares) Arable land (hectares per person) Arable land (% of land area) Land under cereal production (hectares) Permanent cropland (% of land area) Forest area (sq. km) Forest area (% of land area) ... High-technology exports (current US$) High-technology exports (% of manufactured exports) Transport services (% of commercial service exports) Travel services (% of commercial service exports) Voice and Accountability: Estimate Voice and Accountability: Number of Sources Voice and Accountability: Percentile Rank Voice and Accountability: Percentile Rank, Lower Bound of 90% Confidence Interval Voice and Accountability: Percentile Rank, Upper Bound of 90% Confidence Interval Voice and Accountability: Standard Error
Mean 165 6.02e+04 40.5 3.61e+06 0.265 24.3 1.91e+06 2.22 5.91e+04 35 ... 3.33e+10 16.6 25.4 18.3 1.09 9.89 82.7 76.8 87.9 0.107

1 rows × 698 columns

X.cov(ddof=0)
Fertilizer consumption (kilograms per hectare of arable land) Agricultural land (sq. km) Agricultural land (% of land area) Arable land (hectares) Arable land (hectares per person) Arable land (% of land area) Land under cereal production (hectares) Permanent cropland (% of land area) Forest area (sq. km) Forest area (% of land area) ... High-technology exports (current US$) High-technology exports (% of manufactured exports) Transport services (% of commercial service exports) Travel services (% of commercial service exports) Voice and Accountability: Estimate Voice and Accountability: Number of Sources Voice and Accountability: Percentile Rank Voice and Accountability: Percentile Rank, Lower Bound of 90% Confidence Interval Voice and Accountability: Percentile Rank, Upper Bound of 90% Confidence Interval Voice and Accountability: Standard Error
Fertilizer consumption (kilograms per hectare of arable land) 3.22e+04 -9.94e+05 902 -1.35e+08 -9.95 -616 -6.96e+07 -75 -2.82e+06 -992 ... 2.39e+12 1.04e+03 -850 -464 12.3 -27.4 458 508 378 0.0957
Agricultural land (sq. km) -9.94e+05 5.65e+09 4.48e+05 3.18e+11 8.83 2.31e+05 1.71e+11 8.52e+04 3.02e+09 -6.91e+04 ... 1.83e+15 -3.39e+04 -9.62e+04 2.3e+05 -4.05e+03 1.55e+04 -1.1e+05 -1.76e+05 -8.59e+04 28.1
Agricultural land (% of land area) 902 4.48e+05 233 2.37e+07 0.0476 127 1.37e+07 5.85 -3.56e+05 -154 ... 2.7e+11 20.7 59.5 -19.7 -0.412 1.41 -10.5 -12.1 -13.9 -0.000436
Arable land (hectares) -1.35e+08 3.18e+11 2.37e+07 1.94e+13 7.15e+04 2.01e+07 1.06e+13 2.69e+06 1.8e+11 -1.89e+06 ... 1.17e+17 -3.79e+06 -6.47e+05 5.71e+06 -3.05e+05 1.2e+06 -9.01e+06 -1.22e+07 -8.28e+06 213
Arable land (hectares per person) -9.95 8.83 0.0476 7.15e+04 0.041 0.901 5.87e+04 -0.178 1.17e+03 1.18 ... -3.09e+09 -0.431 1.3 -0.38 -0.0235 0.0645 -0.853 -0.811 -0.78 -0.000389
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Voice and Accountability: Number of Sources -27.4 1.55e+04 1.41 1.2e+06 0.0645 2.29 7.53e+05 -0.125 5.55e+03 4.72 ... -2.97e+09 -3.53 0.553 4.56 -0.213 0.765 -7.27 -7.43 -6.71 -0.00238
Voice and Accountability: Percentile Rank 458 -1.1e+05 -10.5 -9.01e+06 -0.853 -34.9 -6.89e+06 -4.36 1.75e+05 18.2 ... 2.25e+11 26.5 -13.2 -70 4.41 -7.27 150 154 136 0.0291
Voice and Accountability: Percentile Rank, Lower Bound of 90% Confidence Interval 508 -1.76e+05 -12.1 -1.22e+07 -0.811 -34.9 -8.51e+06 -6.46 1.68e+05 20.9 ... 2.16e+11 25.6 -13.5 -78.9 4.55 -7.43 154 160 138 0.0292
Voice and Accountability: Percentile Rank, Upper Bound of 90% Confidence Interval 378 -8.59e+04 -13.9 -8.28e+06 -0.78 -39.3 -6.41e+06 -1.41 1.48e+05 21.4 ... 1.9e+11 24.4 -5.33 -58.6 4 -6.71 136 138 128 0.0263
Voice and Accountability: Standard Error 0.0957 28.1 -0.000436 213 -0.000389 -0.0101 -516 0.00284 56.9 -0.0136 ... 5.3e+07 0.00811 -0.00993 -0.00643 0.000843 -0.00238 0.0291 0.0292 0.0263 1.07e-05

698 rows × 698 columns

Standardize the data matrix

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_std = scaler.fit_transform(X)
X_std = pd.DataFrame(X_std, index=X.index, columns=X.columns)
X_std.cov(ddof=0)
Fertilizer consumption (kilograms per hectare of arable land) Agricultural land (sq. km) Agricultural land (% of land area) Arable land (hectares) Arable land (hectares per person) Arable land (% of land area) Land under cereal production (hectares) Permanent cropland (% of land area) Forest area (sq. km) Forest area (% of land area) ... High-technology exports (current US$) High-technology exports (% of manufactured exports) Transport services (% of commercial service exports) Travel services (% of commercial service exports) Voice and Accountability: Estimate Voice and Accountability: Number of Sources Voice and Accountability: Percentile Rank Voice and Accountability: Percentile Rank, Lower Bound of 90% Confidence Interval Voice and Accountability: Percentile Rank, Upper Bound of 90% Confidence Interval Voice and Accountability: Standard Error
Fertilizer consumption (kilograms per hectare of arable land) 1 -0.0738 0.33 -0.171 -0.274 -0.273 -0.159 -0.142 -0.214 -0.328 ... 0.257 0.654 -0.332 -0.17 0.19 -0.174 0.208 0.224 0.186 0.163
Agricultural land (sq. km) -0.0738 1 0.391 0.961 0.00058 0.244 0.931 0.386 0.548 -0.0546 ... 0.468 -0.0508 -0.0897 0.201 -0.149 0.236 -0.12 -0.185 -0.101 0.115
Agricultural land (% of land area) 0.33 0.391 1 0.353 0.0154 0.659 0.367 0.13 -0.318 -0.597 ... 0.339 0.153 0.273 -0.0849 -0.0747 0.106 -0.0563 -0.0625 -0.0804 -0.00875
Arable land (hectares) -0.171 0.961 0.353 1 0.0802 0.364 0.989 0.208 0.558 -0.0254 ... 0.511 -0.0967 -0.0103 0.0853 -0.192 0.311 -0.167 -0.22 -0.166 0.0148
Arable land (hectares per person) -0.274 0.00058 0.0154 0.0802 1 0.353 0.119 -0.3 0.0787 0.346 ... -0.293 -0.239 0.449 -0.123 -0.32 0.364 -0.343 -0.317 -0.341 -0.588
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Voice and Accountability: Number of Sources -0.174 0.236 0.106 0.311 0.364 0.208 0.353 -0.0488 0.0864 0.32 ... -0.0652 -0.453 0.0443 0.342 -0.674 1 -0.677 -0.671 -0.679 -0.834
Voice and Accountability: Percentile Rank 0.208 -0.12 -0.0563 -0.167 -0.343 -0.226 -0.23 -0.121 0.194 0.0882 ... 0.352 0.243 -0.0753 -0.375 0.995 -0.677 1 0.992 0.984 0.727
Voice and Accountability: Percentile Rank, Lower Bound of 90% Confidence Interval 0.224 -0.185 -0.0625 -0.22 -0.317 -0.22 -0.276 -0.174 0.181 0.098 ... 0.328 0.228 -0.0749 -0.41 0.995 -0.671 0.992 1 0.966 0.708
Voice and Accountability: Percentile Rank, Upper Bound of 90% Confidence Interval 0.186 -0.101 -0.0804 -0.166 -0.341 -0.276 -0.232 -0.0426 0.179 0.112 ... 0.323 0.242 -0.033 -0.341 0.98 -0.679 0.984 0.966 1 0.713
Voice and Accountability: Standard Error 0.163 0.115 -0.00875 0.0148 -0.588 -0.245 -0.0647 0.296 0.237 -0.247 ... 0.312 0.279 -0.213 -0.129 0.713 -0.834 0.727 0.708 0.713 1

698 rows × 698 columns

Compute the principal components

from sklearn.decomposition import PCA

pca = PCA()
X_pca = pca.fit_transform(X_std)
X_pca = pd.DataFrame(
    X_pca, index=X.index, columns=[f"PC{i+1}" for i in range(X_pca.shape[1])]
)
X_pca.cov(ddof=0)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27
PC1 151 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC2 0 120 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC3 0 0 77.5 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC4 0 0 0 50.5 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC5 0 0 0 0 45.2 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC6 0 0 0 0 0 31.4 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC7 0 0 0 0 0 0 27.7 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC8 0 0 0 0 0 0 0 22.2 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC9 0 0 0 0 0 0 0 0 20.1 0 ... 0 0 0 0 0 0 0 0 0 0
PC10 0 0 0 0 0 0 0 0 0 17.9 ... 0 0 0 0 0 0 0 0 0 0
PC11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
PC18 0 0 0 0 0 0 0 0 0 0 ... 8.65 0 0 0 0 0 0 0 0 0
PC19 0 0 0 0 0 0 0 0 0 0 ... 0 6.92 0 0 0 0 0 0 0 0
PC20 0 0 0 0 0 0 0 0 0 0 ... 0 0 6.78 0 0 0 0 0 0 0
PC21 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 6.23 0 0 0 0 0 0
PC22 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 5.51 0 0 0 0 0
PC23 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 4.63 0 0 0 0
PC24 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 4.22 0 0 0
PC25 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 4.13 0 0
PC26 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 3.81 0
PC27 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

27 rows × 27 columns

Scree plot

import matplotlib.pyplot as plt
import numpy as np

# Variance explained by each component
explained_variance = pca.explained_variance_ratio_

# Cumulative variance
cumulative_variance = explained_variance.cumsum()
cumulative_variance = np.insert(cumulative_variance, 0, 0)

plt.plot(range(len(explained_variance) + 1), cumulative_variance, marker="o")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.show()

Cumulative Explained Variance.

Cumulative Explained Variance.

Scores

import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# ------------------------------------------------------------
# 2. LOADINGS
# ------------------------------------------------------------
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

load_df = pd.DataFrame(
    loadings, index=X.columns, columns=[f"PC{i+1}" for i in range(X_pca.shape[1])]
)

# ------------------------------------------------------------
# 3. COS² (quality of representation)
# ------------------------------------------------------------

# Individuals cos²
ind_cos2 = (X_pca**2).div((X_pca**2).sum(axis=1), axis=0)
X_pca["cos2"] = ind_cos2.sum(axis=1)

# Variables cos²
var_cos2 = (load_df**2).div((load_df**2).sum(axis=1), axis=0)
load_df["cos2"] = var_cos2.sum(axis=1)

# ------------------------------------------------------------
# 4. SELECT TOP VARIABLES (by cos²)
# ------------------------------------------------------------
TOP_K = 20
load_top = load_df.sort_values("cos2", ascending=False).head(TOP_K)

# ------------------------------------------------------------
# 5. PCA BIPLOT (FACTOEXTRA STYLE)
# ------------------------------------------------------------
fig = px.scatter(
    X_pca,
    x="PC1",
    y="PC2",
    hover_name=X_pca.index,
    hover_data={"cos2": ":.3f"},
    title="PCA Biplot (Individuals + Variables)",
)

# Add variable arrows
for var, row in load_top.iterrows():
    fig.add_annotation(
        x=row["PC1"],
        y=row["PC2"],
        ax=0,
        ay=0,
        xref="x",
        yref="y",
        axref="x",
        ayref="y",
        showarrow=True,
        arrowhead=2,
        arrowsize=1,
        arrowwidth=2,
        opacity=0.8,
    )
    fig.add_annotation(
        x=row["PC1"], y=row["PC2"], text=var, showarrow=False, font=dict(size=10)
    )

# ------------------------------------------------------------
# 6. EQUAL AXES (CRUCIAL)
# ------------------------------------------------------------
lim = np.max(np.abs(X_pca[["PC1", "PC2"]].values)) * 1.1

fig.update_xaxes(range=[-lim, lim], zeroline=True)
fig.update_yaxes(range=[-lim, lim], zeroline=True, scaleanchor="x", scaleratio=1)

# ------------------------------------------------------------
# 7. LAYOUT CLEANUP
# ------------------------------------------------------------
fig.update_layout(
    width=800, height=800, xaxis_title="PC1", yaxis_title="PC2", showlegend=False
)

fig.show()