crimerate <- crimeus
crimerate[2:12] <- round(crimeus[2:12] * 100000 / crimeus$Population)Worksheet
1 PCA on US Crime Data
Exercise 1 In this exercise, we explore the US crime data with Principal Component Analysis (PCA). The goal is to understand data preprocessing, standardization, and interpretation of PCA results.
- Load the data
crime.csv1 into a data framecrimeususingread.csv, with row names given by the first column. - Plot
PopulationvsIndexas a scatter plot.- Use
plot(..., type = "n")first and thentext(...)to label the points. - What patterns or clusters do you notice?
- Use
- Perform a PCA on columns 5 to 12 (the crime variables) using
PCA. - Examine the eigenvalues: why might some of them be exactly 0?
- Comment on why the results may not be very informative.
- What does this bit of code do?
Perform PCA on
Populationand thecrimeratedata (columns 1 and 5 to 12).- Why is it necessary to scale the data before PCA?
- Plot a scree plot and the individuals on the first two components.
Perform PCA on the
crimeratedata excludingInmates(columns 5 to 11).- Plot a scree plot and the individuals on the first two components.
- What differences do you observe with the previous question?
- Plot a scree plot and the individuals on the first two components.
Examine the contributions of variables to the first two components. Which variables drive the main directions of variability?
Write an interpretation answering the following questions:
- Why does PCA on raw counts fail to give meaningful results?
- Why do we transform the data to rates per population before PCA?
- How does scaling (standardization) affect PCA results?
- Which regions are the most extreme according to the first principal component?