The data set music-all.csv Preview the documentView in a new window was constructed by taking the first 40 seconds of the wave file of each song and constructing features such as the average, variance, max of the frequency, and variables related to the loudness and location of peaks in the wave.

• Apply Principal Component Analysis (PCA) to the data set. Make a scatter plot of the first two components against each other and color the points by artist.
• Apply k-means clustering to the data set and create a similar scatter plot to 1) except with the colors determined by the k-means categories. Compare how well k-means performed.

## Include Libraries

library(ggplot2)
## Import and Scale Data

df <- read.csv("music-all.csv")
df_good <- df[complete.cases(df),c(1:3,4:ncol(df))]
df_good[,4:ncol(df)] <- scale(df_good[,4:ncol(df_good)], center=TRUE, scale=TRUE)
head(df_good)
head(df)

## Apply and Plot PCA

Get the top two components and do a scatter plot of them colored (or grouped) by Artist

pca <- prcomp(df_good[,4:ncol(df_good)])
df_good$PCA1 <- pca$x[,1]
df_good$PCA2 <- pca$x[,2]
plt <- ggplot(df_good, aes(PCA1,PCA2)) + geom_text(aes(label=artist, color=artist))
plt <- plt + ggtitle("PCA2 vs. PCA1", subtitle = "Colored by Artist (8 Categories)")
plt <- plt + theme(plot.title = element_text(size = rel(2.0)))
plt

## Apply and Plot K-means on PCA Features (The first two have the highest variance)

Pick the top two components and do a scatter plot of them colored (or grouped) by the Clusters in K-means {k=length(unique(df_good$artist))} cluster <- kmeans(df_good[,c("PCA1","PCA2")], centers=length(unique(df_good$artist)), nstart=30, iter.max=50)
df_good$cluster <- cluster$cluster
plt2 <- ggplot(df_good, aes(PCA1,PCA2)) + geom_text(aes(label=artist, color=cluster))
plt2 <- plt2 + ggtitle("PCA2 vs. PCA1", subtitle = "Colored by Cluster (Number of Artists)")
plt2 <- plt2 + theme(plot.title = element_text(size = rel(2.0)))
plt2

## Comparative Analysis

• The plots clustering algorithm reacted similar to the PCA algorithm when applied on the same PCA data.

• K-means is not that predictive with the music data when we just pick 2 arbitrary features and plot them against each other.

• K-means acts almost identically when I use the two principal components as my K-means feed data.