The data set music-all.csv Preview the documentView in a new window was constructed by taking the first 40 seconds of the wave file of each song and constructing features such as the average, variance, max of the frequency, and variables related to the loudness and location of peaks in the wave.

Include Libraries

package ‘ggplot2’ was built under R version 3.2.5

Import and Scale Data

df <- read.csv("music-all.csv", header=TRUE)
df_good <- df[complete.cases(df),c(1:3,4:ncol(df))]
df_good[,4:ncol(df)] <- scale(df_good[,4:ncol(df_good)], center=TRUE, scale=TRUE)

Apply and Plot PCA

Get the top two components and do a scatter plot of them colored (or grouped) by Artist

pca <- prcomp(df_good[,4:ncol(df_good)])
df_good$PCA1 <- pca$x[,1]
df_good$PCA2 <- pca$x[,2]
plt <- ggplot(df_good, aes(PCA1,PCA2)) + geom_text(aes(label=artist, color=artist))
plt <- plt + ggtitle("PCA2 vs. PCA1", subtitle = "Colored by Artist (8 Categories)")
plt <- plt + theme(plot.title = element_text(size = rel(2.0)))

Apply and Plot K-means on PCA Features (The first two have the highest variance)

Pick the top two components and do a scatter plot of them colored (or grouped) by the Clusters in K-means {k=length(unique(df_good$artist))}

cluster <- kmeans(df_good[,c("PCA1","PCA2")], centers=length(unique(df_good$artist)), nstart=30, iter.max=50)
df_good$cluster <- cluster$cluster
plt2 <- ggplot(df_good, aes(PCA1,PCA2)) + geom_text(aes(label=artist, color=cluster))
plt2 <- plt2 + ggtitle("PCA2 vs. PCA1", subtitle = "Colored by Cluster (Number of Artists)")
plt2 <- plt2 + theme(plot.title = element_text(size = rel(2.0)))

Comparative Analysis