Instructions


The data set music-all.csv Preview the documentView in a new window was constructed by taking the first 40 seconds of the wave file of each song and constructing features such as the average, variance, max of the frequency, and variables related to the loudness and location of peaks in the wave.


Include Libraries

library(ggplot2)
package ‘ggplot2’ was built under R version 3.2.5

Import and Scale Data

setwd("/Users/bmc/Desktop/CSCI-49000/week_6/HW7")
df <- read.csv("music-all.csv", header=TRUE)
df_good <- df[complete.cases(df),c(1:3,4:ncol(df))]
df_good[,4:ncol(df)] <- scale(df_good[,4:ncol(df_good)], center=TRUE, scale=TRUE)
head(df_good)
head(df)

Apply and Plot PCA

Get the top two components and do a scatter plot of them colored (or grouped) by Artist

pca <- prcomp(df_good[,4:ncol(df_good)])
df_good$PCA1 <- pca$x[,1]
df_good$PCA2 <- pca$x[,2]
plt <- ggplot(df_good, aes(PCA1,PCA2)) + geom_text(aes(label=artist, color=artist))
plt <- plt + ggtitle("PCA2 vs. PCA1", subtitle = "Colored by Artist (8 Categories)")
plt <- plt + theme(plot.title = element_text(size = rel(2.0)))
plt


Apply and Plot K-means on PCA Features (The first two have the highest variance)

Pick the top two components and do a scatter plot of them colored (or grouped) by the Clusters in K-means {k=length(unique(df_good$artist))}

cluster <- kmeans(df_good[,c("PCA1","PCA2")], centers=length(unique(df_good$artist)), nstart=30, iter.max=50)
df_good$cluster <- cluster$cluster
plt2 <- ggplot(df_good, aes(PCA1,PCA2)) + geom_text(aes(label=artist, color=cluster))
plt2 <- plt2 + ggtitle("PCA2 vs. PCA1", subtitle = "Colored by Cluster (Number of Artists)")
plt2 <- plt2 + theme(plot.title = element_text(size = rel(2.0)))
plt2


Comparative Analysis

www.000webhost.com