I investigate the difference between audio features of Iranian songs and singers on Spotify.
I am a music lover, and like my other hobbies, I am really interested in applying data science methods to it. A few months ago, I participated in the third week of the TidyTuesday project, where I made a map of Spotify songs based on audio features and a dimensionality reduction algorithm called UMAP. Since then, I have been using Spotify’s Web API to collect data, and recently, I decided to look at some of my favorite Iranian artists and their songs on Spotify. We have different genres and types of music, and while pop and rap are very popular among the younger generation, I like the traditional style more. Nevertheless, I was always curious to understand how different traditional music and pop music are. For this reason, that I like the most These are a few questions that I would like to answer:
library(kableExtra)
library(tidyverse)
library(googlesheets4)
library(tidymodels)
library(gghighlight)
library(hrbrthemes)
library(ggthemes)
library(ggrepel)
library(ggalt)
library(extrafont)
library(ggtext)
I compiled a list of Persian Singers manually and collected information about their available songs on Spotify using the spotifyr
package in R which lets us use R to access the Spotify’s API. This process was cumbersome as sometimes I was not getting what I was looking for. For instance, sometimes, songs that belonged to another random artist were retrieved. For each singer, we can only retrieve the top 10 popular songs. It means that the rest of the songs have no popularity scores. In the end, I collected various kinds of information about more than 10000 songs.
<- read_csv('https://raw.githubusercontent.com/mcnakhaee/datasets/master/Persian_Songs_Spotify.csv',
songs_audio_plus_pop
)
glimpse(songs_audio_plus_pop)
Rows: 10,632
Columns: 32
$ track_id <chr> "31iPeC6I0AiRW8InOxNKzm", "4Fi46ha8teWYTw~
$ poet <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ disc_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ duration_ms <dbl> 446880, 851920, 293160, 648720, 273480, 2~
$ explicit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,~
$ track_name <chr> "Ghazale Taze", "Ayeeneye Hosn", "Tarke E~
$ artist_name <chr> "Salar Aghili", "Salar Aghili", "Salar Ag~
$ popularity <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ track_number <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9,~
$ album_href <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ album_id <chr> "6GcmAWrnnMb2BuVriPhBLa", "6GcmAWrnnMb2Bu~
$ album_name <chr> "Va Eshgh Amad", "Va Eshgh Amad", "Va Esh~
$ album_release_date <chr> "2/3/2020", "2/3/2020", "2/3/2020", "2/3/~
$ album_total_tracks <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ album_release_year <dbl> 2020, 2020, 2020, 2020, 2020, 2019, 2019,~
$ track_href <chr> "https://api.spotify.com/v1/tracks/31iPeC~
$ danceability <dbl> 0.437, 0.379, 0.437, 0.488, 0.301, 0.577,~
$ energy <dbl> 0.390, 0.146, 0.453, 0.138, 0.443, 0.366,~
$ key <dbl> 0, 5, 5, 2, 0, 0, 6, 9, 9, 2, 2, 5, 9, 7,~
$ loudness <dbl> -7.170, -10.008, -5.392, -12.287, -5.702,~
$ mode <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ speechiness <dbl> 0.0299, 0.0414, 0.0349, 0.0451, 0.0334, 0~
$ acousticness <dbl> 0.839, 0.970, 0.664, 0.915, 0.657, 0.834,~
$ instrumentalness <dbl> 3.51e-05, 3.60e-04, 2.07e-03, 6.58e-03, 8~
$ liveness <dbl> 0.1360, 0.0812, 0.1100, 0.2120, 0.1200, 0~
$ valence <dbl> 0.3300, 0.3460, 0.5010, 0.4450, 0.4100, 0~
$ tempo <dbl> 131.913, 105.634, 94.651, 110.967, 148.05~
$ time_signature <dbl> 3, 4, 5, 5, 1, 3, 4, 3, 4, 5, 1, 4, 4, 3,~
$ key_name <chr> "C", "F", "F", "D", "C", "C", "F#", "A", ~
$ mode_name <chr> "minor", "major", "minor", "minor", "mino~
$ key_mode <chr> "C minor", "F major", "F minor", "D minor~
$ artist_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
Apart from variables such as the album that a song belongs to and its date of release, Spotify’s API can give us several features that capture a song’s different audio characteristics.
You can see a full list of these features in this link. However, I am only interested in some of these features, such as:
This excellent visualization inspired me to create a similar plot for some of the most well-known Persian singers and see how their audio features differ from each other.
<-
artists c( 'Sirvan Khosravi','Hesameddin Seraj','Rastak','Shahram Nazeri','Hossein Alizadeh','Reza Sadeghi','Alireza Eftekhari','Mohammadreza Shajarian',
'Salar Aghili','Morteza Pashaei', 'Alireza Ghorbani','Homayoun Shajarian', 'Mohsen Yeganeh' ,'Morteza Pashaei','Moein','Farzad Farzin',
'Babak Jahanbakhsh', 'Ehsan Khajeh Amiri','Siavash Ghomayshi','Xaniar Khosravi','Tohi' ,'Mohsen Chavoshi','Amir Tataloo',
'Hamed Homayoun','Kayhan Kalhor')
I will plot the average, the minimum, and the maximum value of each feature for each singer. That gives us a good picture of how different their audio characteristics are from each other. However, we must make the right adjustments to the dataset before visualizing it:
We need to transform the original dataset into a long-dataframe, which can be done by pivot_longer
from thedplyr
package.
We should rescale each audio feature, otherwise, the plot would not make any sense.
<- c(
order "valence",
"energy",
"tempo",
"loudness",
"acousticness",
"instrumentalness",
"danceability"
)
<- songs_audio_plus_pop %>%
scaled_features_long mutate_at(order, scales::rescale, to = c(0, 7)) %>%
filter(!is.na(popularity)) %>%
filter(artist_name %in% artists) %>%
mutate(artist_name = factor(artist_name)) %>%
pivot_longer(
names_to = 'metric',
cols = c(
"valence",
"energy",
"tempo",
"loudness",
"acousticness",
"danceability"),
values_to = 'value')
Now, we can visualize the results for each artist. As mentioned before, I will compare artists by the minimum (red), the average (orange), and maximum (yellow) values of each audio feature in their songs.
ggplot() +
### This plots the average of each audio feature
geom_polygon(
data = scaled_features_long %>% group_by(artist_name, metric) %>%
summarise_at(c("value"), mean) %>%
arrange(factor(metric, levels = order)) %>%
ungroup(),
aes(x = metric, y = value, group = artist_name,),
alpha = .54,
size = 1.5,
show.legend = T,
fill = '#FF1654'
+
) ### This plots the maximum of each audio feature
geom_polygon(
data = scaled_features_long %>% group_by(artist_name, metric) %>%
summarise_at(c("value"), max) %>%
arrange(factor(metric, levels = order)) %>%
ungroup(),
aes(x = metric, y = value, group = artist_name,),
alpha = .44,
size = 1.5,
show.legend = T,
fill = '#FFE066'
+
) ### This plots the mimumn of each audio feature
geom_polygon(
data = scaled_features_long %>% group_by(artist_name, metric) %>%
summarise_at(c("value"), min) %>%
arrange(factor(metric, levels = order)) %>%
ungroup(),
aes(x = metric, y = value, group = artist_name,),
alpha = .84,
size = 1.5,
show.legend = T,
fill = "#EF476F"
+
) scale_x_discrete(
limits = order,
labels = c(
"Happy",
"Energy",
"Fast",
"Loud",
"Acoustic",
"Instrumental",
"Danceable"
)+
) coord_polar(clip = 'off') +
theme_minimal() +
labs(title = "Persian Singers and Their Audio Characteristics",
caption = 'Source: Spotify \n Visualization: mcnakhaee') +
ylim(0, 8) +
facet_wrap( ~ artist_name, ncol = 4) +
theme(
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(
family = 'Montserrat',
size = 13.5,
margin = ggplot2::margin(30, 0, 20, 0)
),plot.caption = element_text(
family = 'Montserrat',
margin = ggplot2::margin(30, 0, 20, 0),
size = 11,
color = 'grey80'
) ,text = element_text(family = 'Montserrat'),
strip.text = element_text(family = 'Montserrat', size = 18),
strip.text.x = element_text(margin = ggplot2::margin(1, 1, 1, 1, "cm")),
panel.spacing = unit(3.5, "lines"),
panel.grid = element_blank(),
plot.title = element_text(
family = 'Montserrat',
hjust = .5,
margin = ggplot2::margin(30, 0, 20, 0),
size = 32,
color = 'gray10'
) )
My first plot is informative, but it only gives us an overall picture of audio features. However, I would like to have a more detailed picture of singers and the audio features for each of their songs. For this reason, I will also make a separate plot for each audio feature where every song and its corresponding feature values are shown. I will also mark a few popular songs from each artist with a different color on this plot.
# Set a custom theme for our plots
theme_set(theme_void() +
theme(
text = element_text(family = 'Montserrat'),
axis.text.x = element_text(
family = 'Montserrat',
margin = ggplot2::margin(30, 0, 20, 0),
color = 'gray80',
size = 18
),axis.text.y = element_text(
family = 'Montserrat',
margin = ggplot2::margin(30, 0, 20, 20),
color = 'gray80',
size = 20
),axis.title.x = element_text(
family = 'Montserrat',
margin = ggplot2::margin(30, 0, 20, 0),
size = 22,
color = 'gray80'
),plot.title = element_text(
family = 'Montserrat',
hjust = .5,
margin = ggplot2::margin(40, 0, 40, 0),
size = 35,
color = 'gray80'
),plot.caption = element_text(family ='Montserrat',
margin = ggplot2::margin(30, 0, 20, 20),
size = 20,
color = 'gray70') ,
legend.position = 'none',
plot.background = element_rect(fill = "#516869")
))
Again here, I will change the dataset to make it ready for visualization.
<- songs_audio_plus_pop %>%
songs_audio_plus_pop_jitter filter(artist_name %in% artists) %>%
mutate(is_popular = !is.na(popularity)) %>%
distinct(artist_name,track_name,.keep_all = T) %>%
mutate(is_popular_size = if_else(!is.na(popularity),popularity,25),
is_popular_alpha = if_else(!is.na(popularity),0.8,0.5)) %>%
mutate(track_name= str_wrap(track_name, width = 15)) %>%
mutate(popular_track_name = if_else( !is.na(popularity) & nchar(track_name) < 20 & !explicit,track_name,''))
%>%
songs_audio_plus_pop_jitter ggplot(aes(x = artist_name, y = valence)) +
geom_jitter(
aes(
color = is_popular,
size = is_popular_size,
alpha = is_popular_alpha
),size = 6,
width = 0.2,
+
) geom_text_repel(
aes(label = popular_track_name , x = artist_name , y = valence),
family = 'Montserrat',
color = 'gray99',
size = 5,
force = 0.6,
max.iter = 2000,
box.padding = 0.4,
point.padding = 0.6,
min.segment.length = 0.15,
nudge_y = 0.001,
hjust = 0.5,
segment.alpha = 0.6,
segment.size = 0.6
+
) stat_summary(
fun = mean,
geom = 'point',
color = '#FF9F1C',
size = 5,
aes(group = artist_name)) +
scale_color_manual(values = c('#FFD166', '#EF476F')) +
scale_y_continuous(sec.axis = dup_axis()) +
coord_flip()
%>%
songs_audio_plus_pop_jitter ggplot(aes(x = artist_name, y = energy)) +
geom_jitter(
aes(
color = is_popular,
size = is_popular_size,
alpha = is_popular_alpha
),size = 6,
width = 0.2,
+
) geom_text_repel(
aes(label = popular_track_name , x = artist_name , y = energy),
family = 'Montserrat',
color = 'gray90',
size = 6,
force = 0.6,
max.iter = 2000,
box.padding = 0.4,
point.padding = 0.6,
min.segment.length = 0.15,
nudge_y = 0.001,
hjust = 0.5,
segment.alpha = 0.6,
segment.size = 0.6
+
) stat_summary(
fun = mean,
geom = 'point',
color = '#FF9F1C',
size = 5,
aes(group = artist_name)
+
) scale_color_manual(values = c('#EF476F', '#EF476F')) +
scale_y_continuous(sec.axis = dup_axis()) +
coord_flip()
%>%
songs_audio_plus_pop_jitter ggplot(aes(x = artist_name, y = acousticness)) +
geom_jitter(
aes(
color = is_popular,
size = is_popular_size,
alpha = is_popular_alpha
),size = 6,
width = 0.2,
+
) geom_text_repel(
aes(label = popular_track_name , x = artist_name , y = acousticness),
family = 'Montserrat',
color = 'gray90',
size = 6,
force = 0.6,
max.iter = 2000,
box.padding = 0.4,
point.padding = 0.6,
min.segment.length = 0.15,
nudge_y = 0.001,
hjust = 0.5,
segment.alpha = 0.6,
segment.size = 0.6
+
) stat_summary(
fun = mean,
geom = 'point',
color = '#FF9F1C',
size = 5,
aes(group = artist_name)
+
) scale_color_manual(values = c('#118AB2', '#06D6A0')) +
scale_y_continuous(sec.axis = dup_axis()) +
coord_flip()
%>%
songs_audio_plus_pop_jitter ggplot(aes(x = artist_name, y = danceability)) +
geom_jitter(
aes(
color = is_popular,
size = is_popular_size,
alpha = is_popular_alpha
),size = 6,
width = 0.2,
+
) geom_text_repel(
aes(label = popular_track_name , x = artist_name , y = danceability),
family = 'Montserrat',
color = 'gray90',
size = 6,
force = 0.6,
max.iter = 2000,
box.padding = 0.4,
point.padding = 0.6,
min.segment.length = 0.15,
nudge_y = 0.001,
hjust = 0.5,
segment.alpha = 0.6,
segment.size = 0.6
+
) stat_summary(
fun = mean,
geom = 'point',
color = '#FF9F1C',
size = 5,
aes(group = artist_name)
+
) scale_color_manual(values = c('#A5668B', '#EF476F')) +
scale_y_continuous(sec.axis = dup_axis()) +
coord_flip()
%>%
songs_audio_plus_pop_jitter ggplot(aes(x = artist_name, y = loudness)) +
geom_jitter(
aes(
color = is_popular,
size = is_popular_size,
alpha = is_popular_alpha
),size = 6,
width = 0.2,
+
) geom_text_repel(
aes(label = popular_track_name , x = artist_name , y = loudness),
family = 'Montserrat',
color = 'gray90',
size = 6,
force = 0.6,
max.iter = 2000,
box.padding = 0.4,
point.padding = 0.6,
min.segment.length = 0.15,
nudge_y = 0.001,
hjust = 0.5,
segment.alpha = 0.6,
segment.size = 0.6
+
) stat_summary(
fun = mean,
geom = 'point',
color = '#FF9F1C',
size = 5,
aes(group = artist_name)
+
) scale_color_manual(values = c('#06D6A0', '#EF476F')) +
scale_y_continuous(sec.axis = dup_axis()) +
coord_flip()
As I mentioned previously, we can only retrieve his/her top 10 popular songs for each artist. The popularity of a track is a value between 0 (the least popular) and 100 (the most popular). Spotify uses an algorithm to calculate popularity scores, which is heavily influenced by the total number of times a song has been played recently. You can read more about it in this link.
Knowing this fact about how popularity is measured, we can visualize songs and artists that have been popular and played recently.
<- songs_audio_plus_pop %>%
songs_audio_plus_pop filter(
!artist_name %in% c(
'Hatam Asgari',
'Kaveh Deylami',
'Nasser Abdollahi',
'Peyman Yazdanian',
'Abbas Ghaderi',
'Mohammad Golriz',
'Hamid Hami',
'Koveyti Poor',
'Mohsen Sharifian',
'Soheil Nafissi'))
%>%
songs_audio_plus_pop filter(!is.na(popularity)) %>%
mutate(track_name = if_else(!is.na(track_name), track_name, track_name)) %>%
group_by(artist_name) %>%
summarize(
avg_pop = mean(popularity),
min_pop = min(popularity),
max_pop = max(popularity),
most_popular = track_name[which.max(popularity)],
least_popular = track_name[which.min(popularity)]
%>%
) mutate(
artist_name = fct_reorder(artist_name, avg_pop),
%>%
)
ggplot(aes(x = min_pop , xend = max_pop, y = artist_name)) +
geom_dumbbell(
colour_x = '#ef476f',
colour_xend = '#118ab2',
size_x = 7,
size_xend = 7
+
) geom_text(
aes(x = min_pop - 1, y = artist_name, label = least_popular),
size = 7,
family = 'Montserrat',
hjust = 1
+
) geom_text(
aes(x = max_pop + 1, y = artist_name, label = most_popular),
size = 7,
family = 'Montserrat',
hjust = 0
+
) scale_x_continuous(sec.axis = dup_axis()) +
theme_tufte() +
theme(
plot.title = element_text(
family = 'Montserrat',
hjust = .5,
margin = ggplot2::margin(0, 0, 40, 0),
size = 45
),plot.subtitle = element_markdown(
family = 'Montserrat',
size = 15,
margin = ggplot2::margin(20, 0, 40, 0),
hjust = 1
),axis.text.x = element_text(
family = 'Montserrat',
margin = ggplot2::margin(30, 0, 20, 0),
size = 20
),
axis.text.y = element_text(
family = 'Montserrat',
margin = ggplot2::margin(30, 0, 20, 0),
size = 20
),axis.title.x = element_text(
family = 'Montserrat',
margin = ggplot2::margin(30, 0, 20, 0),
size = 30
),plot.caption = element_text(family ='Montserrat',
margin = ggplot2::margin(30, 0, 20, 20),
size = 20,
color = 'gray20') ,
axis.title.y = element_blank(),
plot.background = element_rect(fill = '#FCF0E1'),
plot.margin = unit(c(1, 1, 1.5, 1.2), "cm")
)
This plot shows the most popular song and the least popular track of each artist among his top 10 songs. The artists are also sorted based on their average popularity.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/mcnakhaee, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nakhaee (2020, May 4). Muhammad Nakhaee: The Happiest, Saddest, Most Energetic and Most Popular Persian Singers on Spotify. Retrieved from https://mcnakhaee.com/posts/2020-05-04-persiansongs/
BibTeX citation
@misc{nakhaee2020the, author = {Nakhaee, Muhammad Chenariyan}, title = {Muhammad Nakhaee: The Happiest, Saddest, Most Energetic and Most Popular Persian Singers on Spotify}, url = {https://mcnakhaee.com/posts/2020-05-04-persiansongs/}, year = {2020} }