The Happiest, Saddest, Most Energetic and Most Popular Persian Singers on Spotify

R Visualization

I investigate the difference between audio features of Iranian songs and singers on Spotify.

Muhammad Chenariyan Nakhaee true
05-04-2020

Introduction

I am a music lover, and like my other hobbies, I am really interested in applying data science methods to it. A few months ago, I participated in the third week of the TidyTuesday project, where I made a map of Spotify songs based on audio features and a dimensionality reduction algorithm called UMAP. Since then, I have been using Spotify’s Web API to collect data, and recently, I decided to look at some of my favorite Iranian artists and their songs on Spotify. We have different genres and types of music, and while pop and rap are very popular among the younger generation, I like the traditional style more. Nevertheless, I was always curious to understand how different traditional music and pop music are. For this reason, that I like the most These are a few questions that I would like to answer:

  1. How different audio features can be among top Persian singers?
  2. What are the most danceable and least danceable Persian songs?
  3. Who is the most popular Persian singer, and what is the most popular song?
Show code
library(kableExtra)
library(tidyverse)
library(googlesheets4)
library(tidymodels)
library(gghighlight)
library(hrbrthemes)
library(ggthemes)
library(ggrepel)
library(ggalt)
library(extrafont)
library(ggtext)

Data Collection

I compiled a list of Persian Singers manually and collected information about their available songs on Spotify using the spotifyr package in R which lets us use R to access the Spotify’s API. This process was cumbersome as sometimes I was not getting what I was looking for. For instance, sometimes, songs that belonged to another random artist were retrieved. For each singer, we can only retrieve the top 10 popular songs. It means that the rest of the songs have no popularity scores. In the end, I collected various kinds of information about more than 10000 songs.

Show code
songs_audio_plus_pop <- read_csv('https://raw.githubusercontent.com/mcnakhaee/datasets/master/Persian_Songs_Spotify.csv',
                                  )

glimpse(songs_audio_plus_pop) 
Rows: 10,632
Columns: 32
$ track_id           <chr> "31iPeC6I0AiRW8InOxNKzm", "4Fi46ha8teWYTw~
$ poet               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ disc_number        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ duration_ms        <dbl> 446880, 851920, 293160, 648720, 273480, 2~
$ explicit           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,~
$ track_name         <chr> "Ghazale Taze", "Ayeeneye Hosn", "Tarke E~
$ artist_name        <chr> "Salar Aghili", "Salar Aghili", "Salar Ag~
$ popularity         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ track_number       <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9,~
$ album_href         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ album_id           <chr> "6GcmAWrnnMb2BuVriPhBLa", "6GcmAWrnnMb2Bu~
$ album_name         <chr> "Va Eshgh Amad", "Va Eshgh Amad", "Va Esh~
$ album_release_date <chr> "2/3/2020", "2/3/2020", "2/3/2020", "2/3/~
$ album_total_tracks <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ album_release_year <dbl> 2020, 2020, 2020, 2020, 2020, 2019, 2019,~
$ track_href         <chr> "https://api.spotify.com/v1/tracks/31iPeC~
$ danceability       <dbl> 0.437, 0.379, 0.437, 0.488, 0.301, 0.577,~
$ energy             <dbl> 0.390, 0.146, 0.453, 0.138, 0.443, 0.366,~
$ key                <dbl> 0, 5, 5, 2, 0, 0, 6, 9, 9, 2, 2, 5, 9, 7,~
$ loudness           <dbl> -7.170, -10.008, -5.392, -12.287, -5.702,~
$ mode               <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ speechiness        <dbl> 0.0299, 0.0414, 0.0349, 0.0451, 0.0334, 0~
$ acousticness       <dbl> 0.839, 0.970, 0.664, 0.915, 0.657, 0.834,~
$ instrumentalness   <dbl> 3.51e-05, 3.60e-04, 2.07e-03, 6.58e-03, 8~
$ liveness           <dbl> 0.1360, 0.0812, 0.1100, 0.2120, 0.1200, 0~
$ valence            <dbl> 0.3300, 0.3460, 0.5010, 0.4450, 0.4100, 0~
$ tempo              <dbl> 131.913, 105.634, 94.651, 110.967, 148.05~
$ time_signature     <dbl> 3, 4, 5, 5, 1, 3, 4, 3, 4, 5, 1, 4, 4, 3,~
$ key_name           <chr> "C", "F", "F", "D", "C", "C", "F#", "A", ~
$ mode_name          <chr> "minor", "major", "minor", "minor", "mino~
$ key_mode           <chr> "C minor", "F major", "F minor", "D minor~
$ artist_id          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

Overall Song Features

Apart from variables such as the album that a song belongs to and its date of release, Spotify’s API can give us several features that capture a song’s different audio characteristics.

You can see a full list of these features in this link. However, I am only interested in some of these features, such as:

This excellent visualization inspired me to create a similar plot for some of the most well-known Persian singers and see how their audio features differ from each other.

Show code
artists <-
  c( 'Sirvan Khosravi','Hesameddin Seraj','Rastak','Shahram Nazeri','Hossein Alizadeh','Reza Sadeghi','Alireza Eftekhari','Mohammadreza Shajarian',
     'Salar Aghili','Morteza Pashaei', 'Alireza Ghorbani','Homayoun Shajarian', 'Mohsen Yeganeh' ,'Morteza Pashaei','Moein','Farzad Farzin',
     'Babak Jahanbakhsh', 'Ehsan Khajeh Amiri','Siavash Ghomayshi','Xaniar Khosravi','Tohi' ,'Mohsen Chavoshi','Amir Tataloo',
     'Hamed Homayoun','Kayhan Kalhor')

I will plot the average, the minimum, and the maximum value of each feature for each singer. That gives us a good picture of how different their audio characteristics are from each other. However, we must make the right adjustments to the dataset before visualizing it:

  1. We need to transform the original dataset into a long-dataframe, which can be done by pivot_longer from thedplyr package.

  2. We should rescale each audio feature, otherwise, the plot would not make any sense.

Show code
order <- c(
  "valence",
  "energy",
  "tempo",
  "loudness",
  "acousticness",
  "instrumentalness",
  "danceability"
)

scaled_features_long <- songs_audio_plus_pop %>%
  mutate_at(order, scales::rescale, to = c(0, 7)) %>%
  filter(!is.na(popularity)) %>%
  filter(artist_name %in% artists) %>%
  mutate(artist_name = factor(artist_name))  %>%
  pivot_longer(
    names_to = 'metric',
    cols = c(
      "valence",
      "energy",
      "tempo",
      "loudness",
      "acousticness",
      "danceability"),
    values_to = 'value') 

Now, we can visualize the results for each artist. As mentioned before, I will compare artists by the minimum (red), the average (orange), and maximum (yellow) values of each audio feature in their songs.

Show code
ggplot() +
  ### This plots the average of each audio feature
  geom_polygon(
    data = scaled_features_long %>%  group_by(artist_name, metric) %>%
      summarise_at(c("value"), mean) %>%
      arrange(factor(metric, levels = order)) %>%
      ungroup(),
    aes(x = metric, y = value, group = artist_name,),
    alpha = .54,
    size = 1.5,
    show.legend = T,
    fill = '#FF1654'
  ) +
  ### This plots the maximum of each audio feature
  geom_polygon(
    data = scaled_features_long %>%  group_by(artist_name, metric) %>%
      summarise_at(c("value"), max) %>%
      arrange(factor(metric, levels = order)) %>%
      ungroup(),
    aes(x = metric, y = value, group = artist_name,),
    alpha = .44,
    size = 1.5,
    show.legend = T,
    fill = '#FFE066'
  ) +
  ### This plots the mimumn of each audio feature
  geom_polygon(
    data = scaled_features_long %>%  group_by(artist_name, metric) %>%
      summarise_at(c("value"), min) %>%
      arrange(factor(metric, levels = order)) %>%
      ungroup(),
    aes(x = metric, y = value, group = artist_name,),
    alpha = .84,
    size = 1.5,
    show.legend = T,
    fill =  "#EF476F"
  ) +
  scale_x_discrete(
    limits = order,
    labels = c(
      "Happy",
      "Energy",
      "Fast",
      "Loud",
      "Acoustic",
      "Instrumental",
      "Danceable"
    )
  ) +
  coord_polar(clip = 'off') +
  theme_minimal() +
  labs(title = "Persian Singers and Their Audio Characteristics",
       caption = 'Source: Spotify \n Visualization: mcnakhaee') +
  ylim(0, 8) +
  facet_wrap( ~ artist_name, ncol = 4) +
  theme(
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    axis.text.y = element_blank(),
    axis.text.x = element_text(
      family =  'Montserrat',
      size = 13.5,
      margin = ggplot2::margin(30, 0, 20, 0)
    ),
    plot.caption = element_text(
      family = 'Montserrat',
      margin = ggplot2::margin(30, 0, 20, 0),
      size = 11,
      color = 'grey80'
    ) ,
    text = element_text(family =  'Montserrat'),
    strip.text = element_text(family =  'Montserrat', size = 18),
    strip.text.x = element_text(margin = ggplot2::margin(1, 1, 1, 1, "cm")),
    panel.spacing = unit(3.5, "lines"),
    panel.grid = element_blank(),
    plot.title = element_text(
      family = 'Montserrat',
      hjust = .5,
      margin = ggplot2::margin(30, 0, 20, 0),
      size = 32,
      color = 'gray10'
    )
  )

Looking more closely at each audio feature

My first plot is informative, but it only gives us an overall picture of audio features. However, I would like to have a more detailed picture of singers and the audio features for each of their songs. For this reason, I will also make a separate plot for each audio feature where every song and its corresponding feature values are shown. I will also mark a few popular songs from each artist with a different color on this plot.

Show code
# Set a custom theme for our plots
theme_set(theme_void() +
  theme(
    text = element_text(family =  'Montserrat'),
    axis.text.x = element_text(
      family = 'Montserrat',
      margin = ggplot2::margin(30, 0, 20, 0),
      color = 'gray80',
      size = 18
    ),
    axis.text.y = element_text(
      family = 'Montserrat',
      margin = ggplot2::margin(30, 0, 20, 20),
      color = 'gray80',
      size = 20
    ),
    axis.title.x = element_text(
      family = 'Montserrat',
      margin = ggplot2::margin(30, 0, 20, 0),
      size = 22,
      color = 'gray80'
    ),
    plot.title = element_text(
      family = 'Montserrat',
      hjust = .5,
      margin = ggplot2::margin(40, 0, 40, 0),
      size = 35,
      color = 'gray80'
    ),
    plot.caption = element_text(family ='Montserrat',
                                  margin = ggplot2::margin(30, 0, 20, 20),
                                      size = 20,
                                  color = 'gray70') ,
    legend.position = 'none',
    plot.background = element_rect(fill = "#516869")
  ))

Again here, I will change the dataset to make it ready for visualization.

Show code
songs_audio_plus_pop_jitter <- songs_audio_plus_pop %>% 
  filter(artist_name %in% artists) %>% 
  mutate(is_popular = !is.na(popularity)) %>%
  distinct(artist_name,track_name,.keep_all = T) %>% 
  mutate(is_popular_size = if_else(!is.na(popularity),popularity,25),
         is_popular_alpha = if_else(!is.na(popularity),0.8,0.5)) %>% 
  mutate(track_name= str_wrap(track_name, width = 15)) %>% 
  mutate(popular_track_name = if_else( !is.na(popularity) & nchar(track_name) < 20 & !explicit,track_name,'')) 

Happiness

Show code
songs_audio_plus_pop_jitter %>%
  ggplot(aes(x = artist_name, y = valence)) +
  geom_jitter(
    aes(
      color = is_popular,
      size = is_popular_size,
      alpha = is_popular_alpha
    ),
    size = 6,
    width = 0.2,
  ) +
  geom_text_repel(
    aes(label = popular_track_name , x = artist_name , y = valence),
    family = 'Montserrat',
    color = 'gray99',
    size = 5,
    force = 0.6,
    max.iter = 2000,
    box.padding = 0.4,
    point.padding = 0.6,
    min.segment.length = 0.15,
    nudge_y      = 0.001,
    hjust = 0.5,
    segment.alpha = 0.6,
    segment.size = 0.6
  ) +
  stat_summary(
    fun = mean,
    geom = 'point',
    color = '#FF9F1C',
    size = 5,
    aes(group = artist_name)) +
  scale_color_manual(values = c('#FFD166', '#EF476F')) +
  scale_y_continuous(sec.axis = dup_axis()) +
  coord_flip()

Energy

Show code
songs_audio_plus_pop_jitter %>%
  ggplot(aes(x = artist_name, y = energy)) +
  geom_jitter(
    aes(
      color = is_popular,
      size = is_popular_size,
      alpha = is_popular_alpha
    ),
    size = 6,
    width = 0.2,
  ) +
  geom_text_repel(
    aes(label = popular_track_name , x = artist_name , y = energy),
    family = 'Montserrat',
    color = 'gray90',
    size = 6,
    force = 0.6,
    max.iter = 2000,
    box.padding = 0.4,
    point.padding = 0.6,
    min.segment.length = 0.15,
    nudge_y      = 0.001,
    hjust = 0.5,
    segment.alpha = 0.6,
    segment.size = 0.6
  ) +
  stat_summary(
    fun = mean,
    geom = 'point',
    color = '#FF9F1C',
    size = 5,
    aes(group = artist_name)
  ) +
  scale_color_manual(values = c('#EF476F', '#EF476F')) +
  scale_y_continuous(sec.axis = dup_axis()) +
  coord_flip() 

Acousticness

Show code
songs_audio_plus_pop_jitter %>%
  ggplot(aes(x = artist_name, y = acousticness)) +
  geom_jitter(
    aes(
      color = is_popular,
      size = is_popular_size,
      alpha = is_popular_alpha
    ),
    size = 6,
    width = 0.2,
  ) +
  geom_text_repel(
    aes(label = popular_track_name , x = artist_name , y = acousticness),
    family = 'Montserrat',
    color = 'gray90',
    size = 6,
    force = 0.6,
    max.iter = 2000,
    box.padding = 0.4,
    point.padding = 0.6,
    min.segment.length = 0.15,
    nudge_y      = 0.001,
    hjust = 0.5,
    segment.alpha = 0.6,
    segment.size = 0.6
  ) +
  stat_summary(
    fun = mean,
    geom = 'point',
    color = '#FF9F1C',
    size = 5,
    aes(group = artist_name)
  ) +
  scale_color_manual(values = c('#118AB2', '#06D6A0')) +
  scale_y_continuous(sec.axis = dup_axis()) +
  coord_flip() 

Danceability

Show code
songs_audio_plus_pop_jitter %>%
  ggplot(aes(x = artist_name, y = danceability)) +
  geom_jitter(
    aes(
      color = is_popular,
      size = is_popular_size,
      alpha = is_popular_alpha
    ),
    size = 6,
    width = 0.2,
  ) +
  geom_text_repel(
    aes(label = popular_track_name , x = artist_name , y = danceability),
    family = 'Montserrat',
    color = 'gray90',
    size = 6,
    force = 0.6,
    max.iter = 2000,
    box.padding = 0.4,
    point.padding = 0.6,
    min.segment.length = 0.15,
    nudge_y      = 0.001,
    hjust = 0.5,
    segment.alpha = 0.6,
    segment.size = 0.6
  ) +
  stat_summary(
    fun = mean,
    geom = 'point',
    color = '#FF9F1C',
    size = 5,
    aes(group = artist_name)
  ) +
  scale_color_manual(values = c('#A5668B', '#EF476F')) +
  scale_y_continuous(sec.axis = dup_axis()) +
  coord_flip()

Loudness

Show code
songs_audio_plus_pop_jitter %>%
  ggplot(aes(x = artist_name, y = loudness)) +
  geom_jitter(
    aes(
      color = is_popular,
      size = is_popular_size,
      alpha = is_popular_alpha
    ),
    size = 6,
    width = 0.2,
    
  ) +
  geom_text_repel(
    aes(label = popular_track_name , x = artist_name , y = loudness),
    family = 'Montserrat',
    color = 'gray90',
    size = 6,
    force = 0.6,
    max.iter = 2000,
    box.padding = 0.4,
    point.padding = 0.6,
    min.segment.length = 0.15,
    nudge_y      = 0.001,
    hjust = 0.5,
    segment.alpha = 0.6,
    segment.size = 0.6
  ) +
  stat_summary(
    fun = mean,
    geom = 'point',
    color = '#FF9F1C',
    size = 5,
    aes(group = artist_name)
  ) +
  scale_color_manual(values = c('#06D6A0', '#EF476F')) +
  scale_y_continuous(sec.axis = dup_axis()) +
  coord_flip() 

As I mentioned previously, we can only retrieve his/her top 10 popular songs for each artist. The popularity of a track is a value between 0 (the least popular) and 100 (the most popular). Spotify uses an algorithm to calculate popularity scores, which is heavily influenced by the total number of times a song has been played recently. You can read more about it in this link.

Knowing this fact about how popularity is measured, we can visualize songs and artists that have been popular and played recently.

Show code
songs_audio_plus_pop <- songs_audio_plus_pop %>%
  filter(
    !artist_name %in% c(
      'Hatam Asgari',
      'Kaveh Deylami',
      'Nasser Abdollahi',
      'Peyman Yazdanian',
      'Abbas Ghaderi',
      'Mohammad Golriz',
      'Hamid Hami',
      'Koveyti Poor',
      'Mohsen Sharifian',
      'Soheil Nafissi'))
songs_audio_plus_pop %>%
  filter(!is.na(popularity)) %>%
  mutate(track_name = if_else(!is.na(track_name), track_name, track_name)) %>%
  group_by(artist_name) %>%
  summarize(
    avg_pop = mean(popularity),
    min_pop = min(popularity),
    max_pop = max(popularity),
    most_popular = track_name[which.max(popularity)],
    least_popular = track_name[which.min(popularity)]
  ) %>%
  mutate(
    artist_name = fct_reorder(artist_name, avg_pop),
  ) %>%
  
  ggplot(aes(x = min_pop , xend = max_pop, y = artist_name)) +
  geom_dumbbell(
    colour_x = '#ef476f',
    colour_xend = '#118ab2',
    size_x = 7,
    size_xend = 7
  ) +
  geom_text(
    aes(x = min_pop - 1, y = artist_name, label = least_popular),
    size = 7,
    family = 'Montserrat',
    hjust = 1
  ) +
  geom_text(
    aes(x = max_pop + 1, y = artist_name, label = most_popular),
    size = 7,
    family = 'Montserrat',
    hjust = 0
  ) +
  scale_x_continuous(sec.axis = dup_axis()) +
  theme_tufte() +
  theme(
    plot.title = element_text(
      family = 'Montserrat',
      hjust = .5,
      margin = ggplot2::margin(0, 0, 40, 0),
      size = 45
    ),
    plot.subtitle = element_markdown(
      family = 'Montserrat',
      size = 15,
      margin = ggplot2::margin(20, 0, 40, 0),
      hjust = 1
      
    ),
    axis.text.x = element_text(
      family = 'Montserrat',
      margin = ggplot2::margin(30, 0, 20, 0),
      size = 20
    ),
    
    axis.text.y = element_text(
      family = 'Montserrat',
      margin = ggplot2::margin(30, 0, 20, 0),
      size = 20
    ),
    axis.title.x = element_text(
      family = 'Montserrat',
      margin = ggplot2::margin(30, 0, 20, 0),
      size = 30
    ),
    plot.caption = element_text(family ='Montserrat',
                                margin = ggplot2::margin(30, 0, 20, 20),
                                size = 20,
                                color = 'gray20') ,
    axis.title.y = element_blank(),
    plot.background = element_rect(fill = '#FCF0E1'),
    plot.margin = unit(c(1, 1, 1.5, 1.2), "cm")
  )

This plot shows the most popular song and the least popular track of each artist among his top 10 songs. The artists are also sorted based on their average popularity.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/mcnakhaee, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nakhaee (2020, May 4). Muhammad Nakhaee: The Happiest, Saddest, Most Energetic and Most Popular Persian Singers on Spotify. Retrieved from https://mcnakhaee.com/posts/2020-05-04-persiansongs/

BibTeX citation

@misc{nakhaee2020the,
  author = {Nakhaee, Muhammad Chenariyan},
  title = {Muhammad Nakhaee: The Happiest, Saddest, Most Energetic and Most Popular Persian Singers on Spotify},
  url = {https://mcnakhaee.com/posts/2020-05-04-persiansongs/},
  year = {2020}
}