Going Back to the Roots! How Much Influence Did Arabic Have on Persian Literature?

Since the conquest of Persia (now Iran) by the Muslim forces in the 7th century, Arabic culture and language have had an enormous influence on Iran and Iranians. Although Iran had never fully adapted Arabic as its primary language, the new Persian (Farsi) language is a mix of Arabic and the old Persian (Pahlavi) and almost uses the same alphabet for writing. Also, in some parts of Iran, Arabic is the daily-life language. Over the past 100 years, a very few (narrowly-minded and mostly racist) scholars have tried to erase Arabic words from the Persian literature. Since I was a kid, I have always wanted to put my data science skills and tools to

I decided to start a small project and determine how much influence Arabic has had on Persian Literature and poetry over time. Simply, my goal is to look at every word used in poems and determine whether it comes from Arabic, or it is originally a Persian word. Then I count the occurrence of each of them and compute their ratio.

However, this is not an easy task for several reasons. Although determining the origin of a word is not difficult for a well-educated person, determining the root language of each word manually is not feasible. So, I tried smarter ways (but less accurate) to achieve the same goal. Like many other languages, Persian poems are different from daily written or spoken Persian, and therefore standard NLP methods are not as effective as before.

Ideally, we need a complete dataset of words with Arabic roots used in Persian to solve this task. However, as far as I know, this dataset does not exist, and I must use other approaches: 1. Some rules and exceptions can be used to distinguish Persian words from Arabic words. For example, unlike Persian, Arabic does not have four letters representing “p”, “j” such as Japan, “g” such as game and “ch” in its alphabet. It means that any word that consists of one of these letters it is definitely a non-Arabic word. On the other hand, we do not have any letters in the Persian alphabet for representing the ‘th’ letter (and a few other letters) in Arabic. Therefore, words that consist of these letters are likely to be Arabic words.

# fa --> Farsi (Persian)
# ar ---> Arabic
# un ----> Unkown
def arabic_word(word):
    if 'ث' in word:
        return 'ar'
    elif 'ح' in word:
        return 'ar' 
    elif 'ص' in word:
        return 'ar' 
    if 'ض' in word:
        return 'ar'
    elif 'ظ' in word:
        return 'ar' 
    elif 'ع' in word:
        return 'ar' 
    elif 'ط' in word:
        return 'ar' 
    elif 'ق' in word:
        return 'ar' 
    elif 'ژ' in word:
        return 'fa'
    elif 'گ' in word:
        return 'fa' 
    elif 'چ' in word:
        return 'fa' 
    elif 'پ' in word:
        return 'fa' 
        return 'un'
  1. Unfortunately, the rules mentioned above are not comprehensive, and they cannot determine the origin of many words. So, I turned to the python port of the Langdetect library for help. If the above rules can not determine the origin of a word, I will ask this library to identify the language. I should mention that langdetect can sometimes be wrong, so the final results might not be 100% accurate.

I must also mention that I performed a few preprocessing steps, such as removing stopwords on the poetry corpus. A few other operations such as stemming could have been performed, but my initial assessment might not significantly change the final results. After preprocessing, I stored all the information about the ratio of Arabic and Persian words for each poet in a separate dataset.

lang_ratio_df <- read_csv('lang_ratio_df.csv')
## # A tibble: 6 x 6
##   poet               century    ar    fa ratio period         
##   <chr>                <dbl> <dbl> <dbl> <dbl> <chr>          
## 1 Abusaeid Abolkheir       5  3014  8277 0.364 Khorasani Style
## 2 Ahmad Shamlou           14  8232 28862 0.285 Contemporary   
## 3 Akhavan-Sales           14  3338 14937 0.223 Contemporary   
## 4 Amir Khusrow             8 10582 41997 0.252 Iraqi Style    
## 5 Anvari                   6 29430 67188 0.438 Iraqi Style    
## 6 Artimani                10  2616  7706 0.339 Indian Style

I visualized the ratio of words for each poet using the ggplot library in R.

 lang_ratio_df %>%
    poet = fct_reorder(poet, ratio),

    period = factor(
      levels = c('Khorasani Style','Iraqi Style','Indian Style','Contemporary' )
    )) %>% 
  ggplot(aes(x = poet, y = ratio , color = period)) +
  geom_point(size = 4) +
    y = 0, yend = ratio, x = poet, xend = poet), size = 1) +
    aes(x = poet,  y = ratio,label = scales::percent(ratio)), size = 5, nudge_y = .2,family = 'Montserrat') +
  labs( x = '', y = '', title = 'The Estimated Ratio of Arabic Words Used by Famous Persion Poets') +
  scale_color_tableau() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  coord_flip() +
  facet_wrap( ~ period, scales = "free_y", ncol = 2) +
  theme_tufte() +
    text = element_text(family = 'Montserrat'),
    legend.title =  element_text(size = 20),
    axis.ticks.x = element_blank(),
    legend.text = element_text(
      size = 15,
    margin = ggplot2::margin(0, 20, 0, 0)),
    plot.title = element_text(
      face = "bold",
      color = 'gray',
      size = 22,
      margin = ggplot2::margin(0, 20, 20, 0),
      hjust = 0.5,
      vjust = 0.5),
        strip.text = element_text(
      color = 'gray80',
      size = 18 ,
      margin = ggplot2::margin(1, 0, 1, 0)),
    legend.position = 'none',
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 12, color = 'gray'),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.background = element_rect(fill = "black", color = "black"),
    panel.border = element_rect(fill = NA, color = NA))

As you can see above, every poet used at least a sizable number of Arabic words in his/her work. Most notably, Ferdowsi wrote Shahname (the Book of Kings), which recounts the myths and legends of Persian Kings and Heroes and is the oldest piece of poetry analyzed in my experiment, also includes a considerable number of Arabic words. Other top Persian poets such as Hafez, Saadi and Rumi used Arabic words in almost 40%-50% of their works.

It can be best shown using the following plot, which is made using the ggpage package in R. The plot shows the distribution of words and their origins for several top Persian poets. Note that I only used a random subset of words from each poet’s works and not their whole works of poetry in this plot.

sample_poets_df <- read_csv('sample_poets.csv')
## # A tibble: 6 x 4
##   word  lang  poet   century
##   <chr> <chr> <chr>    <dbl>
## 1 <U+0628><U+0627><U+0645>   fa    Rudaki       3
## 2 <U+0627><U+0634><U+06A9>   fa    Rudaki       3
## 3 <U+063A><U+0645>    ar    Rudaki       3
## 4 <U+0647><U+0645><U+06CC>   fa    Rudaki       3
## 5 <U+0628><U+0631><U+0645>   fa    Rudaki       3
## 6 <U+0646><U+0647><U+0627><U+0646><U+06CC> fa    Rudaki       3
ggpage_df %>%
  mutate(poet = fct_reorder(poet, century)) %>%
  ggpage_plot(aes(fill = lang)) +
  labs(title = 'Distribution of Persian and Arabic Words Used by Top Persian Poets', fill = '') +
  scale_fill_manual(values = plotcolors,
                    guide = 'legend' ,
                    labels = c('Arabic','Persian')) +
  facet_wrap(~ poet, nrow = 3) +
    strip.text = element_text(
      size = 15,
      face = "bold",
      margin = ggplot2::margin(1, 1, 1, 1, "cm"),
      color = 'white'
        text = element_text(family = 'Montserrat'),

    legend.position = 'top',
    legend.text = element_text(
      size = 15,
      margin = ggplot2::margin(10, 10, 10, 10)
    panel.spacing = unit(1, "points"),
    plot.title = element_text(
      face = "bold",
      size = 22,
      margin = ggplot2::margin(30, 0, 30, 0),
      hjust = 0.5
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = '#000F2B'),
    panel.border = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank(),


How Much Influence Did Arabic Have on Persian Literature has been one of my questions since I started to read and study literature. Nobody had been able to answer this question, and I could not have answered it without the help of data science.

My analysis shows that the Arabic language has contributed significantly to our literature and culture. The golden era of Persian poetry can be seen as a result of its integration with Arabic. Persian also made its contribution to the Arabic language and Arabic poetry. So, talking about erasing one language from the other is not helpful or wise, and I hope everyone realizes that.

Muhammad Chenariyan Nakhaee
Machine Learning Researcher

I am Muhammad, a data scientist, and a machine learning enthusiast.