Going Back to the Roots! How Much Influence Did Arabic Have on Persian Literature?

Since the conquest of Persia (now Iran) by the Muslim forces in the 7th century, Arabic culture and language have had a huge influence on Iran and Iranians. Although Iran had never fully adapted Arabic as its main language, but the new Persian (Farsi) language is a mix of Arabic and the old Persian (Pahlavi) and almost use the same alphabet for writing. Also, in some parts of Iran, Arabic is the daily-life language. Over the past 100 years, a very few (narrowly-minded and mostly racist) scholars have tried to erase Arabic words from the Persian literature. Since I was a kid I have always wanted to I put my data science skills and tools to

I decided to start a small project and determine how much influence Arabic has had on the Persian Lieterature and poetry over time. To put it simply, my goal is to look at every word used in poems ad determind wether it comes from Arabic or it is originally a Persian word. Then I count the occurance of each of them and compute their ratio.

However, this is not an easy task for several reasons. Although determing the origin of a word is not difficult for a well-educated person, there are millions of words in Persian literary works.So, labaling each word manually is not feasible and I tried smarter ways (but less accurate ). Similar to many other languages, Persian poems are different from daily written or spoken Persian and therefore common NLP methods are not as effective as before.

Ideally, we need a complete dataset of word with Araabic roots that are used in Persian to solve this task. But this dataset does not exist and I must use other approaches: 1. There are a number of rules and exceptions that can be used to distinguish Persian words from Arabic words. For example, unlike Persian, Arabic does not have four letters representing “p”, “j” such as Japan, “g” such as game and “ch” in its alphabet. It means that any word that consists of one of these letters it is definitely a non-Arabic word. On the other hand, we do not have any letter in the Persian alphabet for representing the ‘th’ letter (and a few other letters) in Arabic. Therefore, words that consist of these letters are likely to be Arabic words.

# fa --> Farsi (Persian)
# ar ---> Arabic
# un ----> Unkown
def arabic_word(word):
    if 'ث' in word:
        return 'ar'
    elif 'ح' in word:
        return 'ar' 
    elif 'ص' in word:
        return 'ar' 
    if 'ض' in word:
        return 'ar'
    elif 'ظ' in word:
        return 'ar' 
    elif 'ع' in word:
        return 'ar' 
    elif 'ط' in word:
        return 'ar' 
    elif 'ق' in word:
        return 'ar' 
    elif 'ژ' in word:
        return 'fa'
    elif 'گ' in word:
        return 'fa' 
    elif 'چ' in word:
        return 'fa' 
    elif 'پ' in word:
        return 'fa' 
        return 'un'
  1. Unfortunately, the rules mentioned above are not comprehensive and the origin of many words cannot be determined by them. So,I turned to the python port of the popular Langdetect library for help. Here, if the origin of a word can not be determined by the above rules, I will ask this library to identify the language. I should mention that langdetect can be sometimes wrong so the final results might not be 100% accurate.

I must also mention that I performed a few preprocessing steps such as removing stopwords on the poetry corpus. A few other operations such as stemming could have been performed but my initial assessment was that they might not significantly change the final results. After preprocessing, I stored all the information about the ratio of Arabic and Persian words for each poet in a separate dataset.

lang_ratio_df <- read_csv('lang_ratio_df.csv')
## Parsed with column specification:
## cols(
##   poet = col_character(),
##   century = col_double(),
##   ar = col_double(),
##   fa = col_double(),
##   ratio = col_double(),
##   period = col_character()
## )
## # A tibble: 6 x 6
##   poet               century    ar    fa ratio period         
##   <chr>                <dbl> <dbl> <dbl> <dbl> <chr>          
## 1 Abusaeid Abolkheir       5  3014  8277 0.364 Khorasani Style
## 2 Ahmad Shamlou           14  8232 28862 0.285 Contemporary   
## 3 Akhavan-Sales           14  3338 14937 0.223 Contemporary   
## 4 Amir Khusrow             8 10582 41997 0.252 Iraqi Style    
## 5 Anvari                   6 29430 67188 0.438 Iraqi Style    
## 6 Artimani                10  2616  7706 0.339 Indian Style

I visualized the ratio of words for each poet using the ggplot library in R.

 lang_ratio_df %>%
    poet = fct_reorder(poet, ratio),

    period = factor(
      levels = c('Khorasani Style','Iraqi Style','Indian Style','Contemporary' )
    )) %>% 
  ggplot(aes(x = poet, y = ratio , color = period)) +
  geom_point(size = 4) +
    y = 0, yend = ratio, x = poet, xend = poet), size = 1) +
    aes(x = poet,  y = ratio,label = scales::percent(ratio)), size = 5, nudge_y = .2,family = 'Montserrat') +
  labs( x = '', y = '', title = 'The Estimated Ratio of Arabic Words Used by Famous Persion Poets') +
  scale_color_tableau() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  coord_flip() +
  facet_wrap( ~ period, scales = "free_y", ncol = 2) +
  theme_tufte() +
    text = element_text(family = 'Montserrat'),
    legend.title =  element_text(size = 20),
    axis.ticks.x = element_blank(),
    legend.text = element_text(
      size = 15,
    margin = ggplot2::margin(0, 20, 0, 0)),
    plot.title = element_text(
      face = "bold",
      color = 'gray',
      size = 22,
      margin = ggplot2::margin(0, 20, 20, 0),
      hjust = 0.5,
      vjust = 0.5),
        strip.text = element_text(
      color = 'gray80',
      size = 18 ,
      margin = ggplot2::margin(1, 0, 1, 0)),
    legend.position = 'none',
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 12, color = 'gray'),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.background = element_rect(fill = "black", color = "black"),
    panel.border = element_rect(fill = NA, color = NA))

As you can see above, every poet used at least a sizable number of Arabic words in his/her work. Most notably, Ferdowsi who wrote Shahname (the Book of Kings), which recount the myths and legends of Persian Kings and Heroes and is the oldest piece of poetry analyzed in my experiment also includes a considerable number of Arabic words. Other top Persian poets such as Hafez, Saadi and Rumi used Arabic words in almost 40%-50% of their works.

Nothing better can show this than the following plot which made by the ggpage package in R. The plot shows the distribution of words and their origins for several top Persian poets. Note that in this plot I used a only random subset of words from the works of each poet and not their whole works of poetry.

sample_poets_df <- read_csv('sample_poets.csv')
## # A tibble: 6 x 4
##   word  lang  poet   century
##   <chr> <chr> <chr>    <dbl>
## 1 <U+0628><U+0627><U+0645>   fa    Rudaki       3
## 2 <U+0627><U+0634><U+06A9>   fa    Rudaki       3
## 3 <U+063A><U+0645>    ar    Rudaki       3
## 4 <U+0647><U+0645><U+06CC>   fa    Rudaki       3
## 5 <U+0628><U+0631><U+0645>   fa    Rudaki       3
## 6 <U+0646><U+0647><U+0627><U+0646><U+06CC> fa    Rudaki       3
ggpage_df %>%
  mutate(poet = fct_reorder(poet, century)) %>%
  ggpage_plot(aes(fill = lang)) +
  labs(title = 'Distribution of Persian and Arabic Words Used by Top Persian Poets', fill = '') +
  scale_fill_manual(values = plotcolors,
                    guide = 'legend' ,
                    labels = c('Arabic','Persian')) +
  facet_wrap(~ poet, nrow = 3) +
    strip.text = element_text(
      size = 15,
      face = "bold",
      margin = ggplot2::margin(1, 1, 1, 1, "cm"),
      color = 'white'
        text = element_text(family = 'Montserrat'),

    legend.position = 'top',
    legend.text = element_text(
      size = 15,
      margin = ggplot2::margin(10, 10, 10, 10)
    panel.spacing = unit(1, "points"),
    plot.title = element_text(
      face = "bold",
      size = 22,
      margin = ggplot2::margin(30, 0, 30, 0),
      hjust = 0.5
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = '#000F2B'),
    panel.border = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank(),


How Much Influence Did Arabic Have on Persian Literature has been one of my questions since I started to read and study literature. Nobody had been able to answer this question and I could not have answered it without the help of data science.

My analysis shows that the Arabic language has contributed significantly to our literature and culture. In fact, the golden era of Persian poetry can be seen as a result of its integration with Arabic. Persian also made its contribution to the Arabic language and Arabic poetry. So, talking about erasing one language from the other is not helpful or wise and I hope everyone realizes that.

Muhammad Chenariyan Nakhaee
Machine Learning Researcher

I am Muhammad,a data scientist and machine learning enthusiast