The Quran's Statistics
By Shadow Caster 13/02/2009
It has come to my attention that the Quran text used in this article is not entirely accurate.
It has been redone using a better version of the quran text and with PHP code. Please visit
The Quran's Statistics v2.
Introduction
The Quran is the holy book of the Muslims and they consider its text to be the unparalleled, unadulterated and the perfect word of God. The Quran was revealed over a period spanning 23 years in the Arabic language and written on papers as volumes and then compiled into one book. The Quran is often said to be written in classical Arabic but classical Arabic is not so different to modern Arabic and the text is almost completely understood by a native speaker of the language.
Originally there were no harakat (diacritics) or nuqat (dots) on the letters because there was no set of definitive, standardized rules to apply these to the Arabic language. After some time and some minor evolution, the Arabic language incorporated diacritics and dots which were consequently applied to the text of the Quran. Today, when we open up a copy of the Quran we see it is ornately riddled with all manner of symbols in addition to the 28 letters of the Arabic alphabet.
Believe it or not, not all the Quranic symbols are representable on computers. The Unicode table lacks some accessory diacritics (or I couldn't find them in version 4.0), so when we see the text of the Quran on a computer it is slightly different to what is in the books. Nonetheless, the required characters are there but there are further 'complications' concerning Arabic on computers -
The English alphabet is very simple since its characters lack diacritics or letter-combinations and simple 7-bit ASCII is able to represent all the alphabetic characters in both uppercase and lowercase, numbers and other punctuation symbols. Arabic characters can be encoded using Windows-1256 or ISO-8859-6, but this is limited and most Arabic text is encoded as Unicode UTF-8. Each letter and diacritic in Arabic is represented as a UTF-8 code and there are some combinations that are represented by one UTF-8 code. These diacritics between the text make it difficult to analyze the letters of the text individually.
There could be a large number of variations of diacritics on a single word so you cannot do a word search or even search for a phrase like you can in English. One approach (admittedly it's not a complete solution) to solving this problem is to expunge diacritics from the text. Arabic text is next-to-impossible to process unless all the diacritics are removed and some characters are modified. In the examples below the scripts ran on diacritic-free renderings of the text of the Quran.
The scripts displayed below were written in Perl 5.8 on Linux Ubuntu 8.04. The diacritic-containing text was taken from http://www.holyquran.net (which has a nice Arabic text search by the way). The diacritic removal code was originally published on http://www.islamic-dictionary.com/blog/?p=31 by the author. The code on this page and the files are released under the Apache 2.0 License. If you see a bug or logical flaw please inform me. I should point out that some things in the code may look a bit long or odd but the usual reason for this is how Arabic is treated on computers - anyone who writes Arabic on a computer will understand. If you download the development folder you can access all the UTF-8 encoded files without any problem.
General Details
This is all common knowledge:
- There are 114 surahs (chapters) in the Quran.
- There are 30 ajza' (volumes/parts) to the Quran.
- There are 6236 verses in the Quran (7 verses in the first chapter with Bismillah included but with Bismillah not included for other surahs, otherwise it is: 112 + 6236 = 6348).
- The "Bismillah" opening phrase is mentioned at the beginning of 113 Surahs and once in the text of Surat al-Naml, so 114 times in total in the whole Quran.
- The most common print of the Arabic Quran contains approximately 604 pages.
- The longest chapter, Surat al-Baqarah, contains 286 verses.
- The shortest chapter, Surat al-Kawthar, contains 3 verses.
- The longest ayah (verse) is in Surat al-Baqarah, verse 2.282.
- The shortest ayat (verses) are two letters long and are present in numerous surahs like Taha (20.1)
- The shortest ayah (verse) with an actual word is in Surat al-Rahman, verse 55.1.
The things that I independently discovered are listed in the summary at the bottom of the page.
Creating the Text Files
The original Arabic text was manually copied from http://www.holyquran.net and placed into 114 individual files. A script was written to parse the raw text and stick the files into two folders. Folder 'c' contains the text without diacritics and folder 's' contains the text with diacritics. Each verse of each file occupies a line. Be careful of text editors (like Microsoft Notepad) which wrap long lines even with Word Wrap switched off! Notepad also doesn't understand the newline character (\n), so I recommend opening files with Wordpad or your favorite text editor (like Notepad++).
proc.pl
#! /usr/bin/perl -w
#
## @author: Shadow Caster
## @date: Sun 8 Feb 2009
## @version: 0.0
## @license: Apache 2.0
##
## This script goes through the text of the Arabic quran from 114 text files.
## The quran text was taken from http://www.holyquran.net
## The script removes numbering (N.) and empty lines and saves to /s/N.txt files.
## The script then removes diacritics & similar characters and saves to /c/N.txt files.
## The diacritic removal code is based on the script found on
## http://www.islamic-dictionary.com/blog/?p=31
#
#define required variables
@chars = ('َ', 'ً', 'ُ', 'ٌ', 'ِ', 'ٍ', 'ْ', 'ّ');
@achars= ('أَ', 'أُ', 'إِ', 'اً', 'إ', 'آ', 'أ');
#loops through all 114 original files
for($i = 1; $i < 115; $i++){
open(HANDLE, $i.".txt") or die("error opening in file"); #original file
open(SHANDLE, ">./s/".$i.".txt") or die("error opening out file"); #sorted file
open(CHANDLE, ">./c/".$i.".txt") or die("error opening c file"); #cleaned file
while(<HANDLE>){
if(/^\n\s?$/){next;} #skip empty line
if(/^\s+\d/){next;} #skip numbering
#save to s file
chomp($_);
$_ =~ s/^\s+//g;
$_ =~ s/\s+$//g;
print SHANDLE $_."\n";
#save to c file
foreach $m(@chars){ #remove harakat
$_ =~ s/$m//gi;
}
foreach $m(@achars){#replace all aliph variations
$_ =~ s/$m/ا/gi;
}
$_ =~ s/(ـ)+//gi; #remove multiple char connectors
$_ =~ s/ة/ه/gi; #replace ending taa with ending haa
print CHANDLE $_."\n";
}
close(HANDLE);
close(SHANDLE);
close(CHANDLE);
} #/for
print "All Done! \n";
The Number of Verses in the Quran
As mind-boggling as this sounds - most Muslims have no idea how many verses there are in the Quran and some actually debate the subject with no knowledge. Some people even claim the Quran has 6666 verses in it, which is a malicious lie to discredit Islam in the eyes of the Christians who see this number as the mark of the Antichrist, as "The Number of the Beast"! In reality, the Quran contains 6236 verses. This script primarily gives the number of verses in each chapter and then you can use your prefered spreadsheet program to add them all together.
numberofverses.pl
#! /usr/bin/perl -w
#
## @author: Shadow Caster
## @date: Tuesday 10 Feb 2009
## @version: 0.0
## @license: Apache 2.0
##
## This script counts the number of verses in each chapter and saves the results
## to a CSV file
#
open(CSVHANDLE, ">./c/numberofverses.csv") or die("Couldn't open CSV file");
#loop through all the surahs
for($i = 1; $i < 115; $i++){
open(HANDLE, "./c/$i.txt") or die("Couldn't open in file");
$count = 0;
while(<HANDLE>){
$count++;
}
print CSVHANDLE "$i, $count \n";
close(HANDLE);
}
close(CSVHANDLE);
print "All Done! \n";
The results can be found at ./c/numberofverses. The table below is just a preview of the results for the first ten surahs.
| Surah No. | No. of Verses |
|---|---|
| 1 | 7 |
| 2 | 286 |
| 3 | 200 |
| 4 | 176 |
| 5 | 120 |
| 6 | 165 |
| 7 | 206 |
| 8 | 75 |
| 9 | 129 |
| 10 | 109 |
The Number of each Arabic Letter in each Chapter
If you press all the keys on a normal QWERTY keyboard with Arabic enabled you will get 32 letters (including compound letters). This script does a tally of how many times each of the letters occur in each surah (chapter). There are 330743 characters (letters) in the Quran.
countletters.pl
#! /usr/bin/perl -w
#
## @author: Shadow Caster
## @date: Monday 9 Feb 2009
## @version: 0.0
## @license: Apache 2.0
##
## This script goes through the text of the diacritic-free Arabic quran from
## 114 text files and counts how many letters of each type are in each surah.
## It saves the results to a CSV file.
#
#There are 32 elements in this array, as per the keyboard
@chars = qw( ذ ص ث ق ف غ ع ه خ ح ج د ش س ي ب ل ا ت ن م ك ط ئ ء ؤ ر ى ة و ز ظ ض );
#open CSV file
open(CSVHANDLE, ">./c/lettercounts.csv") or die("Couldn't open CSV file");
#loops through all the verses
for($i = 1; $i < 115; $i++){
open(HANDLE, "./c/$i.txt") or die("Couldn't open in file");
#put file's lines into string
$m = "";
while(<HANDLE>){
chomp($_);
$m .= $_;
}#/while
#stick into CSV file
foreach $x(@chars){
$uStr = $m;
$count = 0;
$count++ while $uStr =~ /($x)/gi;
print CSVHANDLE "$i, $x, $count \n" unless $count == 0;
}#/foreach
close(HANDLE);
}#/for
close(CSVHANDLE);
print "All Done! \n";
The results can be found at ./c/lettercounts. The below table is a preview showing the output for the first surah.
| Surah No. | Letter | Frequency |
|---|---|---|
| 1 | ذ | 1 |
| 1 | ص | 2 |
| 1 | ق | 1 |
| 1 | غ | 2 |
| 1 | ع | 6 |
| 1 | ه | 5 |
| 1 | ح | 5 |
| 1 | د | 4 |
| 1 | س | 3 |
| 1 | ي | 14 |
| 1 | ب | 4 |
| 1 | ل | 22 |
| 1 | ا | 26 |
| 1 | ت | 3 |
| 1 | ن | 11 |
| 1 | م | 15 |
| 1 | ك | 3 |
| 1 | ط | 2 |
| 1 | ر | 8 |
| 1 | و | 4 |
| 1 | ض | 2 |
The Number of Letters in the whole Quran
This script simply does what countletters.pl does above but it does it for the whole Quran so you can find out how many times a letter is mentioned in the whole text. The results also confirm the number of characters in the Quran is 330743. The top three most common letters are Aleph, Laam and Nun ( ا ل ن ).
countallletters.pl
#! /usr/bin/perl -w
#
## @author: Shadow Caster
## @date: Monday 9 Feb 2009
## @version: 0.0
## @license: Apache 2.0
##
## This script goes through the text of all the diacritic-free Arabic quran from
## 114 text files and counts how many of each letter there are in all the surahs.
## It saves the results to a CSV file.
#
#There are 32 elements in this array, as per the keyboard
@chars = qw( ذ ص ث ق ف غ ع ه خ ح ج د ش س ي ب ل ا ت ن م ك ط ئ ء ؤ ر ى ة و ز ظ ض );
#open each file and stick contents into one string
$m = "";
for($i = 1; $i < 115; $i++){
open(HANDLE, "./c/$i.txt") or die("Couldn't open file $i.txt");
while(<HANDLE>){
chomp($_);
$m .= $_;
}
close(HANDLE);
}
#open CSV file to save to
open(CSVHANDLE, ">./c/countallletters.csv") or die("Couldn't open CSV file");
#stick into CSV file
foreach $x(@chars){
$uStr = $m;
$count = 0;
$count++ while $uStr =~ /($x)/gi;
print CSVHANDLE "$x, $count \n" unless $count == 0;
}#/foreach
close(CSVHANDLE);
print "All Done! \n";
The results can be found at ./c/countallletters.
| Letter | Frequency | Frequency as a percentage |
|---|---|---|
| ا | 59290 | 17.93% |
| ل | 38190 | 11.55% |
| ن | 27270 | 8.25% |
| م | 26735 | 8.08% |
| و | 24812 | 7.50% |
| ي | 21975 | 6.64% |
| ه | 17213 | 5.20% |
| ر | 12403 | 3.75% |
| ب | 11491 | 3.47% |
| ت | 10501 | 3.17% |
| ك | 10497 | 3.17% |
| ع | 9405 | 2.84% |
| ف | 8747 | 2.64% |
| ق | 7034 | 2.13% |
| س | 6012 | 1.82% |
| د | 5991 | 1.81% |
| ذ | 4932 | 1.49% |
| ح | 4140 | 1.25% |
| ج | 3317 | 1.00% |
| ى | 2595 | 0.78% |
| خ | 2497 | 0.75% |
| ش | 2124 | 0.64% |
| ص | 2072 | 0.63% |
| ض | 1686 | 0.51% |
| ز | 1599 | 0.48% |
| ء | 1508 | 0.46% |
| ث | 1414 | 0.43% |
| ط | 1273 | 0.38% |
| غ | 1221 | 0.37% |
| ئ | 1159 | 0.35% |
| ظ | 853 | 0.26% |
| ؤ | 787 | 0.24% |
The Length of each Chapter in Letters
The script below counts the number of letters of the Arabic alphabet in each chapter.
countchars.pl
#! /usr/bin/perl -w
#Did you know you can install perl modules easily through synaptic package manager?!
#You may need to install this module if you want to run this script
use Unicode::String qw(utf8 latin1 utf16);
#
## @author: Shadow Caster
## @date: Monday 9 Feb 2009
## @version: 0.0
## @license: Apache 2.0
##
## This script goes through the text of the diacritic-free Arabic quran from
## 114 text files and counts how many characters are in each file. It removes
## spaces and gaps first. Only Arabic letters are counted.
## It saves the results to a CSV file.
#
#The CSV file to print results to
open(CSVHANDLE, ">./c/charcount.csv") or die("Failed to open csv file");
#loops through all the surahs
for($i = 1; $i < 115; $i++){
open(HANDLE, "./c/$i.txt") or die("Failed to open in file");
#convert array to scalar
@arr = <HANDLE>;
$str = "@arr";
chomp($str);
#remove spaces
$str =~ s/( |\s)//g;
#interpret string as utf8
$uStr = utf8($str);
#save to CSV file
print CSVHANDLE "$i, ".$uStr->length."\n";
close(HANDLE);
}
close(CSVHANDLE);
print "All Done! \n";
The results can be found at ./c/charcount. A preview of the results for the first ten surahs is shown below.
| Surah No. | No. of letters | No. of letters as Percentage |
|---|---|---|
| 1 | 143 | 0.04% |
| 2 | 26252 | 7.94% |
| 3 | 14987 | 4.53% |
| 4 | 16335 | 4.94% |
| 5 | 12206 | 3.69% |
| 6 | 12727 | 3.85% |
| 7 | 14435 | 4.36% |
| 8 | 5386 | 1.63% |
| 9 | 11116 | 3.36% |
| 10 | 7590 | 2.29% |
The total number of letters in the Quran is 330743, calculated by the sum of the number of letters of 114 chapters. There are 114 chapters in the Quran so there are 330743/114 = 2901.3 letters per chapter on average. There are 6236 verses in the Quran so there are 330743/6236 = 53 letters per verse. There are 77799 or 77784 words in the Quran (see below) so there are 330845/77799 = 4.3 letters per word on average.
The Number of Words in the Quran
The script below counts the number of words in each surah (chapter) based on word-separation by spaces.
countwords.pl
#! /usr/bin/perl -w
#
## @author: Shadow Caster
## @date: Tuesday 10 Feb 2009
## @version: 0.0
## @license: Apache 2.0
##
## This script goes through the text of the diacritic-free Arabic quran from
## 114 text files and counts how many words are in each surah.
## It saves the results to a CSV file.
#
#open output CSV file
open(CSVHANDLE, ">./c/wordcount.csv") or die("Couldn't open CSV file");
#loop through all quran files
for($x = 1; $x < 115; $x++){
open(HANDLE, "./c/$x.txt") or die("Couldn't open in file");
$m = "";
while(<HANDLE>){
chomp($_);
$m .= " ".$_." "; #spaces important to stop words joining
}
close(HANDLE);
$m =~ s/^\s+//g; #remove spaces at beginning
$m =~ s/\s+$//g; #remove spaces at end
$m =~ s/\s+/ /g; #remove multiple spaces
@arr = split(/ /, $m);
print CSVHANDLE "$x, " . @arr . "\n";
}
close(CSVHANDLE);
print "All Done! \n";
The results can be found at ./c/wordcount. This table just previews the first ten surahs.
| Surah No. | No. of words | No. of words as percetage |
|---|---|---|
| 1 | 29 | 0.04% |
| 2 | 6140 | 7.89% |
| 3 | 3499 | 4.50% |
| 4 | 3762 | 4.84% |
| 5 | 2837 | 3.65% |
| 6 | 3057 | 3.93% |
| 7 | 3342 | 4.30% |
| 8 | 1242 | 1.60% |
| 9 | 2505 | 3.22% |
| 10 | 1839 | 2.36% |
The total number of words in the Quran is 77799 or 77784, calculated by the sum of the number of words of 114 chapters. I can't understand what I missed-out or the mistake that caused this inconsistency of 15 words - I was unsuccessful in discovering the root of the problem so if you figure it out then please contact me.
There are 114 chapters in the Quran so there are 77799/114 = 682.4 words per chapter on average. There are 6236 verses in the Quran so there are 77799/6236 = 12.5 words per verse on average.
The Number of times each Word is mentioned in the Quran
This script finds all the unique words in the Quran and lists how many times they are repeated in the whole text. There are 14716 unique words in the Quran by alphabetic letters. So the percentage of unique words in the Quran is 18.9%, by alphabetical letters alone. The top 3 most-reoccurring words in the Quran are min (from), Allah (God) and inna (is, if, ...etc) ( من, الله, ان ).
The script considers words with 'huruf' (grammatical letters) joined to them to be different words, so wa-allah is considered differently to Allah. An analogy in English are the words betwixt and twixt which would be treated as separate words even though their root is the same. The only way to accurately calculate this value is by human analysis.
Also, in the Arabic language, a word may have one form but it would be considered two different words because the diacritics ascertain the different meanings - this script does not differentiate between them. Once again, the only way to accurately find this value is by laborious human analysis.
The values discovered by this script should give a general idea of the frequency of particular words in the Quran but should not be considered the definitive value unless there are no variations for the specific words you are analyzing. For example, a word that has one instance in the Quran is impossible to have a different meaning in a different context of harakat since it is the only existing form of the word in this text.
I recommend using the results of this script along with the word search at http://www.holyquran.net.
wordoccurancecount.pl
#! /usr/bin/perl -w
#
## @author: Shadow Caster
## @date: Tuesday 10 Feb 2009
## @version: 0.0
## @license: Apache 2.0
##
## This script goes through the text of the diacritic-free Arabic quran from
## 114 text files and counts how many times each word is written in the whole
## text of the Quran.
## It saves the results to a CSV file. Please use a spreadsheet program to
## sort the list by the tally count.
## Note: This script takes about 2 minutes to finish on a fast computer.
#
#make a string of all quran
$m = "";
for($i = 1; $i < 115; $i++){
open(HANDLE, "./c/$i.txt") or die("Couldn't open in file $i.txt");
while(<HANDLE>){
chomp($_);
#adding spaces is important otherwise words stick together
# on each verse end
$m .= " ".$_." ";
}
close(HANDLE);
}
#make array of all words by splitting on spaces
$m =~ s/^\s+//g; #remove extra spaces at beginning
$m =~ s/\s+$//g; #remove extra spaces at end
$m =~ s/\s+/ /g; #remove multiple spaces
@arr = split(/ /, $m);
#The associative array in which the CSV data is temporarily stored
#%stor = {};
#go through each word and find how many time it occurs in the Quran
#also do the same words without Al- if they have it
foreach $x(@arr){
if(exists($stor{"$x"})){next;} #skip if already in table
my $count = 0;
while($m =~ /\s+$x\s+/g){$count++;}
$stor{"$x"} = $count if $count > 0;
if($x =~ /^ال/){
my $tnuoc = 0;
$x =~ s/^ال//;
if(exists($stor{"$x"})){next;} #skip if already in table
while($m =~ /\s+$x\s+/g){$tnuoc++;}
$stor{"$x"} = $tnuoc if $tnuoc > 0;
}#/if
}#/foreach
#save values to CSV file
open(CSVHANDLE, ">./c/wordoccurancecount.csv") or die("Couldn't open CSV file");
while(($k, $v) = each(%stor)){
print CSVHANDLE "$k, $v \n";
}
close(CSVHANDLE);
print "All Done! \n";
The results can be found at ./c/wordoccurancecount. The table below shows only the top ten but the file contains all the words.
| Word | Frequency | Frequency as percentage |
|---|---|---|
| من | 2764 | 3.55% |
| الله | 2154 | 2.77% |
| ان | 1605 | 2.06% |
| في | 1185 | 1.52% |
| ما | 1011 | 1.30% |
| لا | 812 | 1.04% |
| الذين | 811 | 1.04% |
| الا | 763 | 0.98% |
| على | 670 | 0.86% |
| ولا | 658 | 0.85% |
Summary
You will find the raw original Quran text files as copied from the holyquran.net website, the sorted Quran text files, the Perl scripts, the raw result files (CSV), and the spreadsheet files (XLS/ODS/HTML) with some calculations in this zipped archive (click here). I do not claim that everything is perfect but I have made a solid attempt to be accurate and have double checked most things so if by luck you notice anything that I missed then please contact me so I may correct it.
It has not escaped our notice that the data we present here can be used to discredit the beliefs of heretical cults who base their faith on false ideas of the numeric basis of the Quran. We foresee the data being shared with all honesty and being used to ascertain facts, denounce spurious claims and to vanquish myths. If you discover something interesting then please contact us and we might publish it on this site.
- There are 6236 verses in the Quran
- There are 77799 or 77784 words in the Quran (see above)
- There are 14716 unique words in the Quran (by alphabetic analysis alone)
- The longest chapter in the Quran (2) contains 6140 words
- The shortest chapter in the Quran (108) contains 10 words
- The top 3 most-reoccurring words in the Quran are min (from), Allah (God) and inna (is, if, ...etc) ( من, الله, ان )
- The three most common letters in the Quran are Aleph, Laam & Nun ( ا ل ن )
- There are 330743 alphabetic letters in the Quran
- The longest chapter in the Quran (2) contains 26252 letters
- The shortest chapter in the Quran (108) contains 43 letters