Repository: hci-lab/PyQuran Branch: release Commit: 24c70e4e7315 Files: 55 Total size: 1.1 MB Directory structure: gitextract_ie97inb_/ ├── .gitignore ├── CONTRIBUTING.md ├── CodeConventions/ │ ├── README.md │ └── example_google.py ├── DOCUMENTATION.md ├── LICENSE ├── QuranCorpus/ │ └── quran-uthmani.xml ├── README.md ├── __init__.py ├── core/ │ ├── __init__.py │ └── pyquran.py ├── documentation/ │ ├── TODO │ ├── __init__.py │ ├── auto_gen_docs.py │ ├── docs/ │ │ ├── Alphabetical-Systems.md │ │ ├── CONTRIBUTING.md │ │ ├── FAQ.md │ │ ├── Filtering-Special-Recitation-Symbols.md │ │ ├── Home.md │ │ ├── PyQuran-Founders.md │ │ ├── Wiki-Home.md │ │ ├── analysis_tools.md │ │ ├── arabic_tools.md │ │ ├── authors.md │ │ ├── code_conventions.md │ │ ├── dictFrec.md │ │ ├── example_google.md │ │ ├── index.md │ │ ├── maintainers.md │ │ ├── methods guide.md │ │ ├── quran_tools.md │ │ └── quran_tools_template.md │ ├── generate.sh │ ├── git-adding.sh │ ├── mkdocs.yml │ ├── sources/ │ │ ├── analysis_tools_template.md │ │ ├── arabic_tools_template.md │ │ └── quran_tools_template.md │ └── templates/ │ ├── analysis_tools_template.md │ ├── arabic_tools_template.md │ └── quran_tools_template.md ├── testing/ │ ├── run_test.sh │ ├── test_pyquran.py │ ├── test_quran.py │ ├── test_searchHelper.py │ └── test_shape_systems.py └── tools/ ├── AI.py ├── __init__.py ├── arabic.py ├── buckwalter.py ├── error.py ├── filtering.py ├── quran.py ├── searchHelper.py └── shapeHelper.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ .DS_Store umar_test.py .git .project .pydevproject .settings __pycache__ __pycache__/* tools/Umar_test2.ipynb tools/.ipynb_checkpoints/ core/.pyquran.py.swp *.pyc *.txt *.ipyb *.ipynb ================================================ FILE: CONTRIBUTING.md ================================================ Contributing to PyQuran ======================= We use GitHub issues for reporting bugs and for feature requests. If you want to give us a hand, you may pick one of the opened issues and solve a bug, implement a feature request or to suggest a new missing feature. ## Reporting issues When reporting a bug, use GitHub issue with the **Bug label**, please include as much details as possible about: - your operating system. - your python version. - a self-contained code to reproduce and demonstrate the Bug. **Issue will be closed if the Bug cannot be reproduced.** ## Feature Request Whenever you think PyQuran is missing a feature, create a GitHub issue with **Feature Request label**, define what you want precisely and include sufficient examples to cover all the new feature aspects. If you would like to implement it by yourself, please read the [Contributing Code](#contributing-code) section. ## Code Contribution Your code have to meet [these standartds](code_conventions.md). ## Contributing Flow At first, fork the project on [GitHub](https://github.com/TahaMagdy/PyQuran/), then, create a *feature branch* and start writing your changes. We **DO NOT** accept changes to the *master branch*. Once you are done, push the changes to *your feature branch*, after that create a *pull request* with an expressive title and description. ## Commit Messages **It is so important to commit properly**, we expect you to commit every one logical change. A commit message should describe what have been changed, why, and reference issues fixed (if any). **Commit Message Properties**: 1. The Fist line is the commit title, should be less then or equal 50 characters, it must be expressive. 2. Keep the second line blank. 3. Wrap all other lines in the message body at 80 columns. 4. Include `Fixes #N`, where _N_ is the issue number the commit fixes, if any. Commits should look like the following: ```text explain commit in one line Body of commit message is a few lines of text, explaining things in more detail, possibly giving some background about the issue being fixed, etc. The body of the commit message **can be several paragraphs**, and please do proper word-wrap and keep columns shorter than about 80 characters. Fixes #101 ``` ## Unit Tests We write a test module for every PyQuran module under `PyQuran/testing`. **Naming** If the module is called *X*, then its testing module is called *test_X*. *test_x* must have tough unit tests for every single function. **Note** it is inevitable to run all testing modules before you make any pull request. Pull Requests will not be accepted if there is one fail in testing modules. So, please run them all first. ================================================ FILE: CodeConventions/README.md ================================================ Code Conventions ================
> This helps everyone to read and maintain the code **even when they are maintains someone else code**
> *Please restrict to the rules.* ## Rules: * A line **must not** exceed *80 character* length. * Use **Spaces** not **Tabs**. * Always return to `example_google.py` file. * We dissagree with `example_goole.py` in variables naming ONLY,
and **we agree with it in the whole entire rest**. ## Naming: * **Class Name**: [PascalCase](https://en.wikipedia.org/wiki/PascalCase): initial letter is **upper case** * *Examples*: `Class, NewClass, ...` * **Function**: [snake_case](https://en.wikipedia.org/wiki/Snake_case): Lowercase underscore-separated names. * *Examples*: `foo, foo_name, ...` * **Variables**: [lowerCamelCase](https://en.wikipedia.org/wiki/Camel_case): initial letter is **lower case** and rest are PascalCasee. * *Examples*: `variable, varibaleName, ...` ## Function prototypes * Functions should have a description followed by sections as in the following example. * You don't need to include all section, but include what makes the function as clear as possible. * **Function prototypes also used for proposed functions**. ```python def function_with_types_in_docstring(param1, param2): """Here you write a rigorous description of the function Args: param1 (int): The first parameter. param2 (str): The second parameter. Returns: bool: The return value. True for success, False otherwise. Note: Do not include the `self` parameter in the ``Args`` section. """ pass # in case it is just a prototype (not implemented yet) ```

================================================ FILE: CodeConventions/example_google.py ================================================ # -*- coding: utf-8 -*- """Example Google style docstrings. This module demonstrates documentation as specified by the `Google Python Style Guide`_. Docstrings may extend over multiple lines. Sections are created with a section header and a colon followed by a block of indented text. Example: Examples can be given using either the ``Example`` or ``Examples`` sections. Sections support any reStructuredText formatting, including literal blocks:: $ python example_google.py Section breaks are created by resuming unindented text. Section breaks are also implicitly created anytime a new section starts. Attributes: module_level_variable1 (int): Module level variables may be documented in either the ``Attributes`` section of the module docstring, or in an inline docstring immediately following the variable. Either form is acceptable, but the two should not be mixed. Choose one convention to document module level variables and be consistent with it. Todo: * For module TODOs * You have to also use ``sphinx.ext.todo`` extension .. _Google Python Style Guide: http://google.github.io/styleguide/pyguide.html """ module_level_variable1 = 12345 module_level_variable2 = 98765 """int: Module level variable documented inline. The docstring may span multiple lines. The type may optionally be specified on the first line, separated by a colon. """ def function_with_types_in_docstring(param1, param2): """Example function with types documented in the docstring. `PEP 484`_ type annotations are supported. If attribute, parameter, and return types are annotated according to `PEP 484`_, they do not need to be included in the docstring: Args: param1 (int): The first parameter. param2 (str): The second parameter. Returns: bool: The return value. True for success, False otherwise. .. _PEP 484: https://www.python.org/dev/peps/pep-0484/ """ def function_with_pep484_type_annotations(param1: int, param2: str) -> bool: """Example function with PEP 484 type annotations. Args: param1: The first parameter. param2: The second parameter. Returns: The return value. True for success, False otherwise. """ def module_level_function(param1, param2=None, *args, **kwargs): """This is an example of a module level function. Function parameters should be documented in the ``Args`` section. The name of each parameter is required. The type and description of each parameter is optional, but should be included if not obvious. If \*args or \*\*kwargs are accepted, they should be listed as ``*args`` and ``**kwargs``. The format for a parameter is:: name (type): description The description may span multiple lines. Following lines should be indented. The "(type)" is optional. Multiple paragraphs are supported in parameter descriptions. Args: param1 (int): The first parameter. param2 (:obj:`str`, optional): The second parameter. Defaults to None. Second line of description should be indented. *args: Variable length argument list. **kwargs: Arbitrary keyword arguments. Returns: bool: True if successful, False otherwise. The return type is optional and may be specified at the beginning of the ``Returns`` section followed by a colon. The ``Returns`` section may span multiple lines and paragraphs. Following lines should be indented to match the first line. The ``Returns`` section supports any reStructuredText formatting, including literal blocks:: { 'param1': param1, 'param2': param2 } Raises: AttributeError: The ``Raises`` section is a list of all exceptions that are relevant to the interface. ValueError: If `param2` is equal to `param1`. """ if param1 == param2: raise ValueError('param1 may not be equal to param2') return True def example_generator(n): """Generators have a ``Yields`` section instead of a ``Returns`` section. Args: n (int): The upper limit of the range to generate, from 0 to `n` - 1. Yields: int: The next number in the range of 0 to `n` - 1. Examples: Examples should be written in doctest format, and should illustrate how to use the function. >>> print([i for i in example_generator(4)]) [0, 1, 2, 3] """ for i in range(n): yield i class ExampleError(Exception): """Exceptions are documented in the same way as classes. The __init__ method may be documented in either the class level docstring, or as a docstring on the __init__ method itself. Either form is acceptable, but the two should not be mixed. Choose one convention to document the __init__ method and be consistent with it. Note: Do not include the `self` parameter in the ``Args`` section. Args: msg (str): Human readable string describing the exception. code (:obj:`int`, optional): Error code. Attributes: msg (str): Human readable string describing the exception. code (int): Exception error code. """ def __init__(self, msg, code): self.msg = msg self.code = code class ExampleClass(object): """The summary line for a class docstring should fit on one line. If the class has public attributes, they may be documented here in an ``Attributes`` section and follow the same formatting as a function's ``Args`` section. Alternatively, attributes may be documented inline with the attribute's declaration (see __init__ method below). Properties created with the ``@property`` decorator should be documented in the property's getter method. Attributes: attr1 (str): Description of `attr1`. attr2 (:obj:`int`, optional): Description of `attr2`. """ def __init__(self, param1, param2, param3): """Example of docstring on the __init__ method. The __init__ method may be documented in either the class level docstring, or as a docstring on the __init__ method itself. Either form is acceptable, but the two should not be mixed. Choose one convention to document the __init__ method and be consistent with it. Note: Do not include the `self` parameter in the ``Args`` section. Args: param1 (str): Description of `param1`. param2 (:obj:`int`, optional): Description of `param2`. Multiple lines are supported. param3 (:obj:`list` of :obj:`str`): Description of `param3`. """ self.attr1 = param1 self.attr2 = param2 self.attr3 = param3 #: Doc comment *inline* with attribute #: list of str: Doc comment *before* attribute, with type specified self.attr4 = ['attr4'] self.attr5 = None """str: Docstring *after* attribute, with type specified.""" @property def readonly_property(self): """str: Properties should be documented in their getter method.""" return 'readonly_property' @property def readwrite_property(self): """:obj:`list` of :obj:`str`: Properties with both a getter and setter should only be documented in their getter method. If the setter method contains notable behavior, it should be mentioned here. """ return ['readwrite_property'] @readwrite_property.setter def readwrite_property(self, value): value def example_method(self, param1, param2): """Class methods are similar to regular functions. Note: Do not include the `self` parameter in the ``Args`` section. Args: param1: The first parameter. param2: The second parameter. Returns: True if successful, False otherwise. """ return True def __special__(self): """By default special members with docstrings are not included. Special members are any methods or attributes that start with and end with a double underscore. Any special member with a docstring will be included in the output, if ``napoleon_include_special_with_doc`` is set to True. This behavior can be enabled by changing the following setting in Sphinx's conf.py:: napoleon_include_special_with_doc = True """ pass def __special_without_docstring__(self): pass def _private(self): """By default private members are not included. Private members are any methods or attributes that start with an underscore and are *not* special. By default they are not included in the output. This behavior can be changed such that private members *are* included by changing the following setting in Sphinx's conf.py:: napoleon_include_private_with_doc = True """ pass def _private_without_docstring(self): pass ================================================ FILE: DOCUMENTATION.md ================================================ # Documentation * [Features](#features) * [Imporatan information](#imporatan-information) * [Usage](#usage) * [Functions](#functions) * [Access functions](#access-functions) * [get_sura](#get_sura) * [get_verse](#get_verse) * [get_token](#get_token) * [get_sura_number](#get_sura_number) * [get_sura_name](#get_sura_name) * [get_verse_count](#get_verse_count) * [Manipulate functions](#manipulate-functions) * [separate_token_with_diacritics](#separate_token_with_diacritics) * [get_tashkeel_binary](#get_tashkeel_binary) * [unpack_alef_mad](#unpack_alef_mad) * [shape](#shape) * [check_system](#check_system) * [check_all_alphabet](#check_all_alphabet) * [buckwalter_transliteration](#buckwalter_transliteration) * [extract_tashkeel](#extract_tashkeel) * [Analysis functions](#analysis-functions) * [count_shape](#count_shape) * [count_token](#count_token) * [frequency_of_character](#frequency_of_character) * [generate_frequancy_dictionary](#generate_frequancy_dictionary) * [sort_dictionary_by_similarity](#sort_dictionary_by_similarity) * [check_sura_with_frequency](#check_sura_with_frequency) * [generate_latex_table](#generate_latex_table) * [Search functions](#search-functions) * [search_sequence](#search_sequence) * [search_string_with_tashkeel](#search_string_with_tashkeel) * [search_with_pattern](#search_with_pattern) # Features * Access Holy-Quran : - get **Chapter** with/without diacritics. - get **Verse** with/without diacritics. - get **Token** (word). - get **Chapter name** , **Chapter number**. - get **Verses number** in verse. * Manipulate with Holy-Quran : - Separate to **letters** with/without diacritics. - Apply your **System** on Quran. - get **Binary representation** of Holy-Quran as 0's , 1's. - Extract **Taskill** from sentence. - Dealing with linguistic rules like : - Transfer Alef-mad **"آ"** to "أَأْ" - Convert the **unicode of arabic** text to **buckwalter encoding** and vice versa - Convert Quran to **buckwalter reprsentation** and vice versa. * Analysis Holy-Quran: - get **Frequency Matrix** of letters dependent on Applied _alphabet system_. - get **Frequency dictionary** of tokens. - sort **Frequency dictionary** using similarity threshold. * Search in Holy-Quran using : - **Text** and ther is a variety options. - **diacritics pattern**. - **binary representation pattern** using threshold. # Imporatan information * Note all verses/chapters/tokens start with **1** not **0** #### AlphaSystem : it's collection of Alphabits that you can apply it on Quran as you need, where you can treat many characters as one character, like: ```python system = [['أ','آ','إ'],['ت','ب']] ``` here we treat **['أ','آ','إ']** as one character and **['ت','ب']** as another one and the **res characters** every one treat as one, this system applied to all functions in **PyQuran** in Counting,Search,Filltering ...etc. the default system used in library treat every character as one , you will find some of **pre-defined** parts of system that you can use it to define your system , import **systems** to use them. * pre-defined: * withoutDotSystem (treat all characters has dot as one) * hamazatSystem (treat all characters has hamza as one) ```python from pyquran import systems system = [['ت','ب'], systems.hamazatSystem] ``` # Usage ```python import PyQuran as pq ``` # Functions ## Access functions: #### get_sura **get_sura(chapter_num,with_tashkeil)** - takes **chapter_num** it's the number of surah and returns **list of chapter verses** and the **with_tashkeil (optional)** is the diacritics option and if **_True_** return chapter with diacritics and if **False** return without and defualt _false_ . ```python sura = pq.get_sura(108,True) print(sura) >>> ['إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ', 'فَصَلِّ لِرَبِّكَ وَانْحَرْ', 'إِنَّ شَانِئَكَ هُوَ الْأَبْتَرُ'] ``` #### get_verse **get_verse(chapter_num,verse_num,with_tashkeel)** - takes the **chapter_num** , **verse_num** and and it return **verse content** and **with_tashkeil (optional)** is the diacritics option and if **_True_** return verses with diacritics and if **False** return without and defualt _false_. ```python ayahText=pq.get_verse(110,1,True) print(ayahText) >>> إِذَا جَاءَ نَصْرُ اللَّهِ وَالْفَتْحُ ``` #### get_token **get_token(token_num , verse_num , chapter_num , with_tashkeel)** - takes the **token_num** (position Of Token) , **verse_num** , **chapter_num** and it return **token** and **with_tashkeil (optional)** is the diacritics option and if **_True_** return token with diacritics and if **False** return without and defualt _false_ . ```python tokenText = pq.get_token(4,1,114,True) print(tokenText) >>> النَّاسِ ``` #### get_sura_number **get_sura_number(chapter_name)** - takes the name of chapter and return it's number. ```python suraNumber = pq.get_sura_number('الملك') print(suraNumber) >>> 67 ``` #### get_sura_name **get_sura_name(chapter_num)** - takes the number of chapter and return it's. ```python suraName = pq.get_sura_name(67) print(suraName) >>> الملك ``` #### get_verse_count **get_verse_count(chapter)** - takes **chapter** and return the number of verses. ```python numberOfAyat = pd.get_verse_count(pq.get_sura(110,True)) print(numberOfAyat) >>> 3 ``` ## Manipulate functions: #### separate_token_with_diacritics **separate_token_with_diacritics(sentence)** - takes **sentence** and separate it to characters with there diacritics. ```python wordSeparated = pq.separate_token_with_dicrites('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ') print(wordSeparated) >>> ['إِ', 'نَّ', 'ا', ' ', 'أَ', 'عْ', 'طَ', 'يْ', 'نَ', 'كَ', ' ', 'ا', 'لْ', 'كَ', 'وْ', 'ثَ', 'رَ'] ``` #### get_tashkeel_binary **get_tashkeel_binary(verse)** - takes the verses content or chapters with diacritics and it returns tuple of the mapping of **chracters with diacritics** to **0's,1's** and **harakah** represented as **1** and **sukun** represented as **0** and return list of diacritics too. ```python pattern = pq.get_tashkeel_binary('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ') print(pattern) >>> ('1010 101011 001011', ['ِ', 'ْ', 'َ', '', '', 'َ', 'ْ', 'َ', 'ْ', 'َ', 'َ', '', '', 'ْ', 'َ', 'ْ', 'َ', 'َ']) ``` #### unpack_alef_mad **unpack_alef_mad(ayahWithAlefMad)** - takes **ayahWithAlefMad** (sentence that has Alef-Mad) and it returns the sentence after replace **Alef-mad** to **Alef-hamza-above + fatha** and **alef-hamza-above + sukun**. ```python unpackAlefMad = pq.unpack_alef_mad('آ') print(unpackAlefMad) >>> 'أْأَ' ``` #### shape **shape(system)** - takes **system** (a new system for alphabets) ,system is "**a list of lists**" that want to treat every "**inner list**" as one character and returns a dictionary has the same value for each set of alphabets and diffirent values for the rest of alphabets , you can see to more details [here](#imporatan-information). ```python newSystem = [[pq.beh, pq.teh], [pq.jeem, pq.hah, pq.khah]] updatedSystem = pq.shape(newSystem) print(updatedSystem) >>> {'ب': 0, 'ت': 0, 'ث': 0, 'ج': 1, 'ح': 1, 'خ': 1, 'ء': 2, 'آ': 3, 'أ': 4, 'ؤ': 5, 'إ': 6, 'ئ': 7, 'ا': 8, 'ة': 9, 'د': 10, 'ذ': 11, 'ر': 12, 'ز': 13, 'س': 14, 'ش': 15, 'ص': 16, 'ض': 17, 'ط': 18, 'ظ': 19, 'ع': 20, 'غ': 21, 'ف': 22, 'ق': 23, 'ك': 24, 'ل': 25, 'م': 26, 'ن': 27, 'ه': 28, 'و': 29, 'ى': 30, 'ي': 31, ' ': 70} ``` #### check_all_alphabet **check_all_alphabet(system)** - takes **system** and return the rest of default alphabet chracters that doesn't include **system**. ```python system = [[pq.beh, pq.teh], [pq.jeem, pq.hah, pq.khah]] rest = pq.check_all_alphabet(system) print(rest) >>> ['ء', 'آ', 'أ', 'ؤ', 'إ', 'ئ', 'ا', 'ة', 'ث', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ى', 'ي'] ``` #### check_system **def check_system(system, indx=None)** - takes **system** and return main system after apply new system and takes too **index (optional)** that return specific collection from main system. ```python # without index system = [[pq.beh, pq.teh], [pq.jeem, pq.hah, pq.khah]] rest = pq.check_system(system) print(rest) >>> [['ب', 'ت'], ['ج', 'ح', 'خ'], ['ء'], ['آ'], ['أ'], ['ؤ'], ['إ'], ['ئ'], ['ا'], ['ة'], ['ث'], ['د'], ['ذ'], ['ر'], ['ز'], ['س'], ['ش'], ['ص'], ['ض'], ['ط'], ['ظ'], ['ع'], ['غ'], ['ف'], ['ق'], ['ك'], ['ل'], ['م'], ['ن'], ['ه'], ['و'], ['ى'], ['ي']] # with index system = [[pq.beh, pq.teh], [pq.jeem, pq.hah, pq.khah]] rest = pq.check_system(system,index=1) print(rest) >>> ['ج', 'ح', 'خ'] ``` #### buckwalter_transliteration **buckwalter_transliteration(sentence, reverse)** - takes an **sentence** and **reverse (optional)** the trnslate option if **True** convert **sentence** from Arabic to BuckWalter and if **False (default)** convert **sentence** from BuckWalter to Arabic. ##### note**:the encoding with **diacritics** is different from **without diacritics**. ```python buckwalterEncode = pq.buckwalter_transliteration('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ') print(buckwalterEncode) >>> aEoTayonaka Alokawovara ``` #### extract_tashkeel **extract_tashkeel(sentence)** - takes an **sentence** and return the tashleel only without charaters. **Comming soooooon =D .....Taha Magedy Note** ## Analysis functions: #### count_shape **count_shape(text, system=None)** - takes **text** (chapter/verse), **system (optional)** it's the shape of character as example [[bah,gem]] and return a **n*p matrix** where **n** number of verses and **p** number of collections in system and if not pass system it will apply the defualt. ```python newSystem=[[beh, teh, theh], [jeem, hah, khah]] alphabetAsOneShape =pq.count_shape(get_sura(110), newSystem) print(alphabetAsOneShape) >>> [[1 2 1 0 0 0 1 0 4 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 3 0 1 1 1 0 0] [1 2 0 0 2 0 0 0 5 0 2 0 1 0 1 0 0 0 0 0 0 0 2 0 0 4 0 3 1 3 1 3] [6 2 0 0 0 0 1 0 4 0 1 0 2 0 2 0 0 0 0 0 0 1 2 0 2 0 1 2 2 2 0 0]] ``` #### count_token **count_token(text)** - takes **text** (chapter/verse) and returns the number of tokens. ###### ***note***: the harf ('و') is not calculated as token alone ```python numberOfToken=pq.count_token(tools.get_sura(110)) print(numberOfToken) >>> 19 ``` #### frequency_of_character **frequency_of_character(characters,verse=None,chapterNum=0,verseNum=0,with_tashkeel=False)** - takes **characters** that you need to count , return dictionary that havecounts characters occurrence for verses or with chapter or even all quran and the dictionary contains the key char and values is an occurrence of character . - optional opptions: - **verse** (str): if passed, it will applied to this string only - **chapterNum** (int) : if passed only, it will applied to this chapter only. - **verseNum** (int) : - if passed only, it will applied to **verseNum** for **all Chapters**. - if passed with **chapterNum**, it will applied to verseNum for **chapterNum**. - **with_tashkeel** (bool): - if **True** applied to Quran **with** Tashkieel. - if **False** applied to Quran **without** Tashkieel. - Note : if don't pass any **optional opptions** it will applied to all **Quran**. ```python frequencyOfChar =tools.frequency_of_character(['أ','ب'],'قل أعوذ برب الناس',114,1) print(frequencyOfChar) >>> {أ:1,ب:2} ``` #### generate_frequancy_dictionary **generate_frequency_dictionary(suraNumber=None)** - takes **suraNumber (optional)** the number of chapter and it returns the dictionary of words contains the **word** as key and its **frequency** as value and if not pass **suraNumber** it will applied to **all-Quran**. ```python dictionaryFrequency = pq.generate_frequency_dictionary(114) print(dictionaryFrequency) >>> {'الناس': 4, 'من': 2, 'قل': 1, 'أعوذ': 1, 'برب': 1, 'ملك': 1, 'إله': 1, 'شر': 1, 'الوسواس': 1, 'الخناس': 1, 'الذى': 1, 'يوسوس': 1, 'فى': 1, 'صدور': 1, 'الجنة': 1, 'والناس': 1} ``` #### sort_dictionary_by_similarity **sort_dictionary_by_similarity(frequency_dictionary,threshold=0.8)** - using to **cluster words by using similarity** and sort every bunch of word by most common and sort bunches descending in the same time takes the frequency dictionary generated using [generate_frequency_dictionary](#generate_frequency_dictionary) function. This function takes dictionary of frequencies and **threshold (optional)** to specify **the degree of similarity** ```python sortedDictionary = pq.sort_dictionary_by_similarity(dictionaryFrequency) print(sortedDictionary) >>> {'الناس': 4, 'الخناس': 1, 'والناس': 1, 'من': 2, 'قل': 1, 'أعوذ': 1, 'برب': 1, 'ملك': 1, 'إله': 1, 'شر': 1, 'الوسواس': 1, 'الذى': 1, 'يوسوس': 1, 'فى': 1, 'صدور': 1, 'الجنة': 1} ``` #### check_sura_with_frequency **check_sura_with_frequency(sura_num,freq_dec)** - function checks if frequency dictionary of **specific chapter** is compatible with **original chapter** in quran, it takes **sura_num** (chapter number) and **freq_dec** (frequency dictionary) and return **True** if compatible and **False** in not. ```python dictionaryFrequency = pq.generate_frequency_dictionary(111) matched = pq.check_sura_with_frequency(110,dictionaryFrequency) print(matched) >>> False ``` #### generate_latex_table **generate_latex_table(dictionary,filename,location=".")** - generates latex code of table of frequency it takes dictionary frequency ,it takes **dictionary** (frequency dictionary) , **filename** and **location** (location to save) , the default location is same directory by symbol '.', then it returns **True** if the operation of generation completed successfully **False** if something wrong ```python latexTable = pq.generate_latex_table(dictionaryFrequency,'any_file_name') print(latexTable) >>> True ``` ## Search functions #### search_sequence **search_sequence(sequancesList,verse=None,chapterNum=0,verseNum=0,mode=3)** - take list of sequances and return matched sequance, it search in verse ot chapter or All Quran, - it return for every match : - matched sequance - chapter number of occurrence - token number if word and 0 if sentence - Note : - if found verse != None it will use it en search . - if no verse and found chapterNum and verseNum it will use this verse and use it to search. - if no verse and no verseNum and found chapterNum it will search in chapter. - if no verse and no chapterNum and no verseNum it will search in All Quran. - it has many modes: 1. search with decorated sequance (with tashkeel), and return matched sequance with decorates (with tashkil). 2. search without decorated sequance (without tashkeel), and return matched sequance without decorates (without tashkil). 3. search without decorated sequance (without tashkeel), and return matched sequance with decorates (with tashkil). - optional opptions: - **verse** (str): if passed, it will applied to this string only - **chapterNum** (int) : if passed only, it will applied to this chapter only. - **verseNum** (int) : - if passed only, it will applied to **verseNum** for **all Chapters**. - if passed with **chapterNum**, it will applied to verseNum for **chapterNum**. - **with_tashkeel** (bool): - if **True** applied to Quran **with** Tashkieel. - if **False** applied to Quran **without** Tashkieel. - mode (int): this mode that you need to use and default mode 3 - Note : if don't pass any **optional opptions** it will applied to all **Quran**. - Returns: dict() : key is sequances and value is a list of matched_sequance and their positions ```python matchedKeyword = pq.search_sequence(['قل أعوذ برب']) print(matchedKeyword) >>> {'قل أعوذ برب': [('قُلْ أَعُوذُ بِرَبِّ', 0, 1, 113), ('قُلْ أَعُوذُ بِرَبِّ', 0, 1, 114)]} ``` #### search_string_with_tashkeel **search_string_with_tashkeel(sentence,tashkeel_pattern)** - takes an **sentence** and **tashkeel_pattern** (composed of 0's , 1's) and it returns the locations that matched the pattern of diacrictics start index **inclusive** and end index **exculsive** and return empty list if not found. ```python sentence = 'صِفْ ذَاْ ثَنَاْ كَمْ جَاْدَ شَخْصٌ' tashkeel_pattern = ar.fatha + ar.sukun results = pq.search_string_with_tashkeel(sentence,tashkeel_pattern) print(results) >>> [(3, 5), (7, 9), (10, 12), (13, 15), (17, 19)] ``` #### search_with_pattern **search_with_pattern(pattern,sentence=None,verseNum=None,chapterNum=None,threshold=1)** - this function use to search in 0's,1's pattern and return matched words from sentence pattern dependent on the threshold, it takes a **patter** that you need to looking for , and **sentence (optional)** (sentence where will search), **chapterNum (opetional)** and **verseNum (opetional)** and return list of matched words and sentences. - Cases: 1. if pass sentece only or with another args it will search in sentece only. 2. if not passed sentence and passed verseNum and chapterNum, it will search in this verseNum that exist in chapterNum only. 3. if not passed sentence,verseNum and passed chapterNum only, it will search in this specific chapter only * Note : it's takes time dependent on your threshold and size of chapter, so it's not support to search on All-Quran becouse it take very long time more than 11 min. ```python result = pq.search_with_pattern(pattern="01111",chapterNum=1,threshold=0.9) print(result) >>>['الرَّحِيمِ مَلِكِ', 'نَعْبُدُ وَإِيَّاكَ', 'الْمُسْتَقِيمَ صِرَطَ'] ``` ================================================ FILE: LICENSE ================================================ GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Lesser General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. ================================================ FILE: QuranCorpus/quran-uthmani.xml ================================================ ================================================ FILE: README.md ================================================ # PyQuran: The Python package for Quranic Analysis PyQuran is a package which provides tools for Quranic Analysis and Arabic texts. It is still a small package which needs a lot of your effort. We believe that it is a seed of a fundamental general package for computations on Quran with Python, even at the most basic level which is simply retrieving Quran text. *Before Islam*, Arabic letters were without dots— [*rasm*](https://en.wikipedia.org/wiki/Rasm), which resulted in ambiguty, two or three letters had the same rasm or form. Muslims have decided to remove this ambiguity by adding dots above or below each letter of the ones which share the same rasm. Now each letter has a unique form. By the way, originally, Quran was written using letters without dots. To enable researchers to use modern alphabet, old rasm or other, we introduce *alphabetical systems*, It is a dynamic construction of letters— Alphabetical Systems. ## Quran Corpus We use [tanzil](http://tanzil.net/docs/download) Quran Corpus (*Uthmani Text*), it is in `UTF-8` encoding. You can find all unique characters of Uthmanic Corpus [here](https://hci-lab.github.io/PyQuran-Private/Filtering-Special-Recitation-Symbols/#recitation-symbols). There are *special recitation symbols* مصطلحات الضبط in the *Uthmani Text*, they are a guide for the reciter to know the right positions to pause and the rules of tajweed. We provide an interface to filter those symbols, *on the fly while fetching from the corpus*, we **DO NOT** change the corpus, NEVER. [For the full details about filtering *special recitation symbols* مصطلحات الضبط.](https://hci-lab.github.io/PyQuran-Private/Filtering-Special-Recitation-Symbols/#recitation-symbols) ## Current Features - [Quran Retrieving.](https://hci-lab.github.io/PyQuran-Private/quran_tools/) - Advanced Searching, by [Text](https://hci-lab.github.io/PyQuran-Private/analysis_tools/#search_sequence) and [Diacritics](https://hci-lab.github.io/PyQuran-Private/analysis_tools/#search_string_with_tashkeel) Patterns. - [Buckwalter Transliteration](https://hci-lab.github.io/PyQuran-Private/arabic_tools/#buckwalter_transliteration), back and forth. - Multiple [Alphabetical Systems](https://hci-lab.github.io/PyQuran-Private/arabic_tools/#alphabetical-systems). - Words Frequency Table المعجم الترددى للألفاظ . ## PyQuran needs and Upcoming Features. - Words Frequency Table filtered according to words meaning. - Morphology analysis of words to their roots. - Arabic tools for representing Arabic text for AI algorithms and neural networks, for more serious Arabic text processing and understanding. Those tools should take meaning, diacritics, roots and other morphology aspects in account. - Some PyQuran in-house tools and architecture enhancement will be on GitHub Issues for you contributors to make PyQuran professional and easy to use. ## Contributing To contribute and maintain PyQuran, Please read [CONTRIBUTING](https://hci-lab.github.io/PyQuran-Private/CONTRIBUTING) section. ## Dependencies - [numpy](http://www.numpy.org/) - [pyarabic](https://github.com/linuxscout/pyarabic) ## Install - From PyPI: `$ pip3 install pyquran` ## Citing ``` @MISC {PyQuran2018, author = "Waleed A. Yousef and Taha M. Madbouly and Omar M. Ibrahime and Ali H. El-Kassas and Ali O. Hassan and Abdallah R. Albohy and Moustafa A. Mahmoud", title = "PyQuran: The Python package for Quranic Analysis", howpublished = "https://hci-lab.github.io/PyQuran-Private", year = "2018"} ``` ## Communication [Author Page](https://hci-lab.github.io/PyQuran-Private/authors) ================================================ FILE: __init__.py ================================================ """ """ from pyquran.tools import quran from pyquran.tools import arabic from pyquran.core.pyquran import * ================================================ FILE: core/__init__.py ================================================ # Adding another searching path from sys import path import os # The current path of the current module. path_current_module = os.path.dirname(os.path.abspath(__file__)) tools_modules = '../tools/' tools_path = os.path.join(path_current_module, tools_modules) path.append(tools_path) ================================================ FILE: core/pyquran.py ================================================ """Main PyQuran Library Module * Data: Sat Nov 18 03:30:41 EET 2017 This module contains tools for `Quranic Analysis` (More expressive description later) """ # Adding another searching path from sys import path import os # The current path of the current module. path_current_module = os.path.dirname(os.path.abspath(__file__)) tools_modules = '../tools/' tools_path = os.path.join(path_current_module, tools_modules) path.append(tools_path) import quran import sys import error import numpy import operator import re import searchHelper import functools import difflib as dif import arabic from arabic import * from pyarabic.araby import strip_tashkeel, strip_tatweel,separate,strip_tatweel from audioop import reverse from itertools import chain from collections import Counter, defaultdict import buckwalter import sys import shapeHelper from collections import OrderedDict from xml.etree.ElementTree import ElementTree from xml.etree.ElementTree import Element import xml.etree.ElementTree as etree from xml.dom import minidom def parse_sura(n, alphabets=['ل', 'ب']): """parses the sura and returns a matrix (ndarray), the rows number equals to the ayat number, and the columns number equals to the length of alphabets What it does: it calculates number of occurrences of each on of letters in the alphabets for each aya. If `A` is a ndarray, then A[i,j] is the number of occurrences of the letter alphabets[j] in the aya i. Args: param1 (int): the ordered number of sura in The Mus'haf. param2 ([str]): a list of alphabets Returns: ndarray: with dimensions (a * m), where `a` is the number of ayat el-sura and `m` is the number of letters passed to the function through alphabets[] Issue: 1. A list of Arabic letters maybe flipped by your editor, so, the first char will be the most-right one, unlike a list of English char, the first element is the left-most one. 2. I didn't make alphabets[] 29 by default. Just try it by filling the alphabets with some letters. """ # getting the nth sura sura = quran.get_sura(n) # getting the ndarray dimensions a = len(sura) m = len(alphabets) # building ndarray with appropriate dimensions A = numpy.zeros((a,m), dtype=numpy.int) # Filling ndarray with alphabets[] occurrences i = 0 # number of current aya j = 0 # occurrences for aya in sura: for letter in alphabets: A[i,j] = aya.count(letter) j += 1 j = 0 i += 1 return A def get_frequency(sentence): """it take sentence that you want to compute it's words frequency. Args: sentence (string): sentece that compute it's frequency. Returns: dict: {str: int} Example: ```python q.get_frequency(quran.get_verse(1,1)) >>> {'الرحمن': 1, 'الرحيم': 1, 'الله': 1, 'بسم': 1} ``` """ if type(sentence) != str: raise TypeError('sentece should be string') # split sentence to words word_list = sentence.split() #compute count of uniqe words frequency = Counter(word_list) #sort frequency descending sorted_freq = dict(sorted(frequency.items(),key=operator.itemgetter(1),reverse=True)) return sorted_freq def generate_frequency_dictionary(suraNumber=None): """computes the frequency dictionary; wher key is a unique word and values is the its occurrence. Args: suraNumber (int): it's optional Returns: dict: key is word, str; value is its occurrences, int. Example: ```python q.generate_frequency_dictionary(114) >>> {'أعوذ': 1, 'إله': 1, 'الجنة': 1, 'الخناس': 1, 'الذى': 1, 'الناس': 4, 'الوسواس': 1, 'برب': 1, 'شر': 1, 'صدور': 1, 'فى': 1, 'قل': 1, 'ملك': 1, 'من': 2, 'والناس': 1, 'يوسوس': 1} ``` """ if type(suraNumber) != int and suraNumber != None : raise TypeError('suraNumber should be integer') if suraNumber <=0 or suraNumber > arabic.swar_num: raise ValueError('suraNumber should be in range [1-114]') frequency = {} #get all Quran if suraNumber is None if suraNumber == None: #get all Quran as one sentence Quran = ' '.join([' '.join(quran.get_sura(i)) for i in range(1,115)]) #get all Quran frequency frequency=get_frequency(Quran) #get frequency of suraNumber else: #get sura from QuranCorpus sura = quran.get_sura(sura_number=suraNumber) ayat = ' '.join(sura) #get frequency of sura frequency = get_frequency(ayat) return frequency def check_sura_with_frequency(sura_num,freq_dec): """this function check if frequency dictionary of specific sura is compatible with original sura in shapes count Args: suraNumber (int): sura number Returns: Boolean: True :- if compatible Flase :- if not Example: ```python frequency_dic = q.generate_frequency_dictionary(114) q.check_sura_with_frequency(114, frequency_dic) >>> True ``` """ if type(sura_num) != int: raise TypeError('sura_num should be integer') if type(freq_dec) != dict: raise TypeError('freq_dec should be dictionary') if sura_num <=0: raise ValueError('sura_num should be in range [1-114]') #get number of chars in frequency dec num_of_chars_in_dec = sum([len(word)*count for word,count in freq_dec.items()]) #get number of chars in original sura num_of_chars_in_sura = sum([len(aya.replace(' ','')) for aya in quran.get_sura(sura_num)]) # print(num_of_chars_in_dec ," ", num_of_chars_in_sura) if num_of_chars_in_dec == num_of_chars_in_sura: return True else: return False def sort_dictionary_by_similarity(frequency_dictionary,threshold=0.8): """this function using to cluster words using similarity and sort every bunch of word by most common and sort bunches descending in same time Args: frequency_dictionary: dict, frequency dictionary to be sorted. Returns: dict : {str: int} sorted dictionary Example: ```python frequency_dic = q.generate_frequency_dictionary(114) q.sort_dictionary_by_similarity(frequency_dic) # this dictionary is sorted using similarity 0.8 >>> {'أعوذ': 1, 'إذا': 2, 'العقد': 1, 'الفلق': 1, 'النفثت': 1, 'برب': 1, 'حاسد': 1, 'حسد': 1, 'خلق': 1, 'شر': 4, 'غاسق': 1, 'فى': 1, 'قل': 1, 'ما': 1, 'من': 1, 'وقب': 1, 'ومن': 3} ``` """ if type(threshold) != float: raise TypeError('threshold should be float') if type(frequency_dictionary) != dict: raise TypeError('frequency_dictionary should be dictionary') if threshold < 0 or threshold > 1: raise ValueError('threshold should be float number in range [0-1]') # list of dictionaries and every dictionary has similar words and we will call every dictionary as 'X' list_of_dics = [] # this dictionary key is a position of 'X' and value the sum of frequencies of 'X' list_of_dics_counts = dict() #counter of X's dic_num=0 #lock list used to lock word that added in 'X' occurrence_list = set() #loop on all words to cluster them for word,count in frequency_dictionary.items(): #check if word is locked from some 'X' or not if word not in occurrence_list: #this use to sum all of frequencies of this 'X' sum_of_freqs = count #create new 'X' and add the first word sub_dic = dict({word:count}) #add word in occurrence list to lock it occurrence_list.add(word) #loop in the rest word to get similar word for sub_word,sub_count in frequency_dictionary.items(): #check if word lock or not if sub_word not in occurrence_list: #compute similarity probability similarity_prob = dif.SequenceMatcher(None,word,sub_word).ratio() # check if prob of word is bigger than threshold or not if similarity_prob >= threshold: #add sub_word as a new word in this 'X' sub_dic[sub_word] = sub_count # lock this new word occurrence_list.add(sub_word) # add the frequency of this new word to sum_of_freqs sum_of_freqs +=sub_count #append 'X' in list of dictionaries list_of_dics.append(sub_dic) #append position and summation of this 'X' frequencies list_of_dics_counts[dic_num] = sum_of_freqs # increase number of dictionaries dic_num +=1 #sort list of dictionaries count (sort X's descending) The most frequent list_of_dics_counts = dict(sorted(list_of_dics_counts.items(),key=operator.itemgetter(1),reverse=True)) #new frequency dictionary that will return new_freq_dic =dict() #loop to make them as one dictionary after sorting for position in list_of_dics_counts.keys(): new_sub_dic = dict(sorted(list_of_dics[position].items(),key=operator.itemgetter(1),reverse=True)) for word,count in new_sub_dic.items(): new_freq_dic[word] = count return new_freq_dic def generate_latex_table(dictionary,filename,location="."): """generate latex code of table of frequency Args: dictionary (dict): frequency dictionary filename (string): file name location (string): location to save , the default location is same directory Returns: Boolean: True :- if Done Flase :- if something wrong with folder name Example: ```python frequency_dic = q.generate_frequency_dictionary(114) q.generate_latex_table(frequency_dic,'filename','../location') # it's mean Done, the file 'filename.tex' is ginerated >>> True ``` """ if type(filename) != str: raise TypeError('filename should be string') if type(dictionary) != dict: raise TypeError('dictionary should be dictionary') head_code = """\\documentclass{article} %In the preamble section include the arabtex and utf8 packages \\usepackage{arabtex} \\usepackage{utf8} \\usepackage{longtable} \\usepackage{color, colortbl} \\usepackage{supertabular} \\usepackage{multicol} \\usepackage{geometry} \\geometry{left=.1in, right=.1in, top=.1in, bottom=.1in} \\begin{document} \\begin{multicols}{6} \\setcode{utf8} \\begin{center}""" tail_code = """\\end{center} \\end{multicols} \\end{document}""" begin_table = """\\begin{tabular}{ P{2cm} P{1cm}} \\textbf{words} & \\textbf{\\#} \\\\ \\hline \\\\[0.01cm]""" end_table= """\\end{tabular}""" rows_num = 40 if location != '.': filename = location +"/"+ filename try: file = open(filename+'.tex', 'w', encoding='utf8') file.write(head_code+'\n') n= int(len(dictionary)/rows_num) words = [("\\<"+word+"> & "+str(frequancy)+' \\\\ \n') for word, frequancy in dictionary.items()] start=0 end=rows_num new_words = [] for i in range(n): new_words = new_words+ [begin_table+'\n'] +words[start:end] +[end_table+" \n"] start=end end+=rows_num remain_words = len(dictionary) - rows_num*n if remain_words > 0: new_words += [begin_table+" \n"]+ words[-1*remain_words:]+[end_table+" \n"] for word in new_words: file.write(word) file.write(tail_code) file.close() return True except: return False def shape(system): """shape declare a new system for alphabets ,user pass the alphabets "in a list of list" that want to count it as on shape "inner list" and returns a dictionary has the same value for each set of alphabets and diffrent values for the rest of alphabets Args: param1 ([[char]]): a list of list of alphabets , each inner list have alphabets that with be count as one shape . Returns: dictionary: with all alphabets, where each char "key" have a value value will be equals for alphabets that will be count as oe shape """ newSys=system alphabetMap = OrderedDict() indx = 0 newAlphabet = list(set(chain(*system))) theRestOfAlphabets = list(set(alphabet) - set(newAlphabet)) for char in alphabet: if char in theRestOfAlphabets: alphabetMap.update({char: indx}) indx = indx + 1 elif char in newAlphabet: #sublist that contain this char(give all chars the same indx) #drop this sublist from the system systemItem = shapeHelper.searcher(newSys, char) for char in newSys[systemItem]: alphabetMap.update({char: indx}) newSys=newSys[0:systemItem]+newSys[systemItem+1:] newAlphabet = list(set(chain(*newSys))) indx = indx + 1 ''' for setOfNewAlphabet in system: for char in setOfNewAlphabet: alphabetMap.update({char: indx}) indx = indx + 1 for char in theRestOfAlphabets: alphabetMap.update({char: indx}) indx = indx + 1 ''' alphabetMap.update({" ": 70}) return alphabetMap def count_rasm(text, system=None): """counts the occerences of each letter (As `system` defines) in sura. Args: text: [str], a list of strings , each inner list is ayah . system: Optional, [[char]], revise [Alphabetical Systems](#alphabetical-systems), if `system` is not passed, the normal alphabet is applied. Returns: (N * P) ndarray (Matrix A): N is the number of verses, P is the alphabet (as defined in `system`).\n `A[i][j]` is the number of the letter `j` in the verse `i`. Example: ```python newSystem = [[q.beh, q.teh, q.theh], [q.jeem, q.hah, q.khah]] q.count_rasm(q.quran.get_sura(110), newSystem) >>>[[1 2 1 0 0 0 1 0 4 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 3 0 1 1 1 0 0] [1 2 0 0 2 0 0 0 5 0 2 0 1 0 1 0 0 0 0 0 0 0 2 0 0 4 0 3 1 3 1 3] [6 2 0 0 0 0 1 0 4 0 1 0 2 0 2 0 0 0 0 0 0 1 2 0 2 0 1 2 2 2 0 0]] ``` """ #"there are a intersection between subsets" if system == None: alphabetMap = dict() indx = 0 for char in alphabet: alphabetMap.update({char: indx}) indx = indx + 1 alphabetMap.update({" ": 70}) p=len(alphabet)#+1 #the last one for space char else: for subSys in system: if not isinstance(subSys, list): raise ValueError ("system must be list of list not list") if shapeHelper.check_repetation(system): raise ValueError("there are a repetation in your system") p = len(alphabet) - len(list(set(chain(*system)))) + len(system) alphabetMap = shape(system) n=len(text) A=numpy.zeros((n, p), dtype=numpy.int) i=0 j=0 charCount =[] for verse in text: verse=shapeHelper.convert_text_to_numbers(verse, alphabetMap) for k in range(0,p,1) : charCount.insert(j, verse.count(k)) j+=1 A[i, :] =charCount i+=1 charCount=[] j=0 return A def get_verse_count(surah): """ get_verse_countget get surah as a paramter and return how many ayah in it. What it does: count the number of verses in surah Args: param1 (str ): a strings Returns: int: the number of verses """ return len(surah) def count_token(text): """ count_token get a text (surah or ayah) and count the number of tokens that it has. What it does: count the number of tokens in text Args: param1 (str or [str]): a string or list of strings Returns: int: the number of tokens """ count=0 if isinstance(text, list): for ayah in text: count=count+ayah.count(' ')+1 else: count=text.count(' ')+1 return count def grouping_letter_diacritics(sentance): """Grouping each letter with its diacritics. Args: sentance: str Returns: [str]: a list of _x_, where _x_ is the letter accompanied with its diacritics. Example: ```python q.grouping_letter_diacritics('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ')\n >>> ['إِ', 'نَّ', 'ا', ' ', 'أَ', 'عْ', 'طَ', 'يْ', 'نَ', 'كَ', ' ', 'ا', 'لْ', 'كَ', 'وْ', 'ثَ', 'رَ'] ``` """ sentance_without_tatweel = strip_tatweel(sentance) print(sentance_without_tatweel) hroof_with_tashkeel = [] for index,i in enumerate(sentance): if sentance[index] in (alphabet or alefat or hamzat) or sentance[index] == ' ': k = index harf_with_taskeel =sentance[index] while((k+1) != len(sentance) and (sentance[k+1] in (tashkeel or harakat or shortharakat or tanwin ))): harf_with_taskeel =harf_with_taskeel+""+sentance[k+1] k = k + 1 index = k hroof_with_tashkeel.append(harf_with_taskeel) return hroof_with_tashkeel def frequency_of_character(characters, verse=None, chapterNum=0, verseNum=0, with_tashkeel=False): """counts the number of characters in a specific verse or sura or even the entrire Quran , Note: If verse and chapterNum is not passed, the entire Quran is targeted Args: verse: str, this verse that you need to count it and default is None. chapterNum, int, chapter number is a number of 'sura' that will count it , and default is 0. verseNum: int, verse number in sura. chracters: [], list of characters that you want to count them. with_tashkeel: Bool, to check if you want to search with tashkeel. Returns: {dic} : {str : int} a dictionary and keys is a characters and value is count of every chracter. Example: ```python q.frequency_of_character(['أ',"ب","تُ"],verseNum=2,with_tashkeel=False) #that will count the vers number **2** in all swar >>> {'أ': 101, 'ب': 133, 'تُ': 0} q.frequency_of_character(['أ',"ب","تُ"],chapterNum=1,verseNum=2,with_tashkeel=False) #that will count the vers number **2** in chapter **1** >>> {'أ': 0, 'ب': 1, 'تُ': 0} q.frequency_of_character(['أ',"ب","تُ"],chapterNum=1,verseNum=2,with_tashkeel=False) #that will count in **all Quran** >>> {'أ': 8900, 'ب': 11491, 'تُ': 2149} ``` """ if type(characters) != list: raise TypeError('characters should be list of characters') if type(chapterNum) != int: raise TypeError('chapterNum should be integer') if type(verseNum) != int: raise TypeError('verseNum should be integer') #dectionary that have frequency frequency = dict() #check if count specific verse if verse!=None: if type(verse) != str: raise TypeError('verse should be string') if not with_tashkeel: verse = strip_tashkeel(verse) #count frequency of chars frequency = searchHelper.hellper_frequency_of_chars_in_verse(verse,characters) #check if count specific chapter elif chapterNum!=0: if chapterNum <0 or chapterNum > arabic.swar_num: raise ValueError('chapterNum should be integer number in range [1-114]') #check if count specific verse in this chapter if verseNum!=0: #check if verseNum out of range if(verseNum<0): raise ValueError('chapterNum should be positive integer ') verse = quran.get_sura(chapterNum,with_tashkeel=with_tashkeel)[verseNum-1] #count frequency of chars frequency = searchHelper.hellper_frequency_of_chars_in_verse(verse,characters) else: #count on all chapter chapter = " ".join(quran.get_sura(chapterNum,with_tashkeel=with_tashkeel)) #count frequency of chars frequency = searchHelper.hellper_frequency_of_chars_in_verse(chapter,characters) else: if verseNum!=0: if(verseNum<0): raise ValueError('chapterNum should be positive integer ') #count for specific verse in all Quran Quran = "" for i in range(swar_num): Quran = Quran +" "+quran.get_verse(i+1,verseNum,with_tashkeel=with_tashkeel)+" " #count frequency of chars frequency = searchHelper.hellper_frequency_of_chars_in_verse(Quran,characters) else: #count for all Quran Quran = "" for i in range(swar_num): Quran = Quran +" "+ " ".join(quran.get_sura(i+1,with_tashkeel=with_tashkeel))+" " #count frequency of chars frequency = searchHelper.hellper_frequency_of_chars_in_verse(Quran,characters) return frequency def get_token(tokenNum,verseNum,chapterNum,with_tashkeel=False): """ get token from specific verse form specific chapter Args: tokenNum (int) : position of token verseNum (int): number of verse chapterNum (int): number of chapter with_tashkeel (int) : to check if search with taskeel or not Returns: str : return verse Example: ```python q.get_token(tokenNum=4,verseNum=1,chapterNum=1,with_tashkeel=True) >>> 'الرَّحِيمِ' ``` """ if type(tokenNum) != int: raise TypeError('tokenNum should be integer') if type(chapterNum) != int: raise TypeError('chapterNum should be integer') if type(verseNum) != int: raise TypeError('verseNum should be integer') if chapterNum < 0 or chapterNum > arabic.swar_num: raise ValueError('chapterNum should be integer number in range [1-114]') if tokenNum <= 0: raise ValueError('tokenNum should be positive integer numbers and > 0') if(verseNum<0): raise ValueError('chapterNum should be positive integer ') try: tokens = quran.get_sura(chapterNum,with_tashkeel)[verseNum-1].split() if tokenNum > len(tokens): return "" else: return tokens[tokenNum-1] except: return "" def search_sequence(sequancesList,verse=None,chapterNum=0,verseNum=0,mode=3): """take list of sequances and return matched sequance, it search in verse ot chapter or All Quran , it return for every match : 1 - matched sequance 2 - chapter number of occurrence 3 - token number if word and 0 if sentence Note : - if found verse != None it will use it en search . - if no verse and found chapterNum and verseNum it will - use this verse and use it to search. - if no verse and no verseNum and found chapterNum it will - search in chapter. - if no verse and no chapterNum and no verseNum it will search in All Quran. it has many modes: - search with decorated sequance (with tashkeel), and return matched sequance with decorates (with tashkil). - search without decorated sequance (without tashkeel), and return matched sequance without decorates (without tashkil). - search without decorated sequance (without tashkeel), and return matched sequance with decorates (with tashkil). Args: chapterNum: int, number of chapter where function search. verseNum: int, number of verse wher function search. sequancesList: [], a list of sequances that you want to match them. mode: int, this mode that you need to use and default mode 3. Returns: dict: key is sequances and value is a list of matched_sequance and their positions. Example: ```python # search in chapter = 1 only using mode 3 (default) q.search_sequence(sequancesList=['ملك يوم الدين'],chapterNum=1) #it will return #{'sequance-1' : [ (matched_sequance , position , vers_num , chapter_num) , (....) ], # 'sequance-2' : [ (matched_sequance , position , vers_num , chapter_num) , (....) ] } # Note : position == 0 if sequance is a sentence and == word position if sequance is a word >>> {'ملك يوم الدين': [('مَلِكِ يَوْمِ الدِّينِ', 0, 4, 1)]} # search in all Quran using mode 3 (default) q.search_sequence(sequancesList=['ملك يوم']) >>> {'ملك يوم': [('مَلِكِ يَوْمِ', 0, 4, 1), ('الْمُلْكُ يَوْمَ', 0, 73, 6), ('الْمُلْكُ يَوْمَئِذٍ', 0, 56, 22), ('الْمُلْكُ يَوْمَئِذٍ', 0, 26, 25)]} ``` """ if type(sequancesList) != list: raise TypeError('sequancesList should to be list of strings') if type(verse) != str and verse != None: raise TypeError('verse should to be string') if type(chapterNum) != int: raise ValueError('chapterNum should be integer') if type(verseNum) != int: raise ValueError('verseNum should be integer') if chapterNum < 0 or chapterNum > arabic.swar_num: raise ValueError('chapterNum should be integer number in range [1-114]') if(verseNum<0): raise ValueError('verseNumr should be positive integer and > 0') if mode <= 0 or mode > 3: raise ValueError('mode should be positive integer numbers 1,2 or 3 only') final_dict = dict() #loop on all sequances for sequance in sequancesList: #check mode 1 (taskeel to tashkeel) if mode==1: final_dict[sequance] = searchHelper.hellper_pre_search_sequance( sequance=sequance, verse=verse, chapterNum=chapterNum, verseNum=verseNum, with_tashkeel=True) # chaeck mode 2 (without taskeel to without tashkeel) elif mode==2: final_dict[sequance] = searchHelper.hellper_pre_search_sequance( sequance=sequance, verse=verse, chapterNum=chapterNum, verseNum=verseNum, with_tashkeel=False) # chaeck mode 3 (without taskeel to with tashkeel) elif mode==3: sequance = strip_tashkeel(sequance) final_dict[sequance] = searchHelper.hellper_pre_search_sequance( sequance=sequance, verse=verse, chapterNum=chapterNum, verseNum=verseNum, with_tashkeel=True, mode3=True) return final_dict def search_string_with_tashkeel(string, key): """ Args: string: str, sentence to search by key. key: str, taskeel pattern. Assumption: Searches tashkeel that is exciplitly included in string. Returns: find: list of pairs where x and y are the start and end index of the matched. nod-found: [] Example: ```python string = 'صِفْ ذَاْ ثَنَاْ كَمْ جَاْدَ شَخْصٌ' q.search_string_with_tashkeel(string, 'َْ') >>> [(3, 5), (7, 9), (10, 12), (13, 15), (17, 19)] ``` """ error.is_string(string, 'You must pass an string.') # tashkeel pattern string_tashkeel_only = searchHelper.get_string_taskeel(string) # searching taskeel pattern results = [] for m in re.finditer(key, string_tashkeel_only): spacesBeforeStart = \ searchHelper.count_spaces_before_index(string_tashkeel_only, m.start()) spacesBeforeEnd = \ searchHelper.count_spaces_before_index(string_tashkeel_only, m.start()) begin = m.start() * 2 - spacesBeforeStart end = m.end() * 2 - spacesBeforeEnd one_result = (m.start(), m.end()) results.append(one_result) if results == []: return [] else: return results def buckwalter_transliteration(string, reverse=False): """Back and forth Arabic-Bauckwalter transliteration. Revise [Buckwalter](https://en.wikipedia.org/wiki/Buckwalter_transliteration) Args: string: to be transliterated. reverse: Optional boolean. `False` transliterates from Arabic to Bauckwalter, `True` transliterates from Bauckwalter to Arabic. Returns: str: transliterated string. Example: ```python q.buckwalter_transliteration('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ')\n >>> aEoTayonaka Alokawovara ``` """ for key, value in buckwalter.buck2uni.items(): if not reverse: string = string.replace(value, key) else: string = string.replace(key, value) return string def get_tashkeel_binary(ayah): ''' get_tashkeel_pattern is function takes the str or list(ayah or token) and converts to zero and ones What it does: take token whether ayah or sub ayah and maps it to zero for sukoon and char without diarictics and one for char with harakat and tanwin Args: param1 (str): a string or list Returns: str : zero and ones for each token ''' marksDictionary = {'ْ': 0, '': 0, 'ُ': 1, 'َ': 1, 'ِ': 1, 'ّ': 1, 'ٌ': 1, 'ً': 1, 'ٍ': 1} charWithOutTashkeelOrSukun = '' tashkeelPatternList = [] # list of zeros and ones marksList = [] # convert the List o to string without spaces ayahModified = ''.join(ayah.strip()) tashkeelPatternStringWithSpace = '' # check is there a tatweel in ayah or not if(tatweel in ayahModified): ayahModified = strip_tatweel(ayahModified) # check whether exist alef_mad in ayah if exist unpack the alef mad if (alef_mad in ayahModified): ayahModified = unpack_alef_mad(ayahModified) # separate tashkeel from the ayah ayahOrAyatWithoutTashkeel, marks = separate(ayahModified) for mark in marks: #the pyarabic returns the char of marks without tashkeel with 'ـ' so if check about this mark if not exist #append in list harakat and zero or ones in tashkeel pattern list if yes append the marks and patterns if (mark != 'ـ'): marksList.append(mark) tashkeelPatternList.append(marksDictionary[mark]) else: marksList.append(charWithOutTashkeelOrSukun) tashkeelPatternList.append(marksDictionary[charWithOutTashkeelOrSukun]) # convert list of Tashkeel pattern to String for each token in ayah separate with another token with spce for posOfCharInAyah in range(0, len(ayahOrAyatWithoutTashkeel)): if ayahOrAyatWithoutTashkeel[posOfCharInAyah] == ' ' and tashkeelPatternList[posOfCharInAyah] == 0: tashkeelPatternStringWithSpace += ' ' else: tashkeelPatternStringWithSpace += str(tashkeelPatternList[posOfCharInAyah]) return tashkeelPatternStringWithSpace, marksList def factor_alef_mad(sentance): '''It returns the `sentance` having alef_mad factored into alef_hamza and alef_wasel. Args: sentance: str, a string or list. Returns: str: sentance having the alef_mad factored Example: ```python q.factor_alef_mad('آ')\n >>> 'أْأَ' ``` ''' ayahWithUnpackAlefMad = '' for charOfAyah in sentance: if charOfAyah != 'آ': ayahWithUnpackAlefMad += charOfAyah else: ayahWithUnpackAlefMad += 'أَ' ayahWithUnpackAlefMad += 'أْ' return ayahWithUnpackAlefMad def check_system(system, index=None): ''' Returns the alphabet including treated-as-one letters. If you pass the index as the second optional arguement, it returns the letter of the that index only, not the hole alphabet. Args: system: [[char]], a list of letters, where each letter to be treated as one letter are in one sub-list, see [Alphabetical Systems](#alphabetical-systems). index: Optional integer, is a index of a letter in the new system. Returns: list: full sorted system or a specific index. Example: ```python q.check_system([['alef', 'beh']])\n >>> [['ء'], ['آ'], ['أ', 'ب'], ['ؤ'], ['إ'], ['ئ'], ['ا'], ['ة'], ['ت'], ['ث'], ['ج'], ['ح'], ['خ'], ['د'], ['ذ'], ['ر'], ['ز'], ['س'], ['ش'], ['ص'], ['ض'], ['ط'], ['ظ'], ['ع'], ['غ'], ['ف'], ['ق'], ['ك'], ['ل'], ['م'], ['ن'], ['ه'], ['و'], ['ى'], ['ي']] ``` The previous example prints each letter as one element in a new alphabet list, as you can see the two letters alef and beh are considered one letter. ''' if shapeHelper.check_repetation(system) == True: raise ValueError ("there is a repetition in your system") p = len(alphabet) - len(list(set(chain(*system)))) + len(system) systemDict = shape(system) fullSys = [[key for key, value in systemDict.items() if value == i] for i in range(p)] if index==None: return fullSys else: return fullSys[index] def search_with_pattern(pattern,sentence=None,verseNum=None,chapterNum=None,threshold=1): ''' this function use to search in 0's,1's pattern and return matched words from sentence pattern dependent on the ratio to adopt threshold. Args: pattern (str): 0's,1's pattern that you need to search. sentence (str): Arabic string with tashkeel where function will search. verseNum (int): number of specific verse where will search. chapterNum (int): number of specific chapter where will search. threshold (float): threshold of similarity , if 1 it will get the similar exactly, and if not ,it will get dependant on threshold number. Cases: 1- if pass **sentece** only or with another args it will search in sentece only. 2- if not passed **sentence** , passed **verseNum** and **chapterNum**, it will search in this verseNum that exist in chapterNum only. 3- if not passed **sentence**,**verseNum** and passed **chapterNum** only, it will search in this specific chapter only. 4- if not pass any args it will search in **all Quran** (not recommended, take long time). Return: [list] : it will return list that have matched word, or matched senteces and return empty list if not found. Note : it's takes time dependent on your threshold and size of chapter, so it's not support to search on All-Quran becouse it take very long time more than 11 min. Example: ```python # it will search in chapter **1** only q.search_with_pattern("011101",chapterNum=1) >>> ['لِلَّهِ رَبِّ', 'الْعَلَمِينَ', 'أَنْعَمْتَ عَلَيْهِمْ', 'الْمَغْضُوبِ عَلَيْهِمْ'] ``` ''' if type(pattern) != str or len(pattern)!= (pattern.count('0')+pattern.count('1')): raise TypeError('pattern should to be string of 0\'s and 1\'s like \'011011010\'') if type(sentence) != str and sentence != None: raise TypeError('sentece should to be string') if type(chapterNum) != int and chapterNum != None: raise TypeError('chapterNum should be integer') if type(verseNum) != int and verseNum != None: raise TypeError('verseNum should be integer') if chapterNum < 0 or chapterNum > arabic.swar_num: raise ValueError('chapterNum should be integer number in range [1-114]') if(verseNum!=None and verseNum<0): raise ValueError('verseNumr should be positive integer and > 0') if threshold > 1 or threshold < 0: raise ValueError('Threshold should be 0 <= Threshold <= 1') pattern = pattern.replace(' ','') if len(pattern)<=0: raise ValueError('pattern don\'t passed') #check if sentece exist if sentence != None: #convert sentence to 0/1 sentence_pattern,taskieel = get_tashkeel_binary(sentence) return searchHelper.hellper_search_with_pattern(pattern=pattern, sentence_pattern=sentence_pattern, sentence=sentence, ratio=threshold) else: #check if search in specific chapter if chapterNum != None: #check if search in specific verese if verseNum != None: sentence = quran.get_verse(chapterNum=chapterNum, verseNum=verseNum, with_tashkeel=True) #search in all chapter else: sentence = " ".join(quran.get_sura(chapterNum,True)) #search in all Quran else: raise ValueError('please send sentece or verseNum and chapterNum to search.') #convert sentence to 0/1 sentence_pattern,taskieel = get_tashkeel_binary(sentence) sentence_pattern_without_spaces = sentence_pattern.replace(" ","") #check if no pattern exist if pattern not in sentence_pattern_without_spaces: return [] else: return searchHelper.hellper_search_with_pattern(pattern=pattern, sentence_pattern=sentence_pattern, sentence=sentence, ratio=threshold) def frequency_sura_level(suraNumber): """Computes the frequency dictionary for a sura Args: suraNumber: 1 <= Int <= 114. Return: [aya_frequency_dictionary]: the key of `aya_frequency_dictionary` is a unique word in aya and the corresponding value is its frequency. A list of frequency dictionaries for each verse of Sura. Note: * frequency dictionary is a python dict, which carries word frequencies for an aya. * Its key is (str) word, its value is (int) word frequency Example: ```python q.frequency_sura_level(suraNumber=1) >>> [{بسم': 1, 'الله': 1, 'الرحمن': 1, 'الرحيم': 1'}, {الحمد': 1, 'لله': 1, 'رب': 1, 'العلمين': 1'}, {الرحمن': 1, 'الرحيم': 1'}, {ملك': 1, 'يوم': 1, 'الدين': 1'}, {إياك': 1, 'نعبد': 1, 'وإياك': 1, 'نستعين': 1'}, {اهدنا': 1, 'الصرط': 1, 'المستقيم': 1'}, {عليهم': 2', صرط': 1', الذين': 1', أنعمت': 1', غير': 1', المغضوب': 1', ولا': 1', الضالين': 1'}] ``` """ # A list of frequency dictionaries frequency_ayat_list = [] for aya in quran.get_sura(suraNumber): frequency_ayat_list.append(get_frequency(aya)) return frequency_ayat_list def get_unique_words(): """retuerns a set of all unique words in Quran TODO: need to support suras as well. """ # Unique words words_set = set() for i in range(1, 114+1): sura = quran.get_sura(i) for aya in sura: wordsList = aya.split(' ') for word in wordsList: words_set.add(word) return words_set def get_words(): """returns a list of all words in Quran TODO: need to support suras as well. """ # words words_list = list() for i in range(1, 114+1): sura = quran.get_sura(i) for aya in sura: wordsList = aya.split(' ') for word in wordsList: words_list.append(word) return words_list def frequency_quran_level(): """Compute the words frequences of the Quran. Returns: [sura_level_frequency_dict]: Revise the output of frequency_sura_level. """ # * A list of sura level frequencies. # * Each element is a list of ayat el-sura frequencies. quranWordsFrequences = [] for suraNumber in range(1, 114 +1): suraWordsFrequeces = frequency_sura_level(suraNumber) quranWordsFrequences.append(suraWordsFrequeces) return quranWordsFrequences def prettify(elem): """Return a pretty-printed XML string for the Element. """ rough_string = etree.tostring(elem, 'utf-8') reparsed = minidom.parseString(rough_string) return reparsed.toprettyxml(indent=" ") def quran_words_frequences_data(fileName): """Generate the entire words frequences of Quran into XML or JSON ToDo: Sould support JSONs as well. """ # Computing unique words unique_words = get_unique_words() comma_separated_unique_words = '' for word in unique_words: comma_separated_unique_words += word + ',' # Removing the extra commas comma_separated_unique_words = comma_separated_unique_words.strip(',') # * Creating quran_words_frequences_data -- the root tag root = Element('quran_words_frequences') root.set('unique_words', comma_separated_unique_words) # * Add root to the tree tree = ElementTree(root) for suraNumber in range(1, 114 +1): sura = quran.get_sura(suraNumber) # * Creating sura Tag suraTag = Element('sura') # * set number attribute suraTag.set('number', str(suraNumber)) # * set sura unique words # ??? update get_unique_words # suraTag.set('sura_unique_words', suraUniquewords) ayaCounter = 1 for aya in sura: # Create aya Tag ayaTag = Element('aya') ayaTag.set('number', str(ayaCounter)) # * Computes the words frequency for aya ayaWordsDict = get_frequency(aya) words_comma_separated = '' occurrence_comma_separated = '' for word in ayaWordsDict: words_comma_separated += word + ',' occurrence_comma_separated += str(ayaWordsDict[word]) + ',' # * The same order words_comma_separated = words_comma_separated.strip(',') occurrence_comma_separated = occurrence_comma_separated.strip(',') # * Add words & frequencies attributes ayaTag.set('unique_words', words_comma_separated) ayaTag.set('unique_words_frequencies', occurrence_comma_separated) # * Add aya tag to sura tag suraTag.append(ayaTag) ayaCounter += 1 # * add suraTag to the root root.append(suraTag) # print(prettify(root)) file = open(fileName, 'w') file.write(prettify(root)) file.close() ================================================ FILE: documentation/TODO ================================================ 1. Add Letter definitions (It is a one char letter or a more-than-one-letter considered as one letter) 2. Making the import q.arabic and q.analysis for arabic and analysis tools. 3. struct of tashkell instead of wirting them. ================================================ FILE: documentation/__init__.py ================================================ ================================================ FILE: documentation/auto_gen_docs.py ================================================ #!/usr/local/bin/python3 from sys import path import os # The current path of the current module. path_current_module = os.path.dirname(os.path.abspath(__file__)) tools_modules = '../tools/' tools_path = os.path.join(path_current_module, tools_modules) core_modules = '../core/' core_path = os.path.join(path_current_module,core_modules) path.append(tools_path) # Adding another searching path path.append(core_path) import pyquran import quran import arabic import re import inspect import shutil import sys # {{autogenerated}} '''NOTES 1 * All files MUST have {{autogenerated}} flag. 2 * foo(arg1:type) not supported yet; use foo(arg1)! ''' PAGES = [ { 'page': 'quran_tools.md', 'functions': [ quran.get_sura,# quran.get_verse,# quran.get_sura_number,# quran.get_sura_name,# ] }, { 'page': 'arabic_tools.md', 'functions': [ pyquran.check_system, # pyquran.factor_alef_mad, # pyquran.grouping_letter_diacritics,# arabic.alphabet_excluding,# arabic.strip_tashkeel,# pyquran.buckwalter_transliteration,# ]}, { 'page': 'analysis_tools.md', 'functions': [ pyquran.count_rasm,# pyquran.search_string_with_tashkeel,# pyquran.frequency_of_character,# pyquran.frequency_sura_level,# pyquran.frequency_quran_level,# pyquran.sort_dictionary_by_similarity,# pyquran.check_sura_with_frequency, pyquran.search_sequence, ]}, ] ROOT = 'https://github.com/TahaMagdy/PyQuran' def get_earliest_class_that_defined_member(member, cls): ancestors = get_classes_ancestors([cls]) result = None for ancestor in ancestors: if member in dir(ancestor): result = ancestor if not result: return cls return result def get_classes_ancestors(classes): ancestors = [] for cls in classes: ancestors += cls.__bases__ filtered_ancestors = [] for ancestor in ancestors: if ancestor.__name__ in ['object']: continue filtered_ancestors.append(ancestor) if filtered_ancestors: return filtered_ancestors + get_classes_ancestors(filtered_ancestors) else: return filtered_ancestors def get_function_signature(function, method=True): signature = inspect.getargspec(function) defaults = signature.defaults if method: args = signature.args[1:] else: args = signature.args if defaults: kwargs = zip(args[-len(defaults):], defaults) args = args[:-len(defaults)] else: kwargs = [] st = '%s.%s(' % (function.__module__, function.__name__) for a in args: st += str(a) + ', ' for a, v in kwargs: if isinstance(v, str): v = '\'' + v + '\'' st += str(a) + '=' + str(v) + ', ' if kwargs or args: return st[:-2] + ')' else: return st + ')' def get_class_signature(cls): try: class_signature = get_function_signature(cls.__init__) class_signature = class_signature.replace('__init__', cls.__name__) except: # in case the class inherits from object and does not # define __init__ class_signature = cls.__module__ + '.' + cls.__name__ + '()' return class_signature def class_to_docs_link(cls): module_name = cls.__module__ assert module_name[:6] == 'keras.' module_name = module_name[6:] link = ROOT + module_name.replace('.', '/') + '#' + cls.__name__.lower() return link def class_to_source_link(cls): module_name = cls.__module__ assert module_name[:6] == 'core/pyquran.' path = module_name.replace('.', '/') path += '.py' line = inspect.getsourcelines(cls)[-1] link = 'https://github.com/TahaMagdy/PyQuran' + path + '#L' + str(line) return '[[source]](' + link + ')' def code_snippet(snippet): result = '```python\n' result += snippet + '\n' result += '```\n' return result def process_class_docstring(docstring): docstring = re.sub(r'\n # (.*)\n', r'\n __\1__\n\n', docstring) docstring = re.sub(r' ([^\s\\]+):(.*)\n', r' - __\1__:\2\n', docstring) docstring = docstring.replace(' ' * 5, '\t\t') docstring = docstring.replace(' ' * 3, '\t') docstring = docstring.replace(' ', '') return docstring def process_function_docstring(docstring): docstring = re.sub(r' # (.*)\n', r'\n __\1__\n\n', docstring) docstring = re.sub(r'What it does\n', r'__What it does__\n\n', docstring) docstring = re.sub(r'Args:\n', r'\n__Args__\n\n', docstring) docstring = re.sub(r'Note:\n', r'\n__Note__\n\n', docstring) docstring = re.sub(r'Returns:\n', r'\n__Returns__\n\n', docstring) docstring = re.sub(r'Assumption:\n', r'\n__Assumption__\n\n', docstring) docstring = re.sub(r'Cases:\n', r'\n__Cases__\n\n', docstring) docstring = re.sub(r'Issue:\n', r'\n__Issue__\n\n', docstring) docstring = re.sub(r'Example:\n', r'\n__Example__\n\n', docstring) docstring = re.sub(r' ([^\s\\]+):(.*)\n', r'\n - __\1__:\2\n', docstring) docstring = docstring.replace(' ' * 6, '\t\t') docstring = docstring.replace(' ' * 4, '\t') docstring = docstring.replace(' ', '') return docstring print('Cleaning up existing sources directory.') if os.path.exists('sources'): shutil.rmtree('sources') print('Populating sources directory with templates.') for subdir, dirs, fnames in os.walk('templates'): for fname in fnames: new_subdir = subdir.replace('templates', 'sources') if not os.path.exists(new_subdir): os.makedirs(new_subdir) if fname[-3:] == '.md': fpath = os.path.join(subdir, fname) new_fpath = fpath.replace('templates', 'sources') shutil.copy(fpath, new_fpath) # Take care of index page. #readme = open('README.md').read() #index = open('index.md').read() #index = index.replace('{{autogenerated}}', readme[readme.find('##'):]) #f = open('index.md', 'w') #f.write(index) #f.close() print('Starting autogeneration.') ''' for page_data in PAGES: blocks = [] classes = page_data.get('classes', []) for module in page_data.get('all_module_classes', []): module_classes = [] for name in dir(module): if name[0] == '_' or name in EXCLUDE: continue module_member = getattr(module, name) if inspect.isclass(module_member): cls = module_member if cls.__module__ == module.__name__: if cls not in module_classes: module_classes.append(cls) module_classes.sort(key=lambda x: id(x)) classes += module_classes for cls in classes: subblocks = [] signature = get_class_signature(cls) subblocks.append('' + class_to_source_link(cls) + '') subblocks.append('### ' + cls.__name__ + '\n') subblocks.append(code_snippet(signature)) docstring = cls.__doc__ if docstring: subblocks.append(process_class_docstring(docstring)) blocks.append('\n'.join(subblocks)) ''' for page_data in PAGES: blocks = [] functions = page_data.get('functions', []) for module in page_data.get('all_module_functions', []): module_functions = [] for name in dir(module): if name[0] == '_' or name in EXCLUDE: continue module_member = getattr(module, name) if inspect.isfunction(module_member): function = module_member if module.__name__ in function.__module__: if function not in module_functions: module_functions.append(function) module_functions.sort(key=lambda x: id(x)) functions += module_functions for function in functions: subblocks = [] # TEST print(function) signature = get_function_signature(function, method=False) signature = signature.replace(function.__module__ + '.', '') subblocks.append('### ' + function.__name__ + '\n') subblocks.append(code_snippet(signature)) docstring = function.__doc__ if docstring: subblocks.append(process_function_docstring(docstring)) blocks.append('\n\n'.join(subblocks)) if not blocks: raise RuntimeError('Found no content for page ' + page_data['page']) mkdown = '\n----\n\n'.join(blocks) # save module page. # Either insert content into existing page, # or create page otherwise page_name = page_data['page'] path = os.path.join('docs/', page_name) # ''' if os.path.exists(path): template = open(path).read() assert '{{autogenerated}}' in template, ('Template found for ' + path + ' but missing {{autogenerated}} tag.') mkdown = template.replace('{{autogenerated}}', mkdown) print('...inserting autogenerated content into template:', path) else: print('...creating new page with autogenerated content:', path) # ''' print('...creating new page with autogenerated content:', path) subdir = os.path.dirname(path) ''' if not os.path.exists(subdir): os.makedirs(subdir) ''' open(path, 'w').write(mkdown) ================================================ FILE: documentation/docs/Alphabetical-Systems.md ================================================ What do we mean by Alphabetical Systems?! ================================================ FILE: documentation/docs/CONTRIBUTING.md ================================================ Contributing to PyQuran ======================= We use GitHub issues for reporting bugs and for feature requests. If you want to give us a hand, you may pick one of the opened issues and solve a bug, implement a feature request or to suggest a new missing feature. ## Reporting issues When reporting a bug, use GitHub issue with the **Bug label**, please include as much details as possible about: - your operating system. - your python version. - a self-contained code to reproduce and demonstrate the Bug. **Issue will be closed if the Bug cannot be reproduced.** ## Feature Request Whenever you think PyQuran is missing a feature, create a GitHub issue with **Feature Request label**, define what you want precisely and include sufficient examples to cover all the new feature aspects. If you would like to implement it by yourself, please read the [Contributing Code](#contributing-code) section. ## Code Contribution Your code have to meet [these standartds](code_conventions.md). ## Contributing Flow At first, fork the project on [GitHub](https://github.com/TahaMagdy/PyQuran/), then, create a *feature branch* and start writing your changes. We **DO NOT** accept changes to the *master branch*. Once you are done, push the changes to *your feature branch*, after that create a *pull request* with an expressive title and description. ## Commit Messages **It is so important to commit properly**, we expect you to commit every one logical change. A commit message should describe what have been changed, why, and reference issues fixed (if any). **Commit Message Properties**: 1. The Fist line is the commit title, should be less then or equal 50 characters, it must be expressive. 2. Keep the second line blank. 3. Wrap all other lines in the message body at 80 columns. 4. Include `Fixes #N`, where _N_ is the issue number the commit fixes, if any. Commits should look like the following: ```text explain commit in one line Body of commit message is a few lines of text, explaining things in more detail, possibly giving some background about the issue being fixed, etc. The body of the commit message **can be several paragraphs**, and please do proper word-wrap and keep columns shorter than about 80 characters. Fixes #101 ``` ## Unit Tests We write a test module for every PyQuran module under `PyQuran/testing`. **Naming** If the module is called *X*, then its testing module is called *test_X*. *test_x* must have tough unit tests for every single function. **Note** it is inevitable to run all testing modules before you make any pull request. Pull Requests will not be accepted if there is one fail in testing modules. So, please run them all first. ================================================ FILE: documentation/docs/FAQ.md ================================================ Hello! ================================================ FILE: documentation/docs/Filtering-Special-Recitation-Symbols.md ================================================ # Quran Corpus We use the *Uthmani Text* of Quran from [tanzil](http://tanzil.net/docs/download).
This is the its hashing ```MD5 (quran-uthmani.xml) = 6aae945d556a1b28cfe682c0ea5ab518```. # Recitation Symbols Quran is written in Arabic Alphabet, but the *Quran scholars* have put some marks to help reciters and readers in pronouncing and give them some guidance like the kind of some letters and pause marks. Those are the unique characters in the corpus.
((Table Unicode | Symbol | Kind {letter/mark})) # Filtering Recitation Symbols While fetching from corpus, we run the following method to remove all the recitation marks **they are NOT letters**. The only thing we replace, is the Alef wasl: ٱ, we add Alef: ا instead, because alef wasl and alef are the same one letter in Arabic, but alef wasl has a mark above it to indicate that it is not pronounced as a glottal stop in case of continuing, [Read more about Alef Wasl](https://en.wikipedia.org/wiki/Hamza#Hamzat_wa%E1%B9%A3l). This filtering is done in run time. We **do not** change the corpus at all. **[source](https://github.com/hci-lab/PyQuran-Private/blob/master/tools/filtering.py#L107:#L134)** > Also feel free to report any bugs or lingual errors, you are most welcome, just > open an [issue](https://github.com/hci-lab/PyQuran/issues). ================================================ FILE: documentation/docs/Home.md ================================================ * [FAQ](https://github.com/TahaMagdy/PyQuran/wiki/FAQ) — answers to frequently asked questions ## Documentation This is suitable for the *PyQuran* users. ## Development This section is for *PyQuran* maintainers. - ## Project Structure *PyQuran* is organized as the following: - **core**: contains main functions/modules. - **tools**: contains helper functions/modules. - **testing**: contains unit tests for each module. - **QuranCorpus**: contains Quran corpus and corpus hashes. ``` . │ README.md │ setup.py | __init__.py | ... | └───core │ │ pyquran.py │ │ ... | └───tools | │ filtering.py | | ... │ └───testing | │ test_filtering.py | | ... │ └───QuranCorpus │ quran-uthmani.xml | ... ``` ================================================ FILE: documentation/docs/PyQuran-Founders.md ================================================ # Graduation Project # Contacts Waleed A. Yousef, Ph.D. [wyousef at fci dot Helwan dot edu dot eg]()
Taha Magdy: tahamagdy@fci.helwan.edu.eg
Umar Mohammed: umar.ibrahime@fci.helwan.edu.eg ================================================ FILE: documentation/docs/Wiki-Home.md ================================================ ### Package Structure *PyQuran* is organized as the following: - **core**: contains main functions/modules. - **tools**: contains helper functions/modules. - **testing**: contains unit tests for each module. - **QuranCorpus**: contains Quran corpus and corpus hashes. ``` . │ README.md │ setup.py | __init__.py | ... | └───core │ │ pyquran.py │ │ ... | └───tools | │ filtering.py | | ... │ └───testing | │ test_filtering.py | | ... │ └───QuranCorpus │ quran-uthmani.xml | ... ``` ================================================ FILE: documentation/docs/analysis_tools.md ================================================ ### count_rasm ```python count_rasm(text, system=None) ``` counts the occerences of each letter (As `system` defines) in sura. __Args__ - __text__: [str], a list of strings , each inner list is ayah . - __system__: Optional, [[char]], revise [Alphabetical Systems](#alphabetical-systems), if `system` is not passed, the normal alphabet is applied. __Returns__ (N * P) ndarray (Matrix A): N is the number of verses, P is the alphabet (as defined in `system`). `A[i][j]` is the number of the letter `j` in the verse `i`. __Example__ ```python newSystem = [[q.beh, q.teh, q.theh], [q.jeem, q.hah, q.khah]] q.count_rasm(q.quran.get_sura(110), newSystem) >>>[[1 2 1 0 0 0 1 0 4 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 3 0 1 1 1 0 0] [1 2 0 0 2 0 0 0 5 0 2 0 1 0 1 0 0 0 0 0 0 0 2 0 0 4 0 3 1 3 1 3] [6 2 0 0 0 0 1 0 4 0 1 0 2 0 2 0 0 0 0 0 0 1 2 0 2 0 1 2 2 2 0 0]] ``` ---- ### search_string_with_tashkeel ```python search_string_with_tashkeel(string, key) ``` __Args__ - __string__: str, sentence to search by key. - __key__: str, taskeel pattern. __Assumption__ Searches tashkeel that is exciplitly included in string. __Returns__ - __find__: list of pairs where x and y are the start and end index of the matched. - __nod-found__: [] __Example__ ```python string = 'صِفْ ذَاْ ثَنَاْ كَمْ جَاْدَ شَخْصٌ' q.search_string_with_tashkeel(string, 'َْ') >>> [(3, 5), (7, 9), (10, 12), (13, 15), (17, 19)] ``` ---- ### frequency_of_character ```python frequency_of_character(characters, verse=None, chapterNum=0, verseNum=0, with_tashkeel=False) ``` counts the number of characters in a specific verse or sura or even the entrire Quran , __Note__ If verse and chapterNum is not passed, the entire Quran is targeted __Args__ - __verse__: str, this verse that you need to count it and default is None. chapterNum, int, chapter number is a number of 'sura' that will count it , and default is 0. - __verseNum__: int, verse number in sura. - __chracters__: [], list of characters that you want to count them. - __with_tashkeel__: Bool, to check if you want to search with tashkeel. __Returns__ {dic} : {str : int} a dictionary and keys is a characters and value is count of every chracter. __Example__ ```python q.frequency_of_character(['أ',"ب","تُ"],verseNum=2,with_tashkeel=False) #that will count the vers number **2** in all swar >>> {'أ': 101, 'ب': 133, 'تُ': 0} q.frequency_of_character(['أ',"ب","تُ"],chapterNum=1,verseNum=2,with_tashkeel=False) #that will count the vers number **2** in chapter **1** >>> {'أ': 0, 'ب': 1, 'تُ': 0} q.frequency_of_character(['أ',"ب","تُ"],chapterNum=1,verseNum=2,with_tashkeel=False) #that will count in **all Quran** >>> {'أ': 8900, 'ب': 11491, 'تُ': 2149} ``` ---- ### frequency_sura_level ```python frequency_sura_level(suraNumber) ``` Computes the frequency dictionary for a sura __Args__ - __suraNumber__: 1 <= Int <= 114. - __Return__: - __[aya_frequency_dictionary]__: the key of `aya_frequency_dictionary` is a unique word in aya and the corresponding value is its frequency. A list of frequency dictionaries for each verse of Sura. __Note__ * frequency dictionary is a python dict, which carries word frequencies for an aya. * Its key is (str) word, its value is (int) word frequency __Example__ ```python q.frequency_sura_level(suraNumber=1) >>> [{بسم': 1, 'الله': 1, 'الرحمن': 1, 'الرحيم': 1'}, {الحمد': 1, 'لله': 1, 'رب': 1, 'العلمين': 1'}, {الرحمن': 1, 'الرحيم': 1'}, {ملك': 1, 'يوم': 1, 'الدين': 1'}, {إياك': 1, 'نعبد': 1, 'وإياك': 1, 'نستعين': 1'}, {اهدنا': 1, 'الصرط': 1, 'المستقيم': 1'}, {عليهم': 2', صرط': 1', الذين': 1', أنعمت': 1', غير': 1', المغضوب': 1', ولا': 1', الضالين': 1'}] ``` ---- ### frequency_quran_level ```python frequency_quran_level() ``` Compute the words frequences of the Quran. __Returns__ - __[sura_level_frequency_dict]__: Revise the output of frequency_sura_level. ---- ### sort_dictionary_by_similarity ```python sort_dictionary_by_similarity(frequency_dictionary, threshold=0.8) ``` this function using to cluster words using similarity and sort every bunch of word by most common and sort bunches descending in same time __Args__ - __frequency_dictionary__: dict, frequency dictionary to be sorted. __Returns__ dict : {str: int} sorted dictionary __Example__ ```python frequency_dic = q.generate_frequency_dictionary(114) q.sort_dictionary_by_similarity(frequency_dic) # this dictionary is sorted using similarity 0.8 >>> {'أعوذ': 1, 'إذا': 2, 'العقد': 1, 'الفلق': 1, 'النفثت': 1, 'برب': 1, 'حاسد': 1, 'حسد': 1, 'خلق': 1, 'شر': 4, 'غاسق': 1, 'فى': 1, 'قل': 1, 'ما': 1, 'من': 1, 'وقب': 1, 'ومن': 3} ``` ---- ### check_sura_with_frequency ```python check_sura_with_frequency(sura_num, freq_dec) ``` this function check if frequency dictionary of specific sura is compatible with original sura in shapes count __Args__ suraNumber (int): sura number __Returns__ - __Boolean__: True :- if compatible Flase :- if not __Example__ ```python frequency_dic = q.generate_frequency_dictionary(114) q.check_sura_with_frequency(114, frequency_dic) >>> True ``` ---- ### search_sequence ```python search_sequence(sequancesList, verse=None, chapterNum=0, verseNum=0, mode=3) ``` take list of sequances and return matched sequance, it search in verse ot chapter or All Quran , it return for every match : 1 - matched sequance 2 - chapter number of occurrence 3 - token number if word and 0 if sentence Note : - if found verse != None it will use it en search . - if no verse and found chapterNum and verseNum it will - use this verse and use it to search. - if no verse and no verseNum and found chapterNum it will - search in chapter. - if no verse and no chapterNum and no verseNum it will search in All Quran. it has many modes: - search with decorated sequance (with tashkeel), and return matched sequance with decorates (with tashkil). - search without decorated sequance (without tashkeel), and return matched sequance without decorates (without tashkil). - search without decorated sequance (without tashkeel), and return matched sequance with decorates (with tashkil). __Args__ - __chapterNum__: int, number of chapter where function search. - __verseNum__: int, number of verse wher function search. - __sequancesList__: [], a list of sequances that you want to match them. - __mode__: int, this mode that you need to use and default mode 3. __Returns__ - __dict__: key is sequances and value is a list of matched_sequance and their positions. __Example__ ```python # search in chapter = 1 only using mode 3 (default) q.search_sequence(sequancesList=['ملك يوم الدين'],chapterNum=1) #it will return #{'sequance-1' : [ (matched_sequance , position , vers_num , chapter_num) , (....) ], # 'sequance-2' : [ (matched_sequance , position , vers_num , chapter_num) , (....) ] } # Note : position == 0 if sequance is a sentence and == word position if sequance is a word >>> {'ملك يوم الدين': [('مَلِكِ يَوْمِ الدِّينِ', 0, 4, 1)]} # search in all Quran using mode 3 (default) q.search_sequence(sequancesList=['ملك يوم']) >>> {'ملك يوم': [('مَلِكِ يَوْمِ', 0, 4, 1), ('الْمُلْكُ يَوْمَ', 0, 73, 6), ('الْمُلْكُ يَوْمَئِذٍ', 0, 56, 22), ('الْمُلْكُ يَوْمَئِذٍ', 0, 26, 25)]} ``` ================================================ FILE: documentation/docs/arabic_tools.md ================================================ ## Alphabets We use [PyArabic](https://pypi.python.org/pypi/PyArabic/0.6.2) constants which represents letters, instead of writting Arabic in the code. ```python hamza = u'\u0621' alef_mad = u'\u0622' alef_hamza_above = u'\u0623' waw_hamza = u'\u0624' alef_hamza_below = u'\u0625' yeh_hamza = u'\u0626' alef = u'\u0627' beh = u'\u0628' teh_marbuta = u'\u0629' teh = u'\u062a' theh = u'\u062b' jeem = u'\u062c' hah = u'\u062d' khah = u'\u062e' dal = u'\u062f' thal = u'\u0630' reh = u'\u0631' zain = u'\u0632' seen = u'\u0633' sheen = u'\u0634' sad = u'\u0635' dad = u'\u0636' tah = u'\u0637' zah = u'\u0638' ain = u'\u0639' ghain = u'\u063a' feh = u'\u0641' qaf = u'\u0642' kaf = u'\u0643' lam = u'\u0644' meem = u'\u0645' noon = u'\u0646' heh = u'\u0647' waw = u'\u0648' alef_maksura = u'\u0649' yeh = u'\u064a' madda_above = u'\u0653' hamza_above = u'\u0654' hamza_below = u'\u0655' alef_wasl = u'\u0671' ``` ## Alphabetical Systems (Definitions) [**Rasm**](https://en.wikipedia.org/wiki/Rasm): is any set of letters which are writtern in the same form, namely; they are indistinguishable in wirtting by they are distinguished from the context. For example, the letters ت ث ن ى, they can be written with only one rasm ىـ, without dots. **Alphabetical System**: is a set of rasm; dynamically constructed by specifying the letters that you will treat them as one rasm. By the way, the default Arabic alphabet is a special case of the **Alphabetical System** where each letter is as one rasm. **Predefined systems** are stored in `systems` object. 1. **Default**: each letter is treated as a unique rasm. 2. **Without Dots**: by removing the dots some letters will be indistinguishable; those letters are treated as one rasm. The following example shows the (Without Dots) system as a list of lists; where the sublist contains the letters which share the same rasm. 3. **Hamazat**: consider each any letter accompanied by hamaz ء as one rasm. **NOTE**: You may go further and construct your system by speicying what letters you want to treat as one rasm, then you can do some statistical analysis like, count, variance, average, ... Example: ```python q.systems.withoutDots Out: [['ب', 'ت', 'ث', 'ن'], # Rasm 1 ['ح', 'خ', 'ج'], # Rasm 2 ['د', 'ذ'], # Rasm 3 ['ر', 'ز'], # Rasm 4 ['س', 'ش'], # Rasm 5 ['ص', 'ض'], # Rasm 6 ['ط', 'ظ'], # Rasm 7 ['ع', 'غ'], # Rasm 8 ['ف', 'ق']] # Rasm 9 ``` ### Constructing a user-defined system: ```python system = [[alef_hamza_above, alef], [beh, teh]] ``` The previous piece of code means "Treat *alef_hamza_above* and *alef* as the same one latter, also treat *beh* and *teh* as one letter as well". The rest of letters can be dynamically constructed using `check_system()` And then, a system can be applied to some text analysis functions like counting, filtering, etc. ### check_system ```python check_system(system, index=None) ``` Returns the alphabet including treated-as-one letters. If you pass the index as the second optional arguement, it returns the letter of the that index only, not the hole alphabet. __Args__ - __system__: [[char]], a list of letters, where each letter to be treated as one letter are in one sub-list, see [Alphabetical Systems](#alphabetical-systems). - __index__: Optional integer, is a index of a letter in the new system. __Returns__ - __list__: full sorted system or a specific index. __Example__ ```python q.check_system([['alef', 'beh']]) >>> [['ء'], ['آ'], ['أ', 'ب'], ['ؤ'], ['إ'], ['ئ'], ['ا'], ['ة'], ['ت'], ['ث'], ['ج'], ['ح'], ['خ'], ['د'], ['ذ'], ['ر'], ['ز'], ['س'], ['ش'], ['ص'], ['ض'], ['ط'], ['ظ'], ['ع'], ['غ'], ['ف'], ['ق'], ['ك'], ['ل'], ['م'], ['ن'], ['ه'], ['و'], ['ى'], ['ي']] ``` The previous example prints each letter as one element in a new alphabet list, as you can see the two letters alef and beh are considered one letter. ---- ### factor_alef_mad ```python factor_alef_mad(sentance) ``` It returns the `sentance` having alef_mad factored into alef_hamza and alef_wasel. __Args__ - __sentance__: str, a string or list. __Returns__ - __str__: sentance having the alef_mad factored __Example__ ```python q.factor_alef_mad('آ') >>> 'أْأَ' ``` ---- ### grouping_letter_diacritics ```python grouping_letter_diacritics(sentance) ``` Grouping each letter with its diacritics. __Args__ - __sentance__: str __Returns__ - __[str]__: a list of _x_, where _x_ is the letter accompanied with its diacritics. __Example__ ```python q.grouping_letter_diacritics('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ') >>> ['إِ', 'نَّ', 'ا', ' ', 'أَ', 'عْ', 'طَ', 'يْ', 'نَ', 'كَ', ' ', 'ا', 'لْ', 'كَ', 'وْ', 'ثَ', 'رَ'] ``` ---- ### alphabet_excluding ```python alphabet_excluding(excludedLetters) ``` returns the alphabet excluding `excludedLetters`. __Args__ - __excludedLetters__: list[Char], letters to be excluded from the alphabet. __Returns__ - __str__: alphabet excluding `excludedLetters`. __Example__ ```python q.alphabet_excluding([q.alef, q.beh, q.qaf, q.teh, q.dal, q.yeh, q.alef_mad]) >>> ['ء', 'ٔ', 'أ', 'ؤ', 'إ', 'ئ', 'ة', 'ث', 'ج', 'ح', 'خ', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ى'] ``` ---- ### strip_tashkeel ```python strip_tashkeel(string) ``` convert any letter in the `listOfLetter` to `letter` in the given text __Args__ - __string__: str, to drop tashkeel from. __Example__ ```python x = q.quran.get_verse(12, 2, with_tashkeel=True) x >>> 'إِنَّا أَنزَلْنَهُ قُرْءَنًا عَرَبِيًّا لَّعَلَّكُمْ تَعْقِلُونَ' q.strip_tashkeel(x) >>> 'إنا أنزلنه قرءنا عربيا لعلكم تعقلون' ``` ---- ### buckwalter_transliteration ```python buckwalter_transliteration(string, reverse=False) ``` Back and forth Arabic-Bauckwalter transliteration. Revise [Buckwalter](https://en.wikipedia.org/wiki/Buckwalter_transliteration) __Args__ - __string__: to be transliterated. - __reverse__: Optional boolean. `False` transliterates from Arabic to Bauckwalter, `True` transliterates from Bauckwalter to Arabic. __Returns__ - __str__: transliterated string. __Example__ ```python q.buckwalter_transliteration('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ') >>> aEoTayonaka Alokawovara ``` ================================================ FILE: documentation/docs/authors.md ================================================ Authors ======= - [Dr. Waleed A. Yousef](https://github.com/DrWaleedAYousef), Ph.D., [Human Computer Interaction Laboratory (HCI Lab.)](http://www.hciegypt.com/main/), wyousef@fci.helwan.edu.eg. - [Taha M. Madbouly](https://github.com/TahaMagdy), B.Sc., tahamagdy@fci.helwan.edu.eg. - [Omar M. Ibrahime](https://github.com/moroclash), B.Sc., umar.ibrahime@fci.helwan.edu.eg - [Ali H. El-Kassas](https://github.com/Ali-Abdelmonim), B.Sc., alihassan2@fci.helwan.edu.eg - [Ali O. Hassan](https://github.com/AliOsamaHassan), B.Sc., ali.osama@fci.helwan.edu.eg - [Abdallah R. Albohy](https://github.com/abdo96), B.Sc. abdoengineer2015@gmail.com ================================================ FILE: documentation/docs/code_conventions.md ================================================ Code Conventions ================ This helps everyone to read and maintain the code even when they maintains someone else code
**Please restrict to the rules.** ## Rules * A line **must not** exceed *80 character* length. * Use **Spaces** not **Tabs**. * Always return to `example_google.py` file. * We dissagree with `example_goole.py` in variables naming ONLY,
and **we agree with it in the whole entire rest**. ## Naming * **Class Name**: [PascalCase](https://en.wikipedia.org/wiki/PascalCase): initial letter is **upper case** * *Examples*: `Class, NewClass, ...` * **Function**: [snake_case](https://en.wikipedia.org/wiki/Snake_case): Lowercase underscore-separated names. * *Examples*: `foo, foo_name, ...` * **Variables**: [lowerCamelCase](https://en.wikipedia.org/wiki/Camel_case): initial letter is **lower case** and rest are PascalCasee. * *Examples*: `variable, varibaleName, ...` ## Function prototype * Functions should have a description followed by sections as in the following example. * You don't need to include all section, but include what makes the function as clear as possible. * **Function prototypes also used for proposed functions**. ```python def function_with_types_in_docstring(param1, param2): """Here you write a rigorous description of the function Args: param1 (int): The first parameter. param2 (str): The second parameter. Returns: bool: The return value. True for success, False otherwise. Note: Do not include the `self` parameter in the ``Args`` section. """ # Empty Line pass # in case it is just a prototype (not implemented yet) ``` # Google Standards Example ```python # -*- coding: utf-8 -*- """Example Google style docstrings. This module demonstrates documentation as specified by the `Google Python Style Guide`_. Docstrings may extend over multiple lines. Sections are created with a section header and a colon followed by a block of indented text. Example: Examples can be given using either the ``Example`` or ``Examples`` sections. Sections support any reStructuredText formatting, including literal blocks:: $ python example_google.py Section breaks are created by resuming unindented text. Section breaks are also implicitly created anytime a new section starts. Attributes: module_level_variable1 (int): Module level variables may be documented in either the ``Attributes`` section of the module docstring, or in an inline docstring immediately following the variable. Either form is acceptable, but the two should not be mixed. Choose one convention to document module level variables and be consistent with it. Todo: * For module TODOs * You have to also use ``sphinx.ext.todo`` extension .. _Google Python Style Guide: http://google.github.io/styleguide/pyguide.html """ module_level_variable1 = 12345 module_level_variable2 = 98765 """int: Module level variable documented inline. The docstring may span multiple lines. The type may optionally be specified on the first line, separated by a colon. """ def function_with_types_in_docstring(param1, param2): """Example function with types documented in the docstring. `PEP 484`_ type annotations are supported. If attribute, parameter, and return types are annotated according to `PEP 484`_, they do not need to be included in the docstring: Args: param1 (int): The first parameter. param2 (str): The second parameter. Returns: bool: The return value. True for success, False otherwise. .. _PEP 484: https://www.python.org/dev/peps/pep-0484/ """ def function_with_pep484_type_annotations(param1: int, param2: str) -> bool: """Example function with PEP 484 type annotations. Args: param1: The first parameter. param2: The second parameter. Returns: The return value. True for success, False otherwise. """ def module_level_function(param1, param2=None, *args, **kwargs): """This is an example of a module level function. Function parameters should be documented in the ``Args`` section. The name of each parameter is required. The type and description of each parameter is optional, but should be included if not obvious. If \*args or \*\*kwargs are accepted, they should be listed as ``*args`` and ``**kwargs``. The format for a parameter is:: name (type): description The description may span multiple lines. Following lines should be indented. The "(type)" is optional. Multiple paragraphs are supported in parameter descriptions. Args: param1 (int): The first parameter. param2 (:obj:`str`, optional): The second parameter. Defaults to None. Second line of description should be indented. *args: Variable length argument list. **kwargs: Arbitrary keyword arguments. Returns: bool: True if successful, False otherwise. The return type is optional and may be specified at the beginning of the ``Returns`` section followed by a colon. The ``Returns`` section may span multiple lines and paragraphs. Following lines should be indented to match the first line. The ``Returns`` section supports any reStructuredText formatting, including literal blocks:: { 'param1': param1, 'param2': param2 } Raises: AttributeError: The ``Raises`` section is a list of all exceptions that are relevant to the interface. ValueError: If `param2` is equal to `param1`. """ if param1 == param2: raise ValueError('param1 may not be equal to param2') return True def example_generator(n): """Generators have a ``Yields`` section instead of a ``Returns`` section. Args: n (int): The upper limit of the range to generate, from 0 to `n` - 1. Yields: int: The next number in the range of 0 to `n` - 1. Examples: Examples should be written in doctest format, and should illustrate how to use the function. >>> print([i for i in example_generator(4)]) [0, 1, 2, 3] """ for i in range(n): yield i class ExampleError(Exception): """Exceptions are documented in the same way as classes. The __init__ method may be documented in either the class level docstring, or as a docstring on the __init__ method itself. Either form is acceptable, but the two should not be mixed. Choose one convention to document the __init__ method and be consistent with it. Note: Do not include the `self` parameter in the ``Args`` section. Args: msg (str): Human readable string describing the exception. code (:obj:`int`, optional): Error code. Attributes: msg (str): Human readable string describing the exception. code (int): Exception error code. """ def __init__(self, msg, code): self.msg = msg self.code = code class ExampleClass(object): """The summary line for a class docstring should fit on one line. If the class has public attributes, they may be documented here in an ``Attributes`` section and follow the same formatting as a function's ``Args`` section. Alternatively, attributes may be documented inline with the attribute's declaration (see __init__ method below). Properties created with the ``@property`` decorator should be documented in the property's getter method. Attributes: attr1 (str): Description of `attr1`. attr2 (:obj:`int`, optional): Description of `attr2`. """ def __init__(self, param1, param2, param3): """Example of docstring on the __init__ method. The __init__ method may be documented in either the class level docstring, or as a docstring on the __init__ method itself. Either form is acceptable, but the two should not be mixed. Choose one convention to document the __init__ method and be consistent with it. Note: Do not include the `self` parameter in the ``Args`` section. Args: param1 (str): Description of `param1`. param2 (:obj:`int`, optional): Description of `param2`. Multiple lines are supported. param3 (:obj:`list` of :obj:`str`): Description of `param3`. """ self.attr1 = param1 self.attr2 = param2 self.attr3 = param3 #: Doc comment *inline* with attribute #: list of str: Doc comment *before* attribute, with type specified self.attr4 = ['attr4'] self.attr5 = None """str: Docstring *after* attribute, with type specified.""" @property def readonly_property(self): """str: Properties should be documented in their getter method.""" return 'readonly_property' @property def readwrite_property(self): """:obj:`list` of :obj:`str`: Properties with both a getter and setter should only be documented in their getter method. If the setter method contains notable behavior, it should be mentioned here. """ return ['readwrite_property'] @readwrite_property.setter def readwrite_property(self, value): value def example_method(self, param1, param2): """Class methods are similar to regular functions. Note: Do not include the `self` parameter in the ``Args`` section. Args: param1: The first parameter. param2: The second parameter. Returns: True if successful, False otherwise. """ return True def __special__(self): """By default special members with docstrings are not included. Special members are any methods or attributes that start with and end with a double underscore. Any special member with a docstring will be included in the output, if ``napoleon_include_special_with_doc`` is set to True. This behavior can be enabled by changing the following setting in Sphinx's conf.py:: napoleon_include_special_with_doc = True """ pass def __special_without_docstring__(self): pass def _private(self): """By default private members are not included. Private members are any methods or attributes that start with an underscore and are *not* special. By default they are not included in the output. This behavior can be changed such that private members *are* included by changing the following setting in Sphinx's conf.py:: napoleon_include_private_with_doc = True """ pass def _private_without_docstring(self): pass ``` ================================================ FILE: documentation/docs/dictFrec.md ================================================ Comming soon. ================================================ FILE: documentation/docs/example_google.md ================================================ ```python # -*- coding: utf-8 -*- """Example Google style docstrings. This module demonstrates documentation as specified by the `Google Python Style Guide`_. Docstrings may extend over multiple lines. Sections are created with a section header and a colon followed by a block of indented text. Example: Examples can be given using either the ``Example`` or ``Examples`` sections. Sections support any reStructuredText formatting, including literal blocks:: $ python example_google.py Section breaks are created by resuming unindented text. Section breaks are also implicitly created anytime a new section starts. Attributes: module_level_variable1 (int): Module level variables may be documented in either the ``Attributes`` section of the module docstring, or in an inline docstring immediately following the variable. Either form is acceptable, but the two should not be mixed. Choose one convention to document module level variables and be consistent with it. Todo: * For module TODOs * You have to also use ``sphinx.ext.todo`` extension .. _Google Python Style Guide: http://google.github.io/styleguide/pyguide.html """ module_level_variable1 = 12345 module_level_variable2 = 98765 """int: Module level variable documented inline. The docstring may span multiple lines. The type may optionally be specified on the first line, separated by a colon. """ def function_with_types_in_docstring(param1, param2): """Example function with types documented in the docstring. `PEP 484`_ type annotations are supported. If attribute, parameter, and return types are annotated according to `PEP 484`_, they do not need to be included in the docstring: Args: param1 (int): The first parameter. param2 (str): The second parameter. Returns: bool: The return value. True for success, False otherwise. .. _PEP 484: https://www.python.org/dev/peps/pep-0484/ """ def function_with_pep484_type_annotations(param1: int, param2: str) -> bool: """Example function with PEP 484 type annotations. Args: param1: The first parameter. param2: The second parameter. Returns: The return value. True for success, False otherwise. """ def module_level_function(param1, param2=None, *args, **kwargs): """This is an example of a module level function. Function parameters should be documented in the ``Args`` section. The name of each parameter is required. The type and description of each parameter is optional, but should be included if not obvious. If \*args or \*\*kwargs are accepted, they should be listed as ``*args`` and ``**kwargs``. The format for a parameter is:: name (type): description The description may span multiple lines. Following lines should be indented. The "(type)" is optional. Multiple paragraphs are supported in parameter descriptions. Args: param1 (int): The first parameter. param2 (:obj:`str`, optional): The second parameter. Defaults to None. Second line of description should be indented. *args: Variable length argument list. **kwargs: Arbitrary keyword arguments. Returns: bool: True if successful, False otherwise. The return type is optional and may be specified at the beginning of the ``Returns`` section followed by a colon. The ``Returns`` section may span multiple lines and paragraphs. Following lines should be indented to match the first line. The ``Returns`` section supports any reStructuredText formatting, including literal blocks:: { 'param1': param1, 'param2': param2 } Raises: AttributeError: The ``Raises`` section is a list of all exceptions that are relevant to the interface. ValueError: If `param2` is equal to `param1`. """ if param1 == param2: raise ValueError('param1 may not be equal to param2') return True def example_generator(n): """Generators have a ``Yields`` section instead of a ``Returns`` section. Args: n (int): The upper limit of the range to generate, from 0 to `n` - 1. Yields: int: The next number in the range of 0 to `n` - 1. Examples: Examples should be written in doctest format, and should illustrate how to use the function. >>> print([i for i in example_generator(4)]) [0, 1, 2, 3] """ for i in range(n): yield i class ExampleError(Exception): """Exceptions are documented in the same way as classes. The __init__ method may be documented in either the class level docstring, or as a docstring on the __init__ method itself. Either form is acceptable, but the two should not be mixed. Choose one convention to document the __init__ method and be consistent with it. Note: Do not include the `self` parameter in the ``Args`` section. Args: msg (str): Human readable string describing the exception. code (:obj:`int`, optional): Error code. Attributes: msg (str): Human readable string describing the exception. code (int): Exception error code. """ def __init__(self, msg, code): self.msg = msg self.code = code class ExampleClass(object): """The summary line for a class docstring should fit on one line. If the class has public attributes, they may be documented here in an ``Attributes`` section and follow the same formatting as a function's ``Args`` section. Alternatively, attributes may be documented inline with the attribute's declaration (see __init__ method below). Properties created with the ``@property`` decorator should be documented in the property's getter method. Attributes: attr1 (str): Description of `attr1`. attr2 (:obj:`int`, optional): Description of `attr2`. """ def __init__(self, param1, param2, param3): """Example of docstring on the __init__ method. The __init__ method may be documented in either the class level docstring, or as a docstring on the __init__ method itself. Either form is acceptable, but the two should not be mixed. Choose one convention to document the __init__ method and be consistent with it. Note: Do not include the `self` parameter in the ``Args`` section. Args: param1 (str): Description of `param1`. param2 (:obj:`int`, optional): Description of `param2`. Multiple lines are supported. param3 (:obj:`list` of :obj:`str`): Description of `param3`. """ self.attr1 = param1 self.attr2 = param2 self.attr3 = param3 #: Doc comment *inline* with attribute #: list of str: Doc comment *before* attribute, with type specified self.attr4 = ['attr4'] self.attr5 = None """str: Docstring *after* attribute, with type specified.""" @property def readonly_property(self): """str: Properties should be documented in their getter method.""" return 'readonly_property' @property def readwrite_property(self): """:obj:`list` of :obj:`str`: Properties with both a getter and setter should only be documented in their getter method. If the setter method contains notable behavior, it should be mentioned here. """ return ['readwrite_property'] @readwrite_property.setter def readwrite_property(self, value): value def example_method(self, param1, param2): """Class methods are similar to regular functions. Note: Do not include the `self` parameter in the ``Args`` section. Args: param1: The first parameter. param2: The second parameter. Returns: True if successful, False otherwise. """ return True def __special__(self): """By default special members with docstrings are not included. Special members are any methods or attributes that start with and end with a double underscore. Any special member with a docstring will be included in the output, if ``napoleon_include_special_with_doc`` is set to True. This behavior can be enabled by changing the following setting in Sphinx's conf.py:: napoleon_include_special_with_doc = True """ pass def __special_without_docstring__(self): pass def _private(self): """By default private members are not included. Private members are any methods or attributes that start with an underscore and are *not* special. By default they are not included in the output. This behavior can be changed such that private members *are* included by changing the following setting in Sphinx's conf.py:: napoleon_include_private_with_doc = True """ pass def _private_without_docstring(self): pass ``` ================================================ FILE: documentation/docs/index.md ================================================ # PyQuran: The Python package for Quranic Analysis PyQuran is a package which provides tools for Quranic Analysis and Arabic texts. It is still a small package which needs a lot of your effort. We believe that it is a seed of a fundamental general package for computations on Quran with Python, even at the most basic level which is simply retrieving Quran text. *Before Islam*, Arabic letters were without dots— [*rasm*](https://en.wikipedia.org/wiki/Rasm), which resulted in ambiguty, two or three letters had the same rasm or form. Muslims have decided to remove this ambiguity by adding dots above or below each letter of the ones which share the same rasm. Now each letter has a unique form. By the way, originally, Quran was written using letters without dots. To enable researchers to use modern alphabet, old rasm or other, we introduce *alphabetical systems*, It is a dynamic construction of letters— Alphabetical Systems. ## Quran Corpus We use [tanzil](http://tanzil.net/docs/download) Quran Corpus (*Uthmani Text*), it is in `UTF-8` encoding. You can find all unique characters of Uthmanic Corpus [here](https://hci-lab.github.io/PyQuran-Private/Filtering-Special-Recitation-Symbols/#recitation-symbols). There are *special recitation symbols* مصطلحات الضبط in the *Uthmani Text*, they are a guide for the reciter to know the right positions to pause and the rules of tajweed. We provide an interface to filter those symbols, *on the fly while fetching from the corpus*, we **DO NOT** change the corpus, NEVER. [For the full details about filtering *special recitation symbols* مصطلحات الضبط.](https://hci-lab.github.io/PyQuran-Private/Filtering-Special-Recitation-Symbols/#recitation-symbols) ## Current Features - [Quran Retrieving.](https://hci-lab.github.io/PyQuran-Private/quran_tools/) - Advanced Searching, by [Text](https://hci-lab.github.io/PyQuran-Private/analysis_tools/#search_sequence) and [Diacritics](https://hci-lab.github.io/PyQuran-Private/analysis_tools/#search_string_with_tashkeel) Patterns. - [Buckwalter Transliteration](https://hci-lab.github.io/PyQuran-Private/arabic_tools/#buckwalter_transliteration), back and forth. - Multiple [Alphabetical Systems](https://hci-lab.github.io/PyQuran-Private/arabic_tools/#alphabetical-systems). - Words Frequency Table المعجم الترددى للألفاظ . ## PyQuran needs and Upcoming Features. - Words Frequency Table filtered according to words meaning. - Morphology analysis of words to their roots. - Arabic tools for representing Arabic text for AI algorithms and neural networks, for more serious Arabic text processing and understanding. Those tools should take meaning, diacritics, roots and other morphology aspects in account. - Some PyQuran in-house tools and architecture enhancement will be on GitHub Issues for you contributors to make PyQuran professional and easy to use. ## Contributing To contribute and maintain PyQuran, Please read [CONTRIBUTING](https://hci-lab.github.io/PyQuran-Private/CONTRIBUTING) section. ## Dependencies - [numpy](http://www.numpy.org/) - [pyarabic](https://github.com/linuxscout/pyarabic) ## Install - From PyPI: `$ pip3 install pyquran` ## Citing ``` @MISC {PyQuran2018, author = "Waleed A. Yousef and Taha M. Madbouly and Omar M. Ibrahime and Ali H. El-Kassas and Ali O. Hassan and Abdallah R. Albohy", title = "PyQuran: The Python package for Quranic Analysis", howpublished = "https://hci-lab.github.io/PyQuran-Private", year = "2018"} ``` ## Communication [Author Page](https://hci-lab.github.io/PyQuran-Private/authors) ================================================ FILE: documentation/docs/maintainers.md ================================================ ================================================ FILE: documentation/docs/methods guide.md ================================================ ```python X ``` X **Arguments** - **X**: X **Example** ```python X ``` -------- ------------------- # Thbeed * [Features](#features) * [Imporatan information](#imporatan-information) * [Usage](#usage) * [Functions](#functions) * [Access functions](#access-functions) [x] DONE * [Manipulate functions](#manipulate-functions) [x] DONE * [Analysis functions](#analysis-functions) * [count_shape](#count_shape) * [count_token](#count_token) * [frequency_of_character](#frequency_of_character) * [generate_frequancy_dictionary](#generate_frequancy_dictionary) * [sort_dictionary_by_similarity](#sort_dictionary_by_similarity) * [check_sura_with_frequency](#check_sura_with_frequency) * [generate_latex_table](#generate_latex_table) * [Search functions](#search-functions) * [search_sequence](#search_sequence) * [search_string_with_tashkeel](#search_string_with_tashkeel) * [search_with_pattern](#search_with_pattern) # Features * Access Holy-Quran : - get **Chapter** with/without diacritics. - get **Verse** with/without diacritics. - get **Token** (word). - get **Chapter name** , **Chapter number**. - get **Verses number** in verse. * Manipulate with Holy-Quran : - Separate to **letters** with/without diacritics. - Apply your **System** on Quran. - get **Binary representation** of Holy-Quran as 0's , 1's. - Extract **Taskill** from sentence. - Dealing with linguistic rules like : - Transfer Alef-mad **"آ"** to "أَأْ" - Convert the **unicode of arabic** text to **buckwalter encoding** and vice versa - Convert Quran to **buckwalter reprsentation** and vice versa. * Analysis Holy-Quran: - get **Frequency Matrix** of letters dependent on Applied _alphabet system_. - get **Frequency dictionary** of tokens. - sort **Frequency dictionary** using similarity threshold. * Search in Holy-Quran using : - **Text** and ther is a variety options. - **diacritics pattern**. - **binary representation pattern** using threshold. # Functions ## Manipulate functions: ## Analysis functions: #### count_shape **count_shape(text, system=None)** - takes **text** (chapter/verse), **system (optional)** it's the shape of character as example [[bah,gem]] and return a **n*p matrix** where **n** number of verses and **p** number of collections in system and if not pass system it will apply the defualt. ```python newSystem=[[beh, teh, theh], [jeem, hah, khah]] alphabetAsOneShape =pq.count_shape(get_sura(110), newSystem) print(alphabetAsOneShape) >>> [[1 2 1 0 0 0 1 0 4 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 3 0 1 1 1 0 0] [1 2 0 0 2 0 0 0 5 0 2 0 1 0 1 0 0 0 0 0 0 0 2 0 0 4 0 3 1 3 1 3] [6 2 0 0 0 0 1 0 4 0 1 0 2 0 2 0 0 0 0 0 0 1 2 0 2 0 1 2 2 2 0 0]] ``` #### count_token **count_token(text)** - takes **text** (chapter/verse) and returns the number of tokens. ###### ***note***: the harf ('و') is not calculated as token alone ```python numberOfToken=pq.count_token(tools.get_sura(110)) print(numberOfToken) >>> 19 ``` #### frequency_of_character **frequency_of_character(characters,verse=None,chapterNum=0,verseNum=0,with_tashkeel=False)** - takes **characters** that you need to count , return dictionary that havecounts characters occurrence for verses or with chapter or even all quran and the dictionary contains the key char and values is an occurrence of character . - optional opptions: - **verse** (str): if passed, it will applied to this string only - **chapterNum** (int) : if passed only, it will applied to this chapter only. - **verseNum** (int) : - if passed only, it will applied to **verseNum** for **all Chapters**. - if passed with **chapterNum**, it will applied to verseNum for **chapterNum**. - **with_tashkeel** (bool): - if **True** applied to Quran **with** Tashkieel. - if **False** applied to Quran **without** Tashkieel. - Note : if don't pass any **optional opptions** it will applied to all **Quran**. ```python frequencyOfChar =tools.frequency_of_character(['أ','ب'],'قل أعوذ برب الناس',114,1) print(frequencyOfChar) >>> {أ:1,ب:2} ``` #### generate_frequancy_dictionary **generate_frequency_dictionary(suraNumber=None)** - takes **suraNumber (optional)** the number of chapter and it returns the dictionary of words contains the **word** as key and its **frequency** as value and if not pass **suraNumber** it will applied to **all-Quran**. ```python dictionaryFrequency = pq.generate_frequency_dictionary(114) print(dictionaryFrequency) >>> {'الناس': 4, 'من': 2, 'قل': 1, 'أعوذ': 1, 'برب': 1, 'ملك': 1, 'إله': 1, 'شر': 1, 'الوسواس': 1, 'الخناس': 1, 'الذى': 1, 'يوسوس': 1, 'فى': 1, 'صدور': 1, 'الجنة': 1, 'والناس': 1} ``` #### sort_dictionary_by_similarity **sort_dictionary_by_similarity(frequency_dictionary,threshold=0.8)** - using to **cluster words by using similarity** and sort every bunch of word by most common and sort bunches descending in the same time takes the frequency dictionary generated using [generate_frequency_dictionary](#generate_frequency_dictionary) function. This function takes dictionary of frequencies and **threshold (optional)** to specify **the degree of similarity** ```python sortedDictionary = pq.sort_dictionary_by_similarity(dictionaryFrequency) print(sortedDictionary) >>> {'الناس': 4, 'الخناس': 1, 'والناس': 1, 'من': 2, 'قل': 1, 'أعوذ': 1, 'برب': 1, 'ملك': 1, 'إله': 1, 'شر': 1, 'الوسواس': 1, 'الذى': 1, 'يوسوس': 1, 'فى': 1, 'صدور': 1, 'الجنة': 1} ``` #### check_sura_with_frequency **check_sura_with_frequency(sura_num,freq_dec)** - function checks if frequency dictionary of **specific chapter** is compatible with **original chapter** in quran, it takes **sura_num** (chapter number) and **freq_dec** (frequency dictionary) and return **True** if compatible and **False** in not. ```python dictionaryFrequency = pq.generate_frequency_dictionary(111) matched = pq.check_sura_with_frequency(110,dictionaryFrequency) print(matched) >>> False ``` #### generate_latex_table **generate_latex_table(dictionary,filename,location=".")** - generates latex code of table of frequency it takes dictionary frequency ,it takes **dictionary** (frequency dictionary) , **filename** and **location** (location to save) , the default location is same directory by symbol '.', then it returns **True** if the operation of generation completed successfully **False** if something wrong ```python latexTable = pq.generate_latex_table(dictionaryFrequency,'any_file_name') print(latexTable) >>> True ``` ## Search functions #### search_sequence **search_sequence(sequancesList,verse=None,chapterNum=0,verseNum=0,mode=3)** - take list of sequances and return matched sequance, it search in verse ot chapter or All Quran, - it return for every match : - matched sequance - chapter number of occurrence - token number if word and 0 if sentence - Note : - if found verse != None it will use it en search . - if no verse and found chapterNum and verseNum it will use this verse and use it to search. - if no verse and no verseNum and found chapterNum it will search in chapter. - if no verse and no chapterNum and no verseNum it will search in All Quran. - it has many modes: 1. search with decorated sequance (with tashkeel), and return matched sequance with decorates (with tashkil). 2. search without decorated sequance (without tashkeel), and return matched sequance without decorates (without tashkil). 3. search without decorated sequance (without tashkeel), and return matched sequance with decorates (with tashkil). - optional opptions: - **verse** (str): if passed, it will applied to this string only - **chapterNum** (int) : if passed only, it will applied to this chapter only. - **verseNum** (int) : - if passed only, it will applied to **verseNum** for **all Chapters**. - if passed with **chapterNum**, it will applied to verseNum for **chapterNum**. - **with_tashkeel** (bool): - if **True** applied to Quran **with** Tashkieel. - if **False** applied to Quran **without** Tashkieel. - mode (int): this mode that you need to use and default mode 3 - Note : if don't pass any **optional opptions** it will applied to all **Quran**. - Returns: dict() : key is sequances and value is a list of matched_sequance and their positions ```python matchedKeyword = pq.search_sequence(['قل أعوذ برب']) print(matchedKeyword) >>> {'قل أعوذ برب': [('قُلْ أَعُوذُ بِرَبِّ', 0, 1, 113), ('قُلْ أَعُوذُ بِرَبِّ', 0, 1, 114)]} ``` #### search_string_with_tashkeel **search_string_with_tashkeel(sentence,tashkeel_pattern)** - takes an **sentence** and **tashkeel_pattern** (composed of 0's , 1's) and it returns the locations that matched the pattern of diacrictics start index **inclusive** and end index **exculsive** and return empty list if not found. ```python sentence = 'صِفْ ذَاْ ثَنَاْ كَمْ جَاْدَ شَخْصٌ' tashkeel_pattern = ar.fatha + ar.sukun results = pq.search_string_with_tashkeel(sentence,tashkeel_pattern) print(results) >>> [(3, 5), (7, 9), (10, 12), (13, 15), (17, 19)] ``` #### search_with_pattern **search_with_pattern(pattern,sentence=None,verseNum=None,chapterNum=None,threshold=1)** - this function use to search in 0's,1's pattern and return matched words from sentence pattern dependent on the threshold, it takes a **patter** that you need to looking for , and **sentence (optional)** (sentence where will search), **chapterNum (opetional)** and **verseNum (opetional)** and return list of matched words and sentences. - Cases: 1. if pass sentece only or with another args it will search in sentece only. 2. if not passed sentence and passed verseNum and chapterNum, it will search in this verseNum that exist in chapterNum only. 3. if not passed sentence,verseNum and passed chapterNum only, it will search in this specific chapter only * Note : it's takes time dependent on your threshold and size of chapter, so it's not support to search on All-Quran becouse it take very long time more than 11 min. ```python result = pq.search_with_pattern(pattern="01111",chapterNum=1,threshold=0.9) print(result) >>>['الرَّحِيمِ مَلِكِ', 'نَعْبُدُ وَإِيَّاكَ', 'الْمُسْتَقِيمَ صِرَطَ'] ``` ================================================ FILE: documentation/docs/quran_tools.md ================================================ ## Importing PyQuran Note that PyQuran is imported by a **lowercase name**. ```python import pyquran as q ``` - Quran retrieving tools are in `q.quran`. ### get_sura ```python get_sura(sura_number, with_tashkeel=False, basmalah=False) ``` returns a sura as a list of verses. __Args__ - __sura_number__: 1 <= Integer <= 114, the ordered number of sura in Mushaf. - __with_tashkeel__: Boolean, if true return sura with tashkeel else return without. - __basmalah__: Boolean, adding basmalah as aya. __Returns__ - __[str]__: a list of sura's ayat. __Note__ Index statrts at zero. So if the order number of an aya is x, then it's at (x-1) in the returned list. __Example__ ```python q.quran.get_sura(108, with_tashkeel=True) >>> ['إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ', 'فَصَلِّ لِرَبِّكَ وَانْحَرْ', 'إِنَّ شَانِئَكَ هُوَ الْأَبْتَرُ'] ``` ---- ### get_verse ```python get_verse(sura_number, verse_number, with_tashkeel=False) ``` get specific verse form specific chapter __Args__ - __sura_number__: 1 <= Integer <= 114, the ordered number of sura in Mushaf. - __verse_number__: Integer > 0, number of verse. - __with_tashkeel__: Boolean, if true return sura with tashkeel else return without. __Returns__ - __str__: a verse. __Example__ ```python q.quran.get_verse(sura_number=1, verse_number=2) >>> 'الحمد لله رب العلمين' ``` ---- ### get_sura_number ```python get_sura_number(sura_name) ``` __Args__ sura_name (str) : string represents the sura name. __Returns__ - __int__: the sura number which name is sura_name. __Note__ Do not forget that the index of the returned list starts at zero. So if the order Sura number is x, then it's at (x-1) in the list. __Example__ ```python q.quran.get_sura_number('الملك') >>> 67 ``` ---- ### get_sura_name ```python get_sura_name(sura_number=None) ``` Returns the name of `sura_number`. If `sura_number=None` a list of all sura's names is retunred. __Args__ - __sura_number__: Optional, 1 <= Integer <= 114, the ordered number of sura in Mushaf. __Returns__ - __str__: the sura name which number is sura_number. - __[srt]__: list of all suras' names (if the sura_number parameter is None). __Example__ ```python q.quran.get_sura_name(2) >>> 'البقرة' ``` ================================================ FILE: documentation/docs/quran_tools_template.md ================================================ ## Importing PyQuran ```python import pyquran as q ``` {{autogenerated}} ================================================ FILE: documentation/generate.sh ================================================ #!/bin/bash # Overwite files_template.md > files.md cat templates/analysis_tools_template.md > docs/analysis_tools.md cat templates/arabic_tools_template.md > docs/arabic_tools.md cat templates/quran_tools_template.md > docs/quran_tools.md cat docs/index.md > ../README.md # For the repo; Readme cat docs/index.md > ../../README.md # for the PyPI Readme cat docs/CONTRIBUTING.md > ../CONTRIBUTING.md # Generate docs ./auto_gen_docs.py ================================================ FILE: documentation/git-adding.sh ================================================ git add docs/* git add ../CONTRIBUTING.md git add ../README.md ================================================ FILE: documentation/mkdocs.yml ================================================ site_name: PyQuran theme: readthedocs docs_dir: docs repo_url: https://github.com/hci-lab/pyquran-private # Documentation Layout pages: - Home: 'index.md' - Users Documentation: # Generated: run ./generate to update those files; according to the docs # changed inside the code - Quran Retrieving tools: 'quran_tools.md' - Arabic tools: 'arabic_tools.md' - Analysis tools: 'analysis_tools.md' - Maintainers: - Getting Started: 'CONTRIBUTING.md' - Package Strucutre: 'Wiki-Home.md' - Quran Corpus: 'Filtering-Special-Recitation-Symbols.md' - Code Conventios: 'code_conventions.md' - Related Projects: - Dictionary of Quran Words Frequency: 'dictFrec.md' - Authors: 'authors.md' ================================================ FILE: documentation/sources/analysis_tools_template.md ================================================ {{autogenerated}} ================================================ FILE: documentation/sources/arabic_tools_template.md ================================================ ## Alphabets We use [PyArabic](https://pypi.python.org/pypi/PyArabic/0.6.2) constants which represents letters, instead of writting Arabic in the code. ```python hamza = u'\u0621' alef_mad = u'\u0622' alef_hamza_above = u'\u0623' waw_hamza = u'\u0624' alef_hamza_below = u'\u0625' yeh_hamza = u'\u0626' alef = u'\u0627' beh = u'\u0628' teh_marbuta = u'\u0629' teh = u'\u062a' theh = u'\u062b' jeem = u'\u062c' hah = u'\u062d' khah = u'\u062e' dal = u'\u062f' thal = u'\u0630' reh = u'\u0631' zain = u'\u0632' seen = u'\u0633' sheen = u'\u0634' sad = u'\u0635' dad = u'\u0636' tah = u'\u0637' zah = u'\u0638' ain = u'\u0639' ghain = u'\u063a' feh = u'\u0641' qaf = u'\u0642' kaf = u'\u0643' lam = u'\u0644' meem = u'\u0645' noon = u'\u0646' heh = u'\u0647' waw = u'\u0648' alef_maksura = u'\u0649' yeh = u'\u064a' madda_above = u'\u0653' hamza_above = u'\u0654' hamza_below = u'\u0655' alef_wasl = u'\u0671' ``` ## Alphabetical Systems (Definitions) [**Rasm**](https://en.wikipedia.org/wiki/Rasm): is any set of letters which are writtern in the same form, namely; they are indistinguishable in wirtting by they are distinguished from the context. For example, the letters ت ث ن ى, they can be written with only one rasm ىـ, without dots. **Alphabetical System**: is a set of rasm; dynamically constructed by specifying the letters that you will treat them as one rasm. By the way, the default Arabic alphabet is a special case of the **Alphabetical System** where each letter is as one rasm. **Predefined systems** are stored in `systems` object. 1. **Default**: each letter is treated as a unique rasm. 2. **Without Dots**: by removing the dots some letters will be indistinguishable; those letters are treated as one rasm. The following example shows the (Without Dots) system as a list of lists; where the sublist contains the letters which share the same rasm. 3. **Hamazat**: consider each any letter accompanied by hamaz ء as one rasm. **NOTE**: You may go further and construct your system by speicying what letters you want to treat as one rasm, then you can do some statistical analysis like, count, variance, average, ... Example: ```python q.systems.withoutDots Out: [['ب', 'ت', 'ث', 'ن'], # Rasm 1 ['ح', 'خ', 'ج'], # Rasm 2 ['د', 'ذ'], # Rasm 3 ['ر', 'ز'], # Rasm 4 ['س', 'ش'], # Rasm 5 ['ص', 'ض'], # Rasm 6 ['ط', 'ظ'], # Rasm 7 ['ع', 'غ'], # Rasm 8 ['ف', 'ق']] # Rasm 9 ``` ### Constructing a user-defined system: ```python system = [[alef_hamza_above, alef], [beh, teh]] ``` The previous piece of code means "Treat *alef_hamza_above* and *alef* as the same one latter, also treat *beh* and *teh* as one letter as well". The rest of letters can be dynamically constructed using `check_system()` And then, a system can be applied to some text analysis functions like counting, filtering, etc. {{autogenerated}} ================================================ FILE: documentation/sources/quran_tools_template.md ================================================ ## Importing PyQuran ```python import PyQuran as q ``` - Quran retrieving tools are in `q.quran`. {{autogenerated}} ================================================ FILE: documentation/templates/analysis_tools_template.md ================================================ {{autogenerated}} ================================================ FILE: documentation/templates/arabic_tools_template.md ================================================ ## Alphabets We use [PyArabic](https://pypi.python.org/pypi/PyArabic/0.6.2) constants which represents letters, instead of writting Arabic in the code. ```python hamza = u'\u0621' alef_mad = u'\u0622' alef_hamza_above = u'\u0623' waw_hamza = u'\u0624' alef_hamza_below = u'\u0625' yeh_hamza = u'\u0626' alef = u'\u0627' beh = u'\u0628' teh_marbuta = u'\u0629' teh = u'\u062a' theh = u'\u062b' jeem = u'\u062c' hah = u'\u062d' khah = u'\u062e' dal = u'\u062f' thal = u'\u0630' reh = u'\u0631' zain = u'\u0632' seen = u'\u0633' sheen = u'\u0634' sad = u'\u0635' dad = u'\u0636' tah = u'\u0637' zah = u'\u0638' ain = u'\u0639' ghain = u'\u063a' feh = u'\u0641' qaf = u'\u0642' kaf = u'\u0643' lam = u'\u0644' meem = u'\u0645' noon = u'\u0646' heh = u'\u0647' waw = u'\u0648' alef_maksura = u'\u0649' yeh = u'\u064a' madda_above = u'\u0653' hamza_above = u'\u0654' hamza_below = u'\u0655' alef_wasl = u'\u0671' ``` ## Alphabetical Systems (Definitions) [**Rasm**](https://en.wikipedia.org/wiki/Rasm): is any set of letters which are writtern in the same form, namely; they are indistinguishable in wirtting by they are distinguished from the context. For example, the letters ت ث ن ى, they can be written with only one rasm ىـ, without dots. **Alphabetical System**: is a set of rasm; dynamically constructed by specifying the letters that you will treat them as one rasm. By the way, the default Arabic alphabet is a special case of the **Alphabetical System** where each letter is as one rasm. **Predefined systems** are stored in `systems` object. 1. **Default**: each letter is treated as a unique rasm. 2. **Without Dots**: by removing the dots some letters will be indistinguishable; those letters are treated as one rasm. The following example shows the (Without Dots) system as a list of lists; where the sublist contains the letters which share the same rasm. 3. **Hamazat**: consider each any letter accompanied by hamaz ء as one rasm. **NOTE**: You may go further and construct your system by speicying what letters you want to treat as one rasm, then you can do some statistical analysis like, count, variance, average, ... Example: ```python q.systems.withoutDots Out: [['ب', 'ت', 'ث', 'ن'], # Rasm 1 ['ح', 'خ', 'ج'], # Rasm 2 ['د', 'ذ'], # Rasm 3 ['ر', 'ز'], # Rasm 4 ['س', 'ش'], # Rasm 5 ['ص', 'ض'], # Rasm 6 ['ط', 'ظ'], # Rasm 7 ['ع', 'غ'], # Rasm 8 ['ف', 'ق']] # Rasm 9 ``` ### Constructing a user-defined system: ```python system = [[alef_hamza_above, alef], [beh, teh]] ``` The previous piece of code means "Treat *alef_hamza_above* and *alef* as the same one latter, also treat *beh* and *teh* as one letter as well". The rest of letters can be dynamically constructed using `check_system()` And then, a system can be applied to some text analysis functions like counting, filtering, etc. {{autogenerated}} ================================================ FILE: documentation/templates/quran_tools_template.md ================================================ ## Importing PyQuran ```python import PyQuran as q ``` - Quran retrieving tools are in `q.quran`. {{autogenerated}} ================================================ FILE: testing/run_test.sh ================================================ #!/bin/bash # a shell script to test `PyQuran` comprehensively # # Usage: # $ ./run_test.sh # # ToDo: # * Array of file names # * loop to run them # * add commend line arguments to test a single module. python3 -B test_quran.py python3 -B test_searchHelper.py python3 -B test_pyquran.py ================================================ FILE: testing/test_pyquran.py ================================================ """unittest module for pyquran.py """ import unittest import numpy as np # Adding another searching path from sys import path import os # The current path of the current module. path_current_module = os.path.dirname(os.path.abspath(__file__)) tools_modules = '../tools/' core_modules = '../core/' tools_path = os.path.join(path_current_module, tools_modules) core_path = os.path.join(path_current_module, core_modules) path.append(tools_path) path.append(core_path) from arabic import * import quran import pyquran class Testing_pyquran(unittest.TestCase): def test_search_string_with_tashkeel(self): sentence = 'ﺺِﻓْ ﺫَﺍْ ﺚَﻧَﺍْ ﻚَﻣْ ﺝَﺍْﺩَ ﺶَﺨْﺻٌ' x = pyquran.search_string_with_tashkeel(sentence, fatha + sukun) y = [(3, 5), (7, 9), (10, 12), (13, 15), (17, 19)] self.assertEqual(x, y) def test_get_tashkeel_binary(self): binaryPatternY = '0010101' subAyah = 'الْأَحْيَاءُ' binaryPatternX = pyquran.get_tashkeel_binary(subAyah)[0] self.assertEqual(binaryPatternX,binaryPatternY) binaryPatternY = '1010 101011 001011' subAyah = 'إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ' binaryPatternX = pyquran.get_tashkeel_binary(subAyah)[0] self.assertEqual(binaryPatternX,binaryPatternY) binaryPatternY = '101 00011 0001011 0001101' subAyah = 'بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ' binaryPatternX = pyquran.get_tashkeel_binary(subAyah)[0] self.assertEqual(binaryPatternX,binaryPatternY) binaryPatternY = '11011 1011 10 10 00011101 110 10 00101 00111 0010101 001101 001101' subAyah = ' يُسَبِّحُ لِلَّهِ مَا فِي السَّمَوَاتِ وَمَا فِي الْأَرْضِ الْمَلِكِ الْقُدُّوسِ الْعَزِيزِ الْحَكِيمِ' binaryPatternX = pyquran.get_tashkeel_binary(subAyah)[0] self.assertEqual(binaryPatternX,binaryPatternY) def test_get_frequency(self): ver_w_taskeel = quran.get_verse(1,1,with_tashkeel=True) fre_dec = {'الرَّحِيمِ': 1, 'الرَّحْمَنِ': 1, 'اللَّهِ': 1, 'بِسْمِ': 1} self.assertEqual(pyquran.get_frequency(ver_w_taskeel),fre_dec) fre_dec={'أُنزِلَ': 2, 'إِلَيْكَ': 1, 'بِمَا': 1, 'قَبْلِكَ': 1, 'مِن': 1, 'هُمْ': 1, 'وَالَّذِينَ': 1, 'وَبِالْءَاخِرَةِ': 1, 'وَمَا': 1, 'يُؤْمِنُونَ': 1, 'يُوقِنُونَ': 1} freq = pyquran.get_frequency(quran.get_verse(2,4,with_tashkeel=True)) self.assertEqual(freq,fre_dec) def test_generate_frequency_dictionary(self): fre_dec = {'أحد': 2, 'الصمد': 1, 'الله': 2, 'قل': 1, 'كفوا': 1, 'لم': 1, 'له': 1, 'هو': 1, 'ولم': 2, 'يكن': 1, 'يلد': 1, 'يولد': 1} sura = pyquran.generate_frequency_dictionary(suraNumber=112) self.assertEqual(sura,fre_dec) def test_check_sura_with_frequency(self): freq = pyquran.generate_frequency_dictionary(suraNumber=2) self.assertEqual(pyquran.check_sura_with_frequency(2,freq),True) freq = pyquran.generate_frequency_dictionary(suraNumber=95) self.assertEqual(pyquran.check_sura_with_frequency(95,freq),True) def test_sort_dictionary_by_similarity(self): freq = pyquran.generate_frequency_dictionary(suraNumber=113) fre_dec = {'أعوذ': 1, 'إذا': 2, 'العقد': 1, 'الفلق': 1, 'النفثت': 1, 'برب': 1, 'حاسد': 1, 'حسد': 1, 'خلق': 1, 'شر': 4, 'غاسق': 1, 'فى': 1, 'قل': 1, 'ما': 1, 'من': 1, 'وقب': 1, 'ومن': 3} self.assertEqual(pyquran.sort_dictionary_by_similarity(freq),fre_dec) freq = pyquran.generate_frequency_dictionary(suraNumber=112) fre_dec={'الله': 2, 'ولم': 2, 'قل': 1, 'هو': 1, 'الصمد': 1, 'لم': 1, 'يلد': 1, 'يولد': 1, 'له': 1, 'كفوا': 1, 'أحد': 2, 'يكن': 1} self.assertEqual(pyquran.sort_dictionary_by_similarity(freq,threshold=0.2),fre_dec) fre_dec={'ولم': 2, 'الصمد': 1, 'لم': 1, 'يولد': 1, 'الله': 2, 'له': 1, 'أحد': 2, 'قل': 1, 'هو': 1, 'يلد': 1, 'يكن': 1, 'كفوا': 1} self.assertEqual(pyquran.sort_dictionary_by_similarity(freq,threshold=0.45),fre_dec) def test_frequency_of_character(self): ver_w_taskeel = quran.get_verse(1,1,with_tashkeel=True) self.assertEqual(pyquran.frequency_of_character(['ا','ض',"بً"],with_tashkeel=False),{'ا': 38667, 'ض': 1686, 'بً': 0}) self.assertEqual(pyquran.frequency_of_character(['ا','ض',"بً"],with_tashkeel=True),{'ا': 38667, 'ض': 1686, 'بً': 218}) self.assertEqual(pyquran.frequency_of_character(['ا','ض',"بً"],verseNum=1,with_tashkeel=True),{'ا': 426, 'ض': 18, 'بً': 2}) self.assertEqual(pyquran.frequency_of_character(['ا','ض',"بً"],verseNum=4,chapterNum=12,with_tashkeel=True),{'ا': 4, 'ض': 0, 'بً': 1}) self.assertEqual(pyquran.frequency_of_character(['ا','ض',"بً"],verse=ver_w_taskeel),{'ا': 3, 'ض': 0, 'بً': 0}) def test_get_token(self): self.assertEqual(pyquran.get_token(4,1,1),'الرحيم') self.assertEqual(pyquran.get_token(5,1,1),'') self.assertEqual(pyquran.get_token(20,0,5),'') with self.assertRaises(ValueError): pyquran.get_token(20,0,-5) self.assertEqual(pyquran.get_token(95,1,5),'') self.assertEqual(pyquran.get_token(4,1,1,with_tashkeel=True),'الرَّحِيمِ') def test_search_sequence(self): result=pyquran.search_sequence(['بِسْمِ اللَّهِ','الرحمن'],verseNum=1,chapterNum=1) real={'الرحمن': [('الرَّحْمَنِ', 3, 1, 1)], 'بسم الله': [('بِسْمِ اللَّهِ', 0, 1, 1)]} self.assertEqual(result,real) result=pyquran.search_sequence(['بِسْمِ اللَّهِ','الرحمن'],verseNum=1,chapterNum=1,mode=1) real={'الرحمن': [], 'بِسْمِ اللَّهِ': [('بِسْمِ اللَّهِ', 0, 1, 1)]} self.assertEqual(result,real) def test_search_with_pattern(self): result = pyquran.search_with_pattern(pattern="01101011000101",chapterNum=2) real=['ءَامِنُوا كَمَا ءَامَنَ النَّاسُ', 'وَلَتَجِدَنَّهُمْ أَحْرَصَ النَّاسِ', 'بِالْمَعْرُوفِ حَقًّا عَلَى الْمُتَّقِينَ', 'بِالْمَعْرُوفِ حَقًّا عَلَى الْمُحْسِنِينَ', 'لِلتَّقْوَى وَلَا تَنسَوُا الْفَضْلَ'] self.assertEqual(result,real) result=pyquran.search_with_pattern(pattern="0110101100111010101",chapterNum=2) self.assertEqual(result,[]) result = pyquran.search_with_pattern(pattern="01111",chapterNum=1) real = ['الرَّحِيمِ مَلِكِ', 'نَعْبُدُ وَإِيَّاكَ', 'الْمُسْتَقِيمَ صِرَطَ'] self.assertEqual(result,real) try: pyquran.search_with_pattern(pattern="01111") result=True except: result=False self.assertEqual(result,False) result=pyquran.search_with_pattern(pattern="01111",chapterNum=1,threshold=0.9) real=['الرَّحِيمِ مَلِكِ', 'نَعْبُدُ وَإِيَّاكَ', 'الْمُسْتَقِيمَ صِرَطَ'] self.assertEqual(result,real) def test_count_rasm(self): # test case 1: small surah with system system = [[beh, teh, theh], [jeem, hah, khah]] returnedNParray = pyquran.count_rasm(quran.get_sura(110), system) expectedFROW = [1, 0, 0, 0, 0, 1, 0, 4, 1, 0, 2, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 1, 1, 1, 0, 0] self.assertEqual(returnedNParray.shape, (3, 33)) self.assertEqual(list(returnedNParray[0]), expectedFROW) # Shuffle a subsystem "same result expected" system = [[theh, beh, teh], [jeem, hah, khah]] returnedNParray = pyquran.count_rasm(quran.get_sura(110), system) self.assertEqual(returnedNParray.shape, (3, 33)) self.assertEqual(list(returnedNParray[0]), expectedFROW) #Shuffle system "same result expected" system = [[jeem, hah, khah], [theh, beh, teh]] returnedNParray = pyquran.count_rasm(quran.get_sura(110), system) self.assertEqual(returnedNParray.shape, (3, 33)) self.assertEqual(list(returnedNParray[0]), expectedFROW) system = [[hah, jeem, khah], [theh, teh, beh]] returnedNParray = pyquran.count_rasm(quran.get_sura(110), system) self.assertEqual(returnedNParray.shape, (3, 33)) self.assertEqual(list(returnedNParray[0]), expectedFROW) #build a very strange system :"D system = [[jeem, alef_hamza_above, waw, ghain], [meem, sheen, teh_marbuta, zah], [lam, alef_maksura, dal]] returnedNParray = pyquran.count_rasm(quran.get_sura(110), system) expectedFROW = [1, 0, 0, 2, 0, 1, 0, 4, 0, 0, 1, 0, 1, 0, 3, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0] self.assertEqual(returnedNParray.shape, (3, 29)) self.assertEqual(list(returnedNParray[0]), expectedFROW) # test case 2: big surah with system system = [[beh, teh, theh], [jeem, hah, khah]] returnedNParray = pyquran.count_rasm(quran.get_sura(2), system) self.assertEqual(returnedNParray.shape, (286, 33)) # test case 3: without system returnedNParray = pyquran.count_rasm(quran.get_sura(110)) expectedFROW = [1, 0, 0, 0, 0, 1, 0, 4, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 1, 1, 1, 0, 0] self.assertEqual(returnedNParray.shape, (3, 37)) self.assertEqual(list(returnedNParray[0]), expectedFROW) # Test case 4: repeat a char in two subsystems system = [[beh, teh, theh], [jeem, hah, khah, beh]] self.assertRaises(ValueError, pyquran.count_rasm, quran.get_sura(110), system) # Test case 5: path a system (as a list not list of lists) self.assertRaises(ValueError, pyquran.count_rasm, quran.get_sura(110), [beh, teh, theh]) self.assertRaises(ValueError, pyquran.count_rasm, quran.get_sura(110), [[beh, teh, theh], hah]) def test_check_system(self): system = [[beh, teh, theh], [jeem, hah, khah]] actualList = pyquran.check_system(system) self.assertEqual(len(actualList), 33) indx = list(alphabet).index(beh) self.assertEqual(actualList[indx], [beh, teh, theh]) indx = list(alphabet).index(jeem) # subtract 2 because teh and theh count as beh(all of them equal 7) self.assertEqual(actualList[indx-2], [jeem, hah, khah]) def test_buckwalter_transliteration(self): # test case 1:"from arabic without tashkeel to buckwalter " self.assertEqual(pyquran.buckwalter_transliteration("مرحبا"), "mrHbA") # test case 2:"from arabic with tashkeel to buckwalter " arabicText = "يُولَدُ جَمِيعُ ٱلنّاسِ أَحْرَارًا مُتَسَاوِينَ فِي ٱلْكَرَامَةِ وَٱلْحُقُوقِ. وَقَدْ وُهِبُوا عَقْلًا وَضَمِيرًا وَعَلَيْهِمْ أَنْ يُعَامِلَ بَعْضُهُمْ بَعْضًا بِرُوحِ ٱلْإِخَاءِ" expectedTransliteration = "yuwladu jamiyEu {ln~Asi >aHoraArFA mutasaAwiyna fiy {lokaraAmapi wa{loHuquwqi. waqado wuhibuwA EaqolFA waDamiyrFA waEalayohimo >ano yuEaAmila baEoDuhumo baEoDFA biruwHi {lo alphabet without taskell then alphabet with fatha, ... ''' alphabet = [] + arabic.alphabet alphabet += ' ' arabic_alphabet_tashkeel = [''] + alphabet + arabic_alphabet_tashkeel return arabic_alphabet_tashkeel def one_hot(string, padding_length=0): ''' * Optimized for memory use. * encodes each letter in string with ont-hot vector * returns a list of one-hot vectors a list of (1*182) vectors * letter -> 1*182 vector ''' cleanedString = factor_shadda_tanwin(string) charCleanedString = separate_token_with_dicrites(cleanedString) # Initializing a Matrix encodedString = np.zeros( (padding_length, len(lettersTashkeelCombination)) ) letter = 0 for char in charCleanedString: one_index = lettersTashkeelCombination.index(char) # * add 1 for the current letter in one_index encodedString[letter][one_index] = 1 letter +=1 return encodedString ================================================ FILE: tools/__init__.py ================================================ # Adding another searching path from sys import path import os # The current path of the current module. path_current_module = os.path.dirname(os.path.abspath(__file__)) path.append(path_current_module) ================================================ FILE: tools/arabic.py ================================================ """This module contains Arabic tools for text analysis """ # Umar; remove this to quran and correct the spelling to `suar_num` swar_num = 114 # letters. hamza = u'\u0621' hamza_above = u'\u0654' # alef_mad = u'\u0622' alef_hamza_above = u'\u0623' waw_hamza = u'\u0624' alef_hamza_below = u'\u0625' yeh_hamza = u'\u0626' alef = u'\u0627' beh = u'\u0628' teh_marbuta = u'\u0629' teh = u'\u062a' theh = u'\u062b' jeem = u'\u062c' hah = u'\u062d' khah = u'\u062e' dal = u'\u062f' thal = u'\u0630' reh = u'\u0631' zain = u'\u0632' seen = u'\u0633' sheen = u'\u0634' sad = u'\u0635' dad = u'\u0636' tah = u'\u0637' zah = u'\u0638' ain = u'\u0639' ghain = u'\u063a' feh = u'\u0641' qaf = u'\u0642' kaf = u'\u0643' lam = u'\u0644' meem = u'\u0645' noon = u'\u0646' heh = u'\u0647' waw = u'\u0648' alef_maksura = u'\u0649' yeh = u'\u064a' madda_above = u'\u0653' hamza_above = u'\u0654' hamza_below = u'\u0655' alef_wasl = u'\u0671' tatweel = u'\u0640' # diacritics fathatan = u'\u064b' dammatan = u'\u064c' kasratan = u'\u064d' fatha = u'\u064e' damma = u'\u064f' kasra = u'\u0650' shadda = u'\u0651' sukun = u'\u0652' # small letters small_alef = u"\u0670" small_waw = u"\u06e5" small_yeh = u"\u06e6" #ligatures lam_alef = u'\ufefb' lam_alef_hamza_above = u'\ufef7' lam_alef_hamza_below = u'\ufef9' lam_alef_mad_above = u'\ufef5' simple_lam_alef = u'\u0644\u0627' simple_lam_alef_hamza_above = u'\u0644\u0623' simple_lam_alef_hamza_below = u'\u0644\u0625' simple_lam_alef_mad_above = u'\u0644\u0622' # Lists alphabet = [ hamza, hamza_above, alef_mad, alef_hamza_above, waw_hamza, alef_hamza_below, yeh_hamza, alef, beh, teh_marbuta, teh, theh, jeem, hah, khah, dal, thal, reh, zain, seen, sheen, sad, dad, tah, zah, ain, ghain, feh, qaf, kaf, lam, meem, noon, heh, waw, alef_maksura, yeh ] tashkeel = [fathatan, dammatan, kasratan, fatha, damma, kasra, sukun, shadda] harakat = [fathatan, dammatan, kasratan, fatha, damma, kasra, sukun] shortharakat = [ fatha, damma, kasra, sukun] shortharakatWithShadda = [ fatha, damma, kasra, sukun, shadda] tanwin = [fathatan, dammatan, kasratan] not_def_haraka = tatweel lamAlefLike = [ lam_alef, lam_alef_hamza_above, lam_alef_hamza_below, lam_alef_mad_above, ] hamzat = [ hamza, waw_hamza, yeh_hamza, hamza_above, alef_hamza_below, alef_hamza_above, alef_mad ] alefat = [ alef, alef_mad, alef_hamza_above, alef_hamza_below, alef_wasl, alef_maksura, small_alef, ] # wihtout dots. Groups behLike = [beh, teh, theh, noon] jeemLike = [hah, khah, jeem] dalLike = [dal, thal] rehLike = [reh, zain] seenLike = [seen, sheen] sadLike = [sad, dad] tahLike = [tah, zah] ainLike = [ain, ghain] fehLike = [feh, qaf] weak = [ alef, waw, yeh, alef_maksura] yehlike = [ yeh, yeh_hamza, alef_maksura, small_yeh ] wawLike = [ waw, waw_hamza, small_waw ] tehLike = [ teh, teh_marbuta ] small = [ small_alef, small_waw, small_yeh] moon_letters = [hamza , alef_mad , alef_hamza_above , alef_hamza_below , alef , beh , jeem , hah , khah , ain , ghain , feh , qaf , kaf , meem , heh , waw , yeh ] sun_letters = [ teh , theh , dal , thal , reh , zain , seen , sheen , sad , dad , tah , zah , lam , noon , ] # Systems class Systems: '''A container of systems. ''' def __init__(self): # self.withoutDots = [behLike, jeemLike, dalLike, rehLike, seenLike, sadLike, tahLike, ainLike, fehLike] # self.hamazat = [hamzat] # self.default = alphabet # END CLASS # Exporting object systems = Systems() """ * Some alphabet building tools """ def alphabet_excluding(excludedLetters): """returns the alphabet excluding `excludedLetters`. Args: excludedLetters: list[Char], letters to be excluded from the alphabet. Returns: str: alphabet excluding `excludedLetters`. Example: ```python q.alphabet_excluding([q.alef, q.beh, q.qaf, q.teh, q.dal, q.yeh, q.alef_mad]) >>> ['ء', 'ٔ', 'أ', 'ؤ', 'إ', 'ئ', 'ة', 'ث', 'ج', 'ح', 'خ', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ى'] ``` """ return [x for x in alphabet if x not in excludedLetters] def treat_as_the_same(listOfLetter, letter, text): """convert any letter in the `listOfLetter` to `letter` in the given text Args: listOfLetter (['chars'] or str) letter (char) text (str) Returns: str: a text after changing all the `listOfLetter` to that char `letter` Example: print(treat_as_the_same([alef_hamza_above], alef, line)) print(treat_as_the_same([ain], qaf, line)) """ pass def strip_tashkeel(string): """convert any letter in the `listOfLetter` to `letter` in the given text Args: string: str, to drop tashkeel from. Example: ```python x = q.quran.get_verse(12, 2, with_tashkeel=True) x >>> 'إِنَّا أَنزَلْنَهُ قُرْءَنًا عَرَبِيًّا لَّعَلَّكُمْ تَعْقِلُونَ' q.strip_tashkeel(x) >>> 'إنا أنزلنه قرءنا عربيا لعلكم تعقلون' ``` """ for char in string: if char in tashkeel: string = string.replace(char, '') return string def factor_shadda_tanwin(string): ''' * factors shadda to letter with sukun and letter * factors tanwin to ????????? # Some redundancy is simpler. :"D ''' factoredString = '' charsList = separate_token_with_dicrites(string) # print(charsList) for char in charsList: if len(char) < 2: factoredString += char if len(char) == 2: if char[1] in arabic.shortharakat: factoredString += char elif char[1] == arabic.dammatan: if char[0] == arabic.teh_marbuta: factoredString += arabic.teh + arabic.damma + \ arabic.noon + arabic.sukun else: # the letter factoredString += char[0] + arabic.damma + \ arabic.noon + arabic.sukun elif char[1] == arabic.kasratan: if char[0] == arabic.teh_marbuta: factoredString += char[0] + arabic.teh + \ arabic.kasra + arabic.noon + arabic.sukun else: # the letter factoredString += char[0] + arabic.kasra \ + arabic.noon + arabic.sukun elif char[1] == arabic.fathatan: if char[0] == arabic.alef: factoredString += arabic.noon + arabic.sukun elif char[0] == arabic.teh_marbuta: factoredString += arabic.teh + arabic.fatha \ + arabic.noon + arabic.sukun elif char[1] == arabic.shadda: factoredString += char[0] + arabic.sukun + char[0] if len(char) == 3: factoredString += char[0] + arabic.sukun + char[0] + char[2] return factoredString ''' print(factor_shadda_tanwin('بيتٌ')) print(factor_shadda_tanwin('ولدٍ')) print(factor_shadda_tanwin('ولدَاً')) print(factor_shadda_tanwin('مدرسةً')) print(factor_shadda_tanwin('مدرسةٍ')) print(factor_shadda_tanwin('مدرسةٌ')) print(factor_shadda_tanwin('شبّ')) print(factor_shadda_tanwin('كبَّ')) ''' ''' # Testing for i in factor_shadda_tanwin('أَشَّدونٌ'): print(i) ''' ================================================ FILE: tools/buckwalter.py ================================================ ''' Declare a dictionary with Buckwalter's ASCII symbols as the keys, and their unicode equivalents as values. ''' buck2uni = { "'": u"\u0621", # hamza-on-the-line "|": u"\u0622", # madda ">": u"\u0623", # hamza-on-'alif "&": u"\u0624", # hamza-on-waaw "<": u"\u0625", # hamza-under-'alif "}": u"\u0626", # hamza-on-yaa' "A": u"\u0627", # bare 'alif "b": u"\u0628", # baa' "p": u"\u0629", # taa' marbuuTa "t": u"\u062A", # taa' "v": u"\u062B", # thaa' "j": u"\u062C", # jiim "H": u"\u062D", # Haa' "x": u"\u062E", # khaa' "d": u"\u062F", # daal "*": u"\u0630", # dhaal "r": u"\u0631", # raa' "z": u"\u0632", # zaay "s": u"\u0633", # siin "$": u"\u0634", # shiin "S": u"\u0635", # Saad "D": u"\u0636", # Daad "T": u"\u0637", # Taa' "Z": u"\u0638", # Zaa' (DHaa') "E": u"\u0639", # cayn "g": u"\u063A", # ghayn "_": u"\u0640", # taTwiil "f": u"\u0641", # faa' "q": u"\u0642", # qaaf "k": u"\u0643", # kaaf "l": u"\u0644", # laam "m": u"\u0645", # miim "n": u"\u0646", # nuun "h": u"\u0647", # haa' "w": u"\u0648", # waaw "Y": u"\u0649", # 'alif maqSuura "y": u"\u064A", # yaa' "F": u"\u064B", # fatHatayn "N": u"\u064C", # Dammatayn "K": u"\u064D", # kasratayn "a": u"\u064E", # fatHa "u": u"\u064F", # Damma "i": u"\u0650", # kasra "~": u"\u0651", # shaddah "o": u"\u0652", # sukuun "`": u"\u0670", # dagger 'alif "{": u"\u0671", # waSla } ================================================ FILE: tools/error.py ================================================ """standard error module """ def is_int(number, message): if type(number) is not int: raise ValueError(message) def is_bool(boolean, message): if type(boolean) is not bool: raise ValueError(message) def is_string(string, message): if type(string) is not str: raise ValueError(message) ================================================ FILE: tools/filtering.py ================================================ '''Contains Uthmanic symbols and related functions. reference: en.wikipedia.org/wiki/Arabic_script_in_Unicode ''' import arabic import error import re hamza_above = '\u0654' # u'\u0654' small_high_meem = '\u06e2' small_low_meem = '\u06ed' small_high_seen = '\u06dc' small_low_seen = '\u06e3' small_alef = '\u0670' small_waw = '\u06e5' small_yeh = '\u06e6' small_high_noon = '\u06e8' mad_lazim_mark = '\u0653' tatweel = '\u0640' alef_wasl_with_saad_above = '\u0671' empty_centre_high_stop = '\u06eb' small_high_rounded_zero = '\u06df' empty_center_low_stop = '\u06ea' small_high_upright_rectangular_zero = '\u06e0' rounded_high_stop_with_filled_centre = '\u06ec' recitationSymbols = [ alef_wasl_with_saad_above, # Replace with alef hamza_above, # Remain small_high_meem, # Remove small_low_meem, # Remove small_high_seen, # Remove small_low_seen, # Remove small_alef, # Remove small_waw, # Remove small_yeh, # Remove small_high_noon, # Remove mad_lazim_mark, # Remove tatweel, # Remove empty_centre_high_stop, # Remove small_high_rounded_zero, # Remove empty_center_low_stop, # Remove small_high_upright_rectangular_zero, # Remove rounded_high_stop_with_filled_centre # Remove ] 'my_user_name' ''' # Cannot fide hamza_above import tools import arabic x = tools.search_sequence([hamza_above]) print(x) quran = open('QuranCorpus/quran-uthmani.txt', 'r') quran = quran.read() #print(quran) print(len(quran)) print(hamza_above in quran) import re p = re.compile(quran) print(p.search(hamza_above)) print(p.findall(hamza_above)) ''' """ problems; * 'ء' is removed from AlNsaa 92 u'\u0621' * hamza_above = '\u0654' # u'\u0654' * 1:126 الأخر what is this hamza?! is it أ or alef + hamza above? In [1]: u'\u0621' Out[1]: 'ء' In [2]: '\u0654' Out[2]: 'ٔ' """ def get_patterns(): patterns = [] for x in [small_yeh, small_waw] : for y in arabic.shortharakat: patterns.append(x + y) return patterns + [small_yeh, small_waw] patterns_list = get_patterns() remove_no_tashkeel_after = [ small_high_meem, # Remove small_low_meem, # Remove small_high_seen, # Remove small_low_seen, # Remove small_alef, # Remove small_high_noon, # Remove mad_lazim_mark, # Remove tatweel, # Remove empty_centre_high_stop, # Remove small_high_rounded_zero, # Remove empty_center_low_stop, # Remove small_high_upright_rectangular_zero, # Remove rounded_high_stop_with_filled_centre # Remove ] def recitation_symbols_filter(string, symbols=recitationSymbols): '''Removes the Special Recitation Symbols from `string` Args: param1(str): a string to be filtered param2([char]: a list of recitation symbols Issues: * Some small litters have diacritics when they are removed their diacritics remains. * pyarabic strip_tashkeel -> revise it. ''' error.is_string(string, 'You must pass an string') for symbol in symbols: if symbol == alef_wasl_with_saad_above: string = string.replace(alef_wasl_with_saad_above, arabic.alef) # Do not remove hamza_above elif symbol == hamza_above: continue elif symbol in remove_no_tashkeel_after: string = string.replace(symbol, '') else: for pat in patterns_list: string = re.sub( pat , '', string) return string ''' for x in recitationSymbols : print("> " + x + '\n') ''' ================================================ FILE: tools/quran.py ================================================ """This modules contains functions to retrieve from quran. """ from xml.etree import ElementTree import arabic as ar import filtering import error import os # Relative path to this modul's location in PyQuran. corpus_xml_relative_path= '../QuranCorpus/quran-uthmani.xml' # The current path of the current module. current_path = os.path.dirname(os.path.abspath(__file__)) # Joining this module's path with the relative path of the corpus corpus_path = os.path.join(current_path, corpus_xml_relative_path) # Parsing xml quran_tree = ElementTree.parse(corpus_path) def get_sura(sura_number, with_tashkeel=False, basmalah=False): """returns a sura as a list of verses. Args: sura_number: 1 <= Integer <= 114, the ordered number of sura in Mushaf. with_tashkeel: Boolean, if true return sura with tashkeel else return without. basmalah: Boolean, adding basmalah as aya. Returns: [str]: a list of sura's ayat. Note: Index statrts at zero. So if the order number of an aya is x, then it's at (x-1) in the returned list. Example: ```python q.quran.get_sura(108, with_tashkeel=True)\n >>> ['إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ', 'فَصَلِّ لِرَبِّكَ وَانْحَرْ', 'إِنَّ شَانِئَكَ هُوَ الْأَبْتَرُ'] ``` """ message = "Sura number must be an integer between 1 to 114, inclusive." error.is_int(sura_number, message) message = "The second parameter must be bool, it an optional False by default" error.is_bool(with_tashkeel, message) sura_number -= 1 sura = [] suras_list = quran_tree.findall('sura') ayat = suras_list[sura_number] for aya in ayat: sura.append(aya.attrib['text']) if basmalah and sura_number != 1 -1 and sura_number != 9 -1: #suras_list[0][0].attrib['text'] bismilah = [suras_list[0][0].attrib['text']] sura = bismilah + sura uthmanic_free_sura = [] for aya in sura: uthmanic_free_sura.append(filtering.recitation_symbols_filter(aya)) if not with_tashkeel: return list(map(ar.strip_tashkeel, uthmanic_free_sura)) else: return uthmanic_free_sura def fetch_aya(sura_number, aya_number): """ Args: param1 (int): the ordered number of sura in The Mus'haf. param2 (int): the ordered number of aya in The Mus'haf. Returns: str: an aya as a string """ message = "Sura number must be an integer between 1 to 114, inclusive." error.is_int(sura_number, message) message = "Aya number is a positive integer." error.is_int(sura_number, message) aya_number -= 1 sura = get_sura(sura_number) if aya_number > len(sura) - 1: raise ValueError('Aya number most not exceed the number of ayat in sura.') return sura[aya_number] def retrieve_qruan_as_one_string(): quran_string = '' for i in range (1, 115): for aya in get_sura(i, with_tashkeel=True): quran_string += aya + ' ' return quran_string def get_sura_number(sura_name): """ Args: sura_name (str) : string represents the sura name. Returns: int: the sura number which name is sura_name. Note: Do not forget that the index of the returned list starts at zero. So if the order Sura number is x, then it's at (x-1) in the list. Example: ```python q.quran.get_sura_number('الملك')\n >>> 67 ``` """ suras_list = quran_tree.findall('sura') suraNumber = None for index in range (1, 115): if suras_list[index-1].attrib['name'] == sura_name: suraNumber = index return suraNumber def get_sura_name(sura_number=None): """Returns the name of `sura_number`. If `sura_number=None` a list of all sura's names is retunred. Args: sura_number: Optional, 1 <= Integer <= 114, the ordered number of sura in Mushaf. Returns: str: the sura name which number is sura_number. [srt]: list of all suras' names (if the sura_number parameter is None). Example: ```python q.quran.get_sura_name(2)\n >>> 'البقرة' ``` """ # get all suras suras_list = quran_tree.findall('sura') if sura_number is None : suraName = [(suras_list[i].attrib['name']) for i in range(0,114)] else: # get suraName suraName = suras_list[sura_number-1].attrib['name'] # return suraName return suraName # Redandant: # def get_verse(sura_number, verse_number, with_tashkeel=False): """ get specific verse form specific chapter Args: sura_number: 1 <= Integer <= 114, the ordered number of sura in Mushaf. verse_number: Integer > 0, number of verse. with_tashkeel: Boolean, if true return sura with tashkeel else return without. Returns: str: a verse. Example: ```python q.quran.get_verse(sura_number=1, verse_number=2)\n >>> 'الحمد لله رب العلمين' ``` """ if(sura_number > ar.swar_num or verse_number<=0): return "" try: return get_sura(sura_number,with_tashkeel)[verse_number-1] except: return "" ================================================ FILE: tools/searchHelper.py ================================================ """searchHelper: contains helper functions for searching. """ from arabic import * import re from pyarabic.araby import strip_tashkeel, strip_tatweel import quran def count_spaces_before_index(string, index): """counts spaces before a char in string. Args: param1 (str): string param2 (int): char index inside string Returns: int: number of spaces before string[index] """ count = 0 for i in range(index): if string[i] == ' ': count += 1 return count def get_string_taskeel(string): """get list of tashkeel without letters Args: param1 (str): string param2 (int): char index inside string Returns: list[char]: a list of diacritics found in `straing` """ x = '' for char in string: if char in tashkeel or char == ' ': x += char return x def hellper_get_sequance_positions(verse,sequance): ''' this function takes verse and sequence and returns the position of match word, and if sequence exists in verse more that one, it return list of first matched the word. ''' verse = strip_tashkeel(verse) sequance = strip_tashkeel(sequance) sequance = sequance.split() verse = verse.split() positions = [] for n,v in enumerate(verse): if v not in sequance: continue for en,se in enumerate(sequance): if se != verse[n]: break if en == len(sequance)-1: positions.append(n) n+=1 return positions def hellper_search_function(verse,sequance,verseNum,chapterNum,mode3): #split verse to tokens tokens = re.split(r' ',verse) if mode3: verse = strip_tashkeel(verse) tashkeel_ = "|".join([fatha,fathatan,damma,dammatan ,kasra,kasratan,shadda,sukun]) pattern = r"((\w|["+tashkeel_+"]*)*"+str(sequance)+"(\w|["+tashkeel_+"]*)*)" #get match_sequance matches = re.findall(pattern,verse) matches = [j.strip() for i in matches for j in i if j !=''] #check if found or not if len(matches)!=0: try: new_tokens = verse.split() positions = dict() #get position of occuerance lst = [] if len(sequance.split())>1: for tok in matches: positions[tok] = (0,hellper_get_sequance_positions( verse,tok)) else: for tok in matches: if verse.count(tok) > 1: ls = [i for i,x in enumerate(new_tokens) if x == tok] positions[tok] = (0,ls) else: positions[tok] = (0,[new_tokens.index(tok)]) if chapterNum!=0 and len(sequance.split())==1: for token in matches: loc,ls = positions[token] index = int(ls[loc]) positions[token] = (loc+1,ls) #check if exist the same token many time lst.append((tokens[index], index+1, verseNum, chapterNum)) #if matched sequance token return lst except: pass if len(sequance.split())==1: #if matched sequance token for token in matches: loc,ls = positions[token] index = int(ls[loc]) positions[token] = (loc+1,ls) #check if exist the same token many time lst.append((tokens[index], index+1)) #if matched sequance token return lst else: #check if mode3 False if not mode3: if chapterNum!=0: #if match sequance sentence return [(token,0,verseNum,chapterNum) for token in matches] else: #if match sequance sentence return [(token,0) for token in matches] else: lst = [] #if match sequance sentence for token in matches: new_token = [] loc,ls = positions[token] index = int(ls[loc]) positions[token] = (loc+1,ls) new_token = " ".join([str(tokens[index- len(sequance.split())+i*1+1]) for i in range(len(token.split()))]) if chapterNum!=0: lst.append((new_token,0,verseNum,chapterNum)) else: lst.append((new_token,0)) return lst return [] def hellper_pre_search_sequance(sequance,verse=None,chapterNum=0, verseNum=0,with_tashkeel=False,mode3=False): """ search about sequance in verse or chapter or Quran and return matched seqance and his position if sequance was token or sub-token ,and 0 if sequance was sentence. -cases: * if found verse as string it will search in verse that entered * if no chapterNum and no verseNum and no verse it will search in All Quran. * if no verseNumber and no verse and found chapterNum it will search in chapter. * if found chapterNum and verseNum and no verse it will search in verse. Args: verse (str): it's a verse where function search sequances (str): a sequance that you want to match it chapterNum (int) : number of chapter verseNum (int) : number of verse with_tashkeel (int) : to check if search with taskeel or not mode3 (bool) : if true it will us mode 3 to search Returns: list of tuble : (matched_sequance , his_position , verse number , chapter number ) Note: position will 0 if matched_sequance was part of sentence, and will number if matched_sequance was token or sub-token """ if verseNum<0 or chapterNum <0 : return [] #remove extra spaces sequance = re.sub(r" +"," ",sequance) sequance = sequance.strip() #strip tashkeel if with_tashkeel flage is false if not with_tashkeel: sequance = strip_tashkeel(sequance) #search in verse that enterd if verse != None: return hellper_search_function(verse,sequance,verseNum,chapterNum,mode3) else: #chech if specific chapter if chapterNum!=0: #check if specific verse if verseNum!=0: verse = quran.get_verse(chapterNum,verseNum,with_tashkeel) return hellper_search_function(verse,sequance, verseNum, chapterNum, mode3) else: #search in Chapter verses = quran.get_sura(chapterNum,with_tashkeel) return sum([hellper_search_function(v,sequance, num+1,chapterNum, mode3) for num,v in enumerate(verses)], []) else: #search in all Quran final_list = [] for i in range(swar_num): verses = quran.get_sura(i+1,with_tashkeel) final_list += sum([hellper_search_function(v,sequance, num+1,i+1, mode3) for num,v in enumerate(verses)], []) return final_list def hellper_frequency_of_chars_in_verse(verse,charaters): """ this function count number of characters occurrence in verse Args: verse (str): this verse that you need to count it and default is None. chracter (list) : list of characters that you want to count them Returns: {dic} : a dictionary and keys is a characters and value is count of every chracter. """ #dectionary that have frequency frequency = dict() #count frequency of chars for char in charaters: frequency[char] = verse.count(char) return frequency def hamming_distance(s1, s2): ''' get number of different character in s1 and s2 ''' return sum(el1 != el2 for el1, el2 in zip(s1, s2)) def get_word_num(char_num,sentece): ''' take's the position of letter and return the position of word that has this letter ''' lis = [len(i) for i in sentece.split()] coun = 0 for i,l in enumerate(lis): coun +=l if char_num <= coun: return i def hellper_search_with_pattern(pattern,sentence_pattern,sentence,ratio=1): ''' this function takes 0's,1's pattern and retuen matched words from sentence pattern dependent on the ratio to adopt threshold. Args: pattern (str): 0's,1's pattern that you need to search. sentence_pattern (str): 0's,1's pattern of sentence to search inside it. sentence (str): the real sentence in text format. ratio (float): threshold of similarity , if 1 it will get the similar exactly, and if not ,it will get dependant on ratio number. Return: [[list]] : it will return list of listes that have matched word, or matched senteces and return empty list if not found. ''' sentence_pattern_sequance = sentence_pattern.replace(" ","") pattern_len = len(pattern) if pattern_len >len(sentence): return [] lis = [] s=0 e=pattern_len i=0 while i <= len(sentence_pattern_sequance)-pattern_len: sen = sentence_pattern_sequance[s:e] dif = hamming_distance(sen,pattern)/pattern_len if 1-dif >= ratio: matched =sentence.split()[get_word_num(s+1,sentence_pattern):get_word_num(e,sentence_pattern)+1] matched = " ".join(matched) if matched not in lis: lis.append(matched) s +=1 e +=1 i+=1 return lis ================================================ FILE: tools/shapeHelper.py ================================================ '''shapeHelper: contains helper functions shape. ''' from arabic import * from itertools import chain def searcher(system, ch): for i in range(0, len(system), 1): if ch in system[i]: return i def convert_text_to_numbers(text,alphabetMap): """ convert_text_to_numbers get a text (surah or ayah) and convert it to list of numbers depends on alphabetMap dictionary , user pass the text "list or list of list" that want to count and dictionary that has each chat with it's number that will convert to,and returns a list of numbers What it does: it convert each letter to a number "corresponding to dictionary given as argument" Args: param1 ([str] ): a list of strings , each inner list is ayah . param2(dict) : a dictionary has each alphabet with it's corresponding number Returns: List: list of numbers, where each char in the text converted to number """ i=0 textToNumber=[] for char in text: textToNumber.insert(i, alphabetMap[char]) i = i + 1 return textToNumber def check_repetation(system): diff = len(list(chain(*system)))-len(list(set(chain(*system)))) if diff > 0: return True else: return False