[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nenv/\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\n*.egg-info/\n.installed.cfg\n*.egg\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.cache\nnosetests.xml\ncoverage.xml\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n### PyCharm ###\n# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm\n\n## Directory-based project format\n.idea/\n# if you remove the above rule, at least ignore user-specific stuff:\n# .idea/workspace.xml\n# .idea/tasks.xml\n# and these sensitive or high-churn files:\n# .idea/dataSources.ids\n# .idea/dataSources.xml\n# .idea/sqlDataSources.xml\n# .idea/dynamic.xml\n\n## File-based project format\n*.ipr\n*.iws\n*.iml\n\n## Additional for IntelliJ\nout/\n\n# generated by mpeltonen/sbt-idea plugin\n.idea_modules/\n\n"
  },
  {
    "path": ".travis.yml",
    "content": "sudo: false\n\nnotifications:\n    email: false\n\nlanguage: python\n\npython:\n  - \"2.7\"\n\nbefore_install:\n    - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh\n    - bash miniconda.sh -b -p $HOME/miniconda\n    - export PATH=\"$HOME/miniconda/bin:$PATH\"\n    - conda update --yes conda\n\ninstall:\n    - conda install --yes python=$TRAVIS_PYTHON_VERSION pip numpy scipy\n    - pip install coverage\n    - pip install https://github.com/tariqdaouda/rabaDB/archive/stable.zip\n    - python setup.py install\n\nscript: coverage run -m unittest discover pyGeno/tests/\n\nafter_success: bash <(curl -s https://codecov.io/bash)\n"
  },
  {
    "path": "CHANGELOG.rst",
    "content": "1.3.2\n=====\n\n* Search now uses KMD by default instead of dichotomic search (massive speed gain). Many thanks to @Keija for the implementation. Go to https://github.com/tariqdaouda/pyGeno/pull/34 for details and benchmarks.\n\n1.3.1\n=====\n\n* CSVFile: fixed bug when slice start was None\n* CSVFile: Better support for string separator\n* AGN SNPs Quality cast to float by importer\n* Travis integration\n* Minor CSV parser updates\n\n1.3.0\n=====\n\n* CSVFile will now ignore empty lines and comments\n\n* Added synonymousCodonsFrequencies\n\n1.2.9\n=====\n\n* It is no longer mandatory to set the whole legend of CSV file at initialization. It can figure it out by itself\n\n* Datawraps can now be uncompressed folders\n\n* Explicit error message if there's no file name manifest.ini in datawrap\n\n\n1.2.8\n=====\n\n* Fixed BUG that prevented proper initialization and importation\n\n1.2.5\n=====\n\n* BUG FIX: Opening a lot of chromosomes caused mmap to die screaming\n\n* Removed core indexes. Sqlite sometimes chose them instead of user defined positional indexes, resulting un slow queries\n\n* Doc updates\n\n1.2.3\n=====\n\n* Added functions to retrieve the names of imported snps sets and genomes\n\n* Added remote datawraps to the boostrap module that can be downloaded from pyGeno's website or any other location\n\n* Added a field uniqueId to AgnosticSNPs\n\n* Changed all latin datawrap names to english\n\n* Removed datawrap for dbSNP GRCh37\n\n1.2.2\n=====\n\n* Updated pypi package to include bootstrap datawraps\n\n1.2.1\n=====\n\n* Fixed tests\n\n1.2.0\n=====\n* BUG FIX: get()/iterGet() now works for SNPs and Indels\n\n* BUG FIX: Default SNP filter used to return unwanted Nones for insertions\n\n* BUG FIX: Added cast of lines to str in VCF and CasavaSNP parsers. Sometimes unicode caracters made the translation bug  \n\n* BUG FIX: Corrected a typo that caused find in Transcript to recursively die \n\n* Added a new AgnosticSNP type of SNPs that can be easily made from the results of any snp caller. To make for the loss of support for casava by illumina. See SNP.AgnosticSNP for documentation\n\n* pyGeno now comes with the murine reference genome GRCm38.78\n\n* pyGeno now comes with the human reference genome GRCh38.78, GRCh37.75 is still shipped with pyGeno but might be in the upcoming versions\n\n* pyGeno now comes with a datawrap for common dbSNPs human SNPs (SNP_dbSNP142_human_common_all.tar.gz)\n\n* Added a dummy AgnosticSNP datawrap example for boostraping\n\n* Changed the interface of the bootstrap module\n\n* CSV Parser has now the ability to stream directly to a file\n\n\n1.1.7\n=====\n\n* BUG FIX: looping through CSV lines now works\n\n* Added tests for CSV\n\n1.1.6\n=====\n\n* BUG FIX: find in BinarySequence could not find some subsequences at the tail of sequence\n\n1.1.5\n=====\n\n* BUG FIX in default SNP filter\n\n* Updated description\n\n1.1.4\n=====\n\n* Another BUG FIX in progress bar\n\n1.1.3\n=====\n\n* Small BUG FIX in the progress bar that caused epochs to be misrepresented\n\n* 'Specie' has been changed to 'species' everywhere. That breaks the database the only way to fix it is to redo all importations\n\n1.1.2\n=====\n\n* Genome import is now much more memory efficient\n\n* BUG FIX: find in BinarySequence could not find subsequences at the tail of sequence\n\n* Added a built-in datawrap with only chr1 and y\n\n* Readme update with more infos about importation and link to doc\n \n1.1.1\n=====\n\nMuch better SNP filtering interface\n------------------------------------\nEasier and much morr coherent:\n\n* SNP filtering has now its own module\n\n* SNP Filters are now objects\n\n* SNP Filters must return SequenceSNP, SNPInsert, SNPDeletion or None objects\n\n1.0.0\n=====\nFreshly hatched\n\n"
  },
  {
    "path": "DESCRIPTION.rst",
    "content": "pyGeno: A python package for Precision Medicine, Personalized Genomics and Proteomics\n=====================================================================================\n\nShort description:\n------------------\n\npyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_).\n\n.. _Tariq Daouda: http://www.tariqdaouda.com\n.. _IRIC: http://www.iric.ca\n\nWith pyGeno you can do that:\n\n.. code:: python\n\n from pyGeno.Genome import *\n \n #load a genome \n ref = Genome(name = 'GRCh37.75')\n #load a gene\n gene = ref.get(Gene, name = 'TPST2')[0]\n #print the sequences of all the isoforms\n for prot in gene.get(Protein) :\n  print prot.sequence\n\nYou can also do it for the **specific genomes** of your subjects:\n\n.. code:: python\n\n pers = Genome(name = 'GRCh37.75', SNPs = [\"RNA_S1\"], SNPFilter = myFilter())\n\nAnd much more: https://github.com/tariqdaouda/pyGeno\n\nVerbose Description\n--------------------\n\npyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend\nupon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to\nbe able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Presonalized Genomes**.\n\nPersonalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter.\npyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get\ndirect access to the DNA and Protein sequences of your patients.\n\nMultiple sets of of polymorphisms can also be combined together to leverage their independent benefits ex: \n\nRNA-seq and DNA-seq for the same individual to improve the coverage\nRNA-seq of an individual + dbSNP for validation\nCombine the results of RNA-seq of several individual to create a genome only containing the common polymorphisms\npyGeno is also a personal database that give you access to all the information provided by Ensembl (for both Reference and Personalized Genomes) without the need of queries to distant HTTP APIs. Allowing for much faster and reliable genome wide study pipelines.\n\nIt also comes with parsers for several file types and various other useful tools.\n\nFull Documentation\n------------------\n\nThe full documentation is available here_\n\n.. _here: http://pygeno.iric.ca/\n\nIf you like pyGeno, please let me know.\nFor the latest news, you can follow me on twitter `@tariqdaouda`_.\n\n.. _@tariqdaouda: https://www.twitter.com/tariqdaouda\n"
  },
  {
    "path": "LICENSE",
    "content": "Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"{}\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright {yyyy} {name of copyright owner}\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "include *.rst\ninclude LICENSE\n"
  },
  {
    "path": "README.rst",
    "content": "CODE FREEZE:\n============\n\nPyGeno has long been limited due to it's backend. We are now ready to take it to the next level.\n\nWe are working on a major port of pyGeno to the open-source multi-modal database ArangoDB. PyGeno's code on both branches master and bloody is frozen until we are finished. No pull request will be merged until then, and we won't implement any new features.\n\npyGeno: A Python package for precision medicine and proteogenomics\n==================================================================\n\n.. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg\n   :alt: depsy\n   :target: http://depsy.org/package/python/pyGeno\n\n.. image:: https://pepy.tech/badge/pygeno\n   :alt: downloads\n   :target: https://pepy.tech/project/pygeno\n\n.. image:: https://pepy.tech/badge/pygeno/month\n   :alt: downloads_month\n   :target: https://pepy.tech/project/pygeno/month\n\n.. image:: https://pepy.tech/badge/pygeno/week\n   :alt: downloads_week\n   :target: https://pepy.tech/project/pygeno/week\n\n.. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png\n   :alt: pyGeno's logo\n   \n\npyGeno is (to our knowledge) the only tool available that will gladly build your specific genomes for you.\n\npyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_.\nFor the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.\n\n.. _Tariq Daouda: http://wwww.tariqdaouda.com\n.. _IRIC: http://www.iric.ca\n.. _Sawssan Kaddoura: http://sawssankaddoura.com\n\nClick here for The `full documentation`_.\n\n.. _full documentation: http://pygeno.iric.ca/\n\nFor the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.\n\n.. _@tariqdaouda: https://www.twitter.com/tariqdaouda\n\nCiting pyGeno:\n--------------\nPlease cite this paper_.\n\n.. _paper: http://f1000research.com/articles/5-381/v1\n\nInstallation:\n-------------\n\nIt is recommended to install pyGeno within a `virtual environement`_, to setup one you can use:\n\n.. code:: shell\n\n        virtualenv ~/.pyGenoEnv\n        source ~/.pyGenoEnv/bin/activate\n\npyGeno can be installed through pip:\n\n.. code:: shell\n\t\n\tpip install pyGeno #for the latest stable version\n\nOr github, for the latest developments:\n\n.. code:: shell\n\n\tgit clone https://github.com/tariqdaouda/pyGeno.git\n\tcd pyGeno\n        python setup.py develop\n\n.. _`virtual environement`: http://virtualenv.readthedocs.org/\n\nA brief introduction\n--------------------\n\npyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend\nupon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to\nbe able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Personalized Genomes**.\n\nPersonalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphisms and an optional filter.\npyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get\ndirect access to the DNA and Protein sequences of your patients.\n\n.. code:: python\n\n\tfrom pyGeno.Genome import *\n\t\n\tg = Genome(name = \"GRCh37.75\")\n\tprot = g.get(Protein, id = 'ENSP00000438917')[0]\n\t#print the protein sequence\n\tprint prot.sequence\n\t#print the protein's gene biotype\n\tprint prot.gene.biotype\n\t#print protein's transcript sequence\n\tprint prot.transcript.sequence\n\t\n\t#fancy queries\n\tfor exon in g.get(Exon, {\"CDS_start >\": x1, \"CDS_end <=\" : x2, \"chromosome.number\" : \"22\"}) :\n\t\t#print the exon's coding sequence\n\t\tprint exon.CDS\n\t\t#print the exon's transcript sequence\n\t\tprint exon.transcript.sequence\n\t\n\t#You can do the same for your subject specific genomes\n\t#by combining a reference genome with polymorphisms\n\tg = Genome(name = \"GRCh37.75\", SNPs = [\"STY21_RNA\"], SNPFilter = MyFilter())\n\nAnd if you ever get lost, there's an online **help()** function for each object type:\n\n.. code:: python\n\n\tfrom pyGeno.Genome import *\n\t\n\tprint Exon.help()\n\nShould output:\n\n.. code::\n\t\n\tAvailable fields for Exon: CDS_start, end, chromosome, CDS_length, frame, number, CDS_end, start, genome, length, protein, gene, transcript, id, strand\n\n\t\nCreating a Personalized Genome:\n-------------------------------\nPersonalized Genomes are a powerful feature that allow you to work on the specific genomes and proteomes of your patients. You can even mix several SNP sets together.\n\n.. code:: python\n  \n  from pyGeno.Genome import Genome\n  #the name of the snp set is defined inside the datawrap's manifest.ini file\n  dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')\n  #you can also define a filter (ex: a quality filter) for the SNPs\n  dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())\n  #and even mix several snp sets  \n  dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())\n\nFiltering SNPs:\n---------------\npyGeno allows you to select the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions.\n\n.. code:: python\n\t\n\tfrom pyGeno.SNPFiltering import SNPFilter, SequenceSNP\n\n\tclass QMax_gt_filter(SNPFilter) :\n\t\t\n\t\tdef __init__(self, threshold) :\n\t\t\tself.threshold = threshold\n\t\t\n\t\t#Here SNPs is a dictionary: SNPSet Name => polymorphism  \n\t\t#This filter ignores deletions and insertions and\n\t\t#but applis all SNPs\n\t\tdef filter(self, chromosome, **SNPs) :\n\t\t\tsources = {}\n\t\t\talleles = []\n\t\t\tfor snpSet, snp in SNPs.iteritems() :\n\t\t\t\tpos = snp.start\n\t\t\t\tif snp.alt[0] == '-' :\n\t\t\t\t\tpass\n\t\t\t\telif snp.ref[0] == '-' :\n\t\t\t\t\tpass\n\t\t\t\telse :\n\t\t\t\t\tsources[snpSet] = snp\n\t\t\t\t\talleles.append(snp.alt) #if not an indel append the polymorphism\n\t\t\t\t\n\t\t\t#appends the refence allele to the lot\n\t\t\trefAllele = chromosome.refSequence[pos]\n\t\t\talleles.append(refAllele)\n\t\t\tsources['ref'] = refAllele\n\t\n\t\t\t#optional we keep a record of the polymorphisms that were used during the process\n\t\t\treturn SequenceSNP(alleles, sources = sources)\n\t\t\nThe filter function can also be made more specific by using arguments that have the same names as the SNPSets\n\n.. code:: python\n\n\tdef filter(self, chromosome, dummySRY = None) :\n\t\tif dummySRY.Qmax_gt > self.threshold :\n\t\t\t#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)\n\t\t\treturn SequenceSNP(dummySRY.alt)\n\t\treturn None #None means keep the reference allele\n\nTo apply the filter simply specify if while loading the genome.\n\n.. code:: python\n\n\tpersGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))\n\nTo include several SNPSets use a list.\n\n.. code:: python\n\n\tpersGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = ['ARN_P1', 'ARN_P2'], SNPFilter = myFilter())\n\nGetting an arbitrary sequence:\n------------------------------\nYou can ask for any sequence of any chromosome:\n\n.. code:: python\n\t\n\tchr12 = myGenome.get(Chromosome, number = \"12\")[0]\n\tprint chr12.sequence[x1:x2]\n\t# for the reference sequence\n  \tprint chr12.refSequence[x1:x2]\n\nBatteries included (bootstraping):\n---------------------------------\n\npyGeno's database is populated by importing datawraps.\npyGeno comes with a few data wraps, to get the list you can use:\n\n.. code:: python\n\t\n\timport pyGeno.bootstrap as B\n\tB.printDatawraps()\n\n.. code::\n\n\tAvailable datawraps for boostraping\n\t\n\tSNPs\n\t~~~~|\n\t    |~~~:> Human_agnostic.dummySRY.tar.gz\n\t    |~~~:> Human.dummySRY_casava.tar.gz\n\t    |~~~:> dbSNP142_human_common_all.tar.gz\n\t\n\t\n\tGenomes\n\t~~~~~~~|\n\t       |~~~:> Human.GRCh37.75.tar.gz\n\t       |~~~:> Human.GRCh37.75_Y-Only.tar.gz\n\t       |~~~:> Human.GRCh38.78.tar.gz\n\t       |~~~:> Mouse.GRCm38.78.tar.gz\n\nTo get a list of remote datawraps that pyGeno can download for you, do:\n\n.. code:: python\n\n\tB.printRemoteDatawraps()\n\nImporting whole genomes is a demanding process that take more than an hour and requires (according to tests) \nat least 3GB of memory. Depending on your configuration, more might be required.\n\nThat being said importating a data wrap is a one time operation and once the importation is complete the datawrap\ncan be discarded without consequences.\n\nThe bootstrap module also has some handy functions for importing built-in packages.\n\nSome of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**):\n\n.. code:: python\n\t\n\timport pyGeno.bootstrap as B\n\n\t#Imports only the Y chromosome from the human reference genome GRCh37.75\n\t#Very fast, requires even less memory. No download required.\n\tB.importGenome(\"Human.GRCh37.75_Y-Only.tar.gz\")\n\t\n\t#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP  format. \n\t# This one has one SNP at the begining of the gene SRY\n\tB.importSNPs(\"Human.dummySRY_casava.tar.gz\")\n\nAnd for more **Serious Work**, the whole reference genome.\n\n.. code:: python\n\n\t#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.\n\tB.importGenome(\"Human.GRCh38.78.tar.gz\")\n\t\nImporting a custom datawrap:\n--------------------------\n\n.. code:: python\n\n  from pyGeno.importation.Genomes import *\n  importGenome('GRCh37.75.tar.gz')\n\nTo import a patient's specific polymorphisms\n\n.. code:: python\n\n  from pyGeno.importation.SNPs import *\n  importSNPs('patient1.tar.gz')\n\nFor a list of available datawraps available for download, please have a look here_.\n\nYou can easily make your own datawraps with any tar.gz compressor.\nFor more details on how datawraps are made you can check wiki_ or have a look inside the folder bootstrap_data.\n\n.. _here: http://pygeno.iric.ca/datawraps.html\n.. _wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-friendly-package-to-import-your-data%3F\n\nInstanciating a genome:\n-----------------------\n.. code:: python\n\t\n\tfrom pyGeno.Genome import Genome\n\t#the name of the genome is defined inside the package's manifest.ini file\n\tref = Genome(name = 'GRCh37.75')\n\nPrinting all the proteins of a gene:\n-----------------------------------\n.. code:: python\n\n  from pyGeno.Genome import Genome\n  from pyGeno.Gene import Gene\n  from pyGeno.Protein import Protein\n\nOr simply:\n\n.. code:: python\n\n  from pyGeno.Genome import *\n\nthen:\n\n.. code:: python\n\n  ref = Genome(name = 'GRCh37.75')\n  #get returns a list of elements\n  gene = ref.get(Gene, name = 'TPST2')[0]\n  for prot in gene.get(Protein) :\n  \tprint prot.sequence\n\nMaking queries, get() Vs iterGet():\n-----------------------------------\niterGet is a faster version of get that returns an iterator instead of a list.\n\nMaking queries, syntax:\n----------------------\npyGeno's get function uses the expressivity of rabaDB.\n\nThese are all possible query formats:\n\n.. code:: python\n\n  ref.get(Gene, name = \"SRY\")\n  ref.get(Gene, { \"name like\" : \"HLA\"})\n  chr12.get(Exon, { \"start >=\" : 12000, \"end <\" : 12300 })\n  ref.get(Transcript, { \"gene.name\" : 'SRY' })\n\nCreating indexes to speed up queries:\n------------------------------------\n.. code:: python\n\n  from pyGeno.Gene import Gene\n  #creating an index on gene names if it does not already exist\n  Gene.ensureGlobalIndex('name')\n  #removing the index\n  Gene.dropIndex('name')\n\nFind in sequences:\n------------------\n\nInternally pyGeno uses a binary representation for nucleotides and amino acids to deal with polymorphisms. \nFor example,both \"AGC\" and \"ATG\" will match the following sequence \"...AT/GCCG...\".\n\n.. code:: python\n\n\t#returns the position of the first occurence\n\ttranscript.find(\"AT/GCCG\")\n\t#returns the positions of all occurences\n\ttranscript.findAll(\"AT/GCCG\")\n\t\n\t#similarly, you can also do\n\ttranscript.findIncDNA(\"AT/GCCG\")\n\ttranscript.findAllIncDNA(\"AT/GCCG\")\n\ttranscript.findInUTR3(\"AT/GCCG\")\n\ttranscript.findAllInUTR3(\"AT/GCCG\")\n\ttranscript.findInUTR5(\"AT/GCCG\")\n\ttranscript.findAllInUTR5(\"AT/GCCG\")\n\t\n\t#same for proteins\n\tprotein.find(\"DEV/RDEM\")\n\tprotein.findAll(\"DEV/RDEM\")\n\t\n\t#and for exons\n\texon.find(\"AT/GCCG\")\n\texon.findAll(\"AT/GCCG\")\n\texon.findInCDS(\"AT/GCCG\")\n\texon.findAllInCDS(\"AT/GCCG\")\n\t#...\n\n\t\nProgress Bar:\n-------------\n.. code:: python\n\n  from pyGeno.tools.ProgressBar import ProgressBar\n  pg = ProgressBar(nbEpochs = 155)\n  for i in range(155) :\n  \tpg.update(label = '%d' %i) # or simply p.update() \n  pg.close()\n\n"
  },
  {
    "path": "pyGeno/Chromosome.py",
    "content": "#import copy\n#import types\n#from tools import UsefulFunctions as uf\n\nfrom types import *\nimport configuration as conf\nfrom pyGenoObjectBases import *\n\nfrom SNP import *\nimport SNPFiltering as SF\n\nfrom rabaDB.filters import RabaQuery\nimport rabaDB.fields as rf\n\nfrom tools.SecureMmap import SecureMmap as SecureMmap\nfrom tools import SingletonManager\n\nimport pyGeno.configuration as conf\n\nclass ChrosomeSequence(object) :\n\t\"\"\"Represents a chromosome sequence. If 'refOnly' no ploymorphisms are applied and the ref sequence is always returned\"\"\"\n\n\tdef __init__(self, data, chromosome, refOnly = False) :\n\t\t\n\t\tself.data = data\n\t\tself.refOnly = refOnly\n\t\tself.chromosome = chromosome\n\t\tself.setSNPFilter(self.chromosome.genome.SNPFilter)\n\t\n\tdef setSNPFilter(self, SNPFilter) :\n\t\tself.SNPFilter = SNPFilter\n\t\n\tdef getSequenceData(self, slic) :\n\t\tdata = self.data[slic]\n\t\tSNPTypes = self.chromosome.genome.SNPTypes\n\t\tif SNPTypes is None or self.refOnly :\n\t\t\treturn data\n\t\t\n\t\titerators = []\n\t\tfor setName, SNPType in SNPTypes.iteritems() :\n\t\t\tf = RabaQuery(str(SNPType), namespace = self.chromosome._raba_namespace)\n\t\t\t\n\t\t\tchromosomeNumber = self.chromosome.number\n\n\t\t\tif chromosomeNumber == 'MT':\n\t\t\t\tchromosomeNumber = 'M'\n\t\t\t\n\t\t\tf.addFilter({'start >=' : slic.start, 'start <' : slic.stop, 'setName' : str(setName), 'chromosomeNumber' : chromosomeNumber})\n\t\t\t# conf.db.enableDebug(True)\n\t\t\titerators.append(f.iterRun(sqlTail = 'ORDER BY start'))\n\t\t\n\t\tif len(iterators) < 1 :\n\t\t\treturn data\n\t\t\n\t\tpolys = {}\n\t\tfor iterator in iterators :\n\t\t\tfor poly in iterator :\n\t\t\t\tif poly.start not in polys :\n\t\t\t\t\tpolys[poly.start] = {poly.setName : poly}\n\t\t\t\telse :\n\t\t\t\t\ttry :\n\t\t\t\t\t\tpolys[poly.start][poly.setName].append(poly)\n\t\t\t\t\texcept :\n\t\t\t\t\t\tpolys[poly.start][poly.setName] = [polys[poly.start][poly.setName]]\n\t\t\t\t\t\tpolys[poly.start][poly.setName].append(poly)\n\t\t\t\t\t\t\n\t\tdata = list(data)\n\t\tfor start, setPolys in polys.iteritems() :\n\t\t\t\n\t\t\tseqPos = start - slic.start\n\t\t\tsequenceModifier = self.SNPFilter.filter(self.chromosome, **setPolys)\n\t\t\t# print sequenceModifier.alleles\n\t\t\tif sequenceModifier is not None :\n\t\t\t\tif sequenceModifier.__class__ is SF.SequenceDel :\n\t\t\t\t\tseqPos = seqPos + sequenceModifier.offset\n\t\t\t\t\t#To avoid to change the length of the sequence who can create some bug or side effect\n\t\t\t\t\tdata[seqPos:(seqPos + sequenceModifier.length)] = [''] * sequenceModifier.length\n\t\t\t\telif sequenceModifier.__class__ is SF.SequenceSNP :\n\t\t\t\t\tdata[seqPos] = sequenceModifier.alleles\n\t\t\t\telif sequenceModifier.__class__ is SF.SequenceInsert :\n\t\t\t\t\tseqPos = seqPos + sequenceModifier.offset\n\t\t\t\t\tdata[seqPos] = \"%s%s\" % (data[seqPos], sequenceModifier.bases)\n\t\t\t\telse :\n\t\t\t\t\traise TypeError(\"sequenceModifier on chromosome: %s starting at: %s is of unknown type: %s\" % (self.chromosome.number, snp.start, sequenceModifier.__class__))\n\n\t\treturn data\n\t\n\tdef _getSequence(self, slic) :\n\t\treturn ''.join(self.getSequenceData(slice(0, None, 1)))[slic]\n\n\tdef __getitem__(self, i) :\n\t\treturn self._getSequence(i)\n\n\tdef __len__(self) :\n\t\treturn self.chromosome.length\n\nclass Chromosome_Raba(pyGenoRabaObject) :\n\t\"\"\"The wrapped Raba object that really holds the data\"\"\"\n\t\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\n\theader = rf.Primitive()\n\tnumber = rf.Primitive()\n\tstart = rf.Primitive()\n\tend = rf.Primitive()\n\tlength = rf.Primitive()\n\n\tgenome = rf.RabaObject('Genome_Raba')\n\n\tdef _curate(self) :\n\t\tif  self.end != None and self.start != None :\n\t\t\tself.length = self.end-self.start\n\t\tif self.number != None :\n\t\t\tself.number =  str(self.number).upper()\n\nclass Chromosome(pyGenoRabaObjectWrapper) :\n\t\"\"\"The wrapper for playing with Chromosomes\"\"\"\n\t\n\t_wrapped_class = Chromosome_Raba\n\n\tdef __init__(self, *args, **kwargs) :\n\t\tpyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)\n\n\t\tpath = '%s/chromosome%s.dat'%(self.genome.getSequencePath(), self.number)\n\t\tif not SingletonManager.contains(path) :\n\t\t\tdatMap = SingletonManager.add(SecureMmap(path), path)\n\t\telse :\n\t\t\tdatMap = SingletonManager.get(path)\n\t\t\t\n\t\tself.sequence = ChrosomeSequence(datMap, self)\n\t\tself.refSequence = ChrosomeSequence(datMap, self, refOnly = True)\n\t\tself.loadSequences = False\n\n\tdef getSequenceData(self, slic) :\n\t\treturn self.sequence.getSequenceData(slic)\n\n\tdef _makeLoadQuery(self, objectType, *args, **coolArgs) :\n\t\tif issubclass(objectType, SNP_INDEL) :\n\t\t\tf = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)\n\t\t\tcoolArgs['species'] = self.genome.species\n\t\t\tcoolArgs['chromosomeNumber'] = self.number\n\n\t\t\tif len(args) > 0 and type(args[0]) is types.ListType :\n\t\t\t\tfor a in args[0] :\n\t\t\t\t\tif type(a) is types.DictType :\n\t\t\t\t\t\tf.addFilter(**a)\n\t\t\telse :\n\t\t\t\tf.addFilter(*args, **coolArgs)\n\n\t\t\treturn f\n\t\t\n\t\treturn pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)\n\n\tdef __getitem__(self, i) :\n\t\treturn self.sequence[i]\n\n\tdef __str__(self) :\n\t\treturn \"Chromosome: number %s > %s\" %(self.wrapped_object.number, str(self.wrapped_object.genome))\n"
  },
  {
    "path": "pyGeno/Exon.py",
    "content": "from pyGenoObjectBases import *\nfrom SNP import SNP_INDEL\n\nimport rabaDB.fields as rf\nfrom tools import UsefulFunctions as uf\nfrom tools.BinarySequence import NucBinarySequence\n\nclass Exon_Raba(pyGenoRabaObject) :\n\t\"\"\"The wrapped Raba object that really holds the data\"\"\"\n\t\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\n\tid = rf.Primitive()\n\tnumber = rf.Primitive()\n\tstart = rf.Primitive()\n\tend = rf.Primitive()\n\tlength = rf.Primitive()\n\tCDS_length = rf.Primitive()\n\tCDS_start = rf.Primitive()\n\tCDS_end = rf.Primitive()\n\tframe = rf.Primitive()\n\tstrand = rf.Primitive()\n\n\tgenome = rf.RabaObject('Genome_Raba')\n\tchromosome = rf.RabaObject('Chromosome_Raba')\n\tgene = rf.RabaObject('Gene_Raba')\n\ttranscript = rf.RabaObject('Transcript_Raba')\n\tprotein = rf.RabaObject('Protein_Raba')\n\n\tdef _curate(self) :\n\t\tif self.start != None and self.end != None :\n\t\t\tif self.start > self.end :\n\t\t\t\tself.start, self.end = self.end, self.start\n\t\t\tself.length = self.end-self.start\n\n\t\tif self.CDS_start != None and self.CDS_end != None :\n\t\t\tif self.CDS_start > self.CDS_end :\n\t\t\t\tself.CDS_start, self.CDS_end = self.CDS_end, self.CDS_start\n\t\t\tself.CDS_length = self.CDS_end - self.CDS_start\n\t\t\n\t\tif self.number != None :\n\t\t\tself.number = int(self.number)\n\n\t\tif not self.frame or self.frame == '.' :\n\t\t\tself.frame = None\n\t\telse :\n\t\t\tself.frame = int(self.frame)\n\nclass Exon(pyGenoRabaObjectWrapper) :\n\t\"\"\"The wrapper for playing with Exons\"\"\"\n\t\t\n\t_wrapped_class = Exon_Raba\n\n\tdef __init__(self, *args, **kwargs) :\n\t\tpyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)\n\t\tself._load_sequencesTriggers = set([\"UTR5\", \"UTR3\", \"CDS\", \"sequence\", \"data\"])\n\n\tdef _makeLoadQuery(self, objectType, *args, **coolArgs) :\n\t\tif issubclass(objectType, SNP_INDEL) :\n\t\t\tf = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)\n\t\t\tcoolArgs['species'] = self.genome.species\n\t\t\tcoolArgs['chromosomeNumber'] = self.chromosome.number\n\t\t\tcoolArgs['start >='] = self.start\n\t\t\tcoolArgs['start <'] = self.end\n\t\t\n\t\t\tif len(args) > 0 and type(args[0]) is types.ListType :\n\t\t\t\tfor a in args[0] :\n\t\t\t\t\tif type(a) is types.DictType :\n\t\t\t\t\t\tf.addFilter(**a)\n\t\t\telse :\n\t\t\t\tf.addFilter(*args, **coolArgs)\n\n\t\t\treturn f\n\t\t\n\t\treturn pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)\n\t\n\tdef _load_data(self) :\n\t\tdata = self.chromosome.getSequenceData(slice(self.start,self.end))\n\n\t\tdiffLen = (self.end-self.start) - len(data)\n\t\t\n\t\tif self.strand == '+' :\n\t\t\tself.data = data\n\t\telse :\n\t\t\tself.data = uf.reverseComplementTab(data)\n\n\t\tif self.hasCDS() :\n\t\t\tstart = self.CDS_start-self.start\n\t\t\tend = self.CDS_end-self.start\n\t\t\t\n\t\t\tif self.strand == '+' :\n\t\t\t\tself.UTR5 = self.data[:start]\n\t\t\t\tself.CDS = self.data[start:end+diffLen]\n\t\t\t\tself.UTR3 = self.data[end+diffLen:]\n\t\t\telse :\n\t\t\t\tself.UTR5 = self.data[:len(self.data)-(end-diffLen)]\n\t\t\t\tself.CDS = self.data[len(self.data)-(end-diffLen):len(self.data)-start]\n\t\t\t\tself.UTR3 = self.data[len(self.data)-start:]\n\t\telse :\n\t\t\tself.UTR5 = ''\n\t\t\tself.CDS = ''\n\t\t\tself.UTR3 = ''\n\n\t\tself.sequence = ''.join(self.data)\n\n\tdef _load_bin_sequence(self) :\n\t\tself.bin_sequence = NucBinarySequence(self.sequence)\n\t\tself.bin_UTR5 =  NucBinarySequence(self.UTR5)\n\t\tself.bin_CDS =  NucBinarySequence(self.CDS)\n\t\tself.bin_UTR3 =  NucBinarySequence(self.UTR3)\n\t\t\n\tdef hasCDS(self) :\n\t\t\"\"\"returns true or false depending on if the exon has a CDS\"\"\"\n\t\tif self.CDS_start != None and self.CDS_end != None:\n\t\t\treturn True\n\t\treturn False\n\n\tdef getCDSLength(self) :\n\t\t\"\"\"returns the length of the CDS sequence\"\"\"\n\t\treturn len(self.CDS)\n\n\tdef find(self, sequence) :\n\t\t\"\"\"return the position of the first occurance of sequence\"\"\"\n\t\treturn self.bin_sequence.find(sequence)\n\n\tdef findAll(self, sequence):\n\t\t\"\"\"Returns a lits of all positions where sequence was found\"\"\"\n\t\treturn self.bin_sequence.findAll(sequence)\n\n\tdef findInCDS(self, sequence) :\n\t\t\"\"\"return the position of the first occurance of sequence\"\"\"\n\t\treturn self.bin_CDS.find(sequence)\n\n\tdef findAllInCDS(self, sequence):\n\t\t\"\"\"Returns a lits of all positions where sequence was found\"\"\"\n\t\treturn self.bin_CDS.findAll(sequence)\n\n\tdef pluck(self) :\n\t\t\"\"\"Returns a plucked object. Plucks the exon off the tree, set the value of self.transcript into str(self.transcript). This effectively disconnects the object and\n\t\tmakes it much more lighter in case you'd like to pickle it\"\"\"\n\t\te = copy.copy(self)\n\t\te.transcript = str(self.transcript)\n\t\treturn e\n\n\tdef nextExon(self) :\n\t\t\"\"\"Returns the next exon of the transcript, or None if there is none\"\"\"\n\t\ttry :\n\t\t\treturn self.transcript.exons[self.number+1]\n\t\texcept IndexError :\n\t\t\treturn None\n\t\n\tdef previousExon(self) :\n\t\t\"\"\"Returns the previous exon of the transcript, or None if there is none\"\"\"\n\t\t\n\t\tif self.number == 0 :\n\t\t\treturn None\n\t\t\n\t\ttry :\n\t\t\treturn self.transcript.exons[self.number-1]\n\t\texcept IndexError :\n\t\t\treturn None\n\t\t\n\tdef __str__(self) :\n\t\treturn \"\"\"EXON, id %s, number: %s, (start, end): (%s, %s), cds: (%s, %s) > %s\"\"\" %( self.id, self.number, self.start, self.end, self.CDS_start, self.CDS_end, str(self.transcript))\n\n\tdef __len__(self) :\n\t\treturn len(self.sequence)\n"
  },
  {
    "path": "pyGeno/Gene.py",
    "content": "import configuration as conf\n\nfrom pyGenoObjectBases import *\nfrom SNP import SNP_INDEL\n\nimport rabaDB.fields as rf\n\nclass Gene_Raba(pyGenoRabaObject) :\n\t\"\"\"The wrapped Raba object that really holds the data\"\"\"\n\t\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\n\tid = rf.Primitive()\n\tname = rf.Primitive()\n\tstrand = rf.Primitive()\n\tbiotype = rf.Primitive()\n\t\n\tstart = rf.Primitive()\n\tend = rf.Primitive()\n\t\n\tgenome = rf.RabaObject('Genome_Raba')\n\tchromosome = rf.RabaObject('Chromosome_Raba')\n\n\tdef _curate(self) :\n\t\tself.name = self.name.upper()\n\nclass Gene(pyGenoRabaObjectWrapper) :\n\t\"\"\"The wrapper for playing with genes\"\"\"\n\t\n\t_wrapped_class = Gene_Raba\n\n\tdef __init__(self, *args, **kwargs) :\n\t\tpyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)\n\n\tdef _makeLoadQuery(self, objectType, *args, **coolArgs) :\n\t\tif issubclass(objectType, SNP_INDEL) :\n\t\t\tf = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)\n\t\t\tcoolArgs['species'] = self.genome.species\n\t\t\tcoolArgs['chromosomeNumber'] = self.chromosome.number\n\t\t\tcoolArgs['start >='] = self.start\n\t\t\tcoolArgs['start <'] = self.end\n\t\t\n\t\t\tif len(args) > 0 and type(args[0]) is types.ListType :\n\t\t\t\tfor a in args[0] :\n\t\t\t\t\tif type(a) is types.DictType :\n\t\t\t\t\t\tf.addFilter(**a)\n\t\t\telse :\n\t\t\t\tf.addFilter(*args, **coolArgs)\n\n\t\t\treturn f\n\t\t\n\t\treturn pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)\n\t\n\tdef __str__(self) :\n\t\treturn \"Gene, name: %s, id: %s, strand: '%s' > %s\" %(self.name, self.id, self.strand, str(self.chromosome))\n"
  },
  {
    "path": "pyGeno/Genome.py",
    "content": "import types\nimport configuration as conf\nimport pyGeno.tools.UsefulFunctions as uf\nfrom pyGenoObjectBases import *\n\nfrom Chromosome import Chromosome\nfrom Gene import Gene\nfrom Transcript import Transcript\nfrom Protein import Protein\nfrom Exon import Exon\nimport SNPFiltering as SF\nfrom SNP import *\n\nimport rabaDB.fields as rf\n\ndef getGenomeList() :\n\t\"\"\"Return the names of all imported genomes\"\"\"\n\timport rabaDB.filters as rfilt\n\tf = rfilt.RabaQuery(Genome_Raba)\n\tnames = []\n\tfor g in f.iterRun() :\n\t\tnames.append(g.name)\n\treturn names\n\nclass Genome_Raba(pyGenoRabaObject) :\n\t\"\"\"The wrapped Raba object that really holds the data\"\"\"\n\t\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\t#_raba_not_a_singleton = True #you can have several instances of the same genome but they all share the same location in the database\n\n\tname = rf.Primitive()\n\tspecies = rf.Primitive()\n\n\tsource = rf.Primitive()\n\tpackageInfos = rf.Primitive()\n\n\tdef _curate(self) :\n\t\tself.species = self.species.lower()\n\n\tdef getSequencePath(self) :\n\t\treturn conf.getGenomeSequencePath(self.species, self.name)\n\n\tdef getReferenceSequencePath(self) :\n\t\treturn conf.getReferenceGenomeSequencePath(self.species)\n\n\tdef __len__(self) :\n\t\t\"\"\"Size of the genome in pb\"\"\"\n\t\tl = 0\n\t\tfor c in self.chromosomes :\n\t\t\tl +=  len(c)\n\n\t\treturn l\n\nclass Genome(pyGenoRabaObjectWrapper) :\t\n\t\"\"\"\n\tThis is the entry point to pyGeno::\n\t\t\n\t\tmyGeno = Genome(name = 'GRCh37.75', SNPs = ['RNA_S1', 'DNA_S1'], SNPFilter = MyFilter)\n\t\tfor prot in myGeno.get(Protein) :\n\t\t\tprint prot.sequence\n\t\n\t\"\"\"\n\t_wrapped_class = Genome_Raba\n\n\tdef __init__(self, SNPs = None, SNPFilter = None,  *args, **kwargs) :\n\t\t\n\t\tpyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)\n\n\t\tif type(SNPs) is types.StringType :\n\t\t\tself.SNPsSets = [SNPs]\n\t\telse :\n\t\t\tself.SNPsSets = SNPs\n\n\t\t# print \"pifpasdf\", self.SNPsSets\n\n\t\tif SNPFilter is None :\n\t\t\tself.SNPFilter = SF.DefaultSNPFilter()\n\t\telse :\n\t\t\tif issubclass(SNPFilter.__class__, SF.SNPFilter) :\n\t\t\t\tself.SNPFilter = SNPFilter\n\t\t\telse :\n\t\t\t\traise ValueError(\"The value of 'SNPFilter' is not an object deriving from a subclass of SNPFiltering.SNPFilter. Got: '%s'\" % SNPFilter)\n\n\t\tself.SNPTypes = {}\n\t\t\n\t\tif SNPs is not None :\n\t\t\tf = RabaQuery(SNPMaster, namespace = self._raba_namespace)\n\t\t\tfor se in self.SNPsSets :\n\t\t\t\tf.addFilter(setName = se, species = self.species)\n\n\t\t\tres = f.run()\n\t\t\tif res is None or len(res) < 1 :\n\t\t\t\traise ValueError(\"There's no set of SNPs that goes by the name of %s for species %s\" % (SNPs, self.species))\n\n\t\t\tfor s in res :\n\t\t\t\t# print s.setName, s.SNPType\n\t\t\t\tself.SNPTypes[s.setName] = s.SNPType\n\n\tdef _makeLoadQuery(self, objectType, *args, **coolArgs) :\n\t\tif issubclass(objectType, SNP_INDEL) :\n\t\t\t# conf.db.enableDebug(True)\n\t\t\tf = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)\n\t\t\tcoolArgs['species'] = self.species\n\n\t\t\tif len(args) > 0 and type(args[0]) is types.ListType :\n\t\t\t\tfor a in args[0] :\n\t\t\t\t\tif type(a) is types.DictType :\n\t\t\t\t\t\tf.addFilter(**a)\n\t\t\telse :\n\t\t\t\tf.addFilter(*args, **coolArgs)\n\n\t\t\treturn f\n\t\t\n\t\treturn pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)\n\t\n\tdef __str__(self) :\n\t\treturn \"Genome: %s/%s SNPs: %s\" %(self.species, self.name, self.SNPTypes)\n"
  },
  {
    "path": "pyGeno/Protein.py",
    "content": "import configuration as conf\n\nfrom pyGenoObjectBases import *\nfrom SNP import SNP_INDEL\n\nimport rabaDB.fields as rf\n\nfrom tools import UsefulFunctions as uf\nfrom tools.BinarySequence import AABinarySequence\nimport copy\n\nclass Protein_Raba(pyGenoRabaObject) :\n\t\"\"\"The wrapped Raba object that really holds the data\"\"\"\n\t\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\n\tid = rf.Primitive()\n\tname = rf.Primitive()\n\n\tgenome = rf.RabaObject('Genome_Raba')\n\tchromosome = rf.RabaObject('Chromosome_Raba')\n\tgene = rf.RabaObject('Gene_Raba')\n\ttranscript = rf.RabaObject('Transcript_Raba')\n\n\tdef _curate(self) :\n\t\tif self.name != None :\n\t\t\tself.name = self.name.upper()\n\nclass Protein(pyGenoRabaObjectWrapper) :\n\t\"\"\"The wrapper for playing with Proteins\"\"\"\n\t\n\t_wrapped_class = Protein_Raba\n\n\tdef __init__(self, *args, **kwargs) :\n\t\tpyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)\n\t\tself._load_sequencesTriggers = set([\"sequence\"])\n\n\tdef _makeLoadQuery(self, objectType, *args, **coolArgs) :\n\t\tif issubclass(objectType, SNP_INDEL) :\n\t\t\tf = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)\n\t\t\tcoolArgs['species'] = self.genome.species\n\t\t\tcoolArgs['chromosomeNumber'] = self.chromosome.number\n\t\t\tcoolArgs['start >='] = self.transcript.start\n\t\t\tcoolArgs['start <'] = self.transcript.end\n\t\t\n\t\t\tif len(args) > 0 and type(args[0]) is types.ListType :\n\t\t\t\tfor a in args[0] :\n\t\t\t\t\tif type(a) is types.DictType :\n\t\t\t\t\t\tf.addFilter(**a)\n\t\t\telse :\n\t\t\t\tf.addFilter(*args, **coolArgs)\n\n\t\t\treturn f\n\t\t\n\t\treturn pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)\n\t\n\tdef _load_sequences(self) :\n\t\tif self.chromosome.number != 'MT':\n\t\t\tself.sequence = uf.translateDNA(self.transcript.cDNA).rstrip('*')\n\t\telse:\n\t\t\tself.sequence = uf.translateDNA(self.transcript.cDNA, translTable_id='mt').rstrip('*')\n\n\t\n\tdef getSequence(self):\n\t\treturn self.sequence\n\n\tdef _load_bin_sequence(self) :\n\t\tself.bin_sequence = AABinarySequence(self.sequence)\n\n\tdef getDefaultSequence(self) :\n\t\t\"\"\"Returns a version str sequence where only the last allele of each polymorphisms is shown\"\"\"\n\t\treturn self.bin_sequence.defaultSequence\n\n\tdef getPolymorphisms(self) :\n\t\t\"\"\"Returns a list of all polymorphisms contained in the protein\"\"\"\n\t\treturn self.bin_sequence.getPolymorphisms()\n\n\tdef find(self, sequence):\n\t\t\"\"\"Returns the position of the first occurence of sequence taking polymorphisms into account\"\"\"\n\t\treturn self.bin_sequence.find(sequence)\n\n\tdef findAll(self, sequence):\n\t\t\"\"\"Returns all the position of the occurences of sequence taking polymorphisms into accoun\"\"\"\n\t\treturn self.bin_sequence.findAll(sequence)\n\n\tdef findString(self, sequence) :\n\t\t\"\"\"Returns the first occurence of sequence using simple string search in sequence that doesn't care about polymorphisms\"\"\"\n\t\treturn self.sequence.find(sequence)\n\n\tdef findStringAll(self, sequence):\n\t\t\"\"\"Returns all first occurences of sequence using simple string search in sequence that doesn't care about polymorphisms\"\"\"\n\t\treturn uf.findAll(self.sequence, sequence)\n\n\tdef __getitem__(self, i) :\n\t\treturn self.bin_sequence.getChar(i)\n\t\t\n\tdef __len__(self) :\n\t\treturn len(self.bin_sequence)\n\n\tdef __str__(self) :\n\t\treturn \"Protein, id: %s > %s\" %(self.id, str(self.transcript))\n"
  },
  {
    "path": "pyGeno/SNP.py",
    "content": "import types\n\nimport configuration as conf\n\nfrom pyGenoObjectBases import *\nimport rabaDB.fields as rf\n\n# from tools import UsefulFunctions as uf\n\n\"\"\"General guidelines for SNP classes:\n----\n-All classes must inherit from SNP_INDEL\n-All classes name must end with SNP\n-A SNP_INDELs must have at least chromosomeNumber, setName, species, start and ref filled in order to be inserted into sequences\n-User can define an alias for the alt field (snp_indel alleles) to indicate the default field from wich to extract alleles\n\"\"\"\n\ndef getSNPSetsList() :\n\t\"\"\"Return the names of all imported snp sets\"\"\"\n\timport rabaDB.filters as rfilt\n\tf = rfilt.RabaQuery(SNPMaster)\n\tnames = []\n\tfor g in f.iterRun() :\n\t\tnames.append(g.setName)\n\treturn names\n\nclass SNPMaster(Raba) :\n\t'This object keeps track of SNP sets and their types'\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\tspecies = rf.Primitive()\n\tSNPType = rf.Primitive()\n\tsetName = rf.Primitive()\n\t_raba_uniques = [('setName',)]\n\n\tdef _curate(self) :\n\t\tself.species = self.species.lower()\n\t\tself.setName = self.setName.lower()\n\nclass SNP_INDEL(Raba) :\n\t\"All SNPs should inherit from me. The name of the class must end with SNP\"\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\t_raba_abstract = True # not saved in db\n\n\tspecies = rf.Primitive()\n\tsetName = rf.Primitive()\n\tchromosomeNumber = rf.Primitive()\n\n\tstart = rf.Primitive()\n\tend = rf.Primitive()\n\t\n\tref = rf.Primitive()\n\t\n\t#every SNP_INDEL must have a field alt. This variable allows you to set an alias for it. Chamging the alias does not impact the schema\n\taltAlias = 'alt'\n\t\n\tdef __init__(self, *args, **kwargs) :\n\t\tpass\n\n\tdef __getattribute__(self, k) :\n\t\tif k == 'alt' :\n\t\t\tcls = Raba.__getattribute__(self, '__class__')\n\t\t\treturn Raba.__getattribute__(self, cls.altAlias)\n\t\t\n\t\treturn Raba.__getattribute__(self, k)\n\n\tdef __setattr__(self, k, v) :\n\t\tif k == 'alt' :\n\t\t\tcls = Raba.__getattribute__(self, '__class__')\n\t\t\tRaba.__setattr__(self, cls.altAlias, v)\n\t\t\t\n\t\tRaba.__setattr__(self, k, v)\n\t\n\tdef _curate(self) :\n\t\tself.species = self.species.lower()\n\n\t@classmethod\n\tdef ensureGlobalIndex(cls, fields) :\n\t\tcls.ensureIndex(fields)\n\n\tdef __repr__(self) :\n\t\treturn \"%s> chr: %s, start: %s, end: %s, alt: %s, ref: %s\" %(self.__class__.__name__, self.chromosomeNumber, self.start, self.end, self.alleles, self.ref)\n\t\nclass CasavaSNP(SNP_INDEL) :\n\t\"A SNP of Casava\"\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\t\n\talleles = rf.Primitive()\n\tbcalls_used = rf.Primitive()\n\tbcalls_filt = rf.Primitive()\n\tQSNP = rf.Primitive()\n\tQmax_gt = rf.Primitive()\n\tmax_gt_poly_site = rf.Primitive()\n\tQmax_gt_poly_site = rf.Primitive()\n\tA_used = rf.Primitive()\n\tC_used = rf.Primitive()\n\tG_used = rf.Primitive()\n\tT_used = rf.Primitive()\n\t\n\taltAlias = 'alleles'\n\nclass AgnosticSNP(SNP_INDEL) :\n\t\"\"\"This is a generic SNPs/Indels format that you can easily make from the result of any SNP caller. AgnosticSNP files are tab delimited files such as:\n\n\tchromosomeNumber\tuniqueId  start\t      end\t   ref    alleles\tquality\t caller\n\tY\t\t\t\t\t   1 \t 2655643\t2655644\t\tT\t\tAG\t     30\t\t TopHat\n\tY\t\t\t\t\t   2 \t 2655645\t2655647\t\t-\t\tAG\t     28\t\t TopHat\n\tY\t\t\t\t\t   3 \t 2655648\t2655650\t\tTT\t\t-\t     10\t\t TopHat\n\n\tAll positions must be 0 based\n\tThe '-' indicates a deletion or an insertion. Collumn order has no importance.\n\t\"\"\"\n\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\t\n\talleles = rf.Primitive()\n\tquality = rf.Primitive()\n\tcaller = rf.Primitive()\n\tuniqueId = rf.Primitive() # polymorphism id\n\t\n\taltAlias = 'alleles'\n\n\tdef __repr__(self) :\n\t\treturn \"AgnosticSNP> start: %s, end: %s, quality: %s, caller %s, alt: %s, ref: %s\" %(self.start, self.end, self.quality, self.caller, self.alleles, self.ref)\n\nclass dbSNPSNP(SNP_INDEL) :\n\t\"This class is for SNPs from dbSNP. Feel free to uncomment the fields that you need\"\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\t\n\t# To add/remove a field comment/uncomentd it. Beware, adding or removing a field results in a significant overhead the first time you relaunch pyGeno. You may have to delete and reimport some snps sets.\n\t\n\t#####RSPOS = rf.Primitive() #Chr position reported in already saved into field start\n\tRS = rf.Primitive() #dbSNP ID (i.e. rs number)\n\tALT =  rf.Primitive()\n\tRV = rf.Primitive() #RS orientation is reversed\n\t#VP = rf.Primitive() #Variation Property.  Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf\n\t#GENEINFO = rf.Primitive() #Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)\n\tdbSNPBuildID = rf.Primitive() #First dbSNP Build for RS\n\t#SAO = rf.Primitive() #Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both\n\t#SSR = rf.Primitive() #Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other\n\t#WGT = rf.Primitive() #Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more\n\tVC = rf.Primitive() #Variation Class\n\tPM = rf.Primitive() #Variant is Precious(Clinical,Pubmed Cited)\n\t#TPA = rf.Primitive() #Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)\n\t#PMC = rf.Primitive() #Links exist to PubMed Central article\n\t#S3D = rf.Primitive() #Has 3D structure - SNP3D table\n\t#SLO = rf.Primitive() #Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out\n\t#NSF = rf.Primitive() #Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44\n\t#NSM = rf.Primitive() #Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42\n\t#NSN = rf.Primitive() #Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41\n\t#REF = rf.Primitive() #Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8\n\t#SYN = rf.Primitive() #Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3\n\t#U3 = rf.Primitive() #In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53\n\t#U5 = rf.Primitive() #In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55\n\t#ASS = rf.Primitive() #In acceptor splice site FxnCode = 73\n\t#DSS = rf.Primitive() #In donor splice-site FxnCode = 75\n\t#INT = rf.Primitive() #In Intron FxnCode = 6\n\t#R3 = rf.Primitive() #In 3' gene region FxnCode = 13\n\t#R5 = rf.Primitive() #In 5' gene region FxnCode = 15\n\t#OTH = rf.Primitive() #Has other variant with exactly the same set of mapped positions on NCBI refernce assembly.\n\t#CFL = rf.Primitive() #Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.\n\t#ASP = rf.Primitive() #Is Assembly specific. This is set if the variant only maps to one assembly\n\tMUT = rf.Primitive() #Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources\n\tVLD = rf.Primitive() #Is Validated.  This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.\n\tG5A = rf.Primitive() #>5% minor allele frequency in each and all populations\n\tG5 = rf.Primitive() #>5% minor allele frequency in 1+ populations\n\t#HD = rf.Primitive() #Marker is on high density genotyping kit (50K density or greater).  The variant may have phenotype associations present in dbGaP.\n\t#GNO = rf.Primitive() #Genotypes available. The variant has individual genotype (in SubInd table).\n\tKGValidated = rf.Primitive() #1000 Genome validated\n\t#KGPhase1 = rf.Primitive() #1000 Genome phase 1 (incl. June Interim phase 1)\n\t#KGPilot123 = rf.Primitive() #1000 Genome discovery all pilots 2010(1,2,3)\n\t#KGPROD = rf.Primitive() #Has 1000 Genome submission\n\tOTHERKG = rf.Primitive() #non-1000 Genome submission\n\t#PH3 = rf.Primitive() #HAP_MAP Phase 3 genotyped: filtered, non-redundant\n\t#CDA = rf.Primitive() #Variation is interrogated in a clinical diagnostic assay\n\t#LSD = rf.Primitive() #Submitted from a locus-specific database\n\t#MTP = rf.Primitive() #Microattribution/third-party annotation(TPA:GWAS,PAGE)\n\t#OM = rf.Primitive() #Has OMIM/OMIA\n\t#NOC = rf.Primitive() #Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation.\n\t#WTD = rf.Primitive() #Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set.  If all member ss' are withdrawn, then the rs is deleted to SNPHistory\n\t#NOV = rf.Primitive() #Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common.\n\t#CAF = rf.Primitive() #An ordered, comma delimited list of allele frequencies based on 1000Genomes, starting with the reference allele followed by alternate alleles as ordered in the ALT column. Where a 1000Genomes alternate allele is not in the dbSNPs alternate allele set, the allele is added to the ALT column.  The minor allele is the second largest value in the list, and was previuosly reported in VCF as the GMAF.  This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter\n\tCOMMON = rf.Primitive() #RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.\n\t#CLNHGVS = rf.Primitive() #Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical  INFO tags.\n\t#CLNALLE = rf.Primitive() #Variant alleles from REF or ALT columns.  0 is REF, 1 is the first ALT allele, etc.  This is used to match alleles with other corresponding clinical (CLN) INFO tags.  A value of -1 indicates that no allele was found to match a corresponding HGVS allele name.\n\t#CLNSRC = rf.Primitive() #Variant Clinical Chanels\n\t#CLNORIGIN = rf.Primitive() #Allele Origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other\n\t#CLNSRCID = rf.Primitive() #Variant Clinical Channel IDs\n\t#CLNSIG = rf.Primitive() #Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other\n\t#CLNDSDB = rf.Primitive() #Variant disease database name\n\t#CLNDSDBID = rf.Primitive() #Variant disease database ID\n\t#CLNDBN = rf.Primitive() #Variant disease name\n\t#CLNACC = rf.Primitive() #Variant Accession and Versions\n\t\n\t#FILTER_NC = rf.Primitive() #Inconsistent Genotype Submission For At Least One Sample\n\t\n\taltAlias = 'ALT'\n\nclass TopHatSNP(SNP_INDEL) :\n\t\"A SNP from Top Hat, not implemented\"\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\tpass\n"
  },
  {
    "path": "pyGeno/SNPFiltering.py",
    "content": "import types\n\nimport configuration as conf\n\nfrom tools import UsefulFunctions as uf\n\nclass Sequence_modifiers(object) :\n\t\"\"\"Abtract Class. All sequence must inherit from me\"\"\"\n\tdef __init__(self, sources = {}) :\n\t\tself.sources = sources\n\n\tdef addSource(self, name, snp) :\n\t\t\"Optional, you can keep a dict that records the polymorphims that were mixed together to make self. They are stored into self.sources\"\n\t\tself.sources[name] = snp\n\nclass SequenceSNP(Sequence_modifiers) :\n\t\"\"\"Represents a SNP to be applied to the sequence\"\"\"\n\tdef __init__(self, alleles, sources = {}) :\n\t\tSequence_modifiers.__init__(self, sources)\n\t\tif type(alleles) is types.ListType :\n\t\t\tself.alleles = uf.encodePolymorphicNucleotide(''.join(alleles))\n\t\telse :\n\t\t\tself.alleles = uf.encodePolymorphicNucleotide(alleles)\n\t\nclass SequenceInsert(Sequence_modifiers) :\n\t\"\"\"Represents an Insertion to be applied to the sequence\"\"\"\n\t\n\tdef __init__(self, bases, sources = {}, ref = '-') :\n\t\tSequence_modifiers.__init__(self, sources)\n\t\tself.bases = bases\n\t\tself.offset = 0\n\n\t\t# Allow to use format like C/CCTGGAA(dbSNP) or CCT/CCTGGAA(samtools)\n\t\tif ref != '-':\n\t\t\tif ref == bases[:len(ref)]:\n\t\t\t\tself.offset = len(ref) \n\t\t\t\tself.bases = self.bases[self.offset:]\n\t\t\t\t#-1 because if the insertion are after the last nuc we go out of table\n\t\t\t\tself.offset -= 1\n\t\t\telse:\n\t\t\t\traise NotImplemented(\"This format of Insetion is not accepted. Please change your format, or implement your format in pyGeno.\")\n\n\nclass SequenceDel(Sequence_modifiers) :\n\t\"\"\"Represents a Deletion to be applied to the sequence\"\"\"\n\tdef __init__(self, length, sources = {}, ref = None, alt = '-') :\n\t\tSequence_modifiers.__init__(self, sources)\n\t\tself.length = length\n\t\tself.offset = 0\n\n\t\t# Allow to use format like CCTGGAA/C(dbSNP) or CCTGGAA/CCT(samtools)\n\t\tif alt != '-':\n\t\t\tif ref is not None:\n\t\t\t\tif alt == ref[:len(alt)]:\n\t\t\t\t\tself.offset = len(alt)\n\t\t\t\t\tself.length = self.length - len(alt)\n\t\t\t\telse:\n\t\t\t\t\traise NotImplemented(\"This format of Deletion is not accepted. Please change your format, or implement your format in pyGeno.\")\n\t\t\telse:\n\t\t\t\traise Exception(\"You need to add a ref sequence in your call of SequenceDel. Or implement your format in pyGeno.\")\n\n\n\nclass SNPFilter(object) :\n\t\"\"\"Abtract Class. All filters must inherit from me\"\"\"\n\t\n\tdef __init__(self) :\n\t\tpass\n\n\tdef filter(self, chromosome, **kwargs) :\n\t\traise NotImplemented(\"Must be implemented in child\")\n\nclass DefaultSNPFilter(SNPFilter) :\n\t\"\"\"\n\tDefault filtering object, does not filter anything. Doesn't apply insertions or deletions.\n\tThis is also a template that you can use for own filters. A prototype for a custom filter might be::\n\t\n\t\tclass MyFilter(SNPFilter) :\n\t\t\tdef __init__(self, thres) :\n\t\t\t\tself.thres = thres\n\t\t\t\n\t\t\tdef filter(chromosome, SNP_Set1 = None, SNP_Set2 = None ) :\n\t\t\t\tif SNP_Set1.alt is not None and (SNP_Set1.alt == SNP_Set2.alt) and SNP_Set1.Qmax_gt > self.thres :\n\t\t\t\t\treturn SequenceSNP(SNP_Set1.alt)\n\t\t\t\treturn None\n\t\t\t\n\tWhere SNP_Set1 and SNP_Set2 are the actual names of the snp sets supplied to the genome. In the context of the function\n\tthey represent single polymorphisms, or lists of polymorphisms, derived from thoses sets that occur at the same position.\n\n\tWhatever goes on into the function is absolutely up to you, but in the end, it must return an object of one of the following classes:\n\n\t\t* SequenceSNP\n\n\t\t* SequenceInsert\n\n\t\t* SequenceDel\n\n\t\t* None\n\n\t\t\"\"\"\n\n\tdef __init__(self) :\n\t\tSNPFilter.__init__(self)\n\n\tdef filter(self, chromosome, **kwargs) :\n\t\t\"\"\"The default filter mixes applied all SNPs and ignores Insertions and Deletions.\"\"\"\n\t\tdef appendAllele(alleles, sources, snp) :\n\t\t\tpos = snp.start\n\t\t\tif snp.alt[0] == '-' :\n\t\t\t\tpass\n\t\t\t\t# print warn % ('DELETION', snpSet, snp.start, snp.chromosomeNumber)\n\t\t\telif snp.ref[0] == '-' :\n\t\t\t\tpass\n\t\t\t\t# print warn % ('INSERTION', snpSet, snp.start, snp.chromosomeNumber)\n\t\t\telse :\n\t\t\t\tsources[snpSet] = snp\n\t\t\t\talleles.append(snp.alt) #if not an indel append the polymorphism\n\t\t\t\n\t\t\trefAllele = chromosome.refSequence[pos]\n\t\t\talleles.append(refAllele)\n\t\t\tsources['ref'] = refAllele\n\t\t\treturn alleles, sources\n\t\t\t\t\n\t\twarn = 'Warning: the default snp filter ignores indels. IGNORED %s of SNP set: %s at pos: %s of chromosome: %s'\n\t\tsources = {}\n\t\talleles = []\n\t\tfor snpSet, data in kwargs.iteritems() :\n\t\t\tif type(data) is list :\n\t\t\t\tfor snp in data :\n\t\t\t\t\talleles, sources = appendAllele(alleles, sources, snp)\n\t\t\telse :\n\t\t\t\tallels, sources = appendAllele(alleles, sources, data)\n\n\t\t#appends the refence allele to the lot\n\n\t\t#optional we keep a record of the polymorphisms that were used during the process\n\t\treturn SequenceSNP(alleles, sources = sources)\n"
  },
  {
    "path": "pyGeno/Transcript.py",
    "content": "import configuration as conf\n\nfrom pyGenoObjectBases import *\n\nimport rabaDB.fields as rf\n\nfrom tools import UsefulFunctions as uf\nfrom Exon import *\nfrom SNP import SNP_INDEL\n\nfrom tools.BinarySequence import NucBinarySequence\n\n\nclass Transcript_Raba(pyGenoRabaObject) :\n\t\"\"\"The wrapped Raba object that really holds the data\"\"\"\n\t\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\n\tid = rf.Primitive()\n\tname = rf.Primitive()\n\tlength = rf.Primitive()\n\tstart = rf.Primitive()\n\tend = rf.Primitive()\n\tcoding = rf.Primitive()\n\tbiotype = rf.Primitive()\n\tselenocysteine = rf.RList()\n\t\n\tgenome = rf.RabaObject('Genome_Raba')\n\tchromosome = rf.RabaObject('Chromosome_Raba')\n\tgene = rf.RabaObject('Gene_Raba')\n\tprotein = rf.RabaObject('Protein_Raba')\n\texons = rf.Relation('Exon_Raba')\n\t\n\tdef _curate(self) :\n\t\tif self.name != None :\n\t\t\tself.name = self.name.upper()\n\t\t\n\t\tself.length = abs(self.end - self.start)\n\t\thave_CDS_start = False\n\t\thave_CDS_end = False\n\t\tfor exon in self.exons :\n\t\t\tif exon.CDS_start is not None :\n\t\t\t\thave_CDS_start = True\n\t\t\tif exon.CDS_end is not None :\n\t\t\t\thave_CDS_end = True\n\n\t\tif have_CDS_start and have_CDS_end :\n\t\t\tself.coding = True\n\t\telse :\n\t\t\tself.coding = False\n\nclass Transcript(pyGenoRabaObjectWrapper) :\n\t\"\"\"The wrapper for playing with Transcripts\"\"\"\n\t\n\t_wrapped_class = Transcript_Raba\n\n\tdef __init__(self, *args, **kwargs) :\n\t\tpyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)\n\t\tself.exons = RLWrapper(self, Exon, self.wrapped_object.exons)\n\t\tself._load_sequencesTriggers = set([\"UTR5\", \"UTR3\", \"cDNA\", \"sequence\", \"data\"])\n\t\tself.exonsDict = {}\n\t\n\tdef _makeLoadQuery(self, objectType, *args, **coolArgs) :\n\t\tif issubclass(objectType, SNP_INDEL) :\n\t\t\t# conf.db.enableDebug(True)\n\t\t\tf = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)\n\t\t\tcoolArgs['species'] = self.genome.species\n\t\t\tcoolArgs['chromosomeNumber'] = self.chromosome.number\n\t\t\tcoolArgs['start >='] = self.start\n\t\t\tcoolArgs['start <'] = self.end\n\t\t\n\t\t\tif len(args) > 0 and type(args[0]) is types.ListType :\n\t\t\t\tfor a in args[0] :\n\t\t\t\t\tif type(a) is types.DictType :\n\t\t\t\t\t\tf.addFilter(**a)\n\t\t\telse :\n\t\t\t\tf.addFilter(*args, **coolArgs)\n\n\t\t\treturn f\n\t\t\n\t\treturn pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)\n\t\t\n\tdef _load_data(self) :\n\t\tdef getV(k) :\n\t\t\treturn pyGenoRabaObjectWrapper.__getattribute__(self, k)\n\n\t\tdef setV(k, v) :\n\t\t\treturn pyGenoRabaObjectWrapper.__setattr__(self, k, v)\n\n\t\tself.data = []\n\t\tcDNA = []\n\t\tUTR5 = []\n\t\tUTR3 = []\n\t\texons = []\n\t\tprime5 = True\n\t\tfor ee in self.wrapped_object.exons :\n\t\t\te = pyGenoRabaObjectWrapper_metaclass._wrappers[Exon_Raba](wrapped_object_and_bag = (ee, getV('bagKey')))\n\t\t\tself.exonsDict[(e.start, e.end)] = e\n\t\t\texons.append(e)\n\t\t\tself.data.extend(e.data)\n\t\t\t\n\t\t\tif e.hasCDS() :\n\t\t\t\tUTR5.append(''.join(e.UTR5))\n\n\t\t\t\tif self.selenocysteine is not None:\n\t\t\t\t\tfor position in self.selenocysteine:\n\t\t\t\t\t\tif e.CDS_start <= position <= e.CDS_end:\n\n\t\t\t\t\t\t\tif e.strand == '+':\n\t\t\t\t\t\t\t\tajusted_position = position - e.CDS_start\n\t\t\t\t\t\t\telse:\n\t\t\t\t\t\t\t\tajusted_position = e.CDS_end - position - 3\n\n\t\t\t\t\t\t\tif e.CDS[ajusted_position] == 'T':\n\t\t\t\t\t\t\t\te.CDS = list(e.CDS)\n\t\t\t\t\t\t\t\te.CDS[ajusted_position] = '!'\t\t\t\n\t\t\t\t\n\t\t\t\tif len(cDNA) == 0 and e.frame != 0 :\n\t\t\t\t\te.CDS = e.CDS[e.frame:]\n\t\t\t\t\t\n\t\t\t\t\tif e.strand == '+':\n\t\t\t\t\t\te.CDS_start += e.frame\n\t\t\t\t\telse:\n\t\t\t\t\t\te.CDS_end -= e.frame\n\t\t\t\t\n\t\t\t\tif len(e.CDS):\n\t\t\t\t\tcDNA.append(''.join(e.CDS))\n\t\t\t\tUTR3.append(''.join(e.UTR3))\n\t\t\t\tprime5 = False\n\t\t\telse :\n\t\t\t\tif prime5 :\n\t\t\t\t\tUTR5.append(''.join(e.data))\n\t\t\t\telse :\n\t\t\t\t\tUTR3.append(''.join(e.data))\n\t\t\n\t\tsequence = ''.join(self.data)\n\t\tcDNA = ''.join(cDNA)\n\t\tUTR5 = ''.join(UTR5)\n\t\tUTR3 = ''.join(UTR3)\n\t\tsetV('exons', exons)\n\t\tsetV('sequence', sequence)\n\t\tsetV('cDNA', cDNA)\n\t\tsetV('UTR5', UTR5)\n\t\tsetV('UTR3', UTR3)\n\t\t\n\t\tif len(cDNA) > 0 and len(cDNA) % 3 != 0 :\n\t\t\tsetV('flags', {'DUBIOUS' : True, 'cDNA_LEN_NOT_MULT_3': True})\n\t\telse :\n\t\t\tsetV('flags', {'DUBIOUS' : False, 'cDNA_LEN_NOT_MULT_3': False})\n\n\tdef _load_bin_sequence(self) :\n\t\tself.bin_sequence = NucBinarySequence(self.sequence)\n\t\tself.bin_UTR5 =  NucBinarySequence(self.UTR5)\n\t\tself.bin_cDNA =  NucBinarySequence(self.cDNA)\n\t\tself.bin_UTR3 =  NucBinarySequence(self.UTR3)\n\n\tdef getNucleotideCodon(self, cdnaX1) :\n\t\t\"\"\"Returns the entire codon of the nucleotide at pos cdnaX1 in the cdna, and the position of that nocleotide in the codon\"\"\"\n\t\treturn uf.getNucleotideCodon(self.cDNA, cdnaX1)\n\n\tdef getCodon(self, i) :\n\t\t\"\"\"returns the ith codon\"\"\"\n\t\treturn self.getNucleotideCodon(i*3)[0]\n\n\tdef iterCodons(self) :\n\t\t\"\"\"iterates through the codons\"\"\"\n\t\tfor i in range(len(self.cDNA)/3) :\n\t\t\tyield self.getCodon(i)\n\n\tdef find(self, sequence) :\n\t\t\"\"\"return the position of the first occurance of sequence\"\"\"\n\t\treturn self.bin_sequence.find(sequence)\n\n\tdef findAll(self, sequence):\n\t\t\"\"\"Returns a list of all positions where sequence was found\"\"\"\n\t\treturn self.bin_sequence.findAll(sequence)\n\n\tdef findIncDNA(self, sequence) :\n\t\t\"\"\"return the position of the first occurance of sequence\"\"\"\n\t\treturn self.bin_cDNA.find(sequence)\n\n\tdef findAllIncDNA(self, sequence) :\n\t\t\"\"\"Returns a list of all positions where sequence was found in the cDNA\"\"\"\n\t\treturn self.bin_cDNA.findAll(sequence)\n\n\tdef getcDNALength(self) :\n\t\t\"\"\"returns the length of the cDNA\"\"\"\n\t\treturn len(self.cDNA)\n\n\tdef findInUTR5(self, sequence) :\n\t\t\"\"\"return the position of the first occurance of sequence in the 5'UTR\"\"\"\n\t\treturn self.bin_UTR5.find(sequence)\n\n\tdef findAllInUTR5(self, sequence) :\n\t\t\"\"\"Returns a list of all positions where sequence was found in the 5'UTR\"\"\"\n\t\treturn self.bin_UTR5.findAll(sequence)\n\n\tdef getUTR5Length(self) :\n\t\t\"\"\"returns the length of the 5'UTR\"\"\"\n\t\treturn len(self.bin_UTR5)\n\n\tdef findInUTR3(self, sequence) :\n\t\t\"\"\"return the position of the first occurance of sequence in the 3'UTR\"\"\"\n\t\treturn self.bin_UTR3.find(sequence)\n\n\tdef findAllInUTR3(self, sequence) :\n\t\t\"\"\"Returns a lits of all positions where sequence was found in the 3'UTR\"\"\"\n\t\treturn self.bin_UTR3.findAll(sequence)\n\n\tdef getUTR3Length(self) :\n\t\t\"\"\"returns the length of the 3'UTR\"\"\"\n\t\treturn len(self.bin_UTR3)\n\n\tdef getNbCodons(self) :\n\t\t\"\"\"returns the number of codons in the transcript\"\"\"\n\t\treturn len(self.cDNA)/3\n\t\n\tdef __getattribute__(self, name) :\n\t\treturn pyGenoRabaObjectWrapper.__getattribute__(self, name)\n\n\tdef __getitem__(self, i) :\n\t\treturn self.sequence[i]\n\n\tdef __len__(self) :\n\t\treturn len(self.sequence)\n\n\tdef __str__(self) :\n\t\treturn \"\"\"Transcript, id: %s, name: %s > %s\"\"\" %(self.id, self.name, str(self.gene))\n"
  },
  {
    "path": "pyGeno/__init__.py",
    "content": "__all__ = ['Genome', 'Chromosome', 'Gene', 'Transcript', 'Exon', 'Protein', 'SNP']\n\nfrom configuration import pyGeno_init\npyGeno_init()\n"
  },
  {
    "path": "pyGeno/bootstrap.py",
    "content": "import pyGeno.importation.Genomes as PG\nimport pyGeno.importation.SNPs as PS\nfrom pyGeno.tools.io import printf\nimport os, tempfile, urllib, urllib2, json\nimport pyGeno.configuration as conf\n\nthis_dir, this_filename = os.path.split(__file__)\n\n\n\ndef listRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :\n\t\"\"\"Lists all the datawraps availabe from a remote a remote location.\"\"\"\n\tloc = location + \"/datawraps.json\"\n\tresponse = urllib2.urlopen(loc)\n\tjs = json.loads(response.read())\n\t\n\treturn js\n\ndef printRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :\n\t\"\"\"\n\t\tprint all available datawraps from a remote location the location must have a datawraps.json in the following format::\n\n\t\t\t{\n\t\t\t\"Ordered\": {\n\t\t\t\t\"Reference genomes\": {\n\t\t\t\t\t\"Human\" :\t[\"GRCh37.75\", \"GRCh38.78\"],\n\t\t\t\t\t\"Mouse\" : [\"GRCm38.78\"],\n\t\t\t\t},\n\t\t\t\t\"SNPs\":{\n\t\t\t\t\t}\n\t\t\t},\n\t\t\t\"Flat\":{\n\t\t\t\t\"Reference genomes\": {\n\t\t\t\t\t\"GRCh37.75\": \"Human.GRCh37.75.tar.gz\",\n\t\t\t\t\t\"GRCh38.78\": \"Human.GRCh37.75.tar.gz\",\n\t\t\t\t\t\"GRCm38.78\": \"Mouse.GRCm38.78.tar.gz\"\n\t\t\t\t},\n\t\t\t\t\"SNPs\":{\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\t\n\t\"\"\"\n\t\n\tl = listRemoteDatawraps(location)\n\tprintf(\"Available datawraps for bootstraping\\n\")\n\tprint json.dumps(l[\"Ordered\"], sort_keys=True, indent=4, separators=(',', ': '))\n\ndef _DW(name, url) :\n\tpackageDir = tempfile.mkdtemp(prefix = \"pyGeno_remote_\")\n\t\n\tprintf(\"~~~:>\\n\\tDownloading datawrap: %s...\" % name)\n\tfinalFile = os.path.normpath('%s/%s' %(packageDir, name))\n\turllib.urlretrieve (url, finalFile)\n\tprintf('\\tdone.\\n~~~:>')\n\treturn finalFile\n\ndef importRemoteGenome(name, batchSize = 100) :\n\t\"\"\"Import a genome available from http://pygeno.iric.ca (might work).\"\"\"\n\ttry :\n\t\tdw = listRemoteDatawraps()[\"Flat\"][\"Reference genomes\"][name]\n\texcept AttributeError :\n\t\traise AttributeError(\"There's no remote genome datawrap by the name of: '%s'\" % name)\n\n\tfinalFile = _DW(name, dw[\"url\"])\n\tPG.importGenome(finalFile, batchSize)\n\ndef importRemoteSNPs(name) :\n\t\"\"\"Import a SNP set available from http://pygeno.iric.ca (might work).\"\"\"\n\ttry :\n\t\tdw = listRemoteDatawraps()[\"Flat\"][\"SNPs\"]\n\texcept AttributeError :\n\t\traise AttributeError(\"There's no remote genome datawrap by the name of: '%s'\" % name)\n\n\tfinalFile = _DW(name, dw[\"url\"])\n\tPS.importSNPs(finalFile)\n\ndef listDatawraps() :\n\t\"\"\"Lists all the datawraps pyGeno comes with\"\"\"\n\tl = {\"Genomes\" : [], \"SNPs\" : []}\n\tfor f in os.listdir(os.path.join(this_dir, \"bootstrap_data/genomes\")) :\n\t\tif f.find(\".tar.gz\") > -1 :\n\t\t\tl[\"Genomes\"].append(f)\n\t\n\tfor f in os.listdir(os.path.join(this_dir, \"bootstrap_data/SNPs\")) :\n\t\tif f.find(\".tar.gz\") > -1 :\n\t\t\tl[\"SNPs\"].append(f)\n\n\treturn l\n\ndef printDatawraps() :\n\t\"\"\"print all available datawraps for bootstraping\"\"\"\n\tl = listDatawraps()\n\tprintf(\"Available datawraps for boostraping\\n\")\n\tfor k, v in l.iteritems() :\n\t\tprintf(k)\n\t\tprintf(\"~\"*len(k) + \"|\")\n\t\tfor vv in v :\n\t\t\tprintf(\" \"*len(k) + \"|\" + \"~~~:> \" + vv)\n\t\tprintf('\\n')\n\ndef importGenome(name, batchSize = 100) :\n\t\"\"\"Import a genome shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties.\"\"\"\n\tpath = os.path.join(this_dir, \"bootstrap_data\", \"genomes/\" + name)\n\tPG.importGenome(path, batchSize)\n\ndef importSNPs(name) :\n\t\"\"\"Import a SNP set shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties.\"\"\"\n\tpath = os.path.join(this_dir, \"bootstrap_data\", \"SNPs/\" + name)\n\tPS.importSNPs(path)"
  },
  {
    "path": "pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/manifest.ini",
    "content": "[package_infos]\ndescription = For testing purposes. All polymorphisms at the same position\nmaintainer = Tariq Daouda\nmaintainer_contact = tariq.daouda@umontreal.ca\nversion = 1\n\n[set_infos]\nspecies = human\nname = dummySRY_AGN_indels\ntype = Agnostic\nsource = my place at IRIC\n\n[snps]\nfilename = snps.txt\n"
  },
  {
    "path": "pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/snps.txt",
    "content": "chromosomeNumber\tuniqueId\tstart\tend\tref\talleles\tquality\tcaller\nY\t1\t2655643\t2655646\tT\tAG\t30\ttest\nY\t2\t2655643\t2655647\t-\tAG\t30\ttest\nY\t3\t2655643\t2655650\tTT\t-\t30\ttest\n"
  },
  {
    "path": "pyGeno/bootstrap_data/SNPs/__init__.py",
    "content": ""
  },
  {
    "path": "pyGeno/bootstrap_data/__init__.py",
    "content": ""
  },
  {
    "path": "pyGeno/bootstrap_data/genomes/__init__.py",
    "content": ""
  },
  {
    "path": "pyGeno/configuration.py",
    "content": "import sys, os, time\nfrom ConfigParser import SafeConfigParser\nimport rabaDB.rabaSetup\nimport rabaDB.Raba\n\nclass PythonVersionError(Exception) :\n\tpass\n\npyGeno_FACE = \"~-~-:>\"\npyGeno_BRANCH = \"V2\"\n\npyGeno_VERSION_NAME = 'Lean Viper!'\npyGeno_VERSION_RELEASE_LEVEL = 'Release'\npyGeno_VERSION_NUMBER = 14.09\npyGeno_VERSION_BUILD_TIME = time.ctime(os.path.getmtime(__file__))\n\npyGeno_RABA_NAMESPACE = 'pyGenoRaba'\n\npyGeno_SETTINGS_DIR = os.path.normpath(os.path.expanduser('~/.pyGeno/'))\npyGeno_SETTINGS_PATH = None\npyGeno_RABA_DBFILE = None\npyGeno_DATA_PATH = None\npyGeno_REMOTE_LOCATION = 'http://bioinfo.iric.ca/~daoudat/pyGeno_datawraps'\n\ndb = None #proxy for the raba database\ndbConf = None #proxy for the raba database configuration\n\ndef version() :\n\t\"\"\"returns a tuple describing pyGeno's current version\"\"\"\n\treturn (pyGeno_FACE, pyGeno_BRANCH, pyGeno_VERSION_NAME, pyGeno_VERSION_RELEASE_LEVEL, pyGeno_VERSION_NUMBER, pyGeno_VERSION_BUILD_TIME )\n\ndef prettyVersion() :\n\t\"\"\"returns pyGeno's current version in a pretty human readable way\"\"\"\n\treturn \"pyGeno %s Branch: %s, Name: %s, Release Level: %s, Version: %s, Build time: %s\" % version()\n\ndef checkPythonVersion() :\n\t\"\"\"pyGeno needs python 2.7+\"\"\"\n\t\n\tif sys.version_info[0] < 2 or (sys.version_info[0] > 2  and sys.version_info[1] < 7) :\n\t\treturn False\n\treturn True\n\ndef getGenomeSequencePath(specie, name) :\n\treturn os.path.normpath(pyGeno_DATA_PATH+'/%s/%s' % (specie.lower(), name))\n\ndef createDefaultConfigFile() :\n\t\"\"\"Creates a default configuration file\"\"\"\n\ts = \"[pyGeno_config]\\nsettings_dir=%s\\nremote_location=%s\" % (pyGeno_SETTINGS_DIR, pyGeno_REMOTE_LOCATION)\n\tf = open('%s/config.ini' % pyGeno_SETTINGS_DIR, 'w')\n\tf.write(s)\n\tf.close()\n\ndef getSettingsPath() :\n\t\"\"\"Returns the path where the settings are stored\"\"\"\n\tparser = SafeConfigParser()\n\ttry :\n\t\tparser.read(os.path.normpath(pyGeno_SETTINGS_DIR+'/config.ini'))\n\t\treturn parser.get('pyGeno_config', 'settings_dir')\n\texcept :\n\t\tcreateDefaultConfigFile()\n\t\treturn getSettingsPath()\n\ndef removeFromDBRegistery(obj) :\n\t\"\"\"rabaDB keeps a record of loaded objects to ensure consistency between different queries.\n\tThis function removes an object from the registery\"\"\"\n\trabaDB.Raba.removeFromRegistery(obj)\n\ndef freeDBRegistery() :\n\t\"\"\"rabaDB keeps a record of loaded objects to ensure consistency between different queries. This function empties the registery\"\"\"\n\trabaDB.Raba.freeRegistery()\n\ndef reload() :\n\t\"\"\"reinitialize pyGeno\"\"\"\n\tpyGeno_init()\n\ndef pyGeno_init() :\n\t\"\"\"This function is automatically called at launch\"\"\"\n\t\n\tglobal db, dbConf\n\t\n\tglobal pyGeno_SETTINGS_PATH\n\tglobal pyGeno_RABA_DBFILE\n\tglobal pyGeno_DATA_PATH\n\t\n\tif not checkPythonVersion() :\n\t\traise PythonVersionError(\"==> FATAL: pyGeno only works with python 2.7 and above, please upgrade your python version\")\n\n\tif not os.path.exists(pyGeno_SETTINGS_DIR) :\n\t\tos.makedirs(pyGeno_SETTINGS_DIR)\n\t\n\tpyGeno_SETTINGS_PATH = getSettingsPath()\n\tpyGeno_RABA_DBFILE = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, \"pyGenoRaba.db\") )\n\tpyGeno_DATA_PATH = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, \"data\") )\n\t\n\tif not os.path.exists(pyGeno_SETTINGS_PATH) :\n\t\tos.makedirs(pyGeno_SETTINGS_PATH)\n\n\tif not os.path.exists(pyGeno_DATA_PATH) :\n\t\tos.makedirs(pyGeno_DATA_PATH)\n\n\t#launching the db\n\trabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE, pyGeno_RABA_DBFILE)\n\tdb = rabaDB.rabaSetup.RabaConnection(pyGeno_RABA_NAMESPACE)\n\tdbConf = rabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE)"
  },
  {
    "path": "pyGeno/doc/Makefile",
    "content": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHINXBUILD   = sphinx-build\nPAPER         =\nBUILDDIR      = build\n\n# User-friendly check for sphinx-build\nifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)\n$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)\nendif\n\n# Internal variables.\nPAPEROPT_a4     = -D latex_paper_size=a4\nPAPEROPT_letter = -D latex_paper_size=letter\nALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source\n# the i18n builder cannot share the environment and doctrees with the others\nI18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source\n\n.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext\n\nhelp:\n\t@echo \"Please use \\`make <target>' where <target> is one of\"\n\t@echo \"  html       to make standalone HTML files\"\n\t@echo \"  dirhtml    to make HTML files named index.html in directories\"\n\t@echo \"  singlehtml to make a single large HTML file\"\n\t@echo \"  pickle     to make pickle files\"\n\t@echo \"  json       to make JSON files\"\n\t@echo \"  htmlhelp   to make HTML files and a HTML help project\"\n\t@echo \"  qthelp     to make HTML files and a qthelp project\"\n\t@echo \"  devhelp    to make HTML files and a Devhelp project\"\n\t@echo \"  epub       to make an epub\"\n\t@echo \"  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter\"\n\t@echo \"  latexpdf   to make LaTeX files and run them through pdflatex\"\n\t@echo \"  latexpdfja to make LaTeX files and run them through platex/dvipdfmx\"\n\t@echo \"  text       to make text files\"\n\t@echo \"  man        to make manual pages\"\n\t@echo \"  texinfo    to make Texinfo files\"\n\t@echo \"  info       to make Texinfo files and run them through makeinfo\"\n\t@echo \"  gettext    to make PO message catalogs\"\n\t@echo \"  changes    to make an overview of all changed/added/deprecated items\"\n\t@echo \"  xml        to make Docutils-native XML files\"\n\t@echo \"  pseudoxml  to make pseudoxml-XML files for display purposes\"\n\t@echo \"  linkcheck  to check all external links for integrity\"\n\t@echo \"  doctest    to run all doctests embedded in the documentation (if enabled)\"\n\t@echo \"  coverage   to run coverage check of the documentation (if enabled)\"\n\nclean:\n\trm -rf $(BUILDDIR)/*\n\nhtml:\n\t$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html\n\t@echo\n\t@echo \"Build finished. The HTML pages are in $(BUILDDIR)/html.\"\n\ndirhtml:\n\t$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml\n\t@echo\n\t@echo \"Build finished. The HTML pages are in $(BUILDDIR)/dirhtml.\"\n\nsinglehtml:\n\t$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml\n\t@echo\n\t@echo \"Build finished. The HTML page is in $(BUILDDIR)/singlehtml.\"\n\npickle:\n\t$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle\n\t@echo\n\t@echo \"Build finished; now you can process the pickle files.\"\n\njson:\n\t$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json\n\t@echo\n\t@echo \"Build finished; now you can process the JSON files.\"\n\nhtmlhelp:\n\t$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp\n\t@echo\n\t@echo \"Build finished; now you can run HTML Help Workshop with the\" \\\n\t      \".hhp project file in $(BUILDDIR)/htmlhelp.\"\n\nqthelp:\n\t$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp\n\t@echo\n\t@echo \"Build finished; now you can run \"qcollectiongenerator\" with the\" \\\n\t      \".qhcp project file in $(BUILDDIR)/qthelp, like this:\"\n\t@echo \"# qcollectiongenerator $(BUILDDIR)/qthelp/pyGeno.qhcp\"\n\t@echo \"To view the help file:\"\n\t@echo \"# assistant -collectionFile $(BUILDDIR)/qthelp/pyGeno.qhc\"\n\ndevhelp:\n\t$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp\n\t@echo\n\t@echo \"Build finished.\"\n\t@echo \"To view the help file:\"\n\t@echo \"# mkdir -p $$HOME/.local/share/devhelp/pyGeno\"\n\t@echo \"# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/pyGeno\"\n\t@echo \"# devhelp\"\n\nepub:\n\t$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub\n\t@echo\n\t@echo \"Build finished. The epub file is in $(BUILDDIR)/epub.\"\n\nlatex:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo\n\t@echo \"Build finished; the LaTeX files are in $(BUILDDIR)/latex.\"\n\t@echo \"Run \\`make' in that directory to run these through (pdf)latex\" \\\n\t      \"(use \\`make latexpdf' here to do that automatically).\"\n\nlatexpdf:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo \"Running LaTeX files through pdflatex...\"\n\t$(MAKE) -C $(BUILDDIR)/latex all-pdf\n\t@echo \"pdflatex finished; the PDF files are in $(BUILDDIR)/latex.\"\n\nlatexpdfja:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo \"Running LaTeX files through platex and dvipdfmx...\"\n\t$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja\n\t@echo \"pdflatex finished; the PDF files are in $(BUILDDIR)/latex.\"\n\ntext:\n\t$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text\n\t@echo\n\t@echo \"Build finished. The text files are in $(BUILDDIR)/text.\"\n\nman:\n\t$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man\n\t@echo\n\t@echo \"Build finished. The manual pages are in $(BUILDDIR)/man.\"\n\ntexinfo:\n\t$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo\n\t@echo\n\t@echo \"Build finished. The Texinfo files are in $(BUILDDIR)/texinfo.\"\n\t@echo \"Run \\`make' in that directory to run these through makeinfo\" \\\n\t      \"(use \\`make info' here to do that automatically).\"\n\ninfo:\n\t$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo\n\t@echo \"Running Texinfo files through makeinfo...\"\n\tmake -C $(BUILDDIR)/texinfo info\n\t@echo \"makeinfo finished; the Info files are in $(BUILDDIR)/texinfo.\"\n\ngettext:\n\t$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale\n\t@echo\n\t@echo \"Build finished. The message catalogs are in $(BUILDDIR)/locale.\"\n\nchanges:\n\t$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes\n\t@echo\n\t@echo \"The overview file is in $(BUILDDIR)/changes.\"\n\nlinkcheck:\n\t$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck\n\t@echo\n\t@echo \"Link check complete; look for any errors in the above output \" \\\n\t      \"or in $(BUILDDIR)/linkcheck/output.txt.\"\n\ndoctest:\n\t$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest\n\t@echo \"Testing of doctests in the sources finished, look at the \" \\\n\t      \"results in $(BUILDDIR)/doctest/output.txt.\"\n\ncoverage:\n\t$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage\n\t@echo \"Testing of coverage in the sources finished, look at the \" \\\n\t      \"results in $(BUILDDIR)/coverage/python.txt.\"\n\nxml:\n\t$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml\n\t@echo\n\t@echo \"Build finished. The XML files are in $(BUILDDIR)/xml.\"\n\npseudoxml:\n\t$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml\n\t@echo\n\t@echo \"Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml.\"\n"
  },
  {
    "path": "pyGeno/doc/make.bat",
    "content": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\nset BUILDDIR=build\r\nset ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source\r\nset I18NSPHINXOPTS=%SPHINXOPTS% source\r\nif NOT \"%PAPER%\" == \"\" (\r\n\tset ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%\r\n\tset I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%\r\n)\r\n\r\nif \"%1\" == \"\" goto help\r\n\r\nif \"%1\" == \"help\" (\r\n\t:help\r\n\techo.Please use `make ^<target^>` where ^<target^> is one of\r\n\techo.  html       to make standalone HTML files\r\n\techo.  dirhtml    to make HTML files named index.html in directories\r\n\techo.  singlehtml to make a single large HTML file\r\n\techo.  pickle     to make pickle files\r\n\techo.  json       to make JSON files\r\n\techo.  htmlhelp   to make HTML files and a HTML help project\r\n\techo.  qthelp     to make HTML files and a qthelp project\r\n\techo.  devhelp    to make HTML files and a Devhelp project\r\n\techo.  epub       to make an epub\r\n\techo.  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter\r\n\techo.  text       to make text files\r\n\techo.  man        to make manual pages\r\n\techo.  texinfo    to make Texinfo files\r\n\techo.  gettext    to make PO message catalogs\r\n\techo.  changes    to make an overview over all changed/added/deprecated items\r\n\techo.  xml        to make Docutils-native XML files\r\n\techo.  pseudoxml  to make pseudoxml-XML files for display purposes\r\n\techo.  linkcheck  to check all external links for integrity\r\n\techo.  doctest    to run all doctests embedded in the documentation if enabled\r\n\techo.  coverage   to run coverage check of the documentation if enabled\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"clean\" (\r\n\tfor /d %%i in (%BUILDDIR%\\*) do rmdir /q /s %%i\r\n\tdel /q /s %BUILDDIR%\\*\r\n\tgoto end\r\n)\r\n\r\n\r\nREM Check if sphinx-build is available and fallback to Python version if any\r\n%SPHINXBUILD% 2> nul\r\nif errorlevel 9009 goto sphinx_python\r\ngoto sphinx_ok\r\n\r\n:sphinx_python\r\n\r\nset SPHINXBUILD=python -m sphinx.__init__\r\n%SPHINXBUILD% 2> nul\r\nif errorlevel 9009 (\r\n\techo.\r\n\techo.The 'sphinx-build' command was not found. Make sure you have Sphinx\r\n\techo.installed, then set the SPHINXBUILD environment variable to point\r\n\techo.to the full path of the 'sphinx-build' executable. Alternatively you\r\n\techo.may add the Sphinx directory to PATH.\r\n\techo.\r\n\techo.If you don't have Sphinx installed, grab it from\r\n\techo.http://sphinx-doc.org/\r\n\texit /b 1\r\n)\r\n\r\n:sphinx_ok\r\n\r\n\r\nif \"%1\" == \"html\" (\r\n\t%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/html.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"dirhtml\" (\r\n\t%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"singlehtml\" (\r\n\t%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"pickle\" (\r\n\t%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can process the pickle files.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"json\" (\r\n\t%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can process the JSON files.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"htmlhelp\" (\r\n\t%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can run HTML Help Workshop with the ^\r\n.hhp project file in %BUILDDIR%/htmlhelp.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"qthelp\" (\r\n\t%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can run \"qcollectiongenerator\" with the ^\r\n.qhcp project file in %BUILDDIR%/qthelp, like this:\r\n\techo.^> qcollectiongenerator %BUILDDIR%\\qthelp\\pyGeno.qhcp\r\n\techo.To view the help file:\r\n\techo.^> assistant -collectionFile %BUILDDIR%\\qthelp\\pyGeno.ghc\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"devhelp\" (\r\n\t%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"epub\" (\r\n\t%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The epub file is in %BUILDDIR%/epub.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latex\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; the LaTeX files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latexpdf\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tcd %BUILDDIR%/latex\r\n\tmake all-pdf\r\n\tcd %~dp0\r\n\techo.\r\n\techo.Build finished; the PDF files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latexpdfja\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tcd %BUILDDIR%/latex\r\n\tmake all-pdf-ja\r\n\tcd %~dp0\r\n\techo.\r\n\techo.Build finished; the PDF files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"text\" (\r\n\t%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The text files are in %BUILDDIR%/text.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"man\" (\r\n\t%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The manual pages are in %BUILDDIR%/man.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"texinfo\" (\r\n\t%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"gettext\" (\r\n\t%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The message catalogs are in %BUILDDIR%/locale.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"changes\" (\r\n\t%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.The overview file is in %BUILDDIR%/changes.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"linkcheck\" (\r\n\t%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Link check complete; look for any errors in the above output ^\r\nor in %BUILDDIR%/linkcheck/output.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"doctest\" (\r\n\t%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Testing of doctests in the sources finished, look at the ^\r\nresults in %BUILDDIR%/doctest/output.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"coverage\" (\r\n\t%SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Testing of coverage in the sources finished, look at the ^\r\nresults in %BUILDDIR%/coverage/python.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"xml\" (\r\n\t%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The XML files are in %BUILDDIR%/xml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"pseudoxml\" (\r\n\t%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.\r\n\tgoto end\r\n)\r\n\r\n:end\r\n"
  },
  {
    "path": "pyGeno/doc/source/bootstraping.rst",
    "content": "Bootstraping\n==================================\npyGeno can be quick-started by importing these built-in data wraps.\n\n.. automodule:: bootstrap\n   :members:\n\n"
  },
  {
    "path": "pyGeno/doc/source/citing.rst",
    "content": "Citing\n=========\n\nIf you are using pyGeno please mention it to the rest of the universe by including a link to: https://github.com/tariqdaouda/pyGeno"
  },
  {
    "path": "pyGeno/doc/source/conf.py",
    "content": "# -*- coding: utf-8 -*-\n#\n# pyGeno documentation build configuration file, created by\n# sphinx-quickstart on Thu Nov  6 16:45:34 2014.\n#\n# This file is execfile()d with the current directory set to its\n# containing dir.\n#\n# Note that not all possible configuration values are present in this\n# autogenerated file.\n#\n# All configuration values have a default; values that are commented out\n# serve to show the default.\n\nimport sys\nimport os\n\nimport sphinx_rtd_theme\n\n# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here. If the directory is relative to the\n# documentation root, use os.path.abspath to make it absolute, like shown here.\nsys.path.insert(0, os.path.abspath('../..'))\n\n# -- General configuration ------------------------------------------------\n\n# If your documentation needs a minimal Sphinx version, state it here.\n#needs_sphinx = '1.0'\n\n# Add any Sphinx extension module names here, as strings. They can be\n# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n# ones.\nextensions = [\n    'sphinx.ext.autodoc',\n    'sphinx.ext.doctest',\n    'sphinx.ext.intersphinx',\n    'sphinx.ext.todo',\n    'sphinx.ext.coverage',\n    'sphinx.ext.mathjax',\n    'sphinx.ext.ifconfig',\n    'sphinx.ext.viewcode',\n]\n\n# Add any paths that contain templates here, relative to this directory.\ntemplates_path = ['_templates']\n\n# The suffix of source filenames.\nsource_suffix = '.rst'\n\n# The encoding of source files.\n#source_encoding = 'utf-8-sig'\n\n# The master toctree document.\nmaster_doc = 'index'\n\n# General information about the project.\nproject = u'pyGeno'\ncopyright = u'2014, Tariq Daouda'\n\n# The version info for the project you're documenting, acts as replacement for\n# |version| and |release|, also used in various other places throughout the\n# built documents.\n#\n# The short X.Y version.\nversion = '1.x'\n# The full version, including alpha/beta/rc tags.\nrelease = '1.2.x'\n\n# The language for content autogenerated by Sphinx. Refer to documentation\n# for a list of supported languages.\n#\n# This is also used if you do content translation via gettext catalogs.\n# Usually you set \"language\" from the command line for these cases.\nlanguage = None\n\n# There are two options for replacing |today|: either, you set today to some\n# non-false value, then it is used:\n#today = ''\n# Else, today_fmt is used as the format for a strftime call.\n#today_fmt = '%B %d, %Y'\n\n# List of patterns, relative to source directory, that match files and\n# directories to ignore when looking for source files.\nexclude_patterns = []\n\n# The reST default role (used for this markup: `text`) to use for all\n# documents.\n#default_role = None\n\n# If true, '()' will be appended to :func: etc. cross-reference text.\n#add_function_parentheses = True\n\n# If true, the current module name will be prepended to all description\n# unit titles (such as .. function::).\n#add_module_names = True\n\n# If true, sectionauthor and moduleauthor directives will be shown in the\n# output. They are ignored by default.\n#show_authors = False\n\n# The name of the Pygments (syntax highlighting) style to use.\npygments_style = 'sphinx'\n\n# A list of ignored prefixes for module index sorting.\n#modindex_common_prefix = []\n\n# If true, keep warnings as \"system message\" paragraphs in the built documents.\n#keep_warnings = False\n\n\n# -- Options for HTML output ----------------------------------------------\n\n# The theme to use for HTML and HTML Help pages.  See the documentation for\n# a list of builtin themes.\n# html_theme = \"default\"\n# html_theme_options = {\n# \t\"rightsidebar\":\"true\",\n# \t\"stickysidebar\" : \"false\",\n# }\nhtml_theme = \"sphinx_rtd_theme\"\n\n# Theme options are theme-specific and customize the look and feel of a theme\n# further.  For a list of options available for each theme, see the\n# documentation.\n#html_theme_options = {}\n\n# Add any paths that contain custom themes here, relative to this directory.\n#html_theme_path = []\nhtml_theme_path = [sphinx_rtd_theme.get_html_theme_path()]\n\n# The name for this set of Sphinx documents.  If None, it defaults to\n# \"<project> v<release> documentation\".\n#html_title = None\n\n# A shorter title for the navigation bar.  Default is the same as html_title.\n#html_short_title = None\n\n# The name of an image file (relative to this directory) to place at the top\n# of the sidebar.\nhtml_logo = \"logo.png\"\n\n# The name of an image file (within the static path) to use as favicon of the\n# docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32\n# pixels large.\n#html_favicon = None\n\n# Add any paths that contain custom static files (such as style sheets) here,\n# relative to this directory. They are copied after the builtin static files,\n# so a file named \"default.css\" will overwrite the builtin \"default.css\".\nhtml_static_path = ['_static']\n\n# Add any extra paths that contain custom files (such as robots.txt or\n# .htaccess) here, relative to this directory. These files are copied\n# directly to the root of the documentation.\n#html_extra_path = []\n\n# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,\n# using the given strftime format.\n#html_last_updated_fmt = '%b %d, %Y'\n\n# If true, SmartyPants will be used to convert quotes and dashes to\n# typographically correct entities.\n#html_use_smartypants = True\n\n# Custom sidebar templates, maps document names to template names.\n#html_sidebars = {}\n\n# Additional templates that should be rendered to pages, maps page names to\n# template names.\n#html_additional_pages = {}\n\n# If false, no module index is generated.\n#html_domain_indices = True\n\n# If false, no index is generated.\n#html_use_index = True\n\n# If true, the index is split into individual pages for each letter.\n#html_split_index = False\n\n# If true, links to the reST sources are added to the pages.\n#html_show_sourcelink = True\n\n# If true, \"Created using Sphinx\" is shown in the HTML footer. Default is True.\n#html_show_sphinx = True\n\n# If true, \"(C) Copyright ...\" is shown in the HTML footer. Default is True.\n#html_show_copyright = True\n\n# If true, an OpenSearch description file will be output, and all pages will\n# contain a <link> tag referring to it.  The value of this option must be the\n# base URL from which the finished HTML is served.\n#html_use_opensearch = ''\n\n# This is the file name suffix for HTML files (e.g. \".xhtml\").\n#html_file_suffix = None\n\n# Language to be used for generating the HTML full-text search index.\n# Sphinx supports the following languages:\n#   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'\n#   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr'\n#html_search_language = 'en'\n\n# A dictionary with options for the search language support, empty by default.\n# Now only 'ja' uses this config value\n#html_search_options = {'type': 'default'}\n\n# The name of a javascript file (relative to the configuration directory) that\n# implements a search results scorer. If empty, the default will be used.\n#html_search_scorer = 'scorer.js'\n\n# Output file base name for HTML help builder.\nhtmlhelp_basename = 'pyGenodoc'\n\n# -- Options for LaTeX output ---------------------------------------------\n\nlatex_elements = {\n# The paper size ('letterpaper' or 'a4paper').\n'papersize': 'letterpaper',\n\n# The font size ('10pt', '11pt' or '12pt').\n'pointsize': '10pt',\n\n# Additional stuff for the LaTeX preamble.\n#'preamble': '',\n\n# Latex figure (float) alignment\n#'figure_align': 'htbp',\n}\n\n# Grouping the document tree into LaTeX files. List of tuples\n# (source start file, target name, title,\n#  author, documentclass [howto, manual, or own class]).\nlatex_documents = [\n  ('index', 'pyGeno.tex', u'pyGeno Documentation',\n   u'Tariq Daouda', 'manual'),\n]\n\n# The name of an image file (relative to this directory) to place at the top of\n# the title page.\n# latex_logo = None\n\n# For \"manual\" documents, if this is true, then toplevel headings are parts,\n# not chapters.\n#latex_use_parts = False\n\n# If true, show page references after internal links.\n#latex_show_pagerefs = False\n\n# If true, show URL addresses after external links.\n#latex_show_urls = False\n\n# Documents to append as an appendix to all manuals.\n#latex_appendices = []\n\n# If false, no module index is generated.\n#latex_domain_indices = True\n\n\n# -- Options for manual page output ---------------------------------------\n\n# One entry per manual page. List of tuples\n# (source start file, name, description, authors, manual section).\nman_pages = [\n    ('index', 'pygeno', u'pyGeno Documentation',\n     [u'Tariq Daouda'], 1)\n]\n\n# If true, show URL addresses after external links.\n#man_show_urls = False\n\n\n# -- Options for Texinfo output -------------------------------------------\n\n# Grouping the document tree into Texinfo files. List of tuples\n# (source start file, target name, title, author,\n#  dir menu entry, description, category)\ntexinfo_documents = [\n  ('index', 'pyGeno', u'pyGeno Documentation',\n   u'Tariq Daouda', 'pyGeno', 'One line description of project.',\n   'Miscellaneous'),\n]\n\n# Documents to append as an appendix to all manuals.\n#texinfo_appendices = []\n\n# If false, no module index is generated.\n#texinfo_domain_indices = True\n\n# How to display URL addresses: 'footnote', 'no', or 'inline'.\n#texinfo_show_urls = 'footnote'\n\n# If true, do not generate a @detailmenu in the \"Top\" node's menu.\n#texinfo_no_detailmenu = False\n\n\n# -- Options for Epub output ----------------------------------------------\n\n# Bibliographic Dublin Core info.\nepub_title = u'pyGeno'\nepub_author = u'Tariq Daouda'\nepub_publisher = u'Tariq Daouda'\nepub_copyright = u'2014, Tariq Daouda'\n\n# The basename for the epub file. It defaults to the project name.\n#epub_basename = u'pyGeno'\n\n# The HTML theme for the epub output. Since the default themes are not optimized\n# for small screen space, using the same theme for HTML and epub output is\n# usually not wise. This defaults to 'epub', a theme designed to save visual\n# space.\n#epub_theme = 'epub'\n\n# The language of the text. It defaults to the language option\n# or 'en' if the language is not set.\n#epub_language = ''\n\n# The scheme of the identifier. Typical schemes are ISBN or URL.\n#epub_scheme = ''\n\n# The unique identifier of the text. This can be a ISBN number\n# or the project homepage.\n#epub_identifier = ''\n\n# A unique identification for the text.\n#epub_uid = ''\n\n# A tuple containing the cover image and cover page html template filenames.\n#epub_cover = ()\n\n# A sequence of (type, uri, title) tuples for the guide element of content.opf.\n#epub_guide = ()\n\n# HTML files that should be inserted before the pages created by sphinx.\n# The format is a list of tuples containing the path and title.\n#epub_pre_files = []\n\n# HTML files shat should be inserted after the pages created by sphinx.\n# The format is a list of tuples containing the path and title.\n#epub_post_files = []\n\n# A list of files that should not be packed into the epub file.\nepub_exclude_files = ['search.html']\n\n# The depth of the table of contents in toc.ncx.\n#epub_tocdepth = 3\n\n# Allow duplicate toc entries.\n#epub_tocdup = True\n\n# Choose between 'default' and 'includehidden'.\n#epub_tocscope = 'default'\n\n# Fix unsupported image types using the PIL.\n#epub_fix_images = False\n\n# Scale large images.\n#epub_max_image_width = 0\n\n# How to display URL addresses: 'footnote', 'no', or 'inline'.\n#epub_show_urls = 'inline'\n\n# If false, no index is generated.\n#epub_use_index = True\n\n\n# Example configuration for intersphinx: refer to the Python standard library.\nintersphinx_mapping = {'http://docs.python.org/': None}\n"
  },
  {
    "path": "pyGeno/doc/source/datawraps.rst",
    "content": "Datawraps\n=========\n\nDatawraps are used by pyGeno to import data into it's database. All reference genomes are downloaded from Ensembl, dbSNP data from dbSNP.\nThe :doc:`/bootstraping` module has functions to import datawraps shipped with pyGeno.\nDatawraps can either be tar.gz.archives or folders.\n\nImportation\n-----------\n\nHere's how you import a reference genome datawrap::\n\n\tfrom pyGeno.importation.Genomes import *\n\timportGenome(\"my_datawrap.tar.gz\")\n\n\nAnd a SNP set datawrap::\n\t\n\tfrom pyGeno.importation.SNPs import *\n\timportSNPs(\"my_datawrap.tar.gz\")\n\n\nCreating you own datawraps\n--------------------------\n\nFor polymorphims, create a file called **manifest.ini** with the following format::\n\n\t[package_infos]\n\tdescription = SNPs for testing purposes\n\tmaintainer = Tariq Daouda\n\tmaintainer_contact = tariq.daouda [at] umontreal\n\tversion = 1\n\n\t[set_infos]\n\tspecies = human\n\tname = mySNPSET\n\ttype = Agnostic # or CasavaSNP or dbSNPSNP\n\tsource = Where do these snps come from?\n\n\t[snps]\n\tfilename = snps.txt # or http://www.example.com/snps.txt or ftp://www.example.com/snps.txt if you chose not to include the file in the archive\n\nAnd compress the **manifest.ini** file along sith the snps.txt (if you chose to include it and not to specify an url) into a tar.gz archive.\n\n\nNatively pyGeno supports dbSNP and casava(snp.txt), but it also has its own polymorphism file format (AgnosticSNP) wich is simply a tab delemited file in the following format::\n\n\tchromosomeNumber uniqueId   start        end      ref    alleles   quality       caller\n\t        Y          1       2655643      2655644\t   T       AG        30          TopHat\n\t        Y          2       2655645      2655647    -       AG        28          TopHat\n\t        Y          3       2655648      2655650    TT      -         10          TopHat\n\nEven tough all field are mandatory, the only ones that are critical for pyGeno to be able insert polymorphisms at the right places are: *chromosomeNumber* and *start*. All the other fields are non critical and can follow any convention you wish to apply to them, including the *end* field. You can choose the convention that suits best the query that you are planning to make on SNPs through .get(), or the way you intend to filter them using filtering objtecs.\n\nFor genomes, the manifet.ini file looks like this::\n\n\t[package_infos]\n\tdescription = Test package. This package installs only chromosome Y of mus musculus\n\tmaintainer = Tariq Daouda\n\tmaintainer_contact = tariq.daouda [at] umontreal\n\tversion = GRCm38.73\n\n\t[genome]\n\tspecies = Mus_musculus\n\tname = GRCm38_test\n\tsource = http://useast.ensembl.org/info/data/ftp/index.html\n\n\t[chromosome_files]\n\tY = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz # or an url such as ftp://... or http://\n\n\t[gene_set]\n\tgtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz # or an url such as ftp://... or http://\n\nFile URLs for refercence genomes can be found on `Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`_\n\nTo learn more about datawraps and how to make your own you can have a look at :doc:`/importation`, and the Wiki_.\n\n.. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data\n.. _`Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`: http://useast.ensembl.org/info/data/ftp/index.html\n"
  },
  {
    "path": "pyGeno/doc/source/importation.rst",
    "content": "Importation\n===========\npyGeno's database is populated by importing tar.gz compressed archives called *datawraps*. An importation is a one time step and once a datawrap has been imported it can be discarded with no concequences to the database. \n\nHere's how you import a reference genome datawrap::\n\t\n\tfrom pyGeno.importation.Genomes import *\n\timportGenome(\"my_genome_datawrap.tar.gz\")\n\n\nAnd a SNP set datawrap::\n\t\n\tfrom pyGeno.importation.SNPs import *\n\timportSNPs(\"my_snp_datawrap.tar.gz\")\n\npyGeno comes with a few datawraps that you can quickly import using the :doc:`/bootstraping` module.\n\nYou can find a list of datawraps to import here: :doc:`/datawraps`\n\nYou can also easily create your own by simply putting a bunch of URLs into a *manifest.ini* file and compressing int into a *tar.gz archive* (as explained below or on the Wiki_).\n\n.. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data\n\nGenomes\n-------\n.. automodule:: importation.Genomes\n   :members:\n\nPolymorphisms (SNPs and Indels)\n-------------------------------\n\n.. automodule:: importation.SNPs\n   :members:\n"
  },
  {
    "path": "pyGeno/doc/source/index.rst",
    "content": ".. pyGeno documentation master file, created by\n   sphinx-quickstart on Thu Nov  6 16:45:34 2014.\n   You can adapt this file completely to your liking, but it should at least\n   contain the root `toctree` directive.\n\n.. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png\n   :alt: pyGeno's logo\n\npyGeno: A Python package for precision medicine and proteogenomics\n===================================================================\n.. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg\n   :alt: depsy\n   :target: http://depsy.org/package/python/pyGeno\n.. image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg\n    :target: https://opensource.org/licenses/Apache-2.0\n.. image:: https://img.shields.io/badge/python-2.7-blue.svg \n\npyGeno's `lair is on Github`_.\n\n.. _lair is on Github: http://www.github.com/tariqdaouda/pyGeno\n\nCiting pyGeno:\n--------------\n\nPlease cite this paper_.\n\n.. _paper: http://f1000research.com/articles/5-381/v1\n\nA Quick Intro:\n-----------------\n\nEven tough more and more research focuses on Personalized/Precision Medicine, treatments that are specically tailored to the patient, pyGeno is (to our knowlege) the only tool available that will gladly build your specific genomes for you and you give an easy access to them.\n\npyGeno allows you to create and work on **Personalized Genomes**: custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter.\npyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get\ndirect access to the DNA and Protein sequences of your patients/subjects.\nTo know more about how to create a Personalized Genome, have a look at the :doc:`/quickstart` section.\n\npyGeno can also function as a personal bioinformatic database for Ensembl, that runs directly into python, on your laptop, making faster and more reliable than any REST API. pyGeno makes extracting data such as gene sequences a breeze, and is designed to\nbe able cope with huge queries.\n\n\n.. code::\n\n\tfrom pyGeno.Genome import *\n\n\tg = Genome(name = \"GRCh37.75\")\n\tprot = g.get(Protein, id = 'ENSP00000438917')[0]\n\t#print the protein sequence\n\tprint prot.sequence\n\t#print the protein's gene biotype\n\tprint prot.gene.biotype\n\t#print protein's transcript sequence\n\tprint prot.transcript.sequence\n\n\t#fancy queries\n\tfor exons in g.get(Exons, {\"CDS_start >\": x1, \"CDS_end <=\" : x2, \"chromosome.number\" : \"22\"}) :\n\t\t#print the exon's coding sequence\n\t\tprint exon.CDS\n\t\t#print the exon's transcript sequence\n\t\tprint exon.transcript.sequence\n\n\t#You can do the same for your subject specific genomes\n\t#by combining a reference genome with polymorphisms \n\tg = Genome(name = \"GRCh37.75\", SNPs = [\"STY21_RNA\"], SNPFilter = MyFilter())\n\n\nVerbose Introduction\n---------------------\n\npyGeno integrates:\n\n* **Reference sequences** and annotations from **Ensembl**\n\n* Genomic polymorphisms from the **dbSNP** database\n\n* SNPs from **next-gen** sequencing\n\npyGeno is a python package that  was designed to be:\n\n* Fast to install. It has no dependencies but its own backend: `rabaDB`_.\n* Fast to run and memory efficient, so you can use it on your laptop.\n* Fast to use. No queries to foreign APIs all the data rests on your computer, so it is readily accessible when you need it.\n* Fast to learn. One single function **get()** can do the job of several other tools at once. \n\nIt also comes with:\n\n* Parsers for: FASTA, FASTQ, GTF, VCF, CSV.\n* Useful tools for translation etc...\n* Optimised genome indexation with *Segment Trees*.\n* A funky *Progress Bar*.\n\nOne of the the coolest things about pyGeno is that it also allows to quickly create **personalized genomes**.\nGenomes that you design yourself by combining a reference genome and SNP sets derived from dbSNP or next-gen sequencing.\n\npyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_.\nFor the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.\n\n.. _rabaDB: https://github.com/tariqdaouda/rabaDB\n.. _Tariq Daouda: http://www.tariqdaouda.com\n.. _IRIC: http://www.iric.ca\n.. _Sawssan Kaddoura: http://www.sawssankaddoura.com\n.. _@tariqdaouda: https://www.twitter.com/tariqdaouda\n\nContents:\n----------\n\n.. toctree::\n   :maxdepth: 2\n   \n   publications\n   quickstart\n   installation\n   bootstraping\n   querying\n   importation\n   datawraps\n   objects\n   snp_filter\n   tools\n   parsers\n\nIndices and tables\n==================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :ref:`search`\n\n"
  },
  {
    "path": "pyGeno/doc/source/installation.rst",
    "content": "Installation\n=============\n\nUnix (MacOS, Linux)\n-------------------\n\nThe latest stable version is available from pypi::\n\t\n\tpip install pyGeno\n\n**Upgrade**::\n\n\tpip install pyGeno --upgrade\n\nIf you're more adventurous, the bleeding edge version is available from github (look for the 'bloody' branch)::\n\n\tgit clone https://github.com/tariqdaouda/pyGeno.git\n\tcd pyGeno\n\tpython setup.py develop\n\n**Upgrade**::\n\n\tgit pull\n\nWindows\n-------\n\n* Goto: https://www.python.org/downloads/ and download the installer for the lastest version of python 2.7\n* Double click on the installer to begin installation\n* Click on the windows start menu\n* Type *\"cmd\"* and click on it to launch the command line interface\n* In the command line interface type::\n\t\n\t cd C:\\Python27\\Scripts\n\n* Now type: pip install pyGeno\n* Now click on the windows start menu. In the python 2.7 menu you can either launch *Python (Command line)* or *IDLE (Python GUI)*\n* You can now go to: http://pygeno.iric.ca/quickstart.html and type the commands into either one of them\n\n**UPGRADE:** to upgrade pyGeno to the latest version, launch *cmd* and type::\n\n\tcd C:\\Python27\\Scripts\n\nfollowed by::\n\t\n\tpip install pyGeno --upgrade\n\n"
  },
  {
    "path": "pyGeno/doc/source/objects.rst",
    "content": "Objects\n=======\nWith pyGeno you can manipulate familiar object in intuituive way. All the following classes except SNP inherit from pyGenoObjectWrapper and have therefor access to functions sur as get(), count(), ensureIndex()... \n\nBase classes\n-------------\nBase classes are abstract and are not meant to be instanciated, they nonetheless implement most of the functions that classes such as Genome possess.\n\n.. automodule:: pyGenoObjectBases\n   :members:\n\nGenome\n-------\n.. automodule:: Genome\n   :members:\n\nChromosome\n----------\n.. automodule:: Chromosome\n   :members:\n\nGene\n----\n.. automodule:: Gene\n   :members:\n\nTranscript\n----------\n.. automodule:: Transcript\n   :members:\n\nExon\n----\n.. automodule:: Exon\n   :members:\n\nProtein\n-------\n.. automodule:: Protein\n   :members:\n\nSNP\n---\n.. automodule:: SNP\n   :members:\n\n\n"
  },
  {
    "path": "pyGeno/doc/source/parsers.rst",
    "content": "Parsers\n=======\n\nPyGeno comes with a set of parsers that you can use independentely.\n\nCSV\n---\nTo read and write CSV files.\n\n.. automodule:: tools.parsers.CSVTools\n   :members:\n\nFASTA\n-----\nTo read and write FASTA files.\n\n.. automodule:: tools.parsers.FastaTools\n   :members:\n\nFASTQ\n-----\nTo read and write FASTQ files.\n\n.. automodule:: tools.parsers.FastqTools\n   :members:\n\nGTF\n---\nTo read GTF files.\n\n.. automodule:: tools.parsers.GTFTools\n   :members:\n\nVCF\n---\nTo read VCF files.\n\n.. automodule:: tools.parsers.VCFTools\n   :members:\n \nCasava\n-------\nTo read casava files.\n\n.. automodule:: tools.parsers.CasavaTools\n   :members:\n"
  },
  {
    "path": "pyGeno/doc/source/publications.rst",
    "content": "Publications\n============\n\nPlease cite this one:\n---------------------\n\n`pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`_\n\n.. _`pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`: http://f1000research.com/articles/5-381/v2\n\npyGeno was also used in the following studies:\n----------------------------------------------\n\n`MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`_\n\n.. _`MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127664/\n\n\n`Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015`_\n\n.. _Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015: http://dx.doi.org/10.1038/ncomms10238\n\n`Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014`_\n\n.. _Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014: http://www.ncbi.nlm.nih.gov/pubmed/24714562\n\n`MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012`_\n\n.. _MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012: http://www.ncbi.nlm.nih.gov/pubmed/22438248\n"
  },
  {
    "path": "pyGeno/doc/source/querying.rst",
    "content": "Querying\n=========\n\npyGeno is a personal database that you can query in many ways. Special emphasis has been placed upon ease of use, and you only need to remember two functions::\n\n\t* get()\n\t* help()\n\n**get()** can be called from any pyGeno object to get any objects.\n\n**help()** is you best friend when you get lost using **get()**. When called, it will give the list of all field that you can use in get queries. You can call it either of the class::\n\n\tGene.help()\n\nOr on the object::\n\n\tref = Genome(name = \"GRCh37.75\")\n\tg = ref.get(Gene, name = \"TPST2\")[0]\n\n\tg.help()\n\nBoth will print::\n\n\t'Available fields for Gene: end, name, chromosome, start, biotype, id, strand, genome'\n"
  },
  {
    "path": "pyGeno/doc/source/quickstart.rst",
    "content": "Quickstart\n==========\n\nQuick importation\n-----------------\npyGeno's database is populated by importing data wraps.\npyGeno comes with a few datawraps, to get the list you can use:\n\n.. code:: python\n\t\n\timport pyGeno.bootstrap as B\n\tB.printDatawraps()\n\n.. code::\n\n\tAvailable datawraps for bootstraping\n\t\n\tSNPs\n\t~~~~|\n\t    |~~~:> Human_agnostic.dummySRY.tar.gz\n\t    |~~~:> Human.dummySRY_casava.tar.gz\n\t    |~~~:> dbSNP142_human_GRCh37_common_all.tar.gz\n\t    |~~~:> dbSNP142_human_common_all.tar.gz\n\t\n\t\n\tGenomes\n\t~~~~~~~|\n\t       |~~~:> Human.GRCh37.75.tar.gz\n\t       |~~~:> Human.GRCh37.75_Y-Only.tar.gz\n\t       |~~~:> Human.GRCh38.78.tar.gz\n\t       |~~~:> Mouse.GRCm38.78.tar.gz\n\nTo get a list of remote datawraps that pyGeno can download for you, do:\n\n.. code:: python\n\n\tB.printRemoteDatawraps()\n\n\nImporting whole genomes is a demanding process that take more than an hour and requires (according to tests) \nat least 3GB of memory. Depending on your configuration, more might be required.\n\nThat being said importating a data wrap is a one time operation and once the importation is complete the datawrap\ncan be discarded without consequences.\n\nThe bootstrap module also has some handy functions for importing built-in packages.\n\nSome of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**):\n\n.. code:: python\n\t\n\timport pyGeno.bootstrap as B\n\n\t#Imports only the Y chromosome from the human reference genome GRCh37.75\n\t#Very fast, requires even less memory. No download required.\n\tB.importGenome(\"Human.GRCh37.75_Y-Only.tar.gz\")\n\t\n\t#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP  format. \n\t# This one has one SNP at the begining of the gene SRY\n\tB.importSNPs(\"Human.dummySRY_casava.tar.gz\")\n\nAnd for more serious work, the whole reference genome.\n\n.. code:: python\n\n\t#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.\n\tB.importGenome(\"Human.GRCh38.78.tar.gz\")\n\nThat's it, you can now print the sequences of all the proteins that a gene can produce::\n\n\tfrom pyGeno.Genome import Genome\n\tfrom pyGeno.Gene import Gene\n\tfrom pyGeno.Protein import Protein\n\n\t#the name of the genome is defined inside the package's manifest.ini file\n\tref = Genome(name = 'GRCh37.75')\n\t#get returns a list of elements\n\tgene = ref.get(Gene, name = 'SRY')[0]\n\tfor prot in gene.get(Protein) :\n\t\t  print prot.sequence\n\nYou can see pyGeno achitecture as a graph where everything is connected to everything. For instance you can do things such as::\n\n\tgene = aProt.gene\n\ttrans = aProt.transcript\n\tprot = anExon.protein\n\tgenome = anExon.genome\n\nQueries\n-------\n\nPyGeno allows for several kinds of queries, here are some snippets::\n\n\t#in this case both queries will yield the same result\n\tmyGene.get(Protein, id = \"ENSID...\")\n\tmyGenome.get(Protein, id = \"ENSID...\")\n\t\n\t#even complex stuff\n\texons = myChromosome.get(Exons, {'start >=' : x1, 'stop <' : x2})\n\thlaGenes = myGenome.get(Gene, {'name like' : 'HLA'})\n\n\tsry = myGenome.get(Transcript, { \"gene.name\" : 'SRY' })\n\nTo know the available fields for queries, there's a \"help()\" function::\n\n\tGene.help()\n\n\nFaster queries\n---------------\n\nTo speed up loops use iterGet()::\n\t\n\tfor prot in gene.iterGet(Protein) :\n\t  print prot.sequence\n\nFor more speed create indexes on the fields you need the most::\n\t\n\tGene.ensureGlobalIndex('name')\n\n\nGetting sequences\n-------------------\n\nAnything that has a sequence can be indexed using the usual python list syntax::\n\n\tprotein[34] # for the 34th amino acid\n\tprotein[34:40] # for amino acids in [34, 40[\n\n\ttranscript[23] #for the 23rd nucleotide of the transcript\n\ttranscript[23:30] #for nucletotides in [23, 30[\n\n\ttranscript.cDNA[23:30] #the same but for the protein coding DNA (without the UTRs)\n\nTranscripts, Proteins, Exons also have a *.sequence* attribute. This attribute is the string rendered sequence, it is perfect for printing but it  may contain '/'s \nin case of polymorphic sequence that you must\ntake into account in the indexing. On the other hand if you use indexes directly on the object (as shown in the snippet above) pyGeno will use a binary representaion\nof the sequences thus the indexing is independent of the polymorphisms present in the sequences.\n\nPersonalized Genomes\n--------------------\n\nPersonalized Genomes are a powerful feature that allow to work on the specific genomes and proteomes of your patients. You can even mix several SNPs together::\n\t\n\tfrom pyGeno.Genome import Genome\n\t#the name of the snp set is defined inside the datawraps's manifest.ini file\n\tdummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')\n\t#you can also define a filter (ex: a quality filter) for the SNPs\n\tdummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())\n\t#and even mix several snp sets\n\tdummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())\n\npyGeno allows you to customize the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions::\n\t\n\tfrom pyGeno.SNPFiltering import SNPFilter\n\tfrom pyGeno.SNPFiltering import SequenceSNP\n\n\tclass QMax_gt_filter(SNPFilter) :\n\n\t\tdef __init__(self, threshold) :\n\t\t\tself.threshold = threshold\n\n\t\tdef filter(self, chromosome, dummySRY = None) :\n\t\t\tif dummySRY.Qmax_gt > self.threshold :\n\t\t\t\t#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)\n\t\t\t\treturn SequenceSNP(dummySRY.alt)\n\t\t\treturn None #None means keep the reference allele\n\n\tpersGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))\n\n"
  },
  {
    "path": "pyGeno/doc/source/snp_filter.rst",
    "content": "Filtering Polymorphisms (SNPs and Indels)\n=========================================\n\nPolymorphism filtering is an important part of personalized genomes. By creating your own filters you can easily taylor personalized genomes to your needs. The importaant thing to understand about the filtering process, is that it gives you complete freedom about should be inserted. \nOnce pyGeno finds a polymorphism, it automatically triggers the filter to know what should be inserted at this position, and that can be anything you choose.\n\n.. automodule:: SNPFiltering\n   :members:\n"
  },
  {
    "path": "pyGeno/doc/source/tools.rst",
    "content": "Tools\n=====\n\npyGeno provides a set of tools that can be used independentely. Here you'll find goodies for translation, indexation, and more.\n\nProgress Bar\n-------------\npyGeno's awesome progress bar, with logging capabilities and remaining time estimation.\n\n.. automodule:: tools.ProgressBar\n   :members:\n\nUseful functions\n-----------------\nThis module is a bunch of handy bioinfo functions.\n\n.. automodule:: tools.UsefulFunctions\n   :members:\n\nBinary sequences\n-----------------\nTo encode sequence in binary formats\n\n.. automodule:: tools.BinarySequence\n   :members:\n\nSegment tree\n-------------\nSegment trees are an optimised way to index a genome.\n\n.. automodule:: tools.SegmentTree\n   :members:\n\nSecure memory map\n------------------\nA write protected memory map.\n\n.. automodule:: tools.SecureMmap\n   :members:\n"
  },
  {
    "path": "pyGeno/examples/__init__.py",
    "content": ""
  },
  {
    "path": "pyGeno/examples/bootstraping.py",
    "content": "import pyGeno.bootstrap as B\n\n#~ imports the whole human reference genome\n#~ B.importHumanReference()\nB.importHumanReferenceYOnly()\nB.importDummySRY()\n"
  },
  {
    "path": "pyGeno/examples/snps_queries.py",
    "content": "from pyGeno.Genome import Genome\nfrom pyGeno.Gene import Gene\nfrom pyGeno.Transcript import Transcript\nfrom pyGeno.Protein import Protein\nfrom pyGeno.Exon import Exon\n\nfrom pyGeno.SNPFiltering import SNPFilter\nfrom pyGeno.SNPFiltering import SequenceSNP\n\ndef printing(gene) :\n\tprint \"printing reference sequences\\n-------\"\n\tfor trans in gene.get(Transcript) :\n\t\tprint \"\\t-Transcript name:\", trans.name\n\t\tprint \"\\t-Protein:\", trans.protein.sequence\n\t\tprint \"\\t-Exons:\"\n\t\tfor e in trans.exons :\n\t\t\tprint \"\\t\\t Number:\", e.number\n\t\t\tprint \"\\t\\t-CDS:\", e.CDS\n\t\t\tprint \"\\t\\t-Strand:\", e.strand\n\t\t\tprint \"\\t\\t-CDS_start:\", e.CDS_start\n\t\t\tprint \"\\t\\t-CDS_end:\", e.CDS_end\n\ndef printVs(refGene, presGene) :\n\tprint \"Vs personalized sequences\\n------\"\n\n\t#iterGet returns an iterator instead of a list (faster)\n\tfor trans in presGene.iterGet(Transcript) :\n\t\trefProt = refGene.get(Protein, id = trans.protein.id)[0]\n\t\tpersProt = trans.protein\n\t\tprint persProt.id\n\t\tprint \"\\tref:\" + refProt.sequence[:20] + \"...\"\n\t\tprint \"\\tper:\" + persProt.sequence[:20] + \"...\"\n\t\tprint\n\ndef fancyExonQuery(gene) :\n\te = gene.get(Exon, {'CDS_start >' : 2655029, 'CDS_end <' : 2655200})[0]\n\tprint \"An exon with a CDS in ']2655029, 2655200[':\", e.id\n\t\nclass QMax_gt_filter(SNPFilter) :\n\t\n\tdef __init__(self, threshold) :\n\t\tself.threshold = threshold\n\t\t\n\tdef filter(self, chromosome, dummySRY) :\n\t\tif dummySRY.Qmax_gt > self.threshold :\n\t\t\t#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)\n\t\t\treturn SequenceSNP(dummySRY.alt)\n\t\treturn None #None means keep the reference allele\n\nif __name__ == \"__main__\" :\n\trefGenome = Genome(name = 'GRCh37.75_Y-Only')\n\tpersGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))\n\t\n\tgeneRef = refGenome.get(Gene, name = 'SRY')[0]\n\tpersGene = persGenome.get(Gene, name = 'SRY')[0]\n\t\n\tprinting(geneRef)\n\tprint \"\\n\"\n\tprintVs(geneRef, persGene)\n\tprint \"\\n\"\n\tfancyExonQuery(geneRef)\n"
  },
  {
    "path": "pyGeno/importation/Genomes.py",
    "content": "import os, glob, gzip, tarfile, shutil, time, sys, gc, cPickle, tempfile, urllib2\nfrom contextlib import closing\nfrom ConfigParser import SafeConfigParser\n\nfrom pyGeno.tools.ProgressBar import ProgressBar\nimport pyGeno.configuration as conf\n\nfrom pyGeno.Genome import *\nfrom pyGeno.Chromosome import *\nfrom pyGeno.Gene import *\nfrom pyGeno.Transcript import *\nfrom pyGeno.Exon import *\nfrom pyGeno.Protein import *\n\nfrom pyGeno.tools.parsers.GTFTools import GTFFile\nfrom pyGeno.tools.ProgressBar import ProgressBar\nfrom pyGeno.tools.io import printf\n\nimport gc\n#~ import objgraph\n\ndef backUpDB() :\n    \"\"\"backup the current database version. automatically called by importGenome(). Returns the filename of the backup\"\"\"\n    st = time.ctime().replace(' ', '_')\n    fn = conf.pyGeno_RABA_DBFILE.replace('.db', '_%s-bck.db' % st)\n    shutil.copy2(conf.pyGeno_RABA_DBFILE, fn)\n\n    return fn\n\ndef _decompressPackage(packageFile) :\n    pFile = tarfile.open(packageFile)\n    \n    packageDir = tempfile.mkdtemp(prefix = \"pyGeno_import_\")\n    if os.path.isdir(packageDir) :\n        shutil.rmtree(packageDir)\n    os.makedirs(packageDir)\n\n    for mem in pFile :\n        pFile.extract(mem, packageDir)\n\n    return packageDir\n\ndef _getFile(fil, directory) :\n    if fil.find(\"http://\") == 0 or fil.find(\"ftp://\") == 0 :\n        printf(\"Downloading file: %s...\" % fil)\n        finalFile = os.path.normpath('%s/%s' %(directory, fil.split('/')[-1]))\n        # urllib.urlretrieve (fil, finalFile)\n        with closing(urllib2.urlopen(fil)) as r:\n            with open(finalFile, 'wb') as f:\n                shutil.copyfileobj(r, f)\n        \n        printf('done.')\n    else :\n        finalFile = os.path.normpath('%s/%s' %(directory, fil))\n    \n    return finalFile\n\ndef deleteGenome(species, name) :\n    \"\"\"Removes a genome from the database\"\"\"\n\n    printf('deleting genome (%s, %s)...' % (species, name))\n\n    conf.db.beginTransaction()\n    objs = []\n    allGood = True\n    try :\n        genome = Genome_Raba(name = name, species = species.lower())\n        objs.append(genome)\n        pBar = ProgressBar(label = 'preparing')\n        for typ in (Chromosome_Raba, Gene_Raba, Transcript_Raba, Exon_Raba, Protein_Raba) :\n            pBar.update()\n            f = RabaQuery(typ, namespace = genome._raba_namespace)\n            f.addFilter({'genome' : genome})\n            for e in f.iterRun() :\n                objs.append(e)\n        pBar.close()\n        \n        pBar = ProgressBar(nbEpochs = len(objs), label = 'deleting objects')\n        for e in objs :\n            pBar.update()\n            e.delete()\n        pBar.close()\n        \n    except KeyError as e :\n        #~ printf(\"\\tWARNING, couldn't remove genome form db, maybe it's not there: \", e)\n        raise KeyError(\"\\tWARNING, couldn't remove genome form db, maybe it's not there: \", e)\n        allGood = False\n    printf('\\tdeleting folder')\n    try :\n        shutil.rmtree(conf.getGenomeSequencePath(species, name))\n    except OSError as e:\n        #~ printf('\\tWARNING, Unable to delete folder: ', e)\n        OSError('\\tWARNING, Unable to delete folder: ', e)\n        allGood = False\n        \n    conf.db.endTransaction()\n    return allGood\n\ndef importGenome(packageFile, batchSize = 50, verbose = 0) :\n    \"\"\"Import a pyGeno genome package. A genome packages is folder or a tar.gz ball that contains at it's root:\n\n    * gziped fasta files for all chromosomes, or URLs from where them must be downloaded\n    \n    * gziped GTF gene_set file from Ensembl, or an URL from where it must be downloaded\n    \n    * a manifest.ini file such as::\n    \n        [package_infos]\n        description = Test package. This package installs only chromosome Y of mus musculus\n        maintainer = Tariq Daouda\n        maintainer_contact = tariq.daouda [at] umontreal\n        version = GRCm38.73\n\n        [genome]\n        species = Mus_musculus\n        name = GRCm38_test\n        source = http://useast.ensembl.org/info/data/ftp/index.html\n\n        [chromosome_files]\n        Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz / or an url such as ftp://... or http://\n\n        [gene_set]\n        gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz / or an url such as ftp://... or http://\n\n    All files except the manifest can be downloaded from: http://useast.ensembl.org/info/data/ftp/index.html\n    \n    A rollback is performed if an exception is caught during importation\n    \n    batchSize sets the number of genes to parse before performing a database save. PCs with little ram like\n    small values, while those endowed with more memory may perform faster with higher ones.\n    \n    Verbose must be an int [0, 4] for various levels of verbosity\n    \"\"\"\n\n    def reformatItems(items) :\n        s = str(items)\n        s = s.replace('[', '').replace(']', '').replace(\"',\", ': ').replace('), ', '\\n').replace(\"'\", '').replace('(', '').replace(')', '')\n        return s\n\n    printf('Importing genome package: %s... (This may take a while)' % packageFile)\n\n    isDir = False\n    if not os.path.isdir(packageFile) :\n        packageDir = _decompressPackage(packageFile)\n    else :\n        isDir = True\n        packageDir = packageFile\n\n    parser = SafeConfigParser()\n    parser.read(os.path.normpath(packageDir+'/manifest.ini'))\n    packageInfos = parser.items('package_infos')\n\n    genomeName = parser.get('genome', 'name')\n    species = parser.get('genome', 'species')\n    genomeSource = parser.get('genome', 'source')\n    \n    seqTargetDir = conf.getGenomeSequencePath(species.lower(), genomeName)\n    if os.path.isdir(seqTargetDir) :\n        raise KeyError(\"The directory %s already exists, Please call deleteGenome() first if you want to reinstall\" % seqTargetDir)\n        \n    gtfFile = _getFile(parser.get('gene_set', 'gtf'), packageDir)\n    \n    chromosomesFiles = {}\n    chromosomeSet = set()\n    for key, fil in parser.items('chromosome_files') :\n        chromosomesFiles[key] = _getFile(fil, packageDir)\n        chromosomeSet.add(key)\n\n    try :\n        genome = Genome(name = genomeName, species = species)\n    except KeyError:\n        pass\n    else :\n        raise KeyError(\"There seems to be already a genome (%s, %s), please call deleteGenome() first if you want to reinstall it\" % (genomeName, species))\n\n    genome = Genome_Raba()\n    genome.set(name = genomeName, species = species, source = genomeSource, packageInfos = packageInfos)\n\n    printf(\"Importing:\\n\\t%s\\nGenome:\\n\\t%s\\n...\"  % (reformatItems(packageInfos).replace('\\n', '\\n\\t'), reformatItems(parser.items('genome')).replace('\\n', '\\n\\t')))\n\n    chros = _importGenomeObjects(gtfFile, chromosomeSet, genome, batchSize, verbose)\n    os.makedirs(seqTargetDir)\n    startChro = 0\n    pBar = ProgressBar(nbEpochs = len(chros))\n    for chro in chros :\n        pBar.update(label = \"Importing DNA, chro %s\" % chro.number)\n        length = _importSequence(chro, chromosomesFiles[chro.number.lower()], seqTargetDir)\n        chro.start = startChro\n        chro.end = startChro+length\n        startChro = chro.end\n        chro.save()\n    pBar.close()\n    \n    if not isDir :\n        shutil.rmtree(packageDir)\n    \n    #~ objgraph.show_most_common_types(limit=20)\n    return True\n\n#~ @profile\ndef _importGenomeObjects(gtfFilePath, chroSet, genome, batchSize, verbose = 0) :\n    \"\"\"verbose must be an int [0, 4] for various levels of verbosity\"\"\"\n\n    class Store(object) :\n        \n        def __init__(self, conf) :\n            self.conf = conf\n            self.chromosomes = {}\n            \n            self.genes = {}\n            self.transcripts = {}\n            self.proteins = {}\n            self.exons = {}\n\n        def batch_save(self) :\n            self.conf.db.beginTransaction()\n            \n            for c in self.genes.itervalues() :\n                c.save()\n                conf.removeFromDBRegistery(c)\n                \n            for c in self.transcripts.itervalues() :\n                c.save()\n                conf.removeFromDBRegistery(c.exons)\n                conf.removeFromDBRegistery(c)\n            \n            for c in self.proteins.itervalues() :\n                c.save()\n                conf.removeFromDBRegistery(c)\n            \n            self.conf.db.endTransaction()\n            \n            del(self.genes)\n            del(self.transcripts)\n            del(self.proteins)\n            del(self.exons)\n            \n            self.genes = {}\n            self.transcripts = {}\n            self.proteins = {}\n            self.exons = {}\n\n            gc.collect()\n\n        def save_chros(self) :\n            pBar = ProgressBar(nbEpochs = len(self.chromosomes))\n            for c in self.chromosomes.itervalues() :\n                pBar.update(label = 'Chr %s' % c.number)\n                c.save()\n            pBar.close()\n        \n    printf('Importing gene set infos from %s...' % gtfFilePath)\n    \n    printf('Backuping indexes...')\n    indexes = conf.db.getIndexes()\n    printf(\"Droping all your indexes, (don't worry i'll restore them later)...\")\n    Genome_Raba.flushIndexes()\n    Chromosome_Raba.flushIndexes()\n    Gene_Raba.flushIndexes()\n    Transcript_Raba.flushIndexes()\n    Protein_Raba.flushIndexes()\n    Exon_Raba.flushIndexes()\n    \n    printf(\"Parsing gene set...\")\n    gtf = GTFFile(gtfFilePath, gziped = True)\n    printf('Done. Importation begins!')\n    \n    store = Store(conf)\n    chroNumber = None\n    pBar = ProgressBar(nbEpochs = len(gtf))\n    for line in gtf :\n        chroN = line['seqname']\n        pBar.update(label = \"Chr %s\" % chroN)\n        \n        if (chroN.upper() in chroSet or chroN.lower() in chroSet):\n            strand = line['strand']\n            gene_biotype = line['gene_biotype']\n            regionType = line['feature']\n            frame = line['frame']\n\n            start = int(line['start']) - 1\n            end = int(line['end'])\n            if start > end :\n                start, end = end, start\n\n            chroNumber = chroN.upper()\n            if chroNumber not in store.chromosomes :\n                store.chromosomes[chroNumber] = Chromosome_Raba()\n                store.chromosomes[chroNumber].set(genome = genome, number = chroNumber)\n            \n            try :\n                geneId = line['gene_id']\n                geneName =  line['gene_name']\n            except KeyError :\n                geneId = None\n                geneName = None\n                if verbose :\n                    printf('Warning: no gene_id/name found in line %s' % gtf[i])\n            \n            if geneId is not None :\n                if geneId not in store.genes :\n                    if len(store.genes) > batchSize :\n                        store.batch_save()\n                    \n                    if verbose > 0 :\n                        printf('\\tGene %s, %s...' % (geneId, geneName))\n                    store.genes[geneId] = Gene_Raba()\n                    store.genes[geneId].set(genome = genome, id = geneId, chromosome = store.chromosomes[chroNumber], name = geneName, strand = strand, biotype = gene_biotype)\n                if start < store.genes[geneId].start or store.genes[geneId].start is None :\n                        store.genes[geneId].start = start\n                if end > store.genes[geneId].end or store.genes[geneId].end is None :\n                    store.genes[geneId].end = end\n            try :\n                transId = line['transcript_id']\n                transName = line['transcript_name']\n                try :\n                    transcript_biotype = line['transcript_biotype']\n                except KeyError :\n                    transcript_biotype = None\n            except KeyError :\n                transId = None\n                transName = None\n                if verbose > 2 :\n                    printf('\\t\\tWarning: no transcript_id, name found in line %s' % gtf[i])\n            \n            if transId is not None :\n                if transId not in store.transcripts :\n                    if verbose > 1 :\n                        printf('\\t\\tTranscript %s, %s...' % (transId, transName))\n                    store.transcripts[transId] = Transcript_Raba()\n                    store.transcripts[transId].set(genome = genome, id = transId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), name = transName, biotype=transcript_biotype)\n                if start < store.transcripts[transId].start or store.transcripts[transId].start is None:\n                    store.transcripts[transId].start = start\n                if end > store.transcripts[transId].end or store.transcripts[transId].end is None:\n                    store.transcripts[transId].end = end\n            \n                try :\n                    protId = line['protein_id']\n                except KeyError :\n                    protId = None\n                    if verbose > 2 :\n                        printf('Warning: no protein_id found in line %s' % gtf[i])\n\n                # Store selenocysteine positions in transcript\n                if regionType == 'Selenocysteine':\n                    store.transcripts[transId].selenocysteine.append(start)\n                        \n                if protId is not None and protId not in store.proteins :\n                    if verbose > 1 :\n                        printf('\\t\\tProtein %s...' % (protId))\n                    store.proteins[protId] = Protein_Raba()\n                    store.proteins[protId].set(genome = genome, id = protId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), name = transName)\n                    store.transcripts[transId].protein = store.proteins[protId]\n\n                try :\n                    exonNumber = int(line['exon_number']) - 1\n                    exonKey = (transId, exonNumber)\n                except KeyError :\n                    exonNumber = None\n                    exonKey = None\n                    if verbose > 2 :\n                        printf('Warning: no exon number or id found in line %s' % gtf[i])\n                \n                if exonKey is not None :\n                    if verbose > 3 :\n                        printf('\\t\\t\\texon %s...' % (exonId))\n                    \n                    if exonKey not in store.exons and regionType == 'exon' :\n                        store.exons[exonKey] = Exon_Raba()\n                        store.exons[exonKey].set(genome = genome, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), protein = store.proteins.get(protId, None), strand = strand, number = exonNumber, start = start, end = end)\n                        store.transcripts[transId].exons.append(store.exons[exonKey])\n                    \n                    try :\n                        store.exons[exonKey].id = line['exon_id']\n                    except KeyError :\n                        pass\n                    \n                    if regionType == 'exon' :\n                        if start < store.exons[exonKey].start or store.exons[exonKey].start is None:\n                            store.exons[exonKey].start = start\n                        if end > store.transcripts[transId].end or store.exons[exonKey].end is None:\n                            store.exons[exonKey].end = end\n                    elif regionType == 'CDS' :\n                        store.exons[exonKey].CDS_start = start\n                        store.exons[exonKey].CDS_end = end\n                        store.exons[exonKey].frame = frame\n                    elif regionType == 'stop_codon' :\n                        if strand == '+' :\n                            if store.exons[exonKey].CDS_end != None :\n                                store.exons[exonKey].CDS_end += 3\n                                if store.exons[exonKey].end < store.exons[exonKey].CDS_end :\n                                    store.exons[exonKey].end = store.exons[exonKey].CDS_end\n                                if store.transcripts[transId].end < store.exons[exonKey].CDS_end :\n                                    store.transcripts[transId].end = store.exons[exonKey].CDS_end\n                                if store.genes[geneId].end < store.exons[exonKey].CDS_end :\n                                    store.genes[geneId].end = store.exons[exonKey].CDS_end\n                        if strand == '-' :\n                            if store.exons[exonKey].CDS_start != None :\n                                store.exons[exonKey].CDS_start -= 3\n                                if store.exons[exonKey].start > store.exons[exonKey].CDS_start :\n                                    store.exons[exonKey].start = store.exons[exonKey].CDS_start\n                                if store.transcripts[transId].start > store.exons[exonKey].CDS_start :\n                                    store.transcripts[transId].start = store.exons[exonKey].CDS_start\n                                if store.genes[geneId].start > store.exons[exonKey].CDS_start :\n                                    store.genes[geneId].start = store.exons[exonKey].CDS_start\n    pBar.close()\n    \n    store.batch_save()\n    \n    conf.db.beginTransaction()\n    printf('almost done saving chromosomes...')\n    store.save_chros()\n    \n    printf('saving genome object...')\n    genome.save()\n    conf.db.endTransaction()\n    \n    conf.db.beginTransaction()\n    printf('restoring core indexes...')\n    # Genome.ensureGlobalIndex(('name', 'species'))\n    # Chromosome.ensureGlobalIndex('genome')\n    # Gene.ensureGlobalIndex('genome')\n    # Transcript.ensureGlobalIndex('genome')\n    # Protein.ensureGlobalIndex('genome')\n    # Exon.ensureGlobalIndex('genome')\n    Transcript.ensureGlobalIndex('exons')\n    \n    printf('commiting changes...')\n    conf.db.endTransaction()\n    \n    conf.db.beginTransaction()\n    printf('restoring user indexes')\n    pBar = ProgressBar(label = \"restoring\", nbEpochs = len(indexes))\n    for idx in indexes :\n        pBar.update()\n        conf.db.execute(idx[-1].replace('CREATE INDEX', 'CREATE INDEX IF NOT EXISTS'))\n    pBar.close()\n    \n    printf('commiting changes...')\n    conf.db.endTransaction()\n    \n    return store.chromosomes.values()\n\n#~ @profile\ndef _importSequence(chromosome, fastaFile, targetDir) :\n    \"Serializes fastas into .dat files\"\n\n    f = gzip.open(fastaFile)\n    header = f.readline()\n    strRes = f.read().upper().replace('\\n', '').replace('\\r', '')\n    f.close()\n\n    fn = '%s/chromosome%s.dat' % (targetDir, chromosome.number)\n    f = open(fn, 'w')\n    f.write(strRes)\n    f.close()\n    chromosome.dataFile = fn\n    chromosome.header = header\n    return len(strRes)\n"
  },
  {
    "path": "pyGeno/importation/SNPs.py",
    "content": "import urllib, shutil\n\nfrom ConfigParser import SafeConfigParser\nimport pyGeno.configuration as conf\nfrom pyGeno.SNP import *\nfrom pyGeno.tools.ProgressBar import ProgressBar\nfrom pyGeno.tools.io import printf\nfrom Genomes import _decompressPackage, _getFile\n\nfrom pyGeno.tools.parsers.CasavaTools import SNPsTxtFile\nfrom pyGeno.tools.parsers.VCFTools import VCFFile\nfrom pyGeno.tools.parsers.CSVTools import CSVFile\n\ndef importSNPs(packageFile) :\n\t\"\"\"The big wrapper, this function should detect the SNP type by the package manifest and then launch the corresponding function.\n\tHere's an example of a SNP manifest file for Casava SNPs::\n\n\t\t[package_infos]\n\t\tdescription = Casava SNPs for testing purposes\n\t\tmaintainer = Tariq Daouda\n\t\tmaintainer_contact = tariq.daouda [at] umontreal\n\t\tversion = 1\n\n\t\t[set_infos]\n\t\tspecies = human\n\t\tname = dummySRY\n\t\ttype = Agnostic\n\t\tsource = my place at the IRIC\n\n\t\t[snps]\n\t\tfilename = snps.txt # as with genomes you can either include de file at the root of the package or specify an URL from where it must be downloaded\n\t\"\"\"\n\tprintf(\"Importing polymorphism set: %s... (This may take a while)\" % packageFile)\n\t\n\tisDir = False\n\tif not os.path.isdir(packageFile) :\n\t\tpackageDir = _decompressPackage(packageFile)\n\telse :\n\t\tisDir = True\n\t\tpackageDir = packageFile\n\n\tfpMan = os.path.normpath(packageDir+'/manifest.ini')\n\tif not os.path.isfile(fpMan) :\n\t\traise ValueError(\"Not file named manifest.ini! Mais quel SCANDALE!!!!\")\n\n\tparser = SafeConfigParser()\n\tparser.read(os.path.normpath(packageDir+'/manifest.ini'))\n\tpackageInfos = parser.items('package_infos')\n\n\tsetName = parser.get('set_infos', 'name')\n\ttyp = parser.get('set_infos', 'type')\n\t\n\tif typ.lower()[-3:] != 'snp' :\n\t\ttyp += 'SNP'\n\n\tspecies = parser.get('set_infos', 'species').lower()\n\tgenomeSource = parser.get('set_infos', 'source')\n\tsnpsFileTmp = parser.get('snps', 'filename').strip()\n\tsnpsFile = _getFile(parser.get('snps', 'filename'), packageDir)\n\t\n\treturn_value = None\n\n\ttry :\n\t\tSMaster = SNPMaster(setName = setName)\n\texcept KeyError :\n\t\tif typ.lower() == 'casavasnp' :\n\t\t\treturn_value = _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile)\n\t\telif typ.lower() == 'dbsnpsnp' :\n\t\t\treturn_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)\n\t\telif typ.lower() == 'dbsnp' :\n\t\t\treturn_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)\n\t\telif typ.lower() == 'tophatsnp' :\n\t\t\treturn_value = _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile)\n\t\telif typ.lower() == 'agnosticsnp' :\n\t\t\treturn_value = _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile)\n\t\telse :\n\t\t\traise FutureWarning('Unknown SNP type in manifest %s' % typ)\n\telse :\n\t\traise KeyError(\"There's already a SNP set by the name %s. Use deleteSNPs() to remove it first\" %setName)\n\n\tif not isDir :\n\t\tshutil.rmtree(packageDir)\n\n\treturn return_value\n\ndef deleteSNPs(setName) :\n\t\"\"\"deletes a set of polymorphisms\"\"\"\n\tcon = conf.db\n\ttry :\n\t\tSMaster = SNPMaster(setName = setName)\n\t\tcon.beginTransaction()\n\t\tSNPType = SMaster.SNPType\n\t\tcon.delete(SNPType, 'setName = ?', (setName,))\n\t\tSMaster.delete()\n\t\tcon.endTransaction()\n\texcept KeyError :\n\t\traise KeyError(\"Can't delete the setName %s because i can't find it in SNPMaster, maybe there's not set by that name\" % setName)\n\t\t#~ printf(\"can't delete the setName %s because i can't find it in SNPMaster, maybe there's no set by that name\" % setName)\n\t\treturn False\n\treturn True\n\ndef _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile) :\n\t\"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno wil interpret all positions as 0 based\"\n\tprintf('importing SNP set %s for species %s...' % (setName, species))\n\n\tsnpData = CSVFile()\n\tsnpData.parse(snpsFile, separator = \"\\t\")\n\n\tAgnosticSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))\n\tconf.db.beginTransaction()\n\t\n\tpBar = ProgressBar(len(snpData))\n\tpLabel = ''\n\tcurrChrNumber = None\n\tfor snpEntry in snpData :\n\t\ttmpChr = snpEntry['chromosomeNumber']\n\t\tif tmpChr != currChrNumber :\n\t\t\tcurrChrNumber = tmpChr\n\t\t\tpLabel = 'Chr %s...' % currChrNumber\n\n\t\tsnp = AgnosticSNP()\n\t\tsnp.species = species\n\t\tsnp.setName = setName\n\t\tfor f in snp.getFields() :\n\t\t\ttry :\n\t\t\t\tsetattr(snp, f, snpEntry[f])\n\t\t\texcept KeyError :\n\t\t\t\tif f != 'species' and f != 'setName' :\n\t\t\t\t\tprintf(\"Warning filetype as no key %s\", f)\n\t\tsnp.quality = float(snp.quality)\n\t\tsnp.start = int(snp.start)\n\t\tsnp.end = int(snp.end)\n\t\tsnp.save()\n\t\tpBar.update(label = pLabel)\n\n\tpBar.close()\n\t\n\tsnpMaster = SNPMaster()\n\tsnpMaster.set(setName = setName, SNPType = 'AgnosticSNP', species = species)\n\tsnpMaster.save()\n\n\tprintf('saving...')\n\tconf.db.endTransaction()\n\tprintf('creating indexes...')\n\tAgnosticSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))\n\tprintf('importation of SNP set %s for species %s done.' %(setName, species))\n\t\n\treturn True\n\ndef _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile) :\n\t\"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based\"\n\tprintf('importing SNP set %s for species %s...' % (setName, species))\n\n\tsnpData = SNPsTxtFile(snpsFile)\n\t\n\tCasavaSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))\n\tconf.db.beginTransaction()\n\t\n\tpBar = ProgressBar(len(snpData))\n\tpLabel = ''\n\tcurrChrNumber = None\n\tfor snpEntry in snpData :\n\t\ttmpChr = snpEntry['chromosomeNumber']\n\t\tif tmpChr != currChrNumber :\n\t\t\tcurrChrNumber = tmpChr\n\t\t\tpLabel = 'Chr %s...' % currChrNumber\n\n\t\tsnp = CasavaSNP()\n\t\tsnp.species = species\n\t\tsnp.setName = setName\n\t\t\n\t\tfor f in snp.getFields() :\n\t\t\ttry :\n\t\t\t\tsetattr(snp, f, snpEntry[f])\n\t\t\texcept KeyError :\n\t\t\t\tif f != 'species' and f != 'setName' :\n\t\t\t\t\tprintf(\"Warning filetype as no key %s\", f)\n\t\tsnp.start -= 1\n\t\tsnp.end -= 1\n\t\tsnp.save()\n\t\tpBar.update(label = pLabel)\n\n\tpBar.close()\n\t\n\tsnpMaster = SNPMaster()\n\tsnpMaster.set(setName = setName, SNPType = 'CasavaSNP', species = species)\n\tsnpMaster.save()\n\n\tprintf('saving...')\n\tconf.db.endTransaction()\n\tprintf('creating indexes...')\n\tCasavaSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))\n\tprintf('importation of SNP set %s for species %s done.' %(setName, species))\n\t\n\treturn True\n\ndef _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile) :\n\t\"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based\"\n\tsnpData = VCFFile(snpsFile, gziped = True, stream = True)\n\tdbSNPSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))\n\tconf.db.beginTransaction()\n\tpBar = ProgressBar()\n\tpLabel = ''\n\tfor snpEntry in snpData :\n\t\tpBar.update(label = 'Chr %s, %s...' %  (snpEntry['#CHROM'], snpEntry['ID']))\n\t\t\n\t\tsnp = dbSNPSNP()\n\t\tfor f in snp.getFields() :\n\t\t\ttry :\n\t\t\t\tsetattr(snp, f, snpEntry[f])\n\t\t\texcept KeyError :\n\t\t\t\tpass\n\t\tsnp.chromosomeNumber = snpEntry['#CHROM']\n\t\tsnp.species = species\n\t\tsnp.setName = setName\n\t\tsnp.start = snpEntry['POS']-1\n\t\tsnp.alt = snpEntry['ALT']\n\t\tsnp.ref = snpEntry['REF']\n\t\tsnp.end = snp.start+len(snp.alt)\n\t\tsnp.save()\n\t\n\tpBar.close()\n\t\n\tsnpMaster = SNPMaster()\n\tsnpMaster.set(setName = setName, SNPType = 'dbSNPSNP', species = species)\n\tsnpMaster.save()\n\t\n\tprintf('saving...')\n\tconf.db.endTransaction()\n\tprintf('creating indexes...')\n\tdbSNPSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))\n\tprintf('importation of SNP set %s for species %s done.' %(setName, species))\n\n\treturn True\n\t\ndef _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile) :\n\traise FutureWarning('Not implemented yet')\n"
  },
  {
    "path": "pyGeno/importation/__init__.py",
    "content": "__all__ = ['Genomes', 'SNPs']\n"
  },
  {
    "path": "pyGeno/pyGenoObjectBases.py",
    "content": "import time, types, string\nimport configuration as conf\nfrom rabaDB.rabaSetup import *\nfrom rabaDB.Raba import *\nfrom rabaDB.filters import RabaQuery\n\ndef nosave() :\n\traise ValueError('You can only save object that are linked to reference genomes')\n\nclass pyGenoRabaObject(Raba) :\n\t\"\"\"pyGeno uses rabaDB to persistenly store data. Most persistent \n\tobjects have classes that inherit from this one (Genome_Raba, \n\tChromosome_Raba, Gene_Raba, Protein_Raba, Exon_Raba). Theses classes \n\tare not mean to be accessed directly. Users manipulate wrappers \n\tsuch as : Genome, Chromosome etc... pyGenoRabaObject extends \n\tthe Raba class by adding a function _curate that is called just \n\tbefore saving. This class is to be considered abstract, and is not \n\tmeant to be instanciated\"\"\"\n\n\t_raba_namespace = conf.pyGeno_RABA_NAMESPACE\n\t_raba_abstract = True # not saved in db by default\n\n\tdef __init__(self) :\n\t\tif self is pyGenoRabaObject :\n\t\t\traise TypeError(\"This class is abstract\")\n\t\n\tdef _curate(self) :\n\t\t\"Last operations performed before saving, must be implemented in child\"\n\t\traise TypeError(\"This method is abstract and should be implemented in child\")\n\n\tdef save(self) :\n\t\t\"\"\"Calls _curate() before performing a normal rabaDB lazy save \n\t\t(saving only occurs if the object has been modified)\"\"\"\n\t\t\n\t\tif self.mutated() :\n\t\t\tself._curate()\n\t\tRaba.save(self)\n\nclass pyGenoRabaObjectWrapper_metaclass(type) :\n\t\"\"\"This metaclass keeps track of the relationship between wrapped \n\tclasses and wrappers \"\"\"\n\t_wrappers = {}\n\n\tdef __new__(cls, name, bases, dct) :\n\t\tclsObj = type.__new__(cls, name, bases, dct)\n\t\tcls._wrappers[dct['_wrapped_class']] = clsObj\n\t\treturn clsObj\n\nclass RLWrapper(object) :\n\t\"\"\"A wrapper for rabalists that replaces raba objects by pyGeno Object\"\"\"\n\tdef __init__(self, rabaObj, listObjectType, rl) :\n\t\tself.rabaObj = rabaObj\n\t\tself.rl = rl\n\t\tself.listObjectType = listObjectType\n\n\tdef __getitem__(self, i) :\n\t\treturn self.listObjectType(wrapped_object_and_bag = (self.rl[i], self.rabaObj.bagKey))\n\t\n\tdef __getattr__(self, name) :\n\t\trl =  object.__getattribute__(self, 'rl')\n\t\treturn getattr(rl, name)\n\nclass pyGenoRabaObjectWrapper(object) :\n\t\"\"\"All the wrapper classes such as Genome and Chromosome inherit \n\tfrom this class. It has most that make pyGeno useful, such as \n\tget(), count(), ensureIndex(). This class is to be considered \n\tabstract, and is not meant to be instanciated\"\"\"\n\t__metaclass__ = pyGenoRabaObjectWrapper_metaclass\n\n\t_wrapped_class = None\n\t_bags = {}\n\n\tdef __init__(self, wrapped_object_and_bag = (), *args, **kwargs) :\n\t\tif self is pyGenoRabaObjectWrapper :\n\t\t\traise TypeError(\"This class is abstract\")\n\t\tif wrapped_object_and_bag != () :\n\t\t\tassert wrapped_object_and_bag[0]._rabaClass is self._wrapped_class\n\t\t\tself.wrapped_object = wrapped_object_and_bag[0]\n\t\t\tself.bagKey = wrapped_object_and_bag[1]\n\t\t\tpyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self\n\t\telse :\n\t\t\tself.wrapped_object = self._wrapped_class(*args, **kwargs)\n\t\t\tself.bagKey = time.time()\n\t\t\tpyGenoRabaObjectWrapper._bags[self.bagKey] = {}\n\t\t\tpyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self\n\n\t\tself._load_sequencesTriggers = set()\n\t\tself.loadSequences = True\n\t\tself.loadData = True\n\t\tself.loadBinarySequences = True\n\n\tdef _getObjBagKey(self, obj) :\n\t\t\"\"\"pyGeno objects are kept in bags to ensure that reference \n\t\tobjects are loaded only once. This function returns the bag key \n\t\tof the current object\"\"\"\n\t\treturn (obj._rabaClass.__name__, obj.raba_id)\n\n\tdef _makeLoadQuery(self, objectType, *args, **coolArgs) :\n\t\t# conf.db.enableDebug(True)\n\t\tf = RabaQuery(objectType._wrapped_class, namespace = self._wrapped_class._raba_namespace)\n\t\tcoolArgs[self._wrapped_class.__name__[:-5]] = self.wrapped_object #[:-5] removes _Raba from class name\n\n\t\tif len(args) > 0 and type(args[0]) is types.ListType :\n\t\t\tfor a in args[0] :\n\t\t\t\tif type(a) is types.DictType :\n\t\t\t\t\tf.addFilter(**a)\n\t\telse :\n\t\t\tf.addFilter(*args, **coolArgs)\n\n\t\treturn f\n\n\tdef count(self, objectType, *args, **coolArgs) :\n\t\t\"\"\"Returns the number of elements satisfying the query\"\"\"\n\t\treturn self._makeLoadQuery(objectType, *args, **coolArgs).count()\n\n\tdef get(self, objectType, *args, **coolArgs) :\n\t\t\"\"\"Raba Magic inside. This is th function that you use for \n\t\tquerying pyGeno's DB.\n\t\t\n\t\tUsage examples:\n\t\t\n\t\t\t* myGenome.get(\"Gene\", name = 'TPST2')\n\t\t\n\t\t\t* myGene.get(Protein, id = 'ENSID...')\n\t\t\n\t\t\t* myGenome.get(Transcript, {'start >' : x, 'end <' : y})\"\"\"\n\t\t\n\t\tret = []\n\t\tfor e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() :\n\t\t\tif issubclass(objectType, pyGenoRabaObjectWrapper) :\n\t\t\t\tret.append(objectType(wrapped_object_and_bag = (e, self.bagKey)))\n\t\t\telse :\n\t\t\t\tret.append(e)\n\n\t\treturn ret\n\n\tdef iterGet(self, objectType, *args, **coolArgs) :\n\t\t\"\"\"Same as get. But retuns the elements one by one, much more efficient for large outputs\"\"\"\n\n\t\tfor e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() :\n\t\t\tif issubclass(objectType, pyGenoRabaObjectWrapper) :\n\t\t\t\tyield objectType(wrapped_object_and_bag = (e, self.bagKey))\n\t\t\telse :\n\t\t\t\tyield e\n\n\t#~ def ensureIndex(self, fields) :\n\t\t#~ \"\"\"\n\t\t#~ Warning: May not work on some systems, see ensureGlobalIndex\n\t\t#~ \n\t\t#~ Creates a partial index on self (if it does not exist). \n\t\t#~ Ex: myTranscript.ensureIndex('name')\"\"\"\n\t\t#~ \n\t\t#~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object\n\t\t#~ self._wrapped_class.ensureIndex(fields, where, (whereValues,))\n\n\t#~ def dropIndex(self, fields) :\n\t\t#~ \"\"\"Warning: May not work on some systems, see dropGlobalIndex\n\t\t#~ \n\t\t#~ Drops a partial index on self. Ex: myTranscript.dropIndex('name')\"\"\"\n\n\t\t#~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object\n\t\t#~ self._wrapped_class.dropIndex(fields, where, (whereValues,))\n\t\n\tdef __getattr__(self, name) :\n\t\t\"\"\"If a wrapper does not have a specific field, pyGeno will \n\t\tlook for it in the wrapped_object\"\"\"\n\t\t# print \"pyGenoObjectBases __getattr__ : \" + name + \" from \" + str(type(self))\n\n\t\t\n\t\tif name == 'save' or name == 'delete' :\n\t\t\traise AttributeError(\"You can't delete or save an object from wrapper, try .wrapped_object.delete()/save()\")\n\t\t\n\t\tif name in self._load_sequencesTriggers and self.loadSequences :\n\t\t\tself.loadSequences = False\n\t\t\tself._load_sequences()\n\t\t\treturn getattr(self, name)\n\n\t\tif name in self._load_sequencesTriggers and self.loadData :\n\t\t\tself.loadData = False\n\t\t\tself._load_data()\n\t\t\treturn getattr(self, name)\n\t\t\t\n\t\tif name[:4] == 'bin_' and self.loadBinarySequences :\n\t\t\tself.updateBinarySequence = False\n\t\t\tself._load_bin_sequence()\n\t\t\treturn getattr(self, name)\n\t\t\n\t\tattr = getattr(self.wrapped_object, name)\n\t\tif isRabaObject(attr) :\n\t\t\tattrKey = self._getObjBagKey(attr)\n\t\t\tif attrKey in pyGenoRabaObjectWrapper._bags[self.bagKey] :\n\t\t\t\tretObj = pyGenoRabaObjectWrapper._bags[self.bagKey][attrKey]\n\t\t\telse :\n\t\t\t\twCls = pyGenoRabaObjectWrapper_metaclass._wrappers[attr._rabaClass]\n\t\t\t\tretObj = wCls(wrapped_object_and_bag = (attr, self.bagKey))\n\t\t\treturn retObj\n\t\treturn attr\n\n\t@classmethod\n\tdef getIndexes(cls) :\n\t\t\"\"\"Returns a list of indexes attached to the object's class. Ex \n\t\tTranscript.getIndexes()\"\"\"\n\t\treturn cls._wrapped_class.getIndexes()\n\n\t@classmethod\n\tdef flushIndexes(cls) :\n\t\t\"\"\"Drops all the indexes attached to the object's class. Ex \n\t\tTranscript.flushIndexes()\"\"\"\n\t\treturn cls._wrapped_class.flushIndexes()\n\t\n\t@classmethod\n\tdef help(cls) :\n\t\t\"\"\"Returns a list of available field for queries. Ex \n\t\tTranscript.help()\"\"\"\n\t\treturn cls._wrapped_class.help().replace(\"_Raba\", \"\")\n\n\t@classmethod\n\tdef ensureGlobalIndex(cls, fields) :\n\t\t\"\"\"Add a GLOBAL index to the db to speedup lookouts. Fields can be a \n\t\tlist of fields for Multi-Column Indices or simply the name of a \n\t\tsingle field. A global index is an index on the entire type.\n\t\tA global index on 'Transcript' on field 'name', will index the names for all the transcripts in the database\"\"\"\n\t\tcls._wrapped_class.ensureIndex(fields)\n\n\t@classmethod\n\tdef dropGlobalIndex(cls, fields) :\n\t\t\"\"\"Drops an index, the opposite of ensureGlobalIndex()\"\"\"\n\t\tcls._wrapped_class.dropIndex(fields)\n\n\tdef getSequencesData(self) :\n\t\t\"\"\"This lazy abstract function is only called if the object \n\t\tsequences need to be loaded\"\"\"\n\t\traise NotImplementedError(\"This fct loads non binary sequences and should be implemented in child if needed\")\n\n\tdef _load_sequences(self) :\n\t\t\"\"\"This lazy abstract function is only called if the object \n\t\tsequences need to be loaded\"\"\"\n\t\tself._load_data()\n\n\tdef _load_data(self) :\n\t\t\"\"\"This lazy abstract function is only called if the object \n\t\tsequences need to be loaded\"\"\"\n\t\traise NotImplementedError(\"This fct loads non binary sequences and should be implemented in child if needed\")\n\t\n\tdef _load_bin_sequence(self) :\n\t\t\"\"\"Same as _load_sequences(), but loads binary sequences\"\"\"\n\t\traise NotImplementedError(\"This fct loads binary sequences and should be implemented in child if needed\")\n"
  },
  {
    "path": "pyGeno/tests/__init__.py",
    "content": ""
  },
  {
    "path": "pyGeno/tests/test_csv.py",
    "content": "import unittest\nfrom pyGeno.tools.parsers.CSVTools import *\n\nclass CSVTests(unittest.TestCase):\n\t\t\n\tdef setUp(self):\n\t\tpass\n\n\tdef tearDown(self):\n\t\tpass\n\n\tdef test_createParse(self) :\n\t\ttestVals = [\"test\", \"test2\"]\n\t\tc = CSVFile(legend = [\"col1\", \"col2\"], separator = \"\\t\")\n\t\tl = c.newLine()\n\t\tl[\"col1\"] = testVals[0]\n\t\tl = c.newLine()\n\t\tl[\"col1\"] = testVals[1]\n\t\tc.save(\"test.csv\")\n\t\t# print \"----\", l\n\t\tc2 = CSVFile()\n\t\tc2.parse(\"test.csv\", separator = \"\\t\")\n\t\ti = 0\n\t\tfor l in c2 :\n\t\t\tself.assertEqual(l[\"col1\"], testVals[i])\n\t\t\ti += 1\n\ndef runTests() :\n\tunittest.main()\n\nif __name__ == \"__main__\" :\n\trunTests()\n"
  },
  {
    "path": "pyGeno/tests/test_genome.py",
    "content": "import unittest\nfrom pyGeno.Genome import *\n\nimport pyGeno.bootstrap as B\nfrom pyGeno.importation.Genomes import *\nfrom pyGeno.importation.SNPs import *\n\nclass pyGenoSNPTests(unittest.TestCase):\n\n\tdef setUp(self):\n\t\t# try :\n\t\t# \tB.importGenome(\"Human.GRCh37.75_Y-Only.tar.gz\")\n\t\t# except KeyError :\n\t\t# \tdeleteGenome(\"human\", \"GRCh37.75_Y-Only\")\n\t\t# \tB.importGenome(\"Human.GRCh37.75_Y-Only.tar.gz\")\n\t\t# \tprint \"--> Seems to already exist in db\"\n     \n\t\t# try :\n\t\t# \tB.importSNPs(\"Human_agnostic.dummySRY.tar.gz\")\n\t\t# except KeyError :\n\t\t# \tdeleteSNPs(\"dummySRY_AGN\")\n\t\t# \tB.importSNPs(\"Human_agnostic.dummySRY.tar.gz\")\n\t\t# \tprint \"--> Seems to already exist in db\"\n\t\t\n\t\t# try :\n\t\t# \tB.importSNPs(\"Human_agnostic.dummySRY_indels\")\n\t\t# except KeyError :\n\t\t# \tdeleteSNPs(\"dummySRY_AGN_indels\")\n\t\t# \tB.importSNPs(\"Human_agnostic.dummySRY_indels\")\n\t\t# \tprint \"--> Seems to already exist in db\"\n\t\tself.ref = Genome(name = 'GRCh37.75_Y-Only')\n\n\tdef tearDown(self):\n\t\tpass\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_vanilla(self) :\n\t\tdummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN')\n\t\tpersProt = dummy.get(Protein, id = 'ENSP00000438917')[0]\n\t\trefProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]\n\n\t\tself.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])\n\t\tself.assertEqual('HTGCAATCATATGCTTCTGC', persProt.transcript.cDNA[:20])\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_noModif(self) :\n\t\tfrom pyGeno.SNPFiltering import SNPFilter\n\n\t\tclass MyFilter(SNPFilter) :\n\t\t\tdef __init__(self) :\n\t\t\t\tSNPFilter.__init__(self)\n\n\t\t\tdef filter(self, chromosome, dummySRY_AGN) :\n\t\t\t\treturn None\n\n\t\tdummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())\n\t\tpersProt = dummy.get(Protein, id = 'ENSP00000438917')[0]\n\t\trefProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]\n\n\t\tself.assertEqual(persProt.transcript.cDNA[:20], refProt.transcript.cDNA[:20])\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_insert(self) :\n\t\tfrom pyGeno.SNPFiltering import SNPFilter\n\n\t\tclass MyFilter(SNPFilter) :\n\t\t\tdef __init__(self) :\n\t\t\t\tSNPFilter.__init__(self)\n\n\t\t\tdef filter(self, chromosome, dummySRY_AGN) :\n\t\t\t\tfrom pyGeno.SNPFiltering import  SequenceInsert\n\t\t\n\t\t\t\trefAllele = chromosome.refSequence[dummySRY_AGN.start]\n\t\t\t\treturn SequenceInsert('XXX')\n\n\t\tdummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())\n\t\tpersProt = dummy.get(Protein, id = 'ENSP00000438917')[0]\n\t\trefProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]\n\t\tself.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])\n\t\tself.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20])\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_SNP(self) :\n\t\tfrom pyGeno.SNPFiltering import SNPFilter\n\n\t\tclass MyFilter(SNPFilter) :\n\t\t\tdef __init__(self) :\n\t\t\t\tSNPFilter.__init__(self)\n\n\t\t\tdef filter(self, chromosome, dummySRY_AGN) :\n\t\t\t\tfrom pyGeno.SNPFiltering import SequenceSNP\n\t\n\t\t\t\treturn SequenceSNP(dummySRY_AGN.alt)\n\n\t\tdummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())\n\t\tpersProt = dummy.get(Protein, id = 'ENSP00000438917')[0]\n\n\t\trefProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]\n\t\tself.assertEqual('M', refProt.sequence[0])\n\t\tself.assertEqual('L', persProt.sequence[0])\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_deletion(self) :\n\t\tfrom pyGeno.SNPFiltering import SNPFilter\n\n\t\tclass MyFilter(SNPFilter) :\n\t\t\tdef __init__(self) :\n\t\t\t\tSNPFilter.__init__(self)\n\n\t\t\tdef filter(self, chromosome, dummySRY_AGN) :\n\t\t\t\tfrom pyGeno.SNPFiltering import SequenceDel\n\t\t\n\t\t\t\trefAllele = chromosome.refSequence[dummySRY_AGN.start]\n\t\t\t\treturn SequenceDel(1)\n\n\t\tdummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())\n\t\tpersProt = dummy.get(Protein, id = 'ENSP00000438917')[0]\n\t\trefProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]\n\t\tself.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])\n\t\tself.assertEqual('TGCAATCATATGCTTCTGCT', persProt.transcript.cDNA[:20])\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_indels(self) :\n\t\tfrom pyGeno.SNPFiltering import SNPFilter\n\n\t\tclass MyFilter(SNPFilter) :\n\t\t\tdef __init__(self) :\n\t\t\t\tSNPFilter.__init__(self)\n\n\t\t\tdef filter(self, chromosome, dummySRY_AGN_indels) :\n\t\t\t\tfrom pyGeno.SNPFiltering import  SequenceInsert\n\t\t\t\tret = \"\"\n\t\t\t\tfor s in dummySRY_AGN_indels :\n\t\t\t\t\tret += \"X\"\n\t\t\t\treturn SequenceInsert(ret)\n\n\t\tdummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN_indels', SNPFilter = MyFilter())\n\t\tpersProt = dummy.get(Protein, id = 'ENSP00000438917')[0]\n\t\trefProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]\n\t\tself.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20])\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_bags(self) :\n\t\tdummy = Genome(name = 'GRCh37.75_Y-Only')\n\t\tself.assertEqual(dummy.wrapped_object, self.ref.wrapped_object)\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_prot_find(self) :\n\t\tprot = self.ref.get(Protein, id = 'ENSP00000438917')[0]\n\t\tneedle = prot.sequence[:10]\n\t\tself.assertEqual(0, prot.find(needle))\n\t\tneedle = prot.sequence[-10:]\n\t\tself.assertEqual(len(prot)-10, prot.find(needle))\n\n\t# @unittest.skip(\"skipping\")\n\tdef test_trans_find(self) :\n\t\ttrans = self.ref.get(Transcript, name = \"SRY-001\")[0]\n\t\tself.assertEqual(0, trans.find(trans[:5]))\n\n\t# @unittest.skip(\"remote server down\")\n\t# def test_import_remote_genome(self) :\n\t\t# self.assertRaises(KeyError, B.importRemoteGenome, \"Human.GRCh37.75_Y-Only.tar.gz\")\n\n\t# @unittest.skip(\"remote server down\")\n\t# def test_import_remote_snps(self) :\n\t\t# self.assertRaises(KeyError, B.importRemoteSNPs, \"Human_agnostic.dummySRY.tar.gz\")\n\ndef runTests() :\n\ttry :\n\t\tB.importGenome(\"Human.GRCh37.75_Y-Only.tar.gz\")\n\texcept KeyError :\n\t\tdeleteGenome(\"human\", \"GRCh37.75_Y-Only\")\n\t\tB.importGenome(\"Human.GRCh37.75_Y-Only.tar.gz\")\n\t\tprint \"--> Seems to already exist in db\"\n \n\ttry :\n\t\tB.importSNPs(\"Human_agnostic.dummySRY.tar.gz\")\n\texcept KeyError :\n\t\tdeleteSNPs(\"dummySRY_AGN\")\n\t\tB.importSNPs(\"Human_agnostic.dummySRY.tar.gz\")\n\t\tprint \"--> Seems to already exist in db\"\n\t\n\ttry :\n\t\tB.importSNPs(\"Human_agnostic.dummySRY_indels\")\n\texcept KeyError :\n\t\tdeleteSNPs(\"dummySRY_AGN_indels\")\n\t\tB.importSNPs(\"Human_agnostic.dummySRY_indels\")\n\t\tprint \"--> Seems to already exist in db\"\n\t# import time\n\t# time.sleep(10)\n\tunittest.main()\n\nif __name__ == \"__main__\" :\n\trunTests()\n"
  },
  {
    "path": "pyGeno/tools/BinarySequence.py",
    "content": "import array, copy\nimport UsefulFunctions as uf\n\nclass BinarySequence :\n\t\"\"\"A class for representing sequences in a binary format\"\"\"\n\n        ALPHABETA_SIZE = 32\n        ALPHABETA_KMP = range(ALPHABETA_SIZE)\n        \n\tdef __init__(self, sequence, arrayForma, charToBinDict) :\n\t\n\t\tself.forma = arrayForma\n\t\tself.charToBin = charToBinDict\n\t\tself.sequence = sequence\n\t\t\n\t\tself.binSequence, self.defaultSequence, self.polymorphisms = self.encode(sequence)\n\t\tself.itemsize = self.binSequence.itemsize\n\t\tself.typecode = self.binSequence.typecode\n\t\t#print 'bin', len(self.sequence), len(self.binSequence)\n\n\tdef encode(self, sequence):\n\t\t\"\"\"Returns a tuple (binary reprensentation, default sequence, polymorphisms list)\"\"\"\n\t\t\n\t\tpolymorphisms = []\n\t\tdefaultSequence = '' \n\t\tbinSequence = array.array(self.forma.typecode)\n\t\tb = 0\n\t\ti = 0\n\t\ttrueI = 0 #not inc in case if poly\n\t\tpoly = set()\n\t\twhile i < len(sequence)-1:\n\t\t\tb = b | self.forma[self.charToBin[sequence[i]]]\n\t\t\tif sequence[i+1] == '/' :\n\t\t\t\tpoly.add(sequence[i])\n\t\t\t\ti += 2\n\t\t\telse :\n\t\t\t\tbinSequence.append(b)\n\t\t\t\tif len(poly) > 0 :\n\t\t\t\t\tpoly.add(sequence[i])\n\t\t\t\t\tpolymorphisms.append((trueI, poly))\n\t\t\t\t\tpoly = set()\n\t\t\t\t\n\t\t\t\tbb = 0\n\t\t\t\twhile b % 2 != 0 :\n\t\t\t\t\tb = b/2\n\t\t\t\t\t\n\t\t\t\tdefaultSequence += sequence[i]\n\t\t\t\tb = 0\n\t\t\t\ti += 1\n\t\t\t\ttrueI += 1\n\t\t\n\t\tif i < len(sequence) :\n\t\t\tb = b | self.forma[self.charToBin[sequence[i]]]\n\t\t\tbinSequence.append(b)\n\t\t\tif len(poly) > 0 :\n\t\t\t\tif sequence[i] not in poly :\n\t\t\t\t\tpoly.add(sequence[i])\n\t\t\t\tpolymorphisms.append((trueI, poly))\n\t\t\tdefaultSequence += sequence[i]\n\t\t\n\t\treturn (binSequence, defaultSequence, polymorphisms)\n\n\tdef __testFind(self, arr) :\n\t\tif len(arr)  == 0:\n\t\t\traise TypeError ('binary find, needle is empty')\n\t\tif arr.itemsize != self.itemsize :\n\t\t\traise TypeError ('binary find, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize))\n\t\n\tdef __testBinary(self, arr) :\n\t\tif len(arr) != len(self) :\n\t\t\traise TypeError ('bin operator, both arrays must be of same length')\n\t\tif arr.itemsize != self.itemsize :\n\t\t\traise TypeError ('bin operator, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize))\n\t\n\tdef findPolymorphisms(self, strSeq, strict = False):\n\t\t\"\"\"\n\t\tCompares strSeq with self.sequence.\n\t\tIf not 'strict', this function ignores the cases of matching heterozygocity (ex: for a given position i, strSeq[i] = A and self.sequence[i] = 'A/G'). If 'strict' it returns all positions where strSeq differs self,sequence\n\t\t\"\"\"\n\t\tarr = self.encode(strSeq)[0]\n\t\tres = []\n\t\tif not strict :\n\t\t\tfor i in range(len(arr)+len(self)) :\n\t\t\t\tif i >= len(arr) or i > len(self) :\n\t\t\t\t\tbreak\n\t\t\t\tif arr[i] & self[i] == 0:\n\t\t\t\t\tres.append(i)\n\t\telse :\n\t\t\tfor i in range(len(arr)+len(self)) :\n\t\t\t\tif i >= len(arr) or i > len(self) :\n\t\t\t\t\tbreak\n\t\t\t\tif arr[i] != self[i] :\n\t\t\t\t\tres.append(i)\n\t\treturn res\n\t\t\n\tdef getPolymorphisms(self):\n\t\t\"\"\"returns all polymorphsims in the form of a dict pos => alleles\"\"\"\n\t\treturn self.polymorphisms\n\t\n\tdef getDefaultSequence(self) :\n\t\t\"\"\"returns a default version of sequence where only the last allele of each polymorphisms is shown\"\"\"\n\t\treturn self.defaultSequence\n\t\n\tdef __getSequenceVariants(self, x1, polyStart, polyStop, listSequence) :\n\t\t\"\"\"polyStop, is the polymorphisme at wixh number where the calcul of combinaisons stops\"\"\"\n\t\tif polyStart < len(self.polymorphisms) and polyStart < polyStop: \n\t\t\tsequence = copy.copy(listSequence)\n\t\t\tret = []\n\t\t\t\n\t\t\tpk = self.polymorphisms[polyStart]\n\t\t\tposInSequence = pk[0]-x1\n\t\t\tif posInSequence < len(listSequence) : \n\t\t\t\tfor allele in pk[1] :\n\t\t\t\t\tsequence[posInSequence] = allele\n\t\t\t\t\tret.extend(self.__getSequenceVariants(x1, polyStart+1, polyStop, sequence))\n\t\t\t\n\t\t\treturn ret\n\t\telse :\n\t\t\treturn [''.join(listSequence)]\n\n\tdef getSequenceVariants(self, x1 = 0, x2 = -1, maxVariantNumber = 128) :\n\t\t\"\"\"returns the sequences resulting from all combinaisons of all polymorphismes between x1 and x2. The results is a couple (bool, variants of sequence between x1 and x2), bool == true if there's more combinaisons than maxVariantNumber\"\"\"\n\t\tif x2 == -1 :\n\t\t\txx2 = len(self.defaultSequence)\n\t\telse :\n\t\t\txx2 = x2\n\t\t\n\t\tpolyStart = None\n\t\tnbP = 1\n\t\tstopped = False\n\t\tj = 0\n\t\tfor p in self.polymorphisms :\n\t\t\tif p[0] >= xx2 :\n\t\t\t\tbreak\n\t\t\t\n\t\t\tif x1 <= p[0] :\n\t\t\t\tif polyStart == None :\n\t\t\t\t\tpolyStart = j\n\t\t\t\t\n\t\t\t\tnbP *= len(p[1])\n\t\t\t\t\n\t\t\t\tif nbP > maxVariantNumber :\n\t\t\t\t\tstopped = True\n\t\t\t\t\tbreak\n\t\t\tj += 1\n\t\t\n\t\tif polyStart == None :\n\t\t\treturn (stopped, [self.defaultSequence[x1:xx2]])\n\t\t\n\t\treturn (stopped, self.__getSequenceVariants(x1, polyStart, j, list(self.defaultSequence[x1:xx2])))\n\n\tdef getNbVariants(self, x1, x2 = -1) :\n\t\t\"\"\"returns the nb of variants of sequences between x1 and x2\"\"\"\n\t\tif x2 == -1 :\n\t\t\txx2 = len(self.defaultSequence)\n\t\telse :\n\t\t\txx2 = x2\n\t\t\n\t\tnbP = 1\n\t\tfor p in self.polymorphisms:\n\t\t\tif x1 <= p[0] and p[0] <= xx2 :\n\t\t\t\tnbP *= len(p[1])\n\t\t\n\t\treturn nbP\n\n\tdef _dichFind(self, needle, currHaystack, offset, lst = None) :\n\t\t\"\"\"dichotomic search, if lst is None, will return the first position found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found\"\"\"\n\t\t\n\t\tif len(currHaystack) == 1 :\n\t\t\tif (offset <= (len(self) - len(needle))) and (currHaystack[0] & needle[0]) > 0 and (self[offset+len(needle)-1] & needle[-1]) > 0 :\n\t\t\t\tfound = True\n\t\t\t\tfor i in xrange(1, len(needle)-1) :\n\t\t\t\t\tif self[offset + i] & needle[i] == 0 :\n\t\t\t\t\t\tfound = False\n\t\t\t\t\t\tbreak\n\t\t\t\tif found :\n\t\t\t\t\tif lst is not None :\n\t\t\t\t\t\tlst.append(offset)\n\t\t\t\t\telse :\n\t\t\t\t\t\treturn offset\n\t\t\t\telse :\n\t\t\t\t\tif lst is None :\n\t\t\t\t\t\treturn -1\n\t\telse :\n\t\t\tif (offset <= (len(self) - len(needle))) :\n\t\t\t\tif lst is not None :\n\t\t\t\t\tself._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst)\n\t\t\t\t\tself._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst)\n\t\t\t\telse :\n\t\t\t\t\tv1 = self._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst)\n\t\t\t\t\tif v1 > -1 :\n\t\t\t\t\t\treturn v1\n\t\t\t\t\t\n\t\t\t\t\treturn self._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst)\n\t\t\treturn -1\n\n        def _kmp_construct_next(self, pattern):\n                \"\"\"the helper function for KMP-string-searching is to construct the DFA. pattern should be an integer array. return a 2D array representing the DFA for moving the pattern.\"\"\"\n                next = [[0 for state in pattern] for input_token in self.ALPHABETA_KMP]\n                next[pattern[0]][0] = 1\n                restart_state = 0\n                for state in range(1, len(pattern)):\n                        for input_token in self.ALPHABETA_KMP:\n                                next[input_token][state] = next[input_token][restart_state]\n                        next[pattern[state]][state] = state + 1\n                        restart_state = next[pattern[state]][restart_state]\n                return next\n\n        def _kmp_search_first(self, pInput_sequence, pPattern):\n                \"\"\"use KMP algorithm to search the first occurrence in the input_sequence of the pattern. both arguments are integer arrays. return the position of the occurence if found; otherwise, -1.\"\"\"\n                input_sequence, pattern = pInput_sequence, [len(bin(e)) for e in pPattern]\n                n, m = len(input_sequence), len(pattern)\n                d = p = 0\n                next = self._kmp_construct_next(pattern)\n                while d < n and p < m:\n                        p = next[len(bin(input_sequence[d]))][p]\n                        d += 1\n                if p == m: return d - p\n                else: return -1\n\n        def _kmp_search_all(self, pInput_sequence, pPattern):\n                \"\"\"use KMP algorithm to search all occurrence in the input_sequence of the pattern. both arguments are integer arrays. return a list of the positions of the occurences if found; otherwise, [].\"\"\"\n                r = []\n                input_sequence, pattern = [len(bin(e)) for e in pInput_sequence], [len(bin(e)) for e in pPattern]\n                n, m = len(input_sequence), len(pattern)\n                d = p = 0\n                next = self._kmp_construct_next(pattern)\n                while d < n:\n                        p = next[input_sequence[d]][p]\n                        d += 1\n                        if p == m:\n                                r.append(d - m)\n                                p = 0\n                return r\n\n        def _kmp_find(self, needle, haystack, lst = None):\n\t\t\"\"\"find with KMP-searching. needle is an integer array, reprensenting a pattern. haystack is an integer array, reprensenting the input sequence. if lst is None, return the first position found or -1 if no match found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found.\"\"\"\n                if lst != None: return self._kmp_search_all(haystack, needle)\n                else: return self._kmp_search_first(haystack, needle)\n\t\t\n\tdef findByBiSearch(self, strSeq) :\n\t\t\"\"\"returns the first occurence of strSeq in self. Takes polymorphisms into account\"\"\"\n\t\tarr = self.encode(strSeq)\n\t\treturn self._dichFind(arr[0], self, 0, lst = None)\n\n\tdef findAllByBiSearch(self, strSeq) :\n\t\t\"\"\"Same as find but returns a list of all occurences\"\"\"\n\t\tarr = self.encode(strSeq)\n\t\tlst = []\n\t\tself._dichFind(arr[0], self, 0, lst)\n\t\treturn lst\n\n\tdef find(self, strSeq) :\n\t\t\"\"\"returns the first occurence of strSeq in self. Takes polymorphisms into account\"\"\"\n\t\tarr = self.encode(strSeq)\n                return self._kmp_find(arr[0], self)\n\n\tdef findAll(self, strSeq) :\n\t\t\"\"\"Same as find but returns a list of all occurences\"\"\"\n\t\tarr = self.encode(strSeq)\n\t\tlst = []\n                lst = self._kmp_find(arr[0], self, lst)\n\t\treturn lst\n\t\t\n\tdef __and__(self, arr) :\n\t\tself.__testBinary(arr)\n\t\t\n\t\tret = BinarySequence(self.typecode, self.forma, self.charToBin)\n\t\tfor i in range(len(arr)) :\n\t\t\tret.append(self[i] & arr[i])\n\t\t\n\t\treturn ret\n\t\n\tdef __xor__(self, arr) :\n\t\tself.__testBinary(arr)\n\t\t\n\t\tret = BinarySequence(self.typecode, self.forma, self.charToBin)\n\t\tfor i in range(len(arr)) :\n\t\t\tret.append(self[i] ^ arr[i])\n\t\t\n\t\treturn ret\n\n\tdef __eq__(self, seq) :\n\t\tself.__testBinary(seq)\n\t\t\n\t\tif len(seq) != len(self) :\n\t\t\treturn False\n\n\t\treturn all( self[i] == seq[i] for i in range(len(self)) )\n\n\t\t\n\tdef append(self, arr) :\n\t\tself.binSequence.append(arr)\n\n\tdef extend(self, arr) :\n\t\tself.binSequence.extend(arr)\n\n\tdef decode(self, binSequence):\n\t\t\"\"\"decodes a binary sequence to return a string\"\"\"\n\t\ttry:\n\t\t\tbinSeq = iter(binSequence[0])\n\t\texcept TypeError, te:\n\t\t\tbinSeq = binSequence\n    \n\t\tret = ''\n\t\tfor b in binSeq :\n\t\t\tch = ''\n\t\t\tfor c in self.charToBin :\n\t\t\t\tif b & self.forma[self.charToBin[c]] > 0 :\n\t\t\t\t\tch += c +'/'\n\t\t\tif ch == '' :\n\t\t\t\traise KeyError('Key %d unkowom, bad format' % b)\n\t\t\t\n\t\t\tret += ch[:-1]\n\t\treturn ret\n\t\t\n\tdef getChar(self, i):\n\t\treturn self.decode([self.binSequence[i]])\n\t\t\n\tdef __len__(self):\n\t\treturn len(self.binSequence)\n\n\tdef __getitem__(self,i):\n\t\treturn self.binSequence[i]\n\t\n\tdef __setitem__(self, i, v):\n\t\tself.binSequence[i] = v\n\nclass AABinarySequence(BinarySequence) :\n\t\"\"\"A binary sequence of amino acids\"\"\"\n\t\n\tdef __init__(self, sequence):\n\t\tf = array.array('I', [1L, 2L, 4L, 8L, 16L, 32L, 64L, 128L, 256L, 512L, 1024L, 2048L, 4096L, 8192L, 16384L, 32768L, 65536L, 131072L, 262144L, 524288L, 1048576L, 2097152L])\n\t\tc = {'A': 17, 'C': 14, 'E': 19, 'D': 15, 'G': 13, 'F': 16, 'I': 3, 'H': 9, 'K': 8, '*': 1, 'M': 20, 'L': 0, 'N': 4, 'Q': 11, 'P': 6, 'S': 7, 'R': 5, 'T': 2, 'W': 10, 'V': 18, 'Y': 12, 'U': 21}\n\t\tBinarySequence.__init__(self, sequence, f, c)\n\t\nclass NucBinarySequence(BinarySequence) :\n\t\"\"\"A binary sequence of nucleotides\"\"\"\n\t\n\tdef __init__(self, sequence):\n\t\tf = array.array('B', [1, 2, 4, 8])\n\t\tc = {'A': 0, 'T': 1, 'C': 2, 'G': 3}\n\t\tce = {\n\t\t\t'R' : 'A/G', 'Y' : 'C/T', 'M': 'A/C',\n\t\t\t'K' : 'T/G', 'W' : 'A/T', 'S' : 'C/G', 'B': 'C/G/T',\n\t\t\t'D' : 'A/G/T', 'H' : 'A/C/T', 'V' : 'A/C/G', 'N': 'A/C/G/T'\n\t\t\t}\n\t\tlstSeq = list(sequence)\n\t\tfor i in xrange(len(lstSeq)) :\n\t\t\tif lstSeq[i] in ce :\n\t\t\t\tlstSeq[i] = ce[lstSeq[i]]\n\t\tlstSeq = ''.join(lstSeq)\n\t\tBinarySequence.__init__(self, lstSeq, f, c)\n\nif __name__==\"__main__\":\n\t\n\tdef test0() :\n\t\t#seq = 'ACED/E/GFIHK/MLMQPS/RTWVY'\n\t\tseq = 'ACED/E/GFIHK/MLMQPS/RTWVY/A/R'\n\t\tbSeq = AABinarySequence(seq)\n\t\tstart = 0\n\t\tstop = 4\n\t\trB = bSeq.getSequenceVariants_bck(start, stop)\n\t\tr = bSeq.getSequenceVariants(start, stop)\n\t\t\n\t\t#print start, stop, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1]) \n\t\tprint start, stop#, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1]) \n\t\t\n\t\t#if set(rB[1])!=set(r[1]) :\n\t\tprint '-AV-'\n\t\tprint start, stop, 'nb_comb_r', len(rB[1])\n\t\tprint '\\n'.join(rB[1])\n\t\tprint '=AP========'\n\t\tprint start, stop, 'nb_comb_r', len(r[1]) \n\t\tprint '\\n'.join(r[1])\n\t\n\tdef testVariants() :\n\t\tseq = 'ATGAGTTTGCCGCGCN'\n\t\tbSeq = NucBinarySequence(seq)\n\t\tprint bSeq.getSequenceVariants() \n\n\ttestVariants()\n\n        from random import randint\n        alphabeta = ['A', 'C', 'G', 'T']\n        seq = ''\n        for _ in range(8192):\n                seq += alphabeta[randint(0, 3)]\n        seq += 'ATGAGTTTGCCGCGCN'\n        bSeq = NucBinarySequence(seq)\n\n        ROUND = 512\n        PATTERN = 'GCGC'\n\n        def testFind():\n                for i in range(ROUND):\n                        bSeq.find(PATTERN)\n\n        def testFindByBiSearch():\n                for i in range(ROUND):\n                        bSeq.findByBiSearch(PATTERN)\n                        \n        def testFindAll():\n                for i in range(ROUND):\n                        bSeq.findAll(PATTERN)\n\n        def testFindAllByBiSearch():\n                for i in range(ROUND):\n                        bSeq.findAllByBiSearch(PATTERN)\n\n        import cProfile\n        print('find:\\n')\n\tcProfile.run('testFind()')\n        print('findAll:\\n')\n        cProfile.run('testFindAll()')\n        print('findByBiSearch:\\n')\n        cProfile.run('testFindByBiSearch()')\n        print('findAllByBiSearch:\\n')\n        cProfile.run('testFindAllByBiSearch()')\n"
  },
  {
    "path": "pyGeno/tools/ProgressBar.py",
    "content": "import sys, time, cPickle\n\nclass ProgressBar :\n\t\"\"\"A very simple unthreaded progress bar. This progress bar also logs stats in .logs.\n\tUsage example::\n\n\t\tp = ProgressBar(nbEpochs = -1)\n\t\t\tfor i in range(200000) :\n\t\t\t\tp.update(label = 'value of i %d' % i)\n\t\tp.close()\n\t\n\tIf you don't know the maximum number of epochs you can enter nbEpochs < 1\n\t\"\"\"\n\t\n\tdef __init__(self, nbEpochs = -1, width = 25, label = \"progress\", minRefeshTime = 1) :\n\t\tself.width = width\n\t\tself.currEpoch = 0\n\t\tself.nbEpochs = float(nbEpochs)\n\t\tself.bar = ''\n\n\t\tself.label = label\n\t\tself.wheel = [\"-\", \"\\\\\", \"|\", \"/\"]\n\t\tself.startTime = time.time()\n\t\tself.lastPrintTime = -1\n\t\tself.minRefeshTime = minRefeshTime\n\t\t\n\t\tself.runtime = -1\n\t\tself.runtime_hr = -1\n\t\tself.avg = -1\n\t\tself.remtime = -1\n\t\tself.remtime_hr = -1\n\t\tself.currTime = time.time()\n\t\tself.lastEpochDuration = -1 \n\t\t\n\t\tself.bars = []\n\t\tself.miniSnake = '~-~-~-?:>' \n\t\tself.logs = {'epochDuration' : [], 'avg' : [], 'runtime' : [], 'remtime' : []}\n\t\t\n\tdef formatTime(self, val) :\n\t\tif val < 60 :\n\t\t\treturn '%.3fsc' % val\n\t\telif val < 3600 :\n\t\t\treturn '%.3fmin' % (val/60)\n\t\telse :\n\t\t\treturn '%dh %dmin' % (int(val)/3600, int(val/60)%60)\n\n\tdef _update(self) :\n\t\ttim = time.time()\n\t\tif self.nbEpochs > 1 :\n\t\t\tif self.currTime > 0 :\n\t\t\t\tself.lastEpochDuration = tim - self.currTime\n\t\t\tself.currTime = tim\n\t\t\tself.runtime = tim - self.startTime\n\t\t\tself.runtime_hr = self.formatTime(self.runtime)\n\t\t\tself.avg = self.runtime/self.currEpoch\n\t\t\t\n\t\t\tself.remtime = self.avg * (self.nbEpochs-self.currEpoch)\n\t\t\tself.remtime_hr = self.formatTime(self.remtime)\n\t\n\tdef log(self) :\n\t\t\"\"\"logs stats about the progression, without printing anything on screen\"\"\"\n\t\t\n\t\tself.logs['epochDuration'].append(self.lastEpochDuration)\n\t\tself.logs['avg'].append(self.avg)\n\t\tself.logs['runtime'].append(self.runtime)\n\t\tself.logs['remtime'].append(self.remtime)\n\t\n\tdef saveLogs(self, filename) :\n\t\t\"\"\"dumps logs into a nice pickle\"\"\"\n\t\tf = open(filename, 'wb')\n\t\tcPickle.dump(self.logs, f)\n\t\tf.close()\n\n\tdef update(self, label = '', forceRefresh = False, log = False) :\n\t\t\"\"\"the function to be called at each iteration. Setting log = True is the same as calling log() just after update()\"\"\"\n\t\tself.currEpoch += 1\n\t\ttim = time.time()\n\t\tif (tim - self.lastPrintTime > self.minRefeshTime) or forceRefresh :\n\t\t\tself._update()\n\t\t\t\n\t\t\twheelState = self.wheel[self.currEpoch%len(self.wheel)]\n\t\t\t\n\t\t\tif label == '' :\n\t\t\t\tslabel = self.label\n\t\t\telse :\n\t\t\t\tslabel = label\n\t\t\t\n\t\t\tif self.nbEpochs > 1 :\n\t\t\t\tratio = self.currEpoch/self.nbEpochs\n\t\t\t\tsnakeLen = int(self.width*ratio)\n\t\t\t\tvoidLen = int(self.width - (self.width*ratio))\n\n\t\t\t\tif snakeLen + voidLen < self.width :\n\t\t\t\t\tsnakeLen = self.width - voidLen\n\t\t\t\t\n\t\t\t\tself.bar = \"%s %s[%s:>%s] %.2f%% (%d/%d) runtime: %s, remaining: %s, avg: %s\" %(wheelState, slabel, \"~-\" * snakeLen, \"  \" * voidLen, ratio*100, self.currEpoch, self.nbEpochs, self.runtime_hr, self.remtime_hr, self.formatTime(self.avg))\n\t\t\t\tif self.currEpoch == self.nbEpochs :\n\t\t\t\t\tself.close()\n\t\t\t\t\n\t\t\telse :\n\t\t\t\tw = self.width - len(self.miniSnake)\n\t\t\t\tv = self.currEpoch%(w+1)\n\t\t\t\tsnake = \"%s%s%s\" %(\"  \" * (v), self.miniSnake, \"  \" * (w-v))\n\t\t\t\tself.bar = \"%s %s[%s] %s%% (%d/%s) runtime: %s, remaining: %s, avg: %s\" %(wheelState, slabel, snake, '?', self.currEpoch, '?', self.runtime_hr, '?', self.formatTime(self.avg))\n\t\t\t\n\t\t\tsys.stdout.write(\"\\b\" * (len(self.bar)+1))\n\t\t\tsys.stdout.write(\" \" * (len(self.bar)+1))\n\t\t\tsys.stdout.write(\"\\b\" * (len(self.bar)+1))\n\t\t\tsys.stdout.write(self.bar)\n\t\t\tsys.stdout.flush()\n\t\t\tself.lastPrintTime = time.time()\n\t\t\t\n\t\t\tif log :\n\t\t\t\tself.log()\n\n\tdef close(self) :\n\t\t\"\"\"Closes the bar so your next print will be on another line\"\"\"\n\t\tself.update(forceRefresh = True)\n\t\tprint '\\n'\n\t\t\nif __name__ == \"__main__\" :\n\tp = ProgressBar(nbEpochs = 100000000000)\n\tfor i in xrange(100000000000) :\n\t\tp.update()\n\t\t#time.sleep(3)\n\tp.close()\n\t\n"
  },
  {
    "path": "pyGeno/tools/SecureMmap.py",
    "content": "import mmap\n\nclass SecureMmap:\n\t\"\"\"In a normal mmap, modifying the string modifies the file. This is a mmap with write protection\"\"\"\n\t\n\tdef __init__(self, filename, enableWrite = False) :\n\t\t\n\t\tself.enableWrite = enableWrite\n\t\tself.filename = filename\n\t\tself.name = filename\n\t\t\n\t\tf = open(filename, 'r+b')\n\t\tself.data = mmap.mmap(f.fileno(), 0)\n\t\n\tdef forceSet(self, x1, v) :\n\t\t\"\"\"Forces modification even if the mmap is write protected\"\"\"\n\t\tself.data[x1] = v\n\t\n\tdef __getitem__(self, i):\n\t\treturn self.data[i]\n\t\n\tdef __setitem__(self, i, v) :\n\t\tif self.enableWrite :\n\t\t\traise IOError(\"Secure mmap is write protected\")\n\t\telse :\n\t\t\tself.data[i] = v\n\n\tdef __str__(self) :\n\t\treturn \"secure mmap, file: %s, writing enabled : %s\" % (self.filename, str(self.enableWrite))\n\n\tdef __len__(self) :\n\t\treturn len(self.data)\n"
  },
  {
    "path": "pyGeno/tools/SegmentTree.py",
    "content": "import random, copy, types\n\ndef aux_insertTree(childTree, parentTree):\n\t\"\"\"This a private (You shouldn't have to call it) recursive function that inserts a child tree into a parent tree.\"\"\"\n\tif childTree.x1 != None and childTree.x2 != None :\n\t\tparentTree.insert(childTree.x1, childTree.x2, childTree.name, childTree.referedObject)\n\n\tfor c in childTree.children:\n\t\taux_insertTree(c, parentTree)\n\t\t\ndef aux_moveTree(offset, tree):\n\t\"\"\"This a private recursive (You shouldn't have to call it) function that translates tree(and it's children) to a given x1\"\"\"\n\tif tree.x1 != None and tree.x2 != None :\n\t\ttree.x1, tree.x2 = tree.x1+offset, tree.x2+offset\n\t\t\n\tfor c in tree.children:\n\t\taux_moveTree(offset, c)\n\t\t\nclass SegmentTree :\n\t\"\"\" Optimised genome annotations.\n\tA segment tree is an arborescence of segments. First position is inclusive, second exlusive, respectively refered to as x1 and x2.\n\tA segment tree has the following properties :\n\t\n\t* The root has no x1 or x2 (both set to None).\n\t\n\t* Segment are arrangend in an ascending order\n\t\n\t* For two segment S1 and S2 : [S2.x1, S2.x2[ C [S1.x1, S1.x2[ <=> S2 is a child of S1\n\t\n\tHere's an example of a tree :\n\t\n\t* Root : 0-15\n\t\n\t* ---->Segment : 0-12\n\t\n\t* ------->Segment : 1-6\n\t\n\t* ---------->Segment : 2-3\n\t\n\t* ---------->Segment : 4-5\n\t\n\t* ------->Segment : 7-8\n\t\n\t* ------->Segment : 9-10\n\t\n\t* ---->Segment : 11-14\n\t\n\t* ------->Segment : 12-14\n\t\n\t* ---->Segment : 13-15\n\t\n\tEach segment can have a 'name' and a 'referedObject'. ReferedObject are objects are stored within the graph for future usage.\n\tThese objects are always stored in lists. If referedObject is already a list it will be stored as is.\n\t\"\"\"\n\t\n\tdef __init__(self, x1 = None, x2 = None, name = '', referedObject = [], father = None, level = 0) :\n\t\tif x1 > x2 :\n\t\t\tself.x1, self.x2 = x2, x1\n\t\telse :\n\t\t\tself.x1, self.x2 = x1, x2\n\t\t\n\t\tself.father = father\n\t\tself.level = level\n\t\tself.id = random.randint(0, 10**8)\n\t\tself.name = name\n\t\t\n\t\tself.children = []\n\t\tself.referedObject = referedObject\n\n\tdef __addChild(self, segmentTree, index = -1) :\n\t\tsegmentTree.level = self.level + 1\n\t\tif index < 0 :\n\t\t\tself.children.append(segmentTree)\n\t\telse :\n\t\t\tself.children.insert(index, segmentTree)\n\t\n\tdef insert(self, x1, x2, name = '', referedObject = []) :\n\t\t\"\"\"Insert the segment in it's right place and returns it. \n\t\tIf there's already a segment S as S.x1 == x1 and S.x2 == x2. S.name will be changed to 'S.name U name' and the\n\t\treferedObject will be appended to the already existing list\"\"\"\n\t\t\n\t\tif x1 > x2 :\n\t\t\txx1, xx2 = x2, x1\n\t\telse :\n\t\t\txx1, xx2 = x1, x2\n\n\t\trt = None\n\t\tinsertId = None\n\t\tchildrenToRemove = []\n\t\tfor i in range(len(self.children)) :\n\t\t\tif self.children[i].x1 == xx1 and xx2 == self.children[i].x2 :\n\t\t\t\tself.children[i].name = self.children[i].name + ' U ' + name\n\t\t\t\tself.children[i].referedObject.append(referedObject)\n\t\t\t\treturn self.children[i]\n\t\t\t\n\t\t\tif self.children[i].x1 <= xx1 and xx2 <= self.children[i].x2 :\n\t\t\t\treturn self.children[i].insert(x1, x2, name, referedObject)\n\t\t\t\n\t\t\telif xx1 <= self.children[i].x1 and self.children[i].x2 <= xx2 :\n\t\t\t\tif rt == None :\n\t\t\t\t\tif type(referedObject) is types.ListType :\n\t\t\t\t\t\trt = SegmentTree(xx1, xx2, name, referedObject, self, self.level+1)\n\t\t\t\t\telse :\n\t\t\t\t\t\trt = SegmentTree(xx1, xx2, name, [referedObject], self, self.level+1)\n\t\t\t\t\t\n\t\t\t\t\tinsertId = i\n\t\t\t\t\t\n\t\t\t\trt.__addChild(self.children[i])\n\t\t\t\tself.children[i].father = rt\n\t\t\t\tchildrenToRemove.append(self.children[i])\n\t\t\n\t\t\telif xx1 <= self.children[i].x1 and xx2 <= self.children[i].x2 :\n\t\t\t\tinsertId = i\n\t\t\t\tbreak\n\t\t\t\t\n\t\tif rt != None :\n\t\t\tself.__addChild(rt, insertId)\n\t\t\tfor c in childrenToRemove :\n\t\t\t\tself.children.remove(c)\n\t\telse :\n\t\t\tif type(referedObject) is types.ListType :\n\t\t\t\trt = SegmentTree(xx1, xx2, name, referedObject, self, self.level+1)\n\t\t\telse :\n\t\t\t\trt = SegmentTree(xx1, xx2, name, [referedObject], self, self.level+1)\n\t\t\t\n\t\t\tif insertId != None :\n\t\t\t\tself.__addChild(rt, insertId)\n\t\t\telse :\n\t\t\t\tself.__addChild(rt)\n\t\t\n\t\treturn rt\n\t\n\tdef insertTree(self, childTree):\n\t\t\"\"\"inserts childTree in the right position (regions will be rearanged to fit the organisation of self)\"\"\"\n\t\taux_insertTree(childTree, self)\n\t\t\n\t#~ def included_todo(self, x1, x2=None) :\n\t\t#~ \"Returns all the segments where [x1, x2] is included\"\"\"\n\t\t#~ pass\n\t\t\n\tdef intersect(self, x1, x2 = None) :\n\t\t\"\"\"Returns a list of all segments intersected by [x1, x2]\"\"\"\n\t\t\n\t\tdef condition(x1, x2, tree) :\n\t\t\t#print self.id, tree.x1, tree.x2, x1, x2\n\t\t\tif (tree.x1 != None and tree.x2 != None) and (tree.x1 <= x1 and x1 < tree.x2 or tree.x1 <= x2 and x2 < tree.x2) :\n\t\t\t\treturn True\n\t\t\treturn False\n\t\t\t\n\t\tif x2 == None :\n\t\t\txx1, xx2 = x1, x1\n\t\telif x1 > x2 :\n\t\t\txx1, xx2 = x2, x1\n\t\telse :\n\t\t\txx1, xx2 = x1, x2\n\t\t\t\n\t\tc1 = self.__dichotomicSearch(xx1)\n\t\tc2 = self.__dichotomicSearch(xx2)\n\t\t\n\t\tif c1 == -1 or c2 == -1 :\n\t\t\treturn []\n\t\t\t\n\t\tif xx1 < self.children[c1].x1 :\n\t\t\tc1 -= 1\n\t\t\t\n\t\tinter = self.__radiateDown(x1, x2, c1, condition)\n\t\tif self.children[c1].id == self.children[c2].id :\n\t\t\tinter.extend(self.__radiateUp(x1, x2, c2+1, condition))\n\t\telse :\n\t\t\tinter.extend(self.__radiateUp(x1, x2, c2, condition))\n\t\t\n\t\tret = []\n\t\tfor c in inter :\n\t\t\tret.extend(c.intersect(x1, x2))\n\t\t\n\t\tinter.extend(ret)\n\t\treturn inter\n\t\t\n\tdef __dichotomicSearch(self, x1) :\n\t\tr1 = 0\n\t\tr2 = len(self.children)-1\n\t\tpos = -1\n\t\twhile (r1 <= r2) :\n\t\t\tpos = (r1+r2)/2\n\t\t\tval = self.children[pos].x1\n\n\t\t\tif val == x1 :\n\t\t\t\treturn pos\n\t\t\telif x1 < val :\n\t\t\t\tr2 = pos -1\n\t\t\telse :\n\t\t\t\tr1 = pos +1\n\t\t\n\t\treturn pos\n\t\n\tdef __radiateDown(self, x1, x2, childId, condition) :\n\t\t\"Radiates down: walks self.children downward until condition is no longer verifed or there's no childrens left \"\n\t\tret = []\n\t\ti = childId\n\t\twhile 0 <= i :\n\t\t\tif condition(x1, x2, self.children[i]) :\n\t\t\t\tret.append(self.children[i])\n\t\t\telse :\n\t\t\t\tbreak\n\t\t\ti -= 1\n\t\treturn ret\n\t\n\tdef __radiateUp(self, x1, x2, childId, condition) :\n\t\t\"Radiates uo: walks self.children upward until condition is no longer verifed or there's no childrens left \"\n\t\tret = []\n\t\ti = childId\n\t\twhile i < len(self.children):\n\t\t\tif condition(x1, x2, self.children[i]) :\n\t\t\t\tret.append(self.children[i])\n\t\t\telse :\n\t\t\t\tbreak\n\t\t\ti += 1\n\t\treturn ret\n\t\n\tdef emptyChildren(self) :\n\t\t\"\"\"Kills of children\"\"\"\n\t\tself.children = []\n\t\n\tdef removeGaps(self) :\n\t\t\"\"\"Remove all gaps between regions\"\"\"\n\t\t\n\t\tfor i in range(1, len(self.children)) :\n\t\t\tif self.children[i].x1 > self.children[i-1].x2:\t\t\t\t\n\t\t\t\taux_moveTree(self.children[i-1].x2-self.children[i].x1, self.children[i])\n\t\t\n\tdef getX1(self) :\n\t\t\"\"\"Returns the starting position of the tree\"\"\"\n\t\tif self.x1 != None :\n\t\t\treturn self.x1\n\t\treturn self.children[0].x1\n\n\tdef getX2(self) :\n\t\t\"\"\"Returns the ending position of the tree\"\"\"\n\t\tif self.x2 != None :\n\t\t\treturn self.x2\n\t\treturn self.children[-1].x2\n\t\n\tdef getIndexedLength(self) :\n\t\t\"\"\"Returns the total length of indexed regions\"\"\"\n\t\tif self.x1 != None and self.x2 != None:\n\t\t\treturn self.x2 - self.x1\n\t\telse :\n\t\t\tif len(self.children) == 0 :\n\t\t\t\treturn 0\n\t\t\telse :\n\t\t\t\tl = self.children[0].x2 - self.children[0].x1\n\t\t\t\tfor i in range(1, len(self.children)) :\n\t\t\t\t\tl += self.children[i].x2 - self.children[i].x1 - max(0, self.children[i-1].x2 - self.children[i].x1)\n\t\t\t\treturn l\n\t\n\tdef getFirstLevel(self) :\n\t\t\"\"\"returns a list of couples (x1, x2) of all the first level indexed regions\"\"\"\n\t\tres = []\n\t\tif len(self.children) > 0 :\n\t\t\tfor c in self.children:\n\t\t\t\tres.append((c.x1, c.x2)) \n\t\telse :\n\t\t\tif self.x1 != None :\n\t\t\t\tres = [(self.x1, self.x2)]\n\t\t\telse :\n\t\t\t\tres = None\n\t\treturn res\n\t\t\n\tdef flatten(self) :\n\t\t\"\"\"Flattens the tree. The tree become a tree of depth 1 where overlapping regions have been merged together\"\"\"\n\t\tif len(self.children) > 1 :\n\t\t\tchildren = self.children\n\t\t\tself.emptyChildren()\n\t\t\t\n\t\t\tchildren[0].emptyChildren()\n\t\t\tx1 = children[0].x1\n\t\t\tx2 = children[0].x2\n\t\t\trefObjs = [children[0].referedObject]\n\t\t\tname = children[0].name\n\t\t\t\n\t\t\tfor i in range(1, len(children)) :\n\t\t\t\tchildren[i].emptyChildren()\n\t\t\t\tif children[i-1] >= children[i] :\n\t\t\t\t\tx2 = children[i].x2\n\t\t\t\t\trefObjs.append(children[i].referedObject)\n\t\t\t\t\tname += \" U \" + children[i].name\n\t\t\t\telse :\n\t\t\t\t\tif len(refObjs) == 1 :\n\t\t\t\t\t\trefObjs = refObjs[0]\n\t\t\n\t\t\t\t\tself.insert(x1, x2, name, refObjs)\n\t\t\t\t\tx1 = children[i].x1\n\t\t\t\t\tx2 = children[i].x2\n\t\t\t\t\trefObjs = [children[i].referedObject]\n\t\t\t\t\tname = children[i].name\n\t\t\t\n\t\t\tif len(refObjs) == 1 :\n\t\t\t\trefObjs = refObjs[0]\n\t\t\n\t\t\tself.insert(x1, x2, name, refObjs)\n\n\tdef move(self, newX1) :\n\t\t\"\"\"Moves tree to a new starting position, updates x1s of children\"\"\"\n\t\tif self.x1 != None and self.x2 != None :\n\t\t\toffset = newX1-self.x1\n\t\t\taux_moveTree(offset, self)\n\t\telif len(self.children) > 0 :\n\t\t\toffset = newX1-self.children[0].x1\n\t\t\taux_moveTree(offset, self)\n\n\tdef __str__(self) :\n\t\tstrRes = self.__str()\n\t\t\n\t\toffset = ''\n\t\tfor i in range(self.level+1) :\n\t\t\toffset += '\\t'\n\t\t\t\n\t\tfor c in self.children :\n\t\t\tstrRes += '\\n'+offset+'-->'+str(c)\n\t\t\n\t\treturn strRes\n\t\n\tdef __str(self) :\n\t\tif self.x1 == None :\n\t\t\tif len(self.children) > 0 :\n\t\t\t\treturn \"Root: %d-%d, name: %s, id: %d, obj: %s\" %(self.children[0].x1, self.children[-1].x2, self.name, self.id, repr(self.referedObject))\n\t\t\telse :\n\t\t\t\treturn \"Root: EMPTY , name: %s, id: %d, obj: %s\" %(self.name, self.id, repr(self.referedObject))\n\t\telse :\n\t\t\treturn \"Segment: %d-%d, name: %s, id: %d, father id: %d, obj: %s\" %(self.x1, self.x2, self.name, self.id, self.father.id, repr(self.referedObject))\n\t\t\t\n\t\n\tdef __len__(self) :\n\t\t\"returns the size of the complete indexed region\"\n\t\tif self.x1 != None and self.x2 != None :\n\t\t\treturn self.x2-self.x1\n\t\telse :\n\t\t\treturn self.children[-1].x2 - self.children[0].x1\n\t\n\tdef __repr__(self):\n\t\treturn 'Segment Tree, id:%s, father id:%s, (x1, x2): (%s, %s)' %(self.id, self.father.id, self.x1, self.x2)\n\n\nif __name__== \"__main__\" :\n\ts = SegmentTree()\n\ts.insert(5, 10, 'region 1')\n\ts.insert(8, 12, 'region 2')\n\ts.insert(5, 8, 'region 3')\n\ts.insert(34, 40, 'region 4')\n\ts.insert(35, 38, 'region 5')\n\ts.insert(36, 37, 'region 6', 'aaa')\n\ts.insert(36, 37, 'region 6', 'aaa2')\n\tprint \"Tree:\"\n\tprint s\n\tprint \"indexed length\", s.getIndexedLength()\n\tprint \"removing gaps and adding region 7 : [13-37[\"\n\ts.removeGaps()\n\t#s.insert(13, 37, 'region 7')\n\tprint s\n\tprint \"indexed length\", s.getIndexedLength()\n\t#print \"intersections\"\n\t#for c in [6, 10, 14, 1000] :\n\t#\tprint c, s.intersect(c)\n\t\n\tprint \"Move\"\n\ts.move(0)\n\tprint s\n\tprint \"indexed length\", s.getIndexedLength()\n"
  },
  {
    "path": "pyGeno/tools/SingletonManager.py",
    "content": "#This thing is wonderful\n\nobjects = {}\ndef add(obj, objName='') :\n\t\n\tif objName == '' :\n\t\tkey = obj.name\n\telse :\n\t\tkey = objName\n\t\t\n\tif key not in objects :\n\t\tobjects[key] = obj\n\n\treturn obj\n\ndef contains(k) :\n\treturn k in objects\n\t\ndef get(objName) :\n\ttry :\n\t\treturn objects[objName]\n\texcept :\n\t\treturn None\n"
  },
  {
    "path": "pyGeno/tools/Stats.py",
    "content": "import numpy as np\n\ndef kullback_leibler(p, q) :\n\t\"\"\"Discrete Kullback-Leibler divergence D(P||Q)\"\"\"\n\tp = np.asarray(p, dtype=np.float)\n\tq = np.asarray(q, dtype=np.float)\n\n\tif p.shape != q.shape :\n\t\traise ValueError(\"p and q must be of the same dimensions\")\n\t\n\treturn np.sum(np.where(p > 0, np.log(p / q) * p, 0))\n\ndef squaredError_log10(p, q) :\n\tp = np.asarray(p, dtype=np.float)\n\tq = np.asarray(q, dtype=np.float)\n\t\n\tif p.shape != q.shape :\n\t\traise ValueError(\"p and q must be of the same dimensions\")\n\t\t\n\treturn np.log10(sum((p-q)**2)) - np.log(len(p))\n\t\ndef fisherExactTest(table) :\n\t\"\"\"Fisher's exact test\n\t----------\n\ttable: contengency table\n\t\"\"\"\n\traise NotImplementedError\n"
  },
  {
    "path": "pyGeno/tools/UsefulFunctions.py",
    "content": "import string, os, copy, types\n\nclass UnknownNucleotide(Exception) :\n\tdef __init__(self, nuc) :\n\t\tself.msg =  'Unknown nucleotides %s' % str(nuc)\n\n\tdef __str__(self) :\n\t\treturn self.msg\n\n#This will probably be moved somewhere else in the futur\ndef saveResults(directoryName, fileName, strResults, log = '', args = ''):\n\n\tif not os.path.exists(directoryName):\n\t\tos.makedirs(directoryName)\n\n\tresPath = \"%s/%s\"%(directoryName, fileName)\n\tresFile = open(resPath, 'w')\n\tprint \"Saving results :\\n\\t%s...\"%resPath\n\tresFile.write(strResults)\n\tresFile.close()\n\n\tif log != '' :\n\t\terrPath = \"%s.err.txt\"%(resPath)\n\t\terrFile = open(errPath, 'w')\n\n\t\tprint \"Saving log :\\n\\t%s...\" %errPath\n\t\terrFile.write(log)\n\t\terrFile.close()\n\n\tif args != '' :\n\t\tparamPath = \"%s.args.txt\"%(resPath)\n\t\tparamFile = open(paramPath, 'w')\n\n\t\tprint \"Saving arguments :\\n\\t%s...\" %paramPath\n\t\tparamFile.write(args)\n\t\tparamFile.close()\n\n\treturn \"%s/\"%(directoryName)\n\nnucleotides = ['A', 'T', 'C', 'G']\npolymorphicNucleotides = {\n\t\t\t'R' : ['A','G'], 'Y' : ['C','T'], 'M': ['A','C'],\n\t\t\t'K' : ['T','G'], 'W' : ['A','T'], 'S' : ['C','G'], 'B': ['C','G','T'],\n\t\t\t'D' : ['A','G','T'], 'H' : ['A','C','T'], 'V' : ['A','C','G'], 'N': ['A','C','G','T']\n\t\t\t}\n\n#<7iyed>\n#from Molecular Systems Biology 8; Article number 572; doi:10.1038/msb.2012.3\ncodonAffinity = {'CTT': 'low', 'ACC': 'high', 'ACA': 'low', 'ACG': 'high', 'ATC': 'high', 'AAC': 'high', 'ATA': 'low', 'AGG': 'high', 'CCT': 'low', 'ACT': 'low', 'AGC': 'high', 'AAG': 'high', 'AGA': 'low', 'CAT': 'low', 'AAT': 'low', 'ATT': 'low', 'CTG': 'high', 'CTA': 'low', 'CTC': 'high', 'CAC': 'high', 'AAA': 'low', 'CCG': 'high', 'AGT': 'low', 'CCA': 'low', 'CAA': 'low', 'CCC': 'high', 'TAT': 'low', 'GGT': 'low', 'TGT': 'low', 'CGA': 'low', 'CAG': 'high', 'TCT': 'low', 'GAT': 'low', 'CGG': 'high', 'TTT': 'low', 'TGC': 'high', 'GGG': 'high', 'TAG': 'high', 'GGA': 'low', 'TGG': 'high', 'GGC': 'high', 'TAC': 'high', 'TTC': 'high', 'TCG': 'high', 'TTA': 'low', 'TTG': 'high', 'TCC': 'high', 'GAA': 'low', 'TAA': 'high', 'GCA': 'low', 'GTA': 'low', 'GCC': 'high', 'GTC': 'high', 'GCG': 'high', 'GTG': 'high', 'GAG': 'high', 'GTT': 'low', 'GCT': 'low', 'TGA': 'high', 'GAC': 'high', 'CGT': 'low', 'TCA': 'low', 'ATG': 'high', 'CGC': 'high'}\n\nlowAffinityCodons = set(['CTT', 'ACA', 'AAA', 'ATA', 'CCT', 'AGA', 'CAT', 'AAT', 'ATT', 'CTA', 'ACT', 'CAA', 'AGT', 'CCA', 'TAT', 'GGT', 'TGT', 'CGA', 'TCT', 'GAT', 'TTT', 'GGA', 'TTA', 'CGT', 'GAA', 'TCA', 'GCA', 'GTA', 'GTT', 'GCT'])\nhighAffinityCodons = set(['ACC', 'ATG', 'AAG', 'ACG', 'ATC', 'AAC', 'AGG', 'AGC', 'CTG', 'CTC', 'CAC', 'CCG', 'CAG', 'CCC', 'CGC', 'CGG', 'TGC', 'GGG', 'TAG', 'TGG', 'GGC', 'TAC', 'TTC', 'TCG', 'TTG', 'TCC', 'TAA', 'GCC', 'GTC', 'GCG', 'GTG', 'GAG', 'TGA', 'GAC'])\n\n#</7iyed>\n\n#Empiraclly calculated using genome GRCh37.74 and Ensembl annotations\nhumanCodonCounts = {'CTT': 588990, 'ACC': 760250, 'ACA': 671093, 'ACG': 248588, 'ATC': 819539, 'AAC': 777291, 'ATA': 326568, 'AGG': 520514, 'CCT': 784233, 'ACT': 581281, 'AGC': 826157, 'AAG': 1373474, 'AGA': 560614, 'CAT': 487348, 'AAT': 745200, 'ATT': 685951, 'CTG': 1579105, 'CTA': 311963, 'CTC': 772503, 'CAC': 618558, 'AAA': 1111269, 'CCG': 285345, 'AGT': 558788, 'CCA': 771391, 'CAA': 572531, 'CCC': 809928, 'TAT': 507376, 'GGT': 459267, 'TGT': 443487, 'CGA': 276584, 'CAG': 1483627, 'TCT': 675336, 'GAT': 982540, 'CGG': 477748, 'TTT': 721642, 'TGC': 495033, 'GGG': 661842, 'TAG': 28685, 'GGA': 731598, 'TGG': 535340, 'GGC': 877641, 'TAC': 588108, 'TTC': 774303, 'TCG': 185384, 'TTA': 348372, 'TTG': 563764, 'TCC': 729893, 'GAA': 1355256, 'TAA': 37503, 'GCA': 718158, 'GTA': 316640, 'GCC': 1120424, 'GTC': 576027, 'GCG': 289438, 'GTG': 1119171, 'GAG': 1685297, 'GTT': 486471, 'GCT': 806491, 'TGA': 82954, 'GAC': 1033108, 'CGT': 200762, 'TCA': 569093, 'ATG': 935789, 'CGC': 404889}\n\nhumanCodonCount = 42433513\n\nhumanCodonRatios = {'CTT': 0.013880302580651288, 'ACC': 0.017916263496731935, 'ACA': 0.01581516477318293, 'ACG': 0.005858294127097137, 'ATC': 0.019313484603549088, 'AAC': 0.018317856454637634, 'ATA': 0.007695992551924702, 'AGG': 0.012266578070026868, 'CCT': 0.018481453562423644, 'ACT': 0.0136986301369863, 'AGC': 0.01946944623698726, 'AAG': 0.03236767127906662, 'AGA': 0.013211585851965638, 'CAT': 0.011484978865643295, 'AAT': 0.017561591000019253, 'ATT': 0.016165312544356155, 'CTG': 0.0372136287655467, 'CTA': 0.007351807049300867, 'CTC': 0.018205021111497414, 'CAC': 0.014577110313727737, 'AAA': 0.02618847513285077, 'CCG': 0.006724519838835875, 'AGT': 0.01316855382678309, 'CCA': 0.018178815409414725, 'CAA': 0.013492425197037068, 'CCC': 0.01908698909750885, 'TAT': 0.011956964298477951, 'GGT': 0.010823214189218791, 'TGT': 0.010451338308944631, 'CGA': 0.006518055669819277, 'CAG': 0.034963567593378375, 'TCT': 0.015915156494349172, 'GAT': 0.023154811622596506, 'CGG': 0.011258742588670422, 'TTT': 0.017006416602839365, 'TGC': 0.01166608571861585, 'GGG': 0.015597153127529177, 'TAG': 0.0006759987088507143, 'GGA': 0.017241042475083315, 'TGG': 0.012615971720276849, 'GGC': 0.020682732537369696, 'TAC': 0.013859517122704406, 'TTC': 0.01824744041342983, 'TCG': 0.004368811038576985, 'TTA': 0.008209831695999339, 'TTG': 0.013285819630347362, 'TCC': 0.017200861969641778, 'GAA': 0.03193834081095289, 'TAA': 0.0008838061557618385, 'GCA': 0.01692431168732129, 'GTA': 0.007462026535488589, 'GCC': 0.026404224415734798, 'GTC': 0.013574812907901357, 'GCG': 0.006820976618174413, 'GTG': 0.026374695868334068, 'GAG': 0.039716179049328296, 'GTT': 0.011464311239090669, 'GCT': 0.01900599179709679, 'TGA': 0.0019549170958341345, 'GAC': 0.024346511211551115, 'CGT': 0.004731213274752906, 'TCA': 0.013411404330346158, 'ATG': 0.022053064520017467, 'CGC': 0.009541727077840574}\n\nsynonymousCodonsFrequencies = {'A': {'GCA': 0.24472833804337418, 'GCC': 0.38180943946027124, 'GCT': 0.2748297757275403, 'GCG': 0.09863244676881429}, 'C': {'TGC': 0.5274613220815753, 'TGT': 0.47253867791842474}, 'E': {'GAG': 0.5542731864894314, 'GAA': 0.44572681351056864}, 'D': {'GAT': 0.48745614313610314, 'GAC': 0.5125438568638969}, 'G': {'GGT': 0.1682082284016543, 'GGG': 0.24240206742876733, 'GGA': 0.26795045906236126, 'GGC': 0.3214392451072171}, 'F': {'TTC': 0.51760124870901, 'TTT': 0.48239875129098997}, 'I': {'ATT': 0.3744155479793762, 'ATC': 0.4473324534485262, 'ATA': 0.17825199857209761}, 'H': {'CAC': 0.5593224017231121, 'CAT': 0.4406775982768879}, 'K': {'AAG': 0.552763002048904, 'AAA': 0.44723699795109595}, '*': {'TAG': 0.19233348084375965, 'TGA': 0.5562081774416328, 'TAA': 0.25145834171460757}, 'M': {'ATG': 1.0}, 'L': {'CTT': 0.14142445416797428, 'CTG': 0.37916443861342136, 'CTA': 0.07490652981477404, 'CTC': 0.18548840407837594, 'TTA': 0.08364882247135866, 'TTG': 0.13536735085409574}, 'N': {'AAT': 0.48946102144446174, 'AAC': 0.5105389785555383}, 'Q': {'CAA': 0.27844698705060605, 'CAG': 0.721553012949394}, 'P': {'CCT': 0.29583684315158226, 'CCG': 0.1076409230535928, 'CCA': 0.2909924451987384, 'CCC': 0.30552978859608654}, 'S': {'TCT': 0.19052256484488883, 'AGC': 0.23307146458142142, 'TCG': 0.05229964811768493, 'AGT': 0.15764260007543762, 'TCC': 0.2059139249534016, 'TCA': 0.1605497974271656}, 'R': {'AGG': 0.21322832103906786, 'CGC': 0.16586259289315397, 'CGG': 0.19570924878057572, 'CGA': 0.11330250857089251, 'AGA': 0.2296552676219967, 'CGT': 0.0822420610943132}, 'T': {'ACC': 0.33621349966301256, 'ACA': 0.2967846446949689, 'ACG': 0.10993573358004469, 'ACT': 0.2570661220619738}, 'W': {'TGG': 1.0}, 'V': {'GTA': 0.12674172810489015, 'GTC': 0.230566755353321, 'GTT': 0.19472010868151218, 'GTG': 0.44797140786027667}, 'Y': {'TAT': 0.46315236005272553, 'TAC': 0.5368476399472745}}\n\n\n# Translation tables\n# Ref: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi\ntranslTable = dict()\n\n# Standard Code (NCBI transl_table=1)\ntranslTable['default'] = {\n'TTT' : 'F', 'TCT' : 'S', 'TAT' : 'Y', 'TGT' : 'C',\n'TTC' : 'F', 'TCC' : 'S', 'TAC' : 'Y', 'TGC' : 'C',\n'TTA' : 'L', 'TCA' : 'S', 'TAA' : '*', 'TGA' : '*',\n'TTG' : 'L', 'TCG' : 'S', 'TAG' : '*', 'TGG' : 'W',\n\n'CTT' : 'L', 'CTC' : 'L', 'CTA' : 'L', 'CTG' : 'L',\n'CCT' : 'P', 'CCC' : 'P', 'CCA' : 'P', 'CCG' : 'P',\n'CAT' : 'H', 'CAC' : 'H', 'CAA' : 'Q', 'CAG' : 'Q',\n'CGT' : 'R', 'CGC' : 'R', 'CGA' : 'R', 'CGG' : 'R',\n\n'ATT' : 'I', 'ATC' : 'I', 'ATA' : 'I', 'ATG' : 'M',\n'ACT' : 'T', 'ACC' : 'T', 'ACA' : 'T', 'ACG' : 'T',\n'AAT' : 'N', 'AAC' : 'N', 'AAA' : 'K', 'AAG' : 'K',\n'AGT' : 'S', 'AGC' : 'S', 'AGA' : 'R', 'AGG' : 'R',\n\n'GTT' : 'V', 'GTC' : 'V', 'GTA' : 'V', 'GTG' : 'V',\n'GCT' : 'A', 'GCC' : 'A', 'GCA' : 'A', 'GCG' : 'A',\n'GAT' : 'D', 'GAC' : 'D', 'GAA' : 'E', 'GAG' : 'E',\n'GGT' : 'G', 'GGC' : 'G', 'GGA' : 'G', 'GGG' : 'G',\n\n'!GA' : 'U'\n\n}\ncodonTable = translTable['default']\n\n\n# The Vertebrate Mitochondrial Code (NCBI transl_table=2)\ntranslTable['mt'] = {\n'TTT' : 'F', 'TCT' : 'S', 'TAT' : 'Y', 'TGT' : 'C',\n'TTC' : 'F', 'TCC' : 'S', 'TAC' : 'Y', 'TGC' : 'C',\n'TTA' : 'L', 'TCA' : 'S', 'TAA' : '*', 'TGA' : 'W',\n'TTG' : 'L', 'TCG' : 'S', 'TAG' : '*', 'TGG' : 'W',\n\n'CTT' : 'L', 'CTC' : 'L', 'CTA' : 'L', 'CTG' : 'L',\n'CCT' : 'P', 'CCC' : 'P', 'CCA' : 'P', 'CCG' : 'P',\n'CAT' : 'H', 'CAC' : 'H', 'CAA' : 'Q', 'CAG' : 'Q',\n'CGT' : 'R', 'CGC' : 'R', 'CGA' : 'R', 'CGG' : 'R',\n\n'ATT' : 'I', 'ATC' : 'I', 'ATA' : 'M', 'ATG' : 'M',\n'ACT' : 'T', 'ACC' : 'T', 'ACA' : 'T', 'ACG' : 'T',\n'AAT' : 'N', 'AAC' : 'N', 'AAA' : 'K', 'AAG' : 'K',\n'AGT' : 'S', 'AGC' : 'S', 'AGA' : '*', 'AGG' : '*',\n\n'GTT' : 'V', 'GTC' : 'V', 'GTA' : 'V', 'GTG' : 'V',\n'GCT' : 'A', 'GCC' : 'A', 'GCA' : 'A', 'GCG' : 'A',\n'GAT' : 'D', 'GAC' : 'D', 'GAA' : 'E', 'GAG' : 'E',\n'GGT' : 'G', 'GGC' : 'G', 'GGA' : 'G', 'GGG' : 'G'\n}\n\n\n\nAATable = {'A': ['GCA', 'GCC', 'GCG', 'GCT'], 'C': ['TGT', 'TGC'], 'E': ['GAG', 'GAA'], 'D': ['GAT', 'GAC'], 'G': ['GGT', 'GGG', 'GGA', 'GGC'], 'F': ['TTT', 'TTC'], 'I': ['ATC', 'ATA', 'ATT'], 'H': ['CAT', 'CAC'], 'K': ['AAG', 'AAA'], '*': ['TAG', 'TGA', 'TAA'], 'M': ['ATG'], 'L': ['CTT', 'CTG', 'CTA', 'CTC', 'TTA', 'TTG'], 'N': ['AAC', 'AAT'], 'Q': ['CAA', 'CAG'], 'P': ['CCT', 'CCG', 'CCA', 'CCC'], 'S': ['AGC', 'AGT', 'TCT', 'TCG', 'TCC', 'TCA'], 'R': ['AGG', 'AGA', 'CGA', 'CGG', 'CGT', 'CGC'], 'T': ['ACA', 'ACG', 'ACT', 'ACC'], 'W': ['TGG'], 'V': ['GTA', 'GTC', 'GTG', 'GTT'], 'Y': ['TAT', 'TAC']}\n\nAAs = ['A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', '*', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y']\n\ntoFloat = lambda x: float(x)\ntoInt = lambda x: int(x)\nfloatToStr = lambda x:\"%f\"%(x)\nintToStr = lambda x:\"%d\"%(x)\n\nsplitStr = lambda x: x.split(';')\nstripSplitStr = lambda x: x.strip().split(';')\n\n\ndef findAll(haystack, needle) :\n\t\"\"\"returns a list of all occurances of needle in haystack\"\"\"\n\t\n\th = haystack\n\tres = []\n\tf = haystack.find(needle)\n\toffset = 0\n\twhile (f >= 0) :\n\t\t#print h, needle, f, offset\n\t\tres.append(f+offset)\n\t\toffset += f+len(needle)\n\t\th = h[f+len(needle):]\n\t\tf = h.find(needle)\n\n\treturn res\n\n\ndef complementTab(seq=[]):\n    \"\"\"returns a list of complementary sequence without inversing it\"\"\"\n    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'R': 'Y', 'Y': 'R', 'M': 'K', 'K': 'M',\n                  'W': 'W', 'S': 'S', 'B': 'V', 'D': 'H', 'H': 'D', 'V': 'B', 'N': 'N', 'a': 't',\n                  'c': 'g', 'g': 'c', 't': 'a', 'r': 'y', 'y': 'r', 'm': 'k', 'k': 'm', 'w': 'w',\n                  's': 's', 'b': 'v', 'd': 'h', 'h': 'd', 'v': 'b', 'n': 'n'}\n    seq_tmp = []\n    for bps in seq:\n        if len(bps) == 0:\n        \t#Need manage '' for deletion\n            seq_tmp.append('') \n        elif len(bps) == 1:\n            seq_tmp.append(complement[bps])\n        else:\n        \t#Need manage 'ACT' for insertion\n        \t#The insertion need to be reverse complement (like seq)\n            seq_tmp.append(reverseComplement(bps))\n            \n    #Doesn't work in the second for when bps==''\n    #seq = [complement[bp] if bp != '' else '' for bps in seq for bp in bps]\n\n    return seq_tmp\n\ndef reverseComplementTab(seq):\n    '''\n    Complements a DNA sequence, returning the reverse complement in a list to manage INDEL.\n    '''\n    return complementTab(seq[::-1])\n\ndef reverseComplement(seq):\n\t'''\n\tComplements a DNA sequence, returning the reverse complement.\n\t'''\n\treturn complement(seq)[::-1]\n\ndef complement(seq) :\n\t\"\"\"returns the complementary sequence without inversing it\"\"\"\n\ttb = string.maketrans(\"ACGTRYMKWSBDHVNacgtrymkwsbdhvn\",\n\t\t\t\t\t\t  \"TGCAYRKMWSVHDBNtgcayrkmwsvhdbn\")\n\t\n\t#just to be sure that seq isn't unicode\n\treturn str(seq).translate(tb)\n\ndef translateDNA_6Frames(sequence) :\n\t\"\"\"returns 6 translation of sequence. One for each reading frame\"\"\"\n\ttrans = (\n\t\t\t\ttranslateDNA(sequence, 'f1'),\n\t\t\t\ttranslateDNA(sequence, 'f2'),\n\t\t\t\ttranslateDNA(sequence, 'f3'),\n\n\t\t\t\ttranslateDNA(sequence, 'r1'),\n\t\t\t\ttranslateDNA(sequence, 'r2'),\n\t\t\t\ttranslateDNA(sequence, 'r3'),\n\t\t\t)\n\n\treturn trans\n\ndef translateDNA(sequence, frame = 'f1', translTable_id='default') :\n\t\"\"\"Translates DNA code, frame : fwd1, fwd2, fwd3, rev1, rev2, rev3\"\"\"\n\n\tprotein = \"\"\n\n\tif frame == 'f1' :\n\t\tdna = sequence\n\telif frame == 'f2':\n\t\tdna = sequence[1:]\n\telif frame == 'f3' :\n\t\tdna = sequence[2:]\n\telif frame == 'r1' :\n\t\tdna = reverseComplement(sequence)\n\telif frame == 'r2' :\n\t\tdna = reverseComplement(sequence)\n\t\tdna = dna[1:]\n\telif frame == 'r3' :\n\t\tdna = reverseComplement(sequence)\n\t\tdna = dna[2:]\n\telse :\n\t\traise ValueError('unknown reading frame: %s, should be one of the following: fwd1, fwd2, fwd3, rev1, rev2, rev3' % frame)\n\n\tfor i in range(0, len(dna),  3) :\n\t\tcodon = dna[i:i+3]\n\n\t\t# Check if variant messed with selenocysteine codon\n\t\tif '!' in codon and codon != '!GA':\n\t\t\tcodon = codon.replace('!', 'T')\n\n\t\tif (len(codon) == 3) :\n\t\t\ttry :\n\t\t\t\t# MC\n\t\t\t\tprotein += translTable[translTable_id][codon]\n\t\t\texcept KeyError :\n\t\t\t\tcombinaisons = polymorphicCodonCombinaisons(list(codon))\n\t\t\t\ttranslations = set()\n\t\t\t\tfor ci in range(len(combinaisons)):\n\t\t\t\t\ttranslations.add(translTable[translTable_id][combinaisons[ci]])\n\t\t\t\tprotein += '/'.join(translations)\n\n\treturn protein\n\ndef getSequenceCombinaisons(polymorphipolymorphicDnaSeqSeq, pos = 0) :\n\t\"\"\"Takes a dna sequence with polymorphismes and returns all the possible sequences that it can yield\"\"\"\n\n\tif type(polymorphipolymorphicDnaSeqSeq) is not types.ListType :\n\t\tseq = list(polymorphipolymorphicDnaSeqSeq)\n\telse :\n\t\tseq = polymorphipolymorphicDnaSeqSeq\n\n\tif pos >= len(seq) :\n\t\treturn [''.join(seq)]\n\n\tvariants = []\n\tif seq[pos] in polymorphicNucleotides :\n\t\tchars = decodePolymorphicNucleotide(seq[pos])\n\telse :\n\t\tchars = seq[pos]#.split('/')\n\n\tfor c in chars :\n\t\trseq = copy.copy(seq)\n\t\trseq[pos] = c\n\t\tvariants.extend(getSequenceCombinaisons(rseq, pos + 1))\n\n\treturn variants\n\ndef polymorphicCodonCombinaisons(codon) :\n\t\"\"\"Returns all the possible amino acids encoded by codon\"\"\"\n\treturn getSequenceCombinaisons(codon, 0)\n\ndef encodePolymorphicNucleotide(polySeq) :\n\t\"\"\"returns a single character encoding all nucletides of polySeq \n\tin a single character. PolySeq must have one of the following forms: \n\t['A', 'T', 'G'], 'ATG', 'A/T/G'\"\"\"\n\t\n\tif type(polySeq) is types.StringType :\n\t\tif polySeq.find(\"/\") < 0 :\n\t\t\tsseq = list(polySeq)\n\t\telse :\n\t\t\tsseq = polySeq.split('/')\n\t\t\t\n\telse :\n\t\tsseq = polySeq\n\t\n\tseq = []\n\tfor n in sseq :\n\t\ttry :\n\t\t\tfor n2 in polymorphicNucleotides[n] :\n\t\t\t\tseq.append(n2)\n\t\texcept KeyError :\n\t\t\tseq.append(n)\n\t\n\tseq = set(seq)\n\t\n\tif len(seq) == 4:\n\t\treturn 'N'\n\telif len(seq) == 3 :\n\t\tif 'T' not in seq :\n\t\t\treturn 'V'\n\t\telif 'G' not in seq :\n\t\t\treturn 'H'\n\t\telif 'C' not in seq :\n\t\t\treturn 'D'\n\t\telif 'A' not in seq :\n\t\t\treturn 'B'\n\telif len(seq) == 2 :\n\t\tif 'A' in seq and 'G' in seq :\n\t\t\treturn 'R'\n\t\telif 'C' in seq and 'T' in seq :\n\t\t\treturn 'Y'\n\t\telif 'A' in seq and 'C' in seq :\n\t\t\treturn 'M'\n\t\telif 'T' in seq and 'G' in seq :\n\t\t\treturn 'K'\n\t\telif 'A' in seq and 'T' in seq :\n\t\t\treturn 'W'\n\t\telif 'C' in seq and 'G' in seq :\n\t\t\treturn 'S'\n\telif polySeq[0] in nucleotides :\n\t\treturn polySeq[0]\n\telse :\n\t\traise UnknownNucleotide(polySeq)\n\ndef decodePolymorphicNucleotide(nuc) :\n\t\"\"\"the opposite of encodePolymorphicNucleotide, from 'R' to ['A', 'G']\"\"\"\n\tif nuc in polymorphicNucleotides :\n\t\treturn polymorphicNucleotides[nuc]\n\n\tif nuc in nucleotides :\n\t\treturn nuc\n\n\traise ValueError('nuc: %s, is not a valid nucleotide' % nuc)\n\ndef decodePolymorphicNucleotide_str(nuc) :\n\t\"\"\"same as decodePolymorphicNucleotide but returns a string instead \n\tof a list, from 'R' to 'A/G\"\"\"\n\treturn '/'.join(decodePolymorphicNucleotide(nuc))\n\ndef getNucleotideCodon(sequence, x1) :\n\t\"\"\"Returns the entire codon of the nucleotide at pos x1 in sequence, \n\tand the position of that nocleotide in the codon in a tuple\"\"\"\n\n\tif x1 < 0 or x1 >= len(sequence) :\n\t\treturn None\n\n\tp = x1%3\n\tif p == 0 :\n\t\treturn (sequence[x1: x1+3], 0)\n\telif p ==1 :\n\t\treturn (sequence[x1-1: x1+2], 1)\n\telif p == 2 :\n\t\treturn (sequence[x1-2: x1+1], 2)\n\ndef showDifferences(seq1, seq2) :\n\t\"\"\"Returns a string highligthing differences between seq1 and seq2:\n\t\n\t* Matches by '-'\n\t\n\t* Differences : 'A|T'\n\t\n\t* Exceeded length : '#'\n\t\n\t\"\"\"\n\tret = []\n\tfor i in range(max(len(seq1), len(seq2))) :\n\n\t\tif i >= len(seq1) :\n\t\t\tc1 = '#'\n\t\telse :\n\t\t\tc1 = seq1[i]\n\t\tif i >= len(seq2) :\n\t\t\tc2 = '#'\n\t\telse :\n\t\t\tc2 = seq2[i]\n\n\t\tif c1 != c2 :\n\t\t\tret.append('%s|%s' % (c1, c2))\n\t\telse :\n\t\t\tret.append('-')\n\n\treturn ''.join(ret)\n\ndef highlightSubsequence(sequence, x1, x2, start=' [', stop = '] ') :\n\t\"\"\"returns a sequence where the subsequence in [x1, x2[ is placed \n\tin bewteen 'start' and 'stop'\"\"\"\n\n\tseq = list(sequence)\n\tprint x1, x2-1, len(seq)\n\tseq[x1] = start + seq[x1]\n\tseq[x2-1] = seq[x2-1] + stop\n\treturn ''.join(seq)\n\n# def highlightSubsequence(sequence, x1, x2, start=' [', stop = '] ') :\n# \t\"\"\"returns a sequence where the subsequence in [x1, x2[ is placed \n# \tin bewteen 'start' and 'stop'\"\"\"\n\n# \tseq = list(sequence)\n# \tii = 0\n# \tacc = True\n# \tfor i in range(len(seq)) :\n# \t\tif ii == x1 :\n# \t\t\tseq[i] = start+seq[i]\n# \t\tif ii == x2-1 :\n# \t\t\tseq[i] = seq[i] + stop\n\n# \t\tif i < len(seq) - 1 :\n# \t\t\tif seq[i+1] == '/':\n# \t\t\t\tacc = False\n# \t\t\telse :\n# \t\t\t\tacc = True\n\n# \t\tif acc :\n# \t\t\tii += 1\n\n# \treturn ''.join(seq)\n"
  },
  {
    "path": "pyGeno/tools/__init__.py",
    "content": "__all__ = ['BinarySequence', 'CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'io', 'ProgressBar', 'SecureMmap', 'SegmentTree', 'SingletonManager', 'UsefulFunctions']\n"
  },
  {
    "path": "pyGeno/tools/io.py",
    "content": "import sys\n\ndef printf(*s) :\n\t'print + sys.stdout.flush()'\n\tfor e in s[:-1] :\n\t\tprint e,\n\tprint s[-1]\n\n\tsys.stdout.flush()\n\ndef enterConfirm_prompt(enterMsg) :\n\tstopi = False\n\twhile not stopi :\n\t\tprint \"====\\n At any time you can quit by entering 'quit'\\n====\"\n\t\tvali = raw_input(enterMsg)\n\t\tif vali.lower() == 'quit' :\n\t\t\tvali = None\n\t\t\tstopi = True\n\t\telse :\n\t\t\tprint \"You've entered:\\n\\t%s\" % vali\n\t\t\tvalj = confirm_prompt(\"\")\n\t\t\tif valj == 'yes' :\n\t\t\t\tstopi = True\n\t\t\tif valj == 'quit' :\n\t\t\t\tvali = None\n\t\t\t\tstopi = True\n\t\t\t\t\n\treturn vali\n\ndef confirm_prompt(preMsg) :\n\twhile True :\n\t\tval = raw_input('%splease confirm (\"yes\", \"no\", \"quit\"): ' % preMsg)\n\t\tif val.lower() == 'yes' or val.lower() == 'no' or val.lower() == 'quit':\n\t\t\treturn val.lower()\n\t\t\n"
  },
  {
    "path": "pyGeno/tools/parsers/CSVTools.py",
    "content": "import os, types, collections\n\nclass EmptyLine(Exception) :\n\t\"\"\"Raised when an empty or comment line is found (dealt with internally)\"\"\"\n\n\tdef __init__(self, lineNumber) :\n\t\tmessage = \"Empty line: #%d\" % lineNumber \n\t\tException.__init__(self, message)\n\t\tself.message = message\n\n\tdef __str__(self) :\n\t\treturn self.message\n\ndef removeDuplicates(inFileName, outFileName) :\n\t\"\"\"removes duplicated lines from a 'inFileName' CSV file, the results are witten in 'outFileName'\"\"\"\n\tf = open(inFileName)\n\tlegend = f.readline()\n\t\n\tdata = ''\n\th = {}\n\th[legend] = 0\n\t\n\tlines = f.readlines()\n\tfor l in lines :\n\t\tif not h.has_key(l) :\n\t\t\th[l] = 0\n\t\t\tdata += l\n\t\t\t\n\tf.flush()\n\tf.close()\n\tf = open(outFileName, 'w')\n\tf.write(legend+data)\n\tf.flush()\n\tf.close()\n\ndef catCSVs(folder, ouputFileName, removeDups = False) :\n\t\"\"\"Concatenates all csv in 'folder' and wites the results in 'ouputFileName'. My not work on non Unix systems\"\"\"\n\tstrCmd = r\"\"\"cat %s/*.csv > %s\"\"\" %(folder, ouputFileName)\n\tos.system(strCmd)\n\n\tif removeDups :\n\t\tremoveDuplicates(ouputFileName, ouputFileName)\n\ndef joinCSVs(csvFilePaths, column, ouputFileName, separator = ',') :\n\t\"\"\"csvFilePaths should be an iterable. Joins all CSVs according to the values in the column 'column'. Write the results in a new file 'ouputFileName' \"\"\"\n\t\n\tres = ''\n\tlegend = []\n\tcsvs = []\n\t\n\tfor f in csvFilePaths :\n\t\tc = CSVFile()\n\t\tc.parse(f)\n\t\tcsvs.append(c)\n\t\tlegend.append(separator.join(c.legend.keys()))\n\tlegend = separator.join(legend)\n\t\n\tlines = []\n\tfor i in range(len(csvs[0])) :\n\t\tval = csvs[0].get(i, column)\n\t\tline = separator.join(csvs[0][i])\n\t\tfor c in csvs[1:] :\n\t\t\tfor j in range(len(c)) :\n\t\t\t\tif val == c.get(j, column) :\n\t\t\t\t\tline += separator + separator.join(c[j])\n\t\t\t\t\t\n\t\tlines.append( line )\n\t\n\tres = legend + '\\n' + '\\n'.join(lines)\n\t\n\tf = open(ouputFileName, 'w')\n\tf.write(res)\n\tf.flush()\n\tf.close()\n\t\n\treturn res\n\nclass CSVEntry(object) :\n\t\"\"\"A single entry in a CSV file\"\"\"\n\t\n\tdef __init__(self, csvFile, lineNumber = None) :\n\t\t\n\t\tself.csvFile = csvFile\n\t\tself.data = []\n\t\tif lineNumber != None :\n\t\t\tself.lineNumber = lineNumber\n\t\t\t\n\t\t\ttmpL = csvFile.lines[lineNumber].replace('\\r', '\\n').replace('\\n', '')\n\t\t\tif len(tmpL) == 0 or tmpL[0] in [\"#\", \"\\r\", \"\\n\", csvFile.lineSeparator] :\n\t\t\t\traise EmptyLine(lineNumber)\n\n\t\t\ttmpData = tmpL.split(csvFile.separator)\n\n\t\t\t# tmpDatum = []\n\t\t\ti = 0\n\t\t\twhile i < len(tmpData) :\n\t\t\t# for d in tmpData :\n\t\t\t\td = tmpData[i]\n\t\t\t\tsd = d.strip()\n\t\t\t\t# print sd, tmpData, i\n\t\t\t\tif len(sd) > 0 and sd[0] == csvFile.stringSeparator :\n\t\t\t\t\tmore = []\t\n\t\t\t\t\tfor i in xrange(i, len(tmpData)) :\n\t\t\t\t\t\tmore.append(tmpData[i])\n\t\t\t\t\t\ti+=1\n\t\t\t\t\t\tif more[-1][-1] == csvFile.stringSeparator :\n\t\t\t\t\t\t\tbreak\n\t\t\t\t\tself.data.append(\",\".join(more)[1:-1])\n\t\t\t\t\t\n\t\t\t\t# if len(tmpDatum) > 0 or (len(sd) > 0 and sd[0] == csvFile.stringSeparator) :\n\t\t\t\t# \ttmpDatum.append(sd)\n\t\t\t\t\t\n\t\t\t\t# \tif len(sd) > 0 and sd[-1] == csvFile.stringSeparator :\n\t\t\t\t# \t\tself.data.append(csvFile.separator.join(tmpDatum))\n\t\t\t\t# \t\ttmpDatum = []\n\t\t\t\telse :\n\t\t\t\t\tself.data.append(sd)\n\t\t\t\t\ti += 1\n\t\telse :\n\t\t\tself.lineNumber = len(csvFile)\n\t\t\tfor i in range(len(self.csvFile.legend)) :\n\t\t\t\tself.data.append('')\n\n\tdef commit(self) :\n\t\t\"\"\"commits the line so it is added to a file stream\"\"\"\n\t\tself.csvFile.commitLine(self)\n\n\tdef __iter__(self) :\n\t\tself.currentField = -1\n\t\treturn self\n\t\n\tdef next(self) :\n\t\tself.currentField += 1\n\t\tif self.currentField >= len(self.csvFile.legend) :\n\t\t\traise StopIteration\n\t\t\t\n\t\tk = self.csvFile.legend.keys()[self.currentField]\n\t\tv = self.data[self.currentField]\n\t\treturn k, v\n\n\tdef __getitem__(self, key) :\n\t\t\"\"\"Returns the value of field 'key'\"\"\"\n\t\ttry :\n\t\t\tindice = self.csvFile.legend[key.lower()]\n\t\texcept KeyError :\n\t\t\traise KeyError(\"CSV File has no column: '%s'\" % key)\n\t\t\n\t\treturn self.data[indice]\n\n\tdef __setitem__(self, key, value) :\n\t\t\"\"\"Sets the value of field 'key' to 'value' \"\"\"\n\t\ttry :\n\t\t\tfield = self.csvFile.legend[key.lower()]\n\t\texcept KeyError :\n\t\t\tself.csvFile.addField(key)\n\t\t\tfield = self.csvFile.legend[key.lower()]\n\t\t\tself.data.append(str(value))\n\t\telse :\n\t\t\ttry:\n\t\t\t\tself.data[field] = str(value)\n\t\t\texcept Exception as e:\n\t\t\t\tfor i in xrange(field-len(self.data)+1) :\n\t\t\t\t\tself.data.append(\"\")\n\t\t\t\tself.data[field] = str(value)\n\n\tdef toStr(self) :\n\t\treturn self.csvFile.separator.join(self.data)\n\t\t\n\tdef __repr__(self) :\n\t\tr = {}\n\t\tfor k, v in self.csvFile.legend.iteritems() :\n\t\t\tr[k] = self.data[v]\n\n\t\treturn \"<line %d: %s>\" %(self.lineNumber, str(r))\n\t\t\n\tdef __str__(self) :\n\t\treturn repr(self)\n\t\nclass CSVFile(object) :\n\t\"\"\"\n\tRepresents a whole CSV file::\n\t\t\n\t\t#reading\n\t\tf = CSVFile()\n\t\tf.parse('hop.csv')\n\t\tfor line in f :\n\t\t\tprint line['ref']\n\n\t\t#writing, legend can either be a list of a dict {field : column number}\n\t\tf = CSVFile(legend = ['name', 'email'])\n\t\tl = f.newLine()\n\t\tl['name'] = 'toto'\n\t\tl['email'] = \"hop@gmail.com\"\n\t\t\n\t\tfor field, value in l :\n\t\t\tprint field, value\n\n\t\tf.save('myCSV.csv')\t\t\n\t\"\"\"\n\t\n\tdef __init__(self, legend = [], separator = ',', lineSeparator = '\\n') :\n\t\t\n\t\tself.legend = collections.OrderedDict()\n\t\tfor i in range(len(legend)) :\n\t\t\tif legend[i].lower() in self.legend :\n\t\t\t\traise  ValueError(\"%s is already in the legend\" % legend[i].lower())\n\t\t\tself.legend[legend[i].lower()] = i\n\t\tself.strLegend = separator.join(legend)\n\n\t\tself.filename = \"\"\n\t\tself.lines = []\t\n\t\tself.separator = separator\n\t\tself.lineSeparator = lineSeparator\n\t\tself.currentPos = -1\n\t\n\t\tself.streamFile = None\n\t\tself.writeRate = None\n\t\tself.streamBuffer = None\n\t\tself.keepInMemory = True\n\n\tdef addField(self, field) :\n\t\t\"\"\"add a filed to the legend\"\"\"\n\t\tif field.lower() in self.legend :\n\t\t\traise  ValueError(\"%s is already in the legend\" % field.lower())\n\t\tself.legend[field.lower()] = len(self.legend)\n\t\tif len(self.strLegend) > 0 :\n\t\t\tself.strLegend += self.separator + field\n\t\telse :\n\t\t\tself.strLegend += field\n\t\t\t\n\tdef parse(self, filePath, skipLines=0, separator = ',', stringSeparator = '\"', lineSeparator = '\\n') :\n\t\t\"\"\"Loads a CSV file\"\"\"\n\t\t\n\t\tself.filename = filePath\n\t\tf = open(filePath)\n\t\tif lineSeparator == '\\n' :\n\t\t\tlines = f.readlines()\n\t\telse :\n\t\t\tlines = f.read().split(lineSeparator)\n\t\tf.flush()\n\t\tf.close()\n\t\t\n\t\tlines = lines[skipLines:]\n\t\tself.lines = []\n\t\tself.comments = []\n\t\tfor l in lines :\n\t\t\t# print l\n\t\t\tif len(l) != 0 and l[0] != \"#\" :\n\t\t\t\tself.lines.append(l)\n\t\t\telif l[0] == \"#\" :\n\t\t\t\tself.comments.append(l)\n\n\t\tself.separator = separator\n\t\tself.lineSeparator = lineSeparator\n\t\tself.stringSeparator = stringSeparator\n\t\tself.legend = collections.OrderedDict()\n\t\t\n\t\ti = 0\n\t\tfor c in self.lines[0].lower().replace(stringSeparator, '').split(separator) :\n\t\t\tlegendElement = c.strip()\n\t\t\tif legendElement not in self.legend :\n\t\t\t\tself.legend[legendElement] = i\n\t\t\ti+=1\n\t\n\t\tself.strLegend = self.lines[0].replace('\\r', '\\n').replace('\\n', '')\n\t\tself.lines = self.lines[1:]\n\t\t# sk = skipLines+1\n\t\t# for l in self.lines :\n\t\t# \tif l[0] == \"#\" :\n\t\t# \t\tsk += 1\n\t\t# \telse :\n\t\t# \t\tbreak\n\n\t\t# self.header = self.lines[:sk]\n\t\t# self.lines = self.lines[sk:]\n\t\n\n\tdef streamToFile(self, filename, keepInMemory = False, writeRate = 1) :\n\t\t\"\"\"Starts a stream to a file. Every line must be committed (l.commit()) to be appended in to the file.\n\n\t\tIf keepInMemory is set to True, the parser will keep a version of the whole CSV in memory, writeRate is the number\n\t\tof lines that must be committed before an automatic save is triggered.\n\t\t\"\"\"\n\t\tif len(self.legend) < 1 :\n\t\t\traise ValueError(\"There's no legend defined\")\n\n\t\ttry :\n\t\t\tos.remove(filename)\n\t\texcept :\n\t\t\tpass\n\t\t\n\t\tself.streamFile = open(filename, \"a\")\n\t\tself.writeRate = writeRate\n\t\tself.streamBuffer = []\n\t\tself.keepInMemory = keepInMemory\n\n\t\tself.streamFile.write(self.strLegend + \"\\n\")\n\n\tdef commitLine(self, line) :\n\t\t\"\"\"Commits a line making it ready to be streamed to a file and saves the current buffer if needed. If no stream is active, raises a ValueError\"\"\"\n\t\tif self.streamBuffer is None :\n\t\t\traise ValueError(\"Commit lines is only for when you are streaming to a file\")\n\n\t\tself.streamBuffer.append(line)\n\t\tif len(self.streamBuffer) % self.writeRate == 0 :\n\t\t\tfor i in xrange(len(self.streamBuffer)) :\n\t\t\t\tself.streamBuffer[i] = str(self.streamBuffer[i])\n\t\t\tself.streamFile.write(\"%s\\n\" % ('\\n'.join(self.streamBuffer)))\n\t\t\tself.streamFile.flush()\n\t\t\tself.streamBuffer = []\n\n\tdef closeStreamToFile(self) :\n\t\t\"\"\"Appends the remaining commited lines and closes the stream. If no stream is active, raises a ValueError\"\"\"\n\t\tif self.streamBuffer is None :\n\t\t\traise ValueError(\"Commit lines is only for when you are streaming to a file\")\n\n\t\tfor i in xrange(len(self.streamBuffer)) :\n\t\t\tself.streamBuffer[i] = str(self.streamBuffer[i])\n\t\tself.streamFile.write('\\n'.join(self.streamBuffer))\n\t\tself.streamFile.close()\n\n\t\tself.streamFile = None\n\t\tself.writeRate = None\n\t\tself.streamBuffer = None\n\t\tself.keepInMemory = True\n\n\tdef _developLine(self, line) :\n\t\tstop = False\n\t\twhile not stop :\n\t\t\ttry :\n\t\t\t\tif self.lines[line].__class__ is not CSVEntry :\n\t\t\t\t\tdevL = CSVEntry(self, line)\n\t\t\t\t\tstop = True\n\t\t\t\telse :\n\t\t\t\t\tdevL = self.lines[line]\n\t\t\t\t\tstop = True\n\t\t\texcept EmptyLine as e :\n\t\t\t\tdel(self.lines[line])\n\t\t\t\t\t\n\t\tself.lines[line] = devL\n\t\n\tdef get(self, line, key) :\n\t\tself._developLine(line)\n\t\treturn self.lines[line][key]\n\n\tdef set(self, line, key, val) :\n\t\tself._developLine(line)\n\t\tself.lines[line][key] = val\n\t\n\tdef newLine(self) :\n\t\t\"\"\"Appends an empty line at the end of the CSV and returns it\"\"\"\n\t\tl = CSVEntry(self)\n\t\tif self.keepInMemory :\n\t\t\tself.lines.append(l)\n\t\treturn l\n\t\n\tdef insertLine(self, i) :\n\t\t\"\"\"Inserts an empty line at position i and returns it\"\"\"\n\t\tself.data.insert(i, CSVEntry(self))\n\t\treturn self.lines[i]\n\t\n\tdef save(self, filePath) :\n\t\t\"\"\"save the CSV to a file\"\"\"\n\t\tself.filename = filePath\n\t\tf = open(filePath, 'w')\n\t\tf.write(self.toStr())\n\t\tf.flush()\n\t\tf.close()\n\n\tdef toStr(self) :\n\t\t\"\"\"returns a string version of the CSV\"\"\"\n\t\ts = [self.strLegend]\n\t\tfor l in self.lines :\n\t\t\ts.append(l.toStr())\n\t\treturn self.lineSeparator.join(s)\n\t\n\tdef __iter__(self) :\n\t\tself.currentPos = -1\n\t\treturn self\n\t\n\tdef next(self) :\n\t\tself.currentPos += 1\n\t\tif self.currentPos >= len(self) :\n\t\t\traise StopIteration\n\t\t\t\n\t\tself._developLine(self.currentPos)\n\t\treturn self.lines[self.currentPos]\n\t\n\tdef __getitem__(self, line) :\n\t\ttry :\n\t\t\tif self.lines[line].__class__ is not CSVEntry :\n\t\t\t\tself._developLine(line)\n\t\texcept AttributeError :\n\t\t\tstart = line.start\n\t\t\tif start is None :\n\t\t\t\tstart = 0\n\n\t\t\tfor l in xrange(len(self.lines[line])) :\n\t\t\t\tself._developLine(l + start)\n\n\t\t\t# start, stop = line.start, line.stop\n\t\t\t# if start is None :\n\t\t\t# \tstart = 0\n\t\t\t\n\t\t\t# if stop is None :\n\t\t\t# \tstop = 0\n\n\t\t\t# for l in xrange(start, stop) :\n\t\t\t# \tself._developLine(l)\n\n\t\treturn self.lines[line]\n\n\tdef __len__(self) :\n\t\treturn len(self.lines)\n"
  },
  {
    "path": "pyGeno/tools/parsers/CasavaTools.py",
    "content": "import gzip\nimport pyGeno.tools.UsefulFunctions as uf\n\nclass SNPsTxtEntry(object) :\n\t\"\"\"A single entry in the casavas snps.txt file\"\"\"\n\t\n\tdef __init__(self, lineNumber, snpsTxtFile) :\n\t\tself.snpsTxtFile = snpsTxtFile\n\t\tself.lineNumber = lineNumber\n\t\tself.values = {}\n\t\tsl = str(snpsTxtFile.data[lineNumber]).replace('\\t\\t', '\\t').split('\\t')\n\t\t\n\t\tself.values['chromosomeNumber'] = sl[0].upper().replace('CHR', '')\n\t\t#first column: chro, second first of range (identical to second column)\n\t\tself.values['start'] = int(sl[2])\n\t\tself.values['end'] = int(sl[2])+1\n\t\tself.values['bcalls_used'] = sl[3]\n\t\tself.values['bcalls_filt'] = sl[4]\n\t\tself.values['ref'] = sl[5]\n\t\tself.values['QSNP'] = int(sl[6])\n\t\tself.values['alleles'] = uf.encodePolymorphicNucleotide(sl[7]) #max_gt\n\t\tself.values['Qmax_gt'] = int(sl[8])\n\t\tself.values['max_gt_poly_site'] = sl[9]\n\t\tself.values['Qmax_gt_poly_site'] = int(sl[10])\n\t\tself.values['A_used'] = int(sl[11])\n\t\tself.values['C_used'] = int(sl[12])\n\t\tself.values['G_used'] = int(sl[13])\n\t\tself.values['T_used'] = int(sl[14])\n\t\n\tdef __getitem__(self, fieldName):\n\t\t\"\"\"Returns the value of field 'fieldName'\"\"\"\n\t\treturn self.values[fieldName]\n\t\n\tdef __setitem__(self, fieldName, value) :\n\t\t\"\"\"Sets the value of field 'fieldName' to 'value' \"\"\"\n\t\tself.values[fieldName] = value\n\t\t\n\tdef __str__(self):\n\t\treturn str(self.values)\n\t\nclass SNPsTxtFile(object) :\n\t\"\"\"\n\tRepresents a whole casava's snps.txt file::\n\t\t\n\t\tf = SNPsTxtFile('snps.txt')\n\t\tfor line in f :\n\t\t\tprint line['ref']\n\t\n\t\"\"\"\n\tdef __init__(self, fil, gziped = False) :\n\t\tself.reset()\n\t\tif not gziped :\n\t\t\tf = open(fil)\n\t\telse :\n\t\t\tf = gzip.open(fil)\n\t\t\n\t\tfor l in f :\n\t\t\tif l[0] != '#' :\n\t\t\t\tself.data.append(l)\n\n\t\tf.close()\n\n\tdef reset(self) :\n\t\t\"\"\"Frees the file\"\"\"\n\t\tself.data = []\n\t\tself.currentPos = 0\n\t\n\tdef __iter__(self) :\n\t\tself.currentPos = 0\n\t\treturn self\n\t\n\tdef next(self) :\n\t\tif self.currentPos >= len(self) :\n\t\t\traise StopIteration()\n\t\tv = self[self.currentPos]\n\t\tself.currentPos += 1\n\t\treturn v\n\n\tdef __getitem__(self, i) :\n\t\tif self.data[i].__class__ is not SNPsTxtEntry :\n\t\t\tself.data[i] = SNPsTxtEntry(i, self)\n\t\treturn self.data[i]\n\t\n\tdef __len__(self) :\n\t\treturn len(self.data)\n"
  },
  {
    "path": "pyGeno/tools/parsers/FastaTools.py",
    "content": "import os\n\nclass FastaFile(object) :\n\t\"\"\"\n\tRepresents a whole Fasta file::\n\t\t\n\t\t#reading\n\t\tf = FastaFile()\n\t\tf.parseFile('hop.fasta')\n\t\tfor line in f :\n\t\t\tprint line\n\t\t\n\t\t#writing\n\t\tf = FastaFile()\n\t\tf.add(\">prot1\", \"MLPADEV\")\n\t\tf.save('myFasta.fasta')\n\t\"\"\"\n\tdef __init__(self, fil = None) :\n\t\tself.reset()\n\t\tif fil != None :\n\t\t\tself.parseFile(fil)\n\t\n\tdef reset(self) :\n\t\t\"\"\"Erases everything\"\"\"\n\t\tself.data = []\n\t\tself.currentPos = 0\n\t\n\tdef parseStr(self, st) :\n\t\t\"\"\"Parses a string\"\"\"\n\t\tself.data = st.split('>')[1:]\n\n\tdef parseFile(self, fil) :\n\t\t\"\"\"Opens a file and parses it\"\"\"\n\t\tf = open(fil)\n\t\tself.parseStr(f.read())\n\t\tf.close()\n\n\tdef __splitLine(self, li) :\n\t\tif len(self.data[li]) != 2 :\n\t\t\tself.data[li] = self.data[li].replace('\\r', '\\n')\n\t\t\tself.data[li] = self.data[li].replace('\\n\\n', '\\n')\n\t\t\tl = self.data[li].split('\\n')\n\t\t\theader = '>'+l[0]\n\t\t\tdata = ''.join(l[1:])\n\t\t\tself.data[li] = (header, data)\n\n\tdef get(self, i) :\n\t\t\"\"\"returns the ith entry\"\"\"\n\t\tself.__splitLine(i)\n\t\treturn self.data[i]\n\t\t\n\tdef add(self, header, data) :\n\t\t\"\"\"appends a new entry to the file\"\"\"\n\t\tif header[0] != '>' :\n\t\t\tself.data.append(('>'+header, data))\n\t\telse :\n\t\t\tself.data.append((header, data))\n\t\n\tdef save(self, filePath) :\n\t\t\"\"\"saves the file into filePath\"\"\"\n\t\tf = open(filePath, 'w')\n\t\tf.write(self.toStr())\n\t\tf.close()\n\t\n\tdef toStr(self) :\n\t\t\"\"\"returns a string version of self\"\"\"\n\t\tst = \"\"\n\t\tfor d in self.data :\n\t\t\tst += \"%s\\n%s\\n\" % (d[0], d[1]) \n\t\n\t\treturn st\n\t\t\n\tdef __iter__(self) :\n\t\tself.currentPos = 0\n\t\treturn self\n\t\n\tdef next(self) :\n\t\t#self to call getitem, and split he line if necessary\n\t\ti = self.currentPos +1\n\t\t#print i-1, self.currentPos\n\t\tif i > len(self) :\n\t\t\traise StopIteration()\n\t\t\t\n\t\tself.currentPos = i\n\t\treturn self[self.currentPos-1]\n\n\tdef __getitem__(self, i) :\n\t\t\"\"\"returns the ith entry\"\"\"\n\t\treturn self.get(i)\n\t\n\tdef __setitem__(self, i, v) :\n\t\t\"\"\"sets the value of the ith entry\"\"\"\n\t\tif len(v) != 2: \n\t\t\traise TypeError(\"v must have a len of 2 : (header, data)\")\n\t\t\t\n\t\tself.data[i] = v\n\t\t\n\tdef __len__(self) :\n\t\t\"\"\"returns the number of entries\"\"\"\n\t\treturn len(self.data)\n"
  },
  {
    "path": "pyGeno/tools/parsers/FastqTools.py",
    "content": "import os\n\nclass FastqEntry(object) :\n\t\"\"\"A single entry in the FastqEntry file\"\"\"\n\t\n\tdef __init__(self, ident = \"\", seq = \"\", plus = \"\", qual = \"\") :\n\t\tself.values = {}\n\t\tself.values['identifier'] = ident\n\t\tself.values['sequence'] = seq\n\t\tself.values['+'] = plus\n\t\tself.values['qualities'] = qual\n\t\n\tdef __getitem__(self, i):\n\t\treturn self.values[i]\n\t\n\tdef __setitem__(self, i, v) :\n\t\tself.values[i] = v\n\t\t\n\tdef __str__(self):\n\t\treturn \"%s\\n%s\\n%s\\n%s\" %(self.values['identifier'], self.values['sequence'], self.values['+'], self.values['qualities'])\n\t\nclass FastqFile(object) :\n\t\"\"\"\n\tRepresents a whole Fastq file::\n\t\t\n\t\t#reading\n\t\tf = FastqFile()\n\t\tf.parse('hop.fastq')\n\t\tfor line in f :\n\t\t\tprint line['sequence']\n\t\t\n\t\t#writing, legend can either be a list of a dict {field : column number}\n\t\tf = CSVFile(legend = ['name', 'email'])\n\t\tl = f.newLine()\n\t\tl['name'] = 'toto'\n\t\tl['email'] = \"hop@gmail.com\"\n\t\tf.save('myCSV.csv')\n\t\t\n\t\"\"\"\n\t\n\tdef __init__(self, fil = None) :\n\t\tself.reset()\n\t\tif fil != None :\n\t\t\tself.parseFile(fil)\n\t\n\tdef reset(self) :\n\t\t\"\"\"Frees the file\"\"\"\n\t\tself.data = []\n\t\tself.currentPos = 0\n\t\n\tdef parseStr(self, st) :\n\t\t\"\"\"Parses a string\"\"\"\n\t\tself.data = st.replace('\\r', '\\n')\n\t\tself.data = self.data.replace('\\n\\n', '\\n')\n\t\tself.data = self.data.split('\\n')\n\n\tdef parseFile(self, fil) :\n\t\t\"\"\"Parses a file on disc\"\"\"\n\t\tf = open(fil)\n\t\tself.parseStr(f.read())\n\t\tf.close()\t\t\n\t\t\n\tdef __splitEntry(self, li) :\n\t\ttry :\n\t\t\tself.data['+'] \n\t\t\treturn self.data\n\t\texcept :\n\t\t\tself.data[li] = FastqEntry(self.data[li], self.data[li+1], self.data[li+2], self.data[li+3])\n\t\t\t\n\tdef get(self, li) :\n\t\t\"\"\"returns the ith entry\"\"\"\n\t\ti = li*4\n\t\tself.__splitEntry(i)\n\t\treturn self.data[i]\n\t\n\tdef newEntry(self, ident = \"\", seq = \"\", plus = \"\", qual = \"\") :\n\t\t\"\"\"Appends an empty entry at the end of the CSV and returns it\"\"\"\n\t\te = FastqEntry()\n\t\tself.data.append(e)\n\t\treturn e\n\t\n\tdef add(self, fastqEntry) :\n\t\t\"\"\"appends an entry to self\"\"\"\n\t\tself.data.append(fastqEntry)\n\t\t\n\tdef save(self, filePath) :\n\t\tf = open(filePath, 'w')\n\t\tf.write(self.make())\n\t\tf.close()\n\t\t\t\n\tdef toStr(self) :\n\t\tst = \"\"\n\t\tfor d in self.data :\n\t\t\tst += \"%s\\n%s\" % (d[0], d[1]) \n\t\n\t\treturn st\n\t\t\n\tdef __iter__(self) :\n\t\tself.currentPos = 0\n\t\treturn self\n\t\n\tdef next(self) :\n\t\t#self to call getitem, and split he line if necessary\n\t\ti = self.currentPos +1\n\t\t#print i-1, self.currentPos\n\t\tif i > len(self) :\n\t\t\traise StopIteration()\n\t\t\t\n\t\tself.currentPos = i\n\t\treturn self[self.currentPos-1]\n\n\tdef __getitem__(self, i) :\n\t\t\"\"\"returns the ith entry\"\"\"\n\t\treturn self.get(i)\n\t\n\tdef __setitem__(self, i, v) :\n\t\t\"\"\"sets the ith entry\"\"\"\n\t\tif len(v) != 2: \n\t\t\traise TypeError(\"v must have a len of 2 : (header, data)\")\n\t\t\t\n\t\tself.data[i] = v\n\t\t\n\tdef __len__(self) :\n\t\treturn len(self.data)/4\n"
  },
  {
    "path": "pyGeno/tools/parsers/GTFTools.py",
    "content": "import gzip\n\nclass GTFEntry(object) :\n\tdef __init__(self, gtfFile, lineNumber) :\n\t\t\"\"\"A single entry in a GTF file\"\"\"\n\t\t\n\t\tself.lineNumber = lineNumber\n\t\tself.gtfFile = gtfFile\n\t\tself.data = gtfFile.lines[lineNumber][:-2].split('\\t') #-2 remove ';\\n'\n\t\tproto_atts = self.data[gtfFile.legend['attributes']].strip().split('; ')\n\t\tatts = {}\n\t\tfor a in proto_atts :\n\t\t\tsa = a.split(' ')\n\t\t\tatts[sa[0]] = sa[1].replace('\"', '')\n\t\tself.data[gtfFile.legend['attributes']] = atts\n\t\n\tdef __getitem__(self, k) :\n\t\ttry :\n\t\t\treturn self.data[self.gtfFile.legend[k]]\n\t\texcept KeyError :\n\t\t\ttry :\n\t\t\t\treturn self.data[self.gtfFile.legend['attributes']][k]\n\t\t\texcept KeyError :\n\t\t\t\t#return None\n\t\t\t\traise KeyError(\"Line %d does not have an element %s.\\nline:%s\" %(self.lineNumber, k, self.gtfFile.lines[self.lineNumber]))\n\t\n\tdef __repr__(self) :\n\t\treturn \"<GTFEntry line: %d>\" % self.lineNumber\n\t\n\tdef __str__(self) :\n\t\treturn  \"<GTFEntry line: %d, %s>\" % (self.lineNumber, str(self.data))\n\nclass GTFFile(object) :\n\t\"\"\"This is a simple GTF2.2 (Revised Ensembl GTF) parser, see http://mblab.wustl.edu/GTF22.html for more infos\"\"\"\n\tdef __init__(self, filename, gziped = False) :\n\t\t\n\t\tself.filename = filename\n\t\tself.legend = {'seqname' : 0, 'source' : 1, 'feature' : 2, 'start' : 3, 'end' : 4, 'score' : 5, 'strand' : 6, 'frame' : 7, 'attributes' : 8}\n\n\t\tif gziped : \n\t\t\tf = gzip.open(filename)\n\t\telse :\n\t\t\tf = open(filename)\n\t\t\n\t\tself.lines = []\n\t\tfor l in f :\n\t\t\tif l[0] != '#' and l != '' :\n\t\t\t\tself.lines.append(l)\n\t\tf.close()\n\t\t\n\t\tself.currentIt = -1\n\n\tdef get(self, line, elmt) :\n\t\t\"\"\"returns the value of the field 'elmt' of line 'line'\"\"\"\n\t\treturn self[line][elmt]\n\n\tdef __iter__(self) :\n\t\tself.currentPos = -1\n\t\treturn self\n\n\tdef next(self) :\n\t\tself.currentPos += 1\n\t\ttry :\n\t\t\treturn GTFEntry(self, self.currentPos)\n\t\texcept IndexError:\n\t\t\traise StopIteration\n\n\tdef __getitem__(self, i) :\n\t\t\"\"\"returns the ith entry\"\"\"\n\t\tif self.lines[i].__class__ is not GTFEntry :\n\t\t\tself.lines[i] = GTFEntry(self, i)\n\t\treturn self.lines[i]\n\n\tdef __repr__(self) :\n\t\treturn \"<GTFFile: %s>\" % (os.path.basename(self.filename))\n\n\tdef __str__(self) :\n\t\treturn \"<GTFFile: %s, gziped: %s, len: %d>\" % (os.path.basename(self.filename), self.gziped, len(self))\n\t\n\tdef __len__(self) :\n\t\treturn len(self.lines)\n"
  },
  {
    "path": "pyGeno/tools/parsers/VCFTools.py",
    "content": "import os, types, gzip\n\nclass VCFEntry(object) :\n\t\"\"\"A single entry in a VCF file\"\"\"\n\t\n\tdef __init__(self, vcfFile, line, lineNumber) :\n\t\t#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003\n\t\t#20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.\n\t\tself.vcfFile = vcfFile\n\t\tself.lineNumber = lineNumber\n\t\tself.data = {}\n\t\t\n\t\ttmpL = line.replace('\\r', '\\n').replace('\\n', '')\n\t\ttmpData = str(tmpL).split('\\t')\n\t\tfor i in range(6) :\n\t\t\tself.data[vcfFile.dnegel[i]] = tmpData[i]\n\t\tself.data['POS'] = int(self.data['POS'])\n\t\t\n\t\tfilters = tmpData[6].split(';')\n\t\tif len(filters) == 1 :\n\t\t\tself.data['FILTER'] = filters\n\t\telse :\n\t\t\tfor filter_value in tmpData[6].split(';') :\n\t\t\t\tfilt, value = info_value.split('=')\n\t\t\t\tself.data['FILTER'][filt] = value\n\t\t\n\t\tself.data['INFO'] = {}\n\t\tfor s in tmpData[7].split(';') :\n\t\t\tinfo_value = s.split('=')\n\t\t\ttry :\n\t\t\t\ttyp = self.vcfFile.meta['INFO'][info_value[0]]['Type']\n\t\t\texcept KeyError :\n\t\t\t\ttyp = None\n\t\t\t\t\n\t\t\tif len(info_value) == 1 :\n\t\t\t\tif typ == 'Flag' or typ == None :\n\t\t\t\t\tself.data['INFO'][info_value[0]] = True\n\t\t\t\telse :\n\t\t\t\t\traise ValueError('%s is not a flag and has no value' % info_value[0])\n\t\t\telse :\n\t\t\t\tif typ == 'Integer' :\n\t\t\t\t\tself.data['INFO'][info_value[0]] = int(info_value[1])\n\t\t\t\telif typ == 'Float' :\n\t\t\t\t\tself.data['INFO'][info_value[0]] = float(info_value[1])\n\t\t\t\telse :\n\t\t\t\t\tself.data['INFO'][info_value[0]] = info_value[1]\n\t\t\t\t\n\tdef __getitem__(self, key) :\n\t\t\"with the vcf file format some fields are not present in all elements therefor, this fct never raises an exception but returns None or False if the field is definied as a Flag in Meta\"\n\t\ttry :\n\t\t\treturn self.data[key]\n\t\texcept KeyError:\n\t\t\ttry :\n\t\t\t\treturn self.data['INFO'][key]\n\t\t\texcept KeyError:\n\t\t\t\ttry :\n\t\t\t\t\tif self.vcfFile.meta['INFO'][key]['Type'] == 'Flag' :\n\t\t\t\t\t\tself.data['INFO'][key] = False\n\t\t\t\t\t\treturn self.data['INFO'][key]\n\t\t\t\t\telse :\n\t\t\t\t\t\treturn None\n\t\t\t\texcept KeyError:\n\t\t\t\t\treturn None\n\t\n\tdef __repr__(self) :\n\t\treturn \"<VCFEntry line: %d>\" % self.lineNumber\n\t\n\tdef __str__(self) :\n\t\treturn \"<VCFEntry line: %d,  %s\" % (self.lineNumber, str(self.data))\n\nclass VCFFile(object) :\n\t\"\"\"\n\tThis is a small parser for VCF files, it should work with any VCF file but has only been tested on dbSNP138 files.\n\tRepresents a whole VCF file::\n\t\t\n\t\t#reading\n\t\tf = VCFFile()\n\t\tf.parse('hop.vcf')\n\t\tfor line in f :\n\t\t\tprint line['pos']\n\t\"\"\"\n\t\n\tdef __init__(self, filename = None, gziped = False, stream = False) :\n\t\tself.legend = {}\n\t\tself.dnegel = {}\n\t\tself.meta = {}\n\t\tself.lines = None\n\t\tif filename :\n\t\t\tself.parse(filename, gziped, stream)\n\t\t\n\tdef parse(self, filename, gziped = False, stream = False) :\n\t\t\"\"\"opens a file\"\"\"\n\t\tself.stream = stream\n\t\t\n\t\tif gziped :\n\t\t\tself.f = gzip.open(filename)\n\t\telse :\n\t\t\tself.f = open(filename)\n\t\t\n\t\tself.filename = filename\n\t\tself.gziped = gziped\n\t\t\n\t\tlineId = 0\n\t\tinLegend = True\n\t\twhile inLegend :\n\t\t\tll = self.f.readline()\n\t\t\tl = ll.replace('\\r', '\\n').replace('\\n', '')\n\t\t\tif l[:2] == '##' :\n\t\t\t\teqPos = l.find('=')\n\t\t\t\tkey = l[2:eqPos]\n\t\t\t\tvalues = l[eqPos+1:].strip()\n\t\t\t\t\n\t\t\t\tif l[eqPos+1] != '<' :\n\t\t\t\t\tself.meta[key] = values\n\t\t\t\telse :\n\t\t\t\t\tif key not in self.meta :\n\t\t\t\t\t\tself.meta[key] = {}\n\t\t\t\t\tsvalues = l[eqPos+2:-1].split(',') #remove the < and > that surounf the string \n\t\t\t\t\tidKey = svalues[0].split('=')[1]\n\t\t\t\t\tself.meta[key][idKey] = {} \n\t\t\t\t\ti = 1\n\t\t\t\t\tfor v in svalues[1:] :\n\t\t\t\t\t\tsv = v.split(\"=\")\n\t\t\t\t\t\tfield = sv[0]\n\t\t\t\t\t\tvalue = sv[1]\n\t\t\t\t\t\tif field.lower() == 'description' :\n\t\t\t\t\t\t\tself.meta[key][idKey][field] = ','.join(svalues[i:])[len(field)+2:-1]\n\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\tself.meta[key][idKey][field] = value\n\t\t\t\t\t\ti += 1\n\t\t\telif l[:6] == '#CHROM': #we are in legend\n\t\t\t\tsl = l.split('\\t')\n\t\t\t\tfor i in range(len(sl)) :\n\t\t\t\t\tself.legend[sl[i]] = i\n\t\t\t\t\tself.dnegel[i] = sl[i]\n\t\t\t\tbreak\n\t\t\t\n\t\t\tlineId += 1\n\t\t\n\t\tif not stream :\n\t\t\tself.lines = self.f.readlines()\n\t\t\tself.f.close()\n\t\n\tdef close(self) :\n\t\t\"\"\"closes the file\"\"\"\n\t\tself.f.close()\n\t\t\n\tdef _developLine(self, lineNumber) :\n\t\tif self.lines[lineNumber].__class__ is not VCFEntry :\n\t\t\tself.lines[lineNumber] = VCFEntry(self, self.lines[lineNumber], lineNumber)\n\t\t\n\tdef __iter__(self) :\n\t\tself.currentPos = -1\n\t\treturn self\n\t\n\tdef next(self) :\n\t\tself.currentPos += 1\n\t\tif not self.stream :\n\t\t\ttry :\n\t\t\t\treturn self[self.currentPos-1]\n\t\t\texcept IndexError:\n\t\t\t\traise StopIteration\n\t\telse :\n\t\t\tmidfile_header = True\n\t\t\twhile midfile_header:\n\t\t\t\tline = self.f.readline()\n\t\t\t\tif not line :\n\t\t\t\t\traise StopIteration\n\t\t\t\tif not line.startswith('#'):\n\t\t\t\t\tmidfile_header = False\n\t\t\treturn VCFEntry(self, line, self.currentPos)\n\t\n\tdef __getitem__(self, line) :\n\t\t\"\"\"returns the lineth element\"\"\"\n\t\tif self.stream :\n\t\t\traise KeyError(\"When the file is opened as a stream it's impossible to ask for specific item\")\n\t\t\n\t\tif self.lines[line].__class__ is not VCFEntry :\n\t\t\tself._developLine(line)\n\t\treturn self.lines[line]\n\n\tdef __len__(self) :\n\t\t\"\"\"returns the number of entries\"\"\"\n\t\treturn len(self.lines)\n\t\n\tdef __repr__(self) :\n\t\treturn \"<VCFFile: %s>\" % (os.path.basename(self.filename))\n\t\n\tdef __str__(self) :\n\t\tif self.stream :\n\t\t\treturn \"<VCFFile: %s, gziped: %s, stream: %s, len: undef>\" % (os.path.basename(self.filename), self.gziped, self.stream)\n\t\telse :\n\t\t\treturn \"<VCFFile: %s, gziped: %s, stream: %s, len: %d>\" % (os.path.basename(self.filename), self.gziped, self.stream, len(self))\n\t\t\nif __name__ == '__main__' :\n\tfrom pyGeno.tools.ProgressBar import ProgressBar\n\t\n\t#v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/00-All.vcf.gz', gziped = True, stream = True)\n\tv = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/test/00-All-test.vcf.gz', gziped = True, stream = True)\n\tstartTime = time.time()\n\ti = 0\n\tpBar = ProgressBar()\n\tfor f in v :\n\t\t#print f\n\t\tpBar.update('%s' % i)\n\t\tif i > 1000000 :\n\t\t\tbreak\n\t\ti += 1\n\tpBar.close()\n\t\n\t#v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/00-All.vcf', gziped = False, stream = True)\n\tv = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/test/00-All-test.vcf', gziped = False, stream = False)\n\tstartTime = time.time()\n\ti = 0\n\tpBar = ProgressBar()\n\tfor f in v :\n\t\tpBar.update('%s' % i)\n\t\tif i > 1000000 :\n\t\t\tbreak\n\t\ti += 1\n\t\t#print f\n\tpBar.close()\n\t#print v.lines\n\n"
  },
  {
    "path": "pyGeno/tools/parsers/__init__.py",
    "content": "#Parsers for different types of files\n__all__ = ['CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'CasavaTools']\n"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup, find_packages \nfrom codecs import open\nfrom os import path\n\nhere = path.abspath(path.dirname(__file__))\n\n# Get the long description from the relevant file\nwith open(path.join(here, 'DESCRIPTION.rst'), encoding='utf-8') as f:\n    long_description = f.read()\n\nsetup(\n    name='pyGeno',\n\n    version='1.3.2',\n\n    description='A python package for Personalized Genomics and Proteomics',\n    long_description=long_description,\n    \n    url='http://pyGeno.iric.ca',\n\n    author='Tariq Daouda',\n    author_email='tariq.daouda@umontreal.ca',\n\n\ttest_suite=\"pyGeno.tests\", \n\t\n    license='ApacheV2.0',\n\n    # See https://pypi.python.org/pypi?%3Aaction=list_classifiers\n    classifiers=[\n        # How mature is this project? Common values are\n        #   3 - Alpha\n        #   4 - Beta\n        #   5 - Production/Stable\n        'Development Status :: 5 - Production/Stable',\n\n        'Intended Audience :: Science/Research',\n        'Intended Audience :: Healthcare Industry',\n        'Topic :: Scientific/Engineering :: Bio-Informatics',\n        'Topic :: Software Development :: Libraries',\n        'Topic :: Software Development :: Libraries :: Application Frameworks',\n\n        'License :: OSI Approved :: Apache Software License',\n\n        'Programming Language :: Python :: 2.7',\n    ],\n\n    keywords='proteogenomics genomics proteomics annotations medicine research personalized gene sequence protein',\n\n    packages=find_packages(exclude=[]),\n\n    # List run-time dependencies here.  These will be installed by pip when your\n    # project is installed. For an analysis of \"install_requires\" vs pip's\n    # requirements files see:\n    # https://packaging.python.org/en/latest/technical.html#install-requires-vs-requirements-files\n    install_requires=['rabaDB >= 1.0.5'],\n\n    # If there are data files included in your packages that need to be\n    # installed, specify them here.  If using Python 2.6 or less, then these\n    # have to be included in MANIFEST.in as well.\n    package_data={\n        '': ['*.txt', '*.rst', '*.tar.gz'],\n    },\n\n    # Although 'package_data' is the preferred approach, in some case you may\n    # need to place data files outside of your packages.\n    # see http://docs.python.org/3.4/distutils/setupscript.html#installing-additional-files\n    # In this case, 'data_file' will be installed into '<sys.prefix>/my_data'\n    #~ data_files=[('my_data', ['data/data_file'])],\n\n    # To provide executable scripts, use entry points in preference to the\n    # \"scripts\" keyword. Entry points provide cross-platform support and allow\n    # pip to create the appropriate form of executable for the target platform.\n    entry_points={\n        'console_scripts': [\n            'sample=sample:main',\n        ],\n    },\n)\n"
  }
]