Repository: tariqdaouda/pyGeno Branch: bloody Commit: 6311c9cd9444 Files: 66 Total size: 243.4 KB Directory structure: gitextract_0q2fbl_h/ ├── .gitignore ├── .travis.yml ├── CHANGELOG.rst ├── DESCRIPTION.rst ├── LICENSE ├── MANIFEST.in ├── README.rst ├── pyGeno/ │ ├── Chromosome.py │ ├── Exon.py │ ├── Gene.py │ ├── Genome.py │ ├── Protein.py │ ├── SNP.py │ ├── SNPFiltering.py │ ├── Transcript.py │ ├── __init__.py │ ├── bootstrap.py │ ├── bootstrap_data/ │ │ ├── SNPs/ │ │ │ ├── Human_agnostic.dummySRY_indels/ │ │ │ │ ├── manifest.ini │ │ │ │ └── snps.txt │ │ │ └── __init__.py │ │ ├── __init__.py │ │ └── genomes/ │ │ └── __init__.py │ ├── configuration.py │ ├── doc/ │ │ ├── Makefile │ │ ├── make.bat │ │ └── source/ │ │ ├── bootstraping.rst │ │ ├── citing.rst │ │ ├── conf.py │ │ ├── datawraps.rst │ │ ├── importation.rst │ │ ├── index.rst │ │ ├── installation.rst │ │ ├── objects.rst │ │ ├── parsers.rst │ │ ├── publications.rst │ │ ├── querying.rst │ │ ├── quickstart.rst │ │ ├── snp_filter.rst │ │ └── tools.rst │ ├── examples/ │ │ ├── __init__.py │ │ ├── bootstraping.py │ │ └── snps_queries.py │ ├── importation/ │ │ ├── Genomes.py │ │ ├── SNPs.py │ │ └── __init__.py │ ├── pyGenoObjectBases.py │ ├── tests/ │ │ ├── __init__.py │ │ ├── test_csv.py │ │ └── test_genome.py │ └── tools/ │ ├── BinarySequence.py │ ├── ProgressBar.py │ ├── SecureMmap.py │ ├── SegmentTree.py │ ├── SingletonManager.py │ ├── Stats.py │ ├── UsefulFunctions.py │ ├── __init__.py │ ├── io.py │ └── parsers/ │ ├── CSVTools.py │ ├── CasavaTools.py │ ├── FastaTools.py │ ├── FastqTools.py │ ├── GTFTools.py │ ├── VCFTools.py │ └── __init__.py └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ lib/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .cache nosetests.xml coverage.xml # Translations *.mo *.pot # Django stuff: *.log # Sphinx documentation docs/_build/ # PyBuilder target/ ### PyCharm ### # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm ## Directory-based project format .idea/ # if you remove the above rule, at least ignore user-specific stuff: # .idea/workspace.xml # .idea/tasks.xml # and these sensitive or high-churn files: # .idea/dataSources.ids # .idea/dataSources.xml # .idea/sqlDataSources.xml # .idea/dynamic.xml ## File-based project format *.ipr *.iws *.iml ## Additional for IntelliJ out/ # generated by mpeltonen/sbt-idea plugin .idea_modules/ ================================================ FILE: .travis.yml ================================================ sudo: false notifications: email: false language: python python: - "2.7" before_install: - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh - bash miniconda.sh -b -p $HOME/miniconda - export PATH="$HOME/miniconda/bin:$PATH" - conda update --yes conda install: - conda install --yes python=$TRAVIS_PYTHON_VERSION pip numpy scipy - pip install coverage - pip install https://github.com/tariqdaouda/rabaDB/archive/stable.zip - python setup.py install script: coverage run -m unittest discover pyGeno/tests/ after_success: bash <(curl -s https://codecov.io/bash) ================================================ FILE: CHANGELOG.rst ================================================ 1.3.2 ===== * Search now uses KMD by default instead of dichotomic search (massive speed gain). Many thanks to @Keija for the implementation. Go to https://github.com/tariqdaouda/pyGeno/pull/34 for details and benchmarks. 1.3.1 ===== * CSVFile: fixed bug when slice start was None * CSVFile: Better support for string separator * AGN SNPs Quality cast to float by importer * Travis integration * Minor CSV parser updates 1.3.0 ===== * CSVFile will now ignore empty lines and comments * Added synonymousCodonsFrequencies 1.2.9 ===== * It is no longer mandatory to set the whole legend of CSV file at initialization. It can figure it out by itself * Datawraps can now be uncompressed folders * Explicit error message if there's no file name manifest.ini in datawrap 1.2.8 ===== * Fixed BUG that prevented proper initialization and importation 1.2.5 ===== * BUG FIX: Opening a lot of chromosomes caused mmap to die screaming * Removed core indexes. Sqlite sometimes chose them instead of user defined positional indexes, resulting un slow queries * Doc updates 1.2.3 ===== * Added functions to retrieve the names of imported snps sets and genomes * Added remote datawraps to the boostrap module that can be downloaded from pyGeno's website or any other location * Added a field uniqueId to AgnosticSNPs * Changed all latin datawrap names to english * Removed datawrap for dbSNP GRCh37 1.2.2 ===== * Updated pypi package to include bootstrap datawraps 1.2.1 ===== * Fixed tests 1.2.0 ===== * BUG FIX: get()/iterGet() now works for SNPs and Indels * BUG FIX: Default SNP filter used to return unwanted Nones for insertions * BUG FIX: Added cast of lines to str in VCF and CasavaSNP parsers. Sometimes unicode caracters made the translation bug * BUG FIX: Corrected a typo that caused find in Transcript to recursively die * Added a new AgnosticSNP type of SNPs that can be easily made from the results of any snp caller. To make for the loss of support for casava by illumina. See SNP.AgnosticSNP for documentation * pyGeno now comes with the murine reference genome GRCm38.78 * pyGeno now comes with the human reference genome GRCh38.78, GRCh37.75 is still shipped with pyGeno but might be in the upcoming versions * pyGeno now comes with a datawrap for common dbSNPs human SNPs (SNP_dbSNP142_human_common_all.tar.gz) * Added a dummy AgnosticSNP datawrap example for boostraping * Changed the interface of the bootstrap module * CSV Parser has now the ability to stream directly to a file 1.1.7 ===== * BUG FIX: looping through CSV lines now works * Added tests for CSV 1.1.6 ===== * BUG FIX: find in BinarySequence could not find some subsequences at the tail of sequence 1.1.5 ===== * BUG FIX in default SNP filter * Updated description 1.1.4 ===== * Another BUG FIX in progress bar 1.1.3 ===== * Small BUG FIX in the progress bar that caused epochs to be misrepresented * 'Specie' has been changed to 'species' everywhere. That breaks the database the only way to fix it is to redo all importations 1.1.2 ===== * Genome import is now much more memory efficient * BUG FIX: find in BinarySequence could not find subsequences at the tail of sequence * Added a built-in datawrap with only chr1 and y * Readme update with more infos about importation and link to doc 1.1.1 ===== Much better SNP filtering interface ------------------------------------ Easier and much morr coherent: * SNP filtering has now its own module * SNP Filters are now objects * SNP Filters must return SequenceSNP, SNPInsert, SNPDeletion or None objects 1.0.0 ===== Freshly hatched ================================================ FILE: DESCRIPTION.rst ================================================ pyGeno: A python package for Precision Medicine, Personalized Genomics and Proteomics ===================================================================================== Short description: ------------------ pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_). .. _Tariq Daouda: http://www.tariqdaouda.com .. _IRIC: http://www.iric.ca With pyGeno you can do that: .. code:: python from pyGeno.Genome import * #load a genome ref = Genome(name = 'GRCh37.75') #load a gene gene = ref.get(Gene, name = 'TPST2')[0] #print the sequences of all the isoforms for prot in gene.get(Protein) : print prot.sequence You can also do it for the **specific genomes** of your subjects: .. code:: python pers = Genome(name = 'GRCh37.75', SNPs = ["RNA_S1"], SNPFilter = myFilter()) And much more: https://github.com/tariqdaouda/pyGeno Verbose Description -------------------- pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Presonalized Genomes**. Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter. pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get direct access to the DNA and Protein sequences of your patients. Multiple sets of of polymorphisms can also be combined together to leverage their independent benefits ex: RNA-seq and DNA-seq for the same individual to improve the coverage RNA-seq of an individual + dbSNP for validation Combine the results of RNA-seq of several individual to create a genome only containing the common polymorphisms pyGeno is also a personal database that give you access to all the information provided by Ensembl (for both Reference and Personalized Genomes) without the need of queries to distant HTTP APIs. Allowing for much faster and reliable genome wide study pipelines. It also comes with parsers for several file types and various other useful tools. Full Documentation ------------------ The full documentation is available here_ .. _here: http://pygeno.iric.ca/ If you like pyGeno, please let me know. For the latest news, you can follow me on twitter `@tariqdaouda`_. .. _@tariqdaouda: https://www.twitter.com/tariqdaouda ================================================ FILE: LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "{}" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright {yyyy} {name of copyright owner} Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: MANIFEST.in ================================================ include *.rst include LICENSE ================================================ FILE: README.rst ================================================ CODE FREEZE: ============ PyGeno has long been limited due to it's backend. We are now ready to take it to the next level. We are working on a major port of pyGeno to the open-source multi-modal database ArangoDB. PyGeno's code on both branches master and bloody is frozen until we are finished. No pull request will be merged until then, and we won't implement any new features. pyGeno: A Python package for precision medicine and proteogenomics ================================================================== .. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg :alt: depsy :target: http://depsy.org/package/python/pyGeno .. image:: https://pepy.tech/badge/pygeno :alt: downloads :target: https://pepy.tech/project/pygeno .. image:: https://pepy.tech/badge/pygeno/month :alt: downloads_month :target: https://pepy.tech/project/pygeno/month .. image:: https://pepy.tech/badge/pygeno/week :alt: downloads_week :target: https://pepy.tech/project/pygeno/week .. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png :alt: pyGeno's logo pyGeno is (to our knowledge) the only tool available that will gladly build your specific genomes for you. pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_. For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_. .. _Tariq Daouda: http://wwww.tariqdaouda.com .. _IRIC: http://www.iric.ca .. _Sawssan Kaddoura: http://sawssankaddoura.com Click here for The `full documentation`_. .. _full documentation: http://pygeno.iric.ca/ For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_. .. _@tariqdaouda: https://www.twitter.com/tariqdaouda Citing pyGeno: -------------- Please cite this paper_. .. _paper: http://f1000research.com/articles/5-381/v1 Installation: ------------- It is recommended to install pyGeno within a `virtual environement`_, to setup one you can use: .. code:: shell virtualenv ~/.pyGenoEnv source ~/.pyGenoEnv/bin/activate pyGeno can be installed through pip: .. code:: shell pip install pyGeno #for the latest stable version Or github, for the latest developments: .. code:: shell git clone https://github.com/tariqdaouda/pyGeno.git cd pyGeno python setup.py develop .. _`virtual environement`: http://virtualenv.readthedocs.org/ A brief introduction -------------------- pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Personalized Genomes**. Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphisms and an optional filter. pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get direct access to the DNA and Protein sequences of your patients. .. code:: python from pyGeno.Genome import * g = Genome(name = "GRCh37.75") prot = g.get(Protein, id = 'ENSP00000438917')[0] #print the protein sequence print prot.sequence #print the protein's gene biotype print prot.gene.biotype #print protein's transcript sequence print prot.transcript.sequence #fancy queries for exon in g.get(Exon, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) : #print the exon's coding sequence print exon.CDS #print the exon's transcript sequence print exon.transcript.sequence #You can do the same for your subject specific genomes #by combining a reference genome with polymorphisms g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter()) And if you ever get lost, there's an online **help()** function for each object type: .. code:: python from pyGeno.Genome import * print Exon.help() Should output: .. code:: Available fields for Exon: CDS_start, end, chromosome, CDS_length, frame, number, CDS_end, start, genome, length, protein, gene, transcript, id, strand Creating a Personalized Genome: ------------------------------- Personalized Genomes are a powerful feature that allow you to work on the specific genomes and proteomes of your patients. You can even mix several SNP sets together. .. code:: python from pyGeno.Genome import Genome #the name of the snp set is defined inside the datawrap's manifest.ini file dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY') #you can also define a filter (ex: a quality filter) for the SNPs dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter()) #and even mix several snp sets dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter()) Filtering SNPs: --------------- pyGeno allows you to select the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions. .. code:: python from pyGeno.SNPFiltering import SNPFilter, SequenceSNP class QMax_gt_filter(SNPFilter) : def __init__(self, threshold) : self.threshold = threshold #Here SNPs is a dictionary: SNPSet Name => polymorphism #This filter ignores deletions and insertions and #but applis all SNPs def filter(self, chromosome, **SNPs) : sources = {} alleles = [] for snpSet, snp in SNPs.iteritems() : pos = snp.start if snp.alt[0] == '-' : pass elif snp.ref[0] == '-' : pass else : sources[snpSet] = snp alleles.append(snp.alt) #if not an indel append the polymorphism #appends the refence allele to the lot refAllele = chromosome.refSequence[pos] alleles.append(refAllele) sources['ref'] = refAllele #optional we keep a record of the polymorphisms that were used during the process return SequenceSNP(alleles, sources = sources) The filter function can also be made more specific by using arguments that have the same names as the SNPSets .. code:: python def filter(self, chromosome, dummySRY = None) : if dummySRY.Qmax_gt > self.threshold : #other possibilities of return are SequenceInsert(), SequenceDelete() return SequenceSNP(dummySRY.alt) return None #None means keep the reference allele To apply the filter simply specify if while loading the genome. .. code:: python persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10)) To include several SNPSets use a list. .. code:: python persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = ['ARN_P1', 'ARN_P2'], SNPFilter = myFilter()) Getting an arbitrary sequence: ------------------------------ You can ask for any sequence of any chromosome: .. code:: python chr12 = myGenome.get(Chromosome, number = "12")[0] print chr12.sequence[x1:x2] # for the reference sequence print chr12.refSequence[x1:x2] Batteries included (bootstraping): --------------------------------- pyGeno's database is populated by importing datawraps. pyGeno comes with a few data wraps, to get the list you can use: .. code:: python import pyGeno.bootstrap as B B.printDatawraps() .. code:: Available datawraps for boostraping SNPs ~~~~| |~~~:> Human_agnostic.dummySRY.tar.gz |~~~:> Human.dummySRY_casava.tar.gz |~~~:> dbSNP142_human_common_all.tar.gz Genomes ~~~~~~~| |~~~:> Human.GRCh37.75.tar.gz |~~~:> Human.GRCh37.75_Y-Only.tar.gz |~~~:> Human.GRCh38.78.tar.gz |~~~:> Mouse.GRCm38.78.tar.gz To get a list of remote datawraps that pyGeno can download for you, do: .. code:: python B.printRemoteDatawraps() Importing whole genomes is a demanding process that take more than an hour and requires (according to tests) at least 3GB of memory. Depending on your configuration, more might be required. That being said importating a data wrap is a one time operation and once the importation is complete the datawrap can be discarded without consequences. The bootstrap module also has some handy functions for importing built-in packages. Some of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**): .. code:: python import pyGeno.bootstrap as B #Imports only the Y chromosome from the human reference genome GRCh37.75 #Very fast, requires even less memory. No download required. B.importGenome("Human.GRCh37.75_Y-Only.tar.gz") #A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP format. # This one has one SNP at the begining of the gene SRY B.importSNPs("Human.dummySRY_casava.tar.gz") And for more **Serious Work**, the whole reference genome. .. code:: python #Downloads the whole genome (205MB, sequences + annotations), may take an hour or more. B.importGenome("Human.GRCh38.78.tar.gz") Importing a custom datawrap: -------------------------- .. code:: python from pyGeno.importation.Genomes import * importGenome('GRCh37.75.tar.gz') To import a patient's specific polymorphisms .. code:: python from pyGeno.importation.SNPs import * importSNPs('patient1.tar.gz') For a list of available datawraps available for download, please have a look here_. You can easily make your own datawraps with any tar.gz compressor. For more details on how datawraps are made you can check wiki_ or have a look inside the folder bootstrap_data. .. _here: http://pygeno.iric.ca/datawraps.html .. _wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-friendly-package-to-import-your-data%3F Instanciating a genome: ----------------------- .. code:: python from pyGeno.Genome import Genome #the name of the genome is defined inside the package's manifest.ini file ref = Genome(name = 'GRCh37.75') Printing all the proteins of a gene: ----------------------------------- .. code:: python from pyGeno.Genome import Genome from pyGeno.Gene import Gene from pyGeno.Protein import Protein Or simply: .. code:: python from pyGeno.Genome import * then: .. code:: python ref = Genome(name = 'GRCh37.75') #get returns a list of elements gene = ref.get(Gene, name = 'TPST2')[0] for prot in gene.get(Protein) : print prot.sequence Making queries, get() Vs iterGet(): ----------------------------------- iterGet is a faster version of get that returns an iterator instead of a list. Making queries, syntax: ---------------------- pyGeno's get function uses the expressivity of rabaDB. These are all possible query formats: .. code:: python ref.get(Gene, name = "SRY") ref.get(Gene, { "name like" : "HLA"}) chr12.get(Exon, { "start >=" : 12000, "end <" : 12300 }) ref.get(Transcript, { "gene.name" : 'SRY' }) Creating indexes to speed up queries: ------------------------------------ .. code:: python from pyGeno.Gene import Gene #creating an index on gene names if it does not already exist Gene.ensureGlobalIndex('name') #removing the index Gene.dropIndex('name') Find in sequences: ------------------ Internally pyGeno uses a binary representation for nucleotides and amino acids to deal with polymorphisms. For example,both "AGC" and "ATG" will match the following sequence "...AT/GCCG...". .. code:: python #returns the position of the first occurence transcript.find("AT/GCCG") #returns the positions of all occurences transcript.findAll("AT/GCCG") #similarly, you can also do transcript.findIncDNA("AT/GCCG") transcript.findAllIncDNA("AT/GCCG") transcript.findInUTR3("AT/GCCG") transcript.findAllInUTR3("AT/GCCG") transcript.findInUTR5("AT/GCCG") transcript.findAllInUTR5("AT/GCCG") #same for proteins protein.find("DEV/RDEM") protein.findAll("DEV/RDEM") #and for exons exon.find("AT/GCCG") exon.findAll("AT/GCCG") exon.findInCDS("AT/GCCG") exon.findAllInCDS("AT/GCCG") #... Progress Bar: ------------- .. code:: python from pyGeno.tools.ProgressBar import ProgressBar pg = ProgressBar(nbEpochs = 155) for i in range(155) : pg.update(label = '%d' %i) # or simply p.update() pg.close() ================================================ FILE: pyGeno/Chromosome.py ================================================ #import copy #import types #from tools import UsefulFunctions as uf from types import * import configuration as conf from pyGenoObjectBases import * from SNP import * import SNPFiltering as SF from rabaDB.filters import RabaQuery import rabaDB.fields as rf from tools.SecureMmap import SecureMmap as SecureMmap from tools import SingletonManager import pyGeno.configuration as conf class ChrosomeSequence(object) : """Represents a chromosome sequence. If 'refOnly' no ploymorphisms are applied and the ref sequence is always returned""" def __init__(self, data, chromosome, refOnly = False) : self.data = data self.refOnly = refOnly self.chromosome = chromosome self.setSNPFilter(self.chromosome.genome.SNPFilter) def setSNPFilter(self, SNPFilter) : self.SNPFilter = SNPFilter def getSequenceData(self, slic) : data = self.data[slic] SNPTypes = self.chromosome.genome.SNPTypes if SNPTypes is None or self.refOnly : return data iterators = [] for setName, SNPType in SNPTypes.iteritems() : f = RabaQuery(str(SNPType), namespace = self.chromosome._raba_namespace) chromosomeNumber = self.chromosome.number if chromosomeNumber == 'MT': chromosomeNumber = 'M' f.addFilter({'start >=' : slic.start, 'start <' : slic.stop, 'setName' : str(setName), 'chromosomeNumber' : chromosomeNumber}) # conf.db.enableDebug(True) iterators.append(f.iterRun(sqlTail = 'ORDER BY start')) if len(iterators) < 1 : return data polys = {} for iterator in iterators : for poly in iterator : if poly.start not in polys : polys[poly.start] = {poly.setName : poly} else : try : polys[poly.start][poly.setName].append(poly) except : polys[poly.start][poly.setName] = [polys[poly.start][poly.setName]] polys[poly.start][poly.setName].append(poly) data = list(data) for start, setPolys in polys.iteritems() : seqPos = start - slic.start sequenceModifier = self.SNPFilter.filter(self.chromosome, **setPolys) # print sequenceModifier.alleles if sequenceModifier is not None : if sequenceModifier.__class__ is SF.SequenceDel : seqPos = seqPos + sequenceModifier.offset #To avoid to change the length of the sequence who can create some bug or side effect data[seqPos:(seqPos + sequenceModifier.length)] = [''] * sequenceModifier.length elif sequenceModifier.__class__ is SF.SequenceSNP : data[seqPos] = sequenceModifier.alleles elif sequenceModifier.__class__ is SF.SequenceInsert : seqPos = seqPos + sequenceModifier.offset data[seqPos] = "%s%s" % (data[seqPos], sequenceModifier.bases) else : raise TypeError("sequenceModifier on chromosome: %s starting at: %s is of unknown type: %s" % (self.chromosome.number, snp.start, sequenceModifier.__class__)) return data def _getSequence(self, slic) : return ''.join(self.getSequenceData(slice(0, None, 1)))[slic] def __getitem__(self, i) : return self._getSequence(i) def __len__(self) : return self.chromosome.length class Chromosome_Raba(pyGenoRabaObject) : """The wrapped Raba object that really holds the data""" _raba_namespace = conf.pyGeno_RABA_NAMESPACE header = rf.Primitive() number = rf.Primitive() start = rf.Primitive() end = rf.Primitive() length = rf.Primitive() genome = rf.RabaObject('Genome_Raba') def _curate(self) : if self.end != None and self.start != None : self.length = self.end-self.start if self.number != None : self.number = str(self.number).upper() class Chromosome(pyGenoRabaObjectWrapper) : """The wrapper for playing with Chromosomes""" _wrapped_class = Chromosome_Raba def __init__(self, *args, **kwargs) : pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs) path = '%s/chromosome%s.dat'%(self.genome.getSequencePath(), self.number) if not SingletonManager.contains(path) : datMap = SingletonManager.add(SecureMmap(path), path) else : datMap = SingletonManager.get(path) self.sequence = ChrosomeSequence(datMap, self) self.refSequence = ChrosomeSequence(datMap, self, refOnly = True) self.loadSequences = False def getSequenceData(self, slic) : return self.sequence.getSequenceData(slic) def _makeLoadQuery(self, objectType, *args, **coolArgs) : if issubclass(objectType, SNP_INDEL) : f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace) coolArgs['species'] = self.genome.species coolArgs['chromosomeNumber'] = self.number if len(args) > 0 and type(args[0]) is types.ListType : for a in args[0] : if type(a) is types.DictType : f.addFilter(**a) else : f.addFilter(*args, **coolArgs) return f return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs) def __getitem__(self, i) : return self.sequence[i] def __str__(self) : return "Chromosome: number %s > %s" %(self.wrapped_object.number, str(self.wrapped_object.genome)) ================================================ FILE: pyGeno/Exon.py ================================================ from pyGenoObjectBases import * from SNP import SNP_INDEL import rabaDB.fields as rf from tools import UsefulFunctions as uf from tools.BinarySequence import NucBinarySequence class Exon_Raba(pyGenoRabaObject) : """The wrapped Raba object that really holds the data""" _raba_namespace = conf.pyGeno_RABA_NAMESPACE id = rf.Primitive() number = rf.Primitive() start = rf.Primitive() end = rf.Primitive() length = rf.Primitive() CDS_length = rf.Primitive() CDS_start = rf.Primitive() CDS_end = rf.Primitive() frame = rf.Primitive() strand = rf.Primitive() genome = rf.RabaObject('Genome_Raba') chromosome = rf.RabaObject('Chromosome_Raba') gene = rf.RabaObject('Gene_Raba') transcript = rf.RabaObject('Transcript_Raba') protein = rf.RabaObject('Protein_Raba') def _curate(self) : if self.start != None and self.end != None : if self.start > self.end : self.start, self.end = self.end, self.start self.length = self.end-self.start if self.CDS_start != None and self.CDS_end != None : if self.CDS_start > self.CDS_end : self.CDS_start, self.CDS_end = self.CDS_end, self.CDS_start self.CDS_length = self.CDS_end - self.CDS_start if self.number != None : self.number = int(self.number) if not self.frame or self.frame == '.' : self.frame = None else : self.frame = int(self.frame) class Exon(pyGenoRabaObjectWrapper) : """The wrapper for playing with Exons""" _wrapped_class = Exon_Raba def __init__(self, *args, **kwargs) : pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs) self._load_sequencesTriggers = set(["UTR5", "UTR3", "CDS", "sequence", "data"]) def _makeLoadQuery(self, objectType, *args, **coolArgs) : if issubclass(objectType, SNP_INDEL) : f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace) coolArgs['species'] = self.genome.species coolArgs['chromosomeNumber'] = self.chromosome.number coolArgs['start >='] = self.start coolArgs['start <'] = self.end if len(args) > 0 and type(args[0]) is types.ListType : for a in args[0] : if type(a) is types.DictType : f.addFilter(**a) else : f.addFilter(*args, **coolArgs) return f return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs) def _load_data(self) : data = self.chromosome.getSequenceData(slice(self.start,self.end)) diffLen = (self.end-self.start) - len(data) if self.strand == '+' : self.data = data else : self.data = uf.reverseComplementTab(data) if self.hasCDS() : start = self.CDS_start-self.start end = self.CDS_end-self.start if self.strand == '+' : self.UTR5 = self.data[:start] self.CDS = self.data[start:end+diffLen] self.UTR3 = self.data[end+diffLen:] else : self.UTR5 = self.data[:len(self.data)-(end-diffLen)] self.CDS = self.data[len(self.data)-(end-diffLen):len(self.data)-start] self.UTR3 = self.data[len(self.data)-start:] else : self.UTR5 = '' self.CDS = '' self.UTR3 = '' self.sequence = ''.join(self.data) def _load_bin_sequence(self) : self.bin_sequence = NucBinarySequence(self.sequence) self.bin_UTR5 = NucBinarySequence(self.UTR5) self.bin_CDS = NucBinarySequence(self.CDS) self.bin_UTR3 = NucBinarySequence(self.UTR3) def hasCDS(self) : """returns true or false depending on if the exon has a CDS""" if self.CDS_start != None and self.CDS_end != None: return True return False def getCDSLength(self) : """returns the length of the CDS sequence""" return len(self.CDS) def find(self, sequence) : """return the position of the first occurance of sequence""" return self.bin_sequence.find(sequence) def findAll(self, sequence): """Returns a lits of all positions where sequence was found""" return self.bin_sequence.findAll(sequence) def findInCDS(self, sequence) : """return the position of the first occurance of sequence""" return self.bin_CDS.find(sequence) def findAllInCDS(self, sequence): """Returns a lits of all positions where sequence was found""" return self.bin_CDS.findAll(sequence) def pluck(self) : """Returns a plucked object. Plucks the exon off the tree, set the value of self.transcript into str(self.transcript). This effectively disconnects the object and makes it much more lighter in case you'd like to pickle it""" e = copy.copy(self) e.transcript = str(self.transcript) return e def nextExon(self) : """Returns the next exon of the transcript, or None if there is none""" try : return self.transcript.exons[self.number+1] except IndexError : return None def previousExon(self) : """Returns the previous exon of the transcript, or None if there is none""" if self.number == 0 : return None try : return self.transcript.exons[self.number-1] except IndexError : return None def __str__(self) : return """EXON, id %s, number: %s, (start, end): (%s, %s), cds: (%s, %s) > %s""" %( self.id, self.number, self.start, self.end, self.CDS_start, self.CDS_end, str(self.transcript)) def __len__(self) : return len(self.sequence) ================================================ FILE: pyGeno/Gene.py ================================================ import configuration as conf from pyGenoObjectBases import * from SNP import SNP_INDEL import rabaDB.fields as rf class Gene_Raba(pyGenoRabaObject) : """The wrapped Raba object that really holds the data""" _raba_namespace = conf.pyGeno_RABA_NAMESPACE id = rf.Primitive() name = rf.Primitive() strand = rf.Primitive() biotype = rf.Primitive() start = rf.Primitive() end = rf.Primitive() genome = rf.RabaObject('Genome_Raba') chromosome = rf.RabaObject('Chromosome_Raba') def _curate(self) : self.name = self.name.upper() class Gene(pyGenoRabaObjectWrapper) : """The wrapper for playing with genes""" _wrapped_class = Gene_Raba def __init__(self, *args, **kwargs) : pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs) def _makeLoadQuery(self, objectType, *args, **coolArgs) : if issubclass(objectType, SNP_INDEL) : f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace) coolArgs['species'] = self.genome.species coolArgs['chromosomeNumber'] = self.chromosome.number coolArgs['start >='] = self.start coolArgs['start <'] = self.end if len(args) > 0 and type(args[0]) is types.ListType : for a in args[0] : if type(a) is types.DictType : f.addFilter(**a) else : f.addFilter(*args, **coolArgs) return f return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs) def __str__(self) : return "Gene, name: %s, id: %s, strand: '%s' > %s" %(self.name, self.id, self.strand, str(self.chromosome)) ================================================ FILE: pyGeno/Genome.py ================================================ import types import configuration as conf import pyGeno.tools.UsefulFunctions as uf from pyGenoObjectBases import * from Chromosome import Chromosome from Gene import Gene from Transcript import Transcript from Protein import Protein from Exon import Exon import SNPFiltering as SF from SNP import * import rabaDB.fields as rf def getGenomeList() : """Return the names of all imported genomes""" import rabaDB.filters as rfilt f = rfilt.RabaQuery(Genome_Raba) names = [] for g in f.iterRun() : names.append(g.name) return names class Genome_Raba(pyGenoRabaObject) : """The wrapped Raba object that really holds the data""" _raba_namespace = conf.pyGeno_RABA_NAMESPACE #_raba_not_a_singleton = True #you can have several instances of the same genome but they all share the same location in the database name = rf.Primitive() species = rf.Primitive() source = rf.Primitive() packageInfos = rf.Primitive() def _curate(self) : self.species = self.species.lower() def getSequencePath(self) : return conf.getGenomeSequencePath(self.species, self.name) def getReferenceSequencePath(self) : return conf.getReferenceGenomeSequencePath(self.species) def __len__(self) : """Size of the genome in pb""" l = 0 for c in self.chromosomes : l += len(c) return l class Genome(pyGenoRabaObjectWrapper) : """ This is the entry point to pyGeno:: myGeno = Genome(name = 'GRCh37.75', SNPs = ['RNA_S1', 'DNA_S1'], SNPFilter = MyFilter) for prot in myGeno.get(Protein) : print prot.sequence """ _wrapped_class = Genome_Raba def __init__(self, SNPs = None, SNPFilter = None, *args, **kwargs) : pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs) if type(SNPs) is types.StringType : self.SNPsSets = [SNPs] else : self.SNPsSets = SNPs # print "pifpasdf", self.SNPsSets if SNPFilter is None : self.SNPFilter = SF.DefaultSNPFilter() else : if issubclass(SNPFilter.__class__, SF.SNPFilter) : self.SNPFilter = SNPFilter else : raise ValueError("The value of 'SNPFilter' is not an object deriving from a subclass of SNPFiltering.SNPFilter. Got: '%s'" % SNPFilter) self.SNPTypes = {} if SNPs is not None : f = RabaQuery(SNPMaster, namespace = self._raba_namespace) for se in self.SNPsSets : f.addFilter(setName = se, species = self.species) res = f.run() if res is None or len(res) < 1 : raise ValueError("There's no set of SNPs that goes by the name of %s for species %s" % (SNPs, self.species)) for s in res : # print s.setName, s.SNPType self.SNPTypes[s.setName] = s.SNPType def _makeLoadQuery(self, objectType, *args, **coolArgs) : if issubclass(objectType, SNP_INDEL) : # conf.db.enableDebug(True) f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace) coolArgs['species'] = self.species if len(args) > 0 and type(args[0]) is types.ListType : for a in args[0] : if type(a) is types.DictType : f.addFilter(**a) else : f.addFilter(*args, **coolArgs) return f return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs) def __str__(self) : return "Genome: %s/%s SNPs: %s" %(self.species, self.name, self.SNPTypes) ================================================ FILE: pyGeno/Protein.py ================================================ import configuration as conf from pyGenoObjectBases import * from SNP import SNP_INDEL import rabaDB.fields as rf from tools import UsefulFunctions as uf from tools.BinarySequence import AABinarySequence import copy class Protein_Raba(pyGenoRabaObject) : """The wrapped Raba object that really holds the data""" _raba_namespace = conf.pyGeno_RABA_NAMESPACE id = rf.Primitive() name = rf.Primitive() genome = rf.RabaObject('Genome_Raba') chromosome = rf.RabaObject('Chromosome_Raba') gene = rf.RabaObject('Gene_Raba') transcript = rf.RabaObject('Transcript_Raba') def _curate(self) : if self.name != None : self.name = self.name.upper() class Protein(pyGenoRabaObjectWrapper) : """The wrapper for playing with Proteins""" _wrapped_class = Protein_Raba def __init__(self, *args, **kwargs) : pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs) self._load_sequencesTriggers = set(["sequence"]) def _makeLoadQuery(self, objectType, *args, **coolArgs) : if issubclass(objectType, SNP_INDEL) : f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace) coolArgs['species'] = self.genome.species coolArgs['chromosomeNumber'] = self.chromosome.number coolArgs['start >='] = self.transcript.start coolArgs['start <'] = self.transcript.end if len(args) > 0 and type(args[0]) is types.ListType : for a in args[0] : if type(a) is types.DictType : f.addFilter(**a) else : f.addFilter(*args, **coolArgs) return f return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs) def _load_sequences(self) : if self.chromosome.number != 'MT': self.sequence = uf.translateDNA(self.transcript.cDNA).rstrip('*') else: self.sequence = uf.translateDNA(self.transcript.cDNA, translTable_id='mt').rstrip('*') def getSequence(self): return self.sequence def _load_bin_sequence(self) : self.bin_sequence = AABinarySequence(self.sequence) def getDefaultSequence(self) : """Returns a version str sequence where only the last allele of each polymorphisms is shown""" return self.bin_sequence.defaultSequence def getPolymorphisms(self) : """Returns a list of all polymorphisms contained in the protein""" return self.bin_sequence.getPolymorphisms() def find(self, sequence): """Returns the position of the first occurence of sequence taking polymorphisms into account""" return self.bin_sequence.find(sequence) def findAll(self, sequence): """Returns all the position of the occurences of sequence taking polymorphisms into accoun""" return self.bin_sequence.findAll(sequence) def findString(self, sequence) : """Returns the first occurence of sequence using simple string search in sequence that doesn't care about polymorphisms""" return self.sequence.find(sequence) def findStringAll(self, sequence): """Returns all first occurences of sequence using simple string search in sequence that doesn't care about polymorphisms""" return uf.findAll(self.sequence, sequence) def __getitem__(self, i) : return self.bin_sequence.getChar(i) def __len__(self) : return len(self.bin_sequence) def __str__(self) : return "Protein, id: %s > %s" %(self.id, str(self.transcript)) ================================================ FILE: pyGeno/SNP.py ================================================ import types import configuration as conf from pyGenoObjectBases import * import rabaDB.fields as rf # from tools import UsefulFunctions as uf """General guidelines for SNP classes: ---- -All classes must inherit from SNP_INDEL -All classes name must end with SNP -A SNP_INDELs must have at least chromosomeNumber, setName, species, start and ref filled in order to be inserted into sequences -User can define an alias for the alt field (snp_indel alleles) to indicate the default field from wich to extract alleles """ def getSNPSetsList() : """Return the names of all imported snp sets""" import rabaDB.filters as rfilt f = rfilt.RabaQuery(SNPMaster) names = [] for g in f.iterRun() : names.append(g.setName) return names class SNPMaster(Raba) : 'This object keeps track of SNP sets and their types' _raba_namespace = conf.pyGeno_RABA_NAMESPACE species = rf.Primitive() SNPType = rf.Primitive() setName = rf.Primitive() _raba_uniques = [('setName',)] def _curate(self) : self.species = self.species.lower() self.setName = self.setName.lower() class SNP_INDEL(Raba) : "All SNPs should inherit from me. The name of the class must end with SNP" _raba_namespace = conf.pyGeno_RABA_NAMESPACE _raba_abstract = True # not saved in db species = rf.Primitive() setName = rf.Primitive() chromosomeNumber = rf.Primitive() start = rf.Primitive() end = rf.Primitive() ref = rf.Primitive() #every SNP_INDEL must have a field alt. This variable allows you to set an alias for it. Chamging the alias does not impact the schema altAlias = 'alt' def __init__(self, *args, **kwargs) : pass def __getattribute__(self, k) : if k == 'alt' : cls = Raba.__getattribute__(self, '__class__') return Raba.__getattribute__(self, cls.altAlias) return Raba.__getattribute__(self, k) def __setattr__(self, k, v) : if k == 'alt' : cls = Raba.__getattribute__(self, '__class__') Raba.__setattr__(self, cls.altAlias, v) Raba.__setattr__(self, k, v) def _curate(self) : self.species = self.species.lower() @classmethod def ensureGlobalIndex(cls, fields) : cls.ensureIndex(fields) def __repr__(self) : return "%s> chr: %s, start: %s, end: %s, alt: %s, ref: %s" %(self.__class__.__name__, self.chromosomeNumber, self.start, self.end, self.alleles, self.ref) class CasavaSNP(SNP_INDEL) : "A SNP of Casava" _raba_namespace = conf.pyGeno_RABA_NAMESPACE alleles = rf.Primitive() bcalls_used = rf.Primitive() bcalls_filt = rf.Primitive() QSNP = rf.Primitive() Qmax_gt = rf.Primitive() max_gt_poly_site = rf.Primitive() Qmax_gt_poly_site = rf.Primitive() A_used = rf.Primitive() C_used = rf.Primitive() G_used = rf.Primitive() T_used = rf.Primitive() altAlias = 'alleles' class AgnosticSNP(SNP_INDEL) : """This is a generic SNPs/Indels format that you can easily make from the result of any SNP caller. AgnosticSNP files are tab delimited files such as: chromosomeNumber uniqueId start end ref alleles quality caller Y 1 2655643 2655644 T AG 30 TopHat Y 2 2655645 2655647 - AG 28 TopHat Y 3 2655648 2655650 TT - 10 TopHat All positions must be 0 based The '-' indicates a deletion or an insertion. Collumn order has no importance. """ _raba_namespace = conf.pyGeno_RABA_NAMESPACE alleles = rf.Primitive() quality = rf.Primitive() caller = rf.Primitive() uniqueId = rf.Primitive() # polymorphism id altAlias = 'alleles' def __repr__(self) : return "AgnosticSNP> start: %s, end: %s, quality: %s, caller %s, alt: %s, ref: %s" %(self.start, self.end, self.quality, self.caller, self.alleles, self.ref) class dbSNPSNP(SNP_INDEL) : "This class is for SNPs from dbSNP. Feel free to uncomment the fields that you need" _raba_namespace = conf.pyGeno_RABA_NAMESPACE # To add/remove a field comment/uncomentd it. Beware, adding or removing a field results in a significant overhead the first time you relaunch pyGeno. You may have to delete and reimport some snps sets. #####RSPOS = rf.Primitive() #Chr position reported in already saved into field start RS = rf.Primitive() #dbSNP ID (i.e. rs number) ALT = rf.Primitive() RV = rf.Primitive() #RS orientation is reversed #VP = rf.Primitive() #Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf #GENEINFO = rf.Primitive() #Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|) dbSNPBuildID = rf.Primitive() #First dbSNP Build for RS #SAO = rf.Primitive() #Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both #SSR = rf.Primitive() #Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other #WGT = rf.Primitive() #Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more VC = rf.Primitive() #Variation Class PM = rf.Primitive() #Variant is Precious(Clinical,Pubmed Cited) #TPA = rf.Primitive() #Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data) #PMC = rf.Primitive() #Links exist to PubMed Central article #S3D = rf.Primitive() #Has 3D structure - SNP3D table #SLO = rf.Primitive() #Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out #NSF = rf.Primitive() #Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44 #NSM = rf.Primitive() #Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42 #NSN = rf.Primitive() #Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41 #REF = rf.Primitive() #Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8 #SYN = rf.Primitive() #Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3 #U3 = rf.Primitive() #In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53 #U5 = rf.Primitive() #In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55 #ASS = rf.Primitive() #In acceptor splice site FxnCode = 73 #DSS = rf.Primitive() #In donor splice-site FxnCode = 75 #INT = rf.Primitive() #In Intron FxnCode = 6 #R3 = rf.Primitive() #In 3' gene region FxnCode = 13 #R5 = rf.Primitive() #In 5' gene region FxnCode = 15 #OTH = rf.Primitive() #Has other variant with exactly the same set of mapped positions on NCBI refernce assembly. #CFL = rf.Primitive() #Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies. #ASP = rf.Primitive() #Is Assembly specific. This is set if the variant only maps to one assembly MUT = rf.Primitive() #Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources VLD = rf.Primitive() #Is Validated. This bit is set if the variant has 2+ minor allele count based on frequency or genotype data. G5A = rf.Primitive() #>5% minor allele frequency in each and all populations G5 = rf.Primitive() #>5% minor allele frequency in 1+ populations #HD = rf.Primitive() #Marker is on high density genotyping kit (50K density or greater). The variant may have phenotype associations present in dbGaP. #GNO = rf.Primitive() #Genotypes available. The variant has individual genotype (in SubInd table). KGValidated = rf.Primitive() #1000 Genome validated #KGPhase1 = rf.Primitive() #1000 Genome phase 1 (incl. June Interim phase 1) #KGPilot123 = rf.Primitive() #1000 Genome discovery all pilots 2010(1,2,3) #KGPROD = rf.Primitive() #Has 1000 Genome submission OTHERKG = rf.Primitive() #non-1000 Genome submission #PH3 = rf.Primitive() #HAP_MAP Phase 3 genotyped: filtered, non-redundant #CDA = rf.Primitive() #Variation is interrogated in a clinical diagnostic assay #LSD = rf.Primitive() #Submitted from a locus-specific database #MTP = rf.Primitive() #Microattribution/third-party annotation(TPA:GWAS,PAGE) #OM = rf.Primitive() #Has OMIM/OMIA #NOC = rf.Primitive() #Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation. #WTD = rf.Primitive() #Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set. If all member ss' are withdrawn, then the rs is deleted to SNPHistory #NOV = rf.Primitive() #Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common. #CAF = rf.Primitive() #An ordered, comma delimited list of allele frequencies based on 1000Genomes, starting with the reference allele followed by alternate alleles as ordered in the ALT column. Where a 1000Genomes alternate allele is not in the dbSNPs alternate allele set, the allele is added to the ALT column. The minor allele is the second largest value in the list, and was previuosly reported in VCF as the GMAF. This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter COMMON = rf.Primitive() #RS is a common SNP. A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency. #CLNHGVS = rf.Primitive() #Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical INFO tags. #CLNALLE = rf.Primitive() #Variant alleles from REF or ALT columns. 0 is REF, 1 is the first ALT allele, etc. This is used to match alleles with other corresponding clinical (CLN) INFO tags. A value of -1 indicates that no allele was found to match a corresponding HGVS allele name. #CLNSRC = rf.Primitive() #Variant Clinical Chanels #CLNORIGIN = rf.Primitive() #Allele Origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other #CLNSRCID = rf.Primitive() #Variant Clinical Channel IDs #CLNSIG = rf.Primitive() #Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other #CLNDSDB = rf.Primitive() #Variant disease database name #CLNDSDBID = rf.Primitive() #Variant disease database ID #CLNDBN = rf.Primitive() #Variant disease name #CLNACC = rf.Primitive() #Variant Accession and Versions #FILTER_NC = rf.Primitive() #Inconsistent Genotype Submission For At Least One Sample altAlias = 'ALT' class TopHatSNP(SNP_INDEL) : "A SNP from Top Hat, not implemented" _raba_namespace = conf.pyGeno_RABA_NAMESPACE pass ================================================ FILE: pyGeno/SNPFiltering.py ================================================ import types import configuration as conf from tools import UsefulFunctions as uf class Sequence_modifiers(object) : """Abtract Class. All sequence must inherit from me""" def __init__(self, sources = {}) : self.sources = sources def addSource(self, name, snp) : "Optional, you can keep a dict that records the polymorphims that were mixed together to make self. They are stored into self.sources" self.sources[name] = snp class SequenceSNP(Sequence_modifiers) : """Represents a SNP to be applied to the sequence""" def __init__(self, alleles, sources = {}) : Sequence_modifiers.__init__(self, sources) if type(alleles) is types.ListType : self.alleles = uf.encodePolymorphicNucleotide(''.join(alleles)) else : self.alleles = uf.encodePolymorphicNucleotide(alleles) class SequenceInsert(Sequence_modifiers) : """Represents an Insertion to be applied to the sequence""" def __init__(self, bases, sources = {}, ref = '-') : Sequence_modifiers.__init__(self, sources) self.bases = bases self.offset = 0 # Allow to use format like C/CCTGGAA(dbSNP) or CCT/CCTGGAA(samtools) if ref != '-': if ref == bases[:len(ref)]: self.offset = len(ref) self.bases = self.bases[self.offset:] #-1 because if the insertion are after the last nuc we go out of table self.offset -= 1 else: raise NotImplemented("This format of Insetion is not accepted. Please change your format, or implement your format in pyGeno.") class SequenceDel(Sequence_modifiers) : """Represents a Deletion to be applied to the sequence""" def __init__(self, length, sources = {}, ref = None, alt = '-') : Sequence_modifiers.__init__(self, sources) self.length = length self.offset = 0 # Allow to use format like CCTGGAA/C(dbSNP) or CCTGGAA/CCT(samtools) if alt != '-': if ref is not None: if alt == ref[:len(alt)]: self.offset = len(alt) self.length = self.length - len(alt) else: raise NotImplemented("This format of Deletion is not accepted. Please change your format, or implement your format in pyGeno.") else: raise Exception("You need to add a ref sequence in your call of SequenceDel. Or implement your format in pyGeno.") class SNPFilter(object) : """Abtract Class. All filters must inherit from me""" def __init__(self) : pass def filter(self, chromosome, **kwargs) : raise NotImplemented("Must be implemented in child") class DefaultSNPFilter(SNPFilter) : """ Default filtering object, does not filter anything. Doesn't apply insertions or deletions. This is also a template that you can use for own filters. A prototype for a custom filter might be:: class MyFilter(SNPFilter) : def __init__(self, thres) : self.thres = thres def filter(chromosome, SNP_Set1 = None, SNP_Set2 = None ) : if SNP_Set1.alt is not None and (SNP_Set1.alt == SNP_Set2.alt) and SNP_Set1.Qmax_gt > self.thres : return SequenceSNP(SNP_Set1.alt) return None Where SNP_Set1 and SNP_Set2 are the actual names of the snp sets supplied to the genome. In the context of the function they represent single polymorphisms, or lists of polymorphisms, derived from thoses sets that occur at the same position. Whatever goes on into the function is absolutely up to you, but in the end, it must return an object of one of the following classes: * SequenceSNP * SequenceInsert * SequenceDel * None """ def __init__(self) : SNPFilter.__init__(self) def filter(self, chromosome, **kwargs) : """The default filter mixes applied all SNPs and ignores Insertions and Deletions.""" def appendAllele(alleles, sources, snp) : pos = snp.start if snp.alt[0] == '-' : pass # print warn % ('DELETION', snpSet, snp.start, snp.chromosomeNumber) elif snp.ref[0] == '-' : pass # print warn % ('INSERTION', snpSet, snp.start, snp.chromosomeNumber) else : sources[snpSet] = snp alleles.append(snp.alt) #if not an indel append the polymorphism refAllele = chromosome.refSequence[pos] alleles.append(refAllele) sources['ref'] = refAllele return alleles, sources warn = 'Warning: the default snp filter ignores indels. IGNORED %s of SNP set: %s at pos: %s of chromosome: %s' sources = {} alleles = [] for snpSet, data in kwargs.iteritems() : if type(data) is list : for snp in data : alleles, sources = appendAllele(alleles, sources, snp) else : allels, sources = appendAllele(alleles, sources, data) #appends the refence allele to the lot #optional we keep a record of the polymorphisms that were used during the process return SequenceSNP(alleles, sources = sources) ================================================ FILE: pyGeno/Transcript.py ================================================ import configuration as conf from pyGenoObjectBases import * import rabaDB.fields as rf from tools import UsefulFunctions as uf from Exon import * from SNP import SNP_INDEL from tools.BinarySequence import NucBinarySequence class Transcript_Raba(pyGenoRabaObject) : """The wrapped Raba object that really holds the data""" _raba_namespace = conf.pyGeno_RABA_NAMESPACE id = rf.Primitive() name = rf.Primitive() length = rf.Primitive() start = rf.Primitive() end = rf.Primitive() coding = rf.Primitive() biotype = rf.Primitive() selenocysteine = rf.RList() genome = rf.RabaObject('Genome_Raba') chromosome = rf.RabaObject('Chromosome_Raba') gene = rf.RabaObject('Gene_Raba') protein = rf.RabaObject('Protein_Raba') exons = rf.Relation('Exon_Raba') def _curate(self) : if self.name != None : self.name = self.name.upper() self.length = abs(self.end - self.start) have_CDS_start = False have_CDS_end = False for exon in self.exons : if exon.CDS_start is not None : have_CDS_start = True if exon.CDS_end is not None : have_CDS_end = True if have_CDS_start and have_CDS_end : self.coding = True else : self.coding = False class Transcript(pyGenoRabaObjectWrapper) : """The wrapper for playing with Transcripts""" _wrapped_class = Transcript_Raba def __init__(self, *args, **kwargs) : pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs) self.exons = RLWrapper(self, Exon, self.wrapped_object.exons) self._load_sequencesTriggers = set(["UTR5", "UTR3", "cDNA", "sequence", "data"]) self.exonsDict = {} def _makeLoadQuery(self, objectType, *args, **coolArgs) : if issubclass(objectType, SNP_INDEL) : # conf.db.enableDebug(True) f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace) coolArgs['species'] = self.genome.species coolArgs['chromosomeNumber'] = self.chromosome.number coolArgs['start >='] = self.start coolArgs['start <'] = self.end if len(args) > 0 and type(args[0]) is types.ListType : for a in args[0] : if type(a) is types.DictType : f.addFilter(**a) else : f.addFilter(*args, **coolArgs) return f return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs) def _load_data(self) : def getV(k) : return pyGenoRabaObjectWrapper.__getattribute__(self, k) def setV(k, v) : return pyGenoRabaObjectWrapper.__setattr__(self, k, v) self.data = [] cDNA = [] UTR5 = [] UTR3 = [] exons = [] prime5 = True for ee in self.wrapped_object.exons : e = pyGenoRabaObjectWrapper_metaclass._wrappers[Exon_Raba](wrapped_object_and_bag = (ee, getV('bagKey'))) self.exonsDict[(e.start, e.end)] = e exons.append(e) self.data.extend(e.data) if e.hasCDS() : UTR5.append(''.join(e.UTR5)) if self.selenocysteine is not None: for position in self.selenocysteine: if e.CDS_start <= position <= e.CDS_end: if e.strand == '+': ajusted_position = position - e.CDS_start else: ajusted_position = e.CDS_end - position - 3 if e.CDS[ajusted_position] == 'T': e.CDS = list(e.CDS) e.CDS[ajusted_position] = '!' if len(cDNA) == 0 and e.frame != 0 : e.CDS = e.CDS[e.frame:] if e.strand == '+': e.CDS_start += e.frame else: e.CDS_end -= e.frame if len(e.CDS): cDNA.append(''.join(e.CDS)) UTR3.append(''.join(e.UTR3)) prime5 = False else : if prime5 : UTR5.append(''.join(e.data)) else : UTR3.append(''.join(e.data)) sequence = ''.join(self.data) cDNA = ''.join(cDNA) UTR5 = ''.join(UTR5) UTR3 = ''.join(UTR3) setV('exons', exons) setV('sequence', sequence) setV('cDNA', cDNA) setV('UTR5', UTR5) setV('UTR3', UTR3) if len(cDNA) > 0 and len(cDNA) % 3 != 0 : setV('flags', {'DUBIOUS' : True, 'cDNA_LEN_NOT_MULT_3': True}) else : setV('flags', {'DUBIOUS' : False, 'cDNA_LEN_NOT_MULT_3': False}) def _load_bin_sequence(self) : self.bin_sequence = NucBinarySequence(self.sequence) self.bin_UTR5 = NucBinarySequence(self.UTR5) self.bin_cDNA = NucBinarySequence(self.cDNA) self.bin_UTR3 = NucBinarySequence(self.UTR3) def getNucleotideCodon(self, cdnaX1) : """Returns the entire codon of the nucleotide at pos cdnaX1 in the cdna, and the position of that nocleotide in the codon""" return uf.getNucleotideCodon(self.cDNA, cdnaX1) def getCodon(self, i) : """returns the ith codon""" return self.getNucleotideCodon(i*3)[0] def iterCodons(self) : """iterates through the codons""" for i in range(len(self.cDNA)/3) : yield self.getCodon(i) def find(self, sequence) : """return the position of the first occurance of sequence""" return self.bin_sequence.find(sequence) def findAll(self, sequence): """Returns a list of all positions where sequence was found""" return self.bin_sequence.findAll(sequence) def findIncDNA(self, sequence) : """return the position of the first occurance of sequence""" return self.bin_cDNA.find(sequence) def findAllIncDNA(self, sequence) : """Returns a list of all positions where sequence was found in the cDNA""" return self.bin_cDNA.findAll(sequence) def getcDNALength(self) : """returns the length of the cDNA""" return len(self.cDNA) def findInUTR5(self, sequence) : """return the position of the first occurance of sequence in the 5'UTR""" return self.bin_UTR5.find(sequence) def findAllInUTR5(self, sequence) : """Returns a list of all positions where sequence was found in the 5'UTR""" return self.bin_UTR5.findAll(sequence) def getUTR5Length(self) : """returns the length of the 5'UTR""" return len(self.bin_UTR5) def findInUTR3(self, sequence) : """return the position of the first occurance of sequence in the 3'UTR""" return self.bin_UTR3.find(sequence) def findAllInUTR3(self, sequence) : """Returns a lits of all positions where sequence was found in the 3'UTR""" return self.bin_UTR3.findAll(sequence) def getUTR3Length(self) : """returns the length of the 3'UTR""" return len(self.bin_UTR3) def getNbCodons(self) : """returns the number of codons in the transcript""" return len(self.cDNA)/3 def __getattribute__(self, name) : return pyGenoRabaObjectWrapper.__getattribute__(self, name) def __getitem__(self, i) : return self.sequence[i] def __len__(self) : return len(self.sequence) def __str__(self) : return """Transcript, id: %s, name: %s > %s""" %(self.id, self.name, str(self.gene)) ================================================ FILE: pyGeno/__init__.py ================================================ __all__ = ['Genome', 'Chromosome', 'Gene', 'Transcript', 'Exon', 'Protein', 'SNP'] from configuration import pyGeno_init pyGeno_init() ================================================ FILE: pyGeno/bootstrap.py ================================================ import pyGeno.importation.Genomes as PG import pyGeno.importation.SNPs as PS from pyGeno.tools.io import printf import os, tempfile, urllib, urllib2, json import pyGeno.configuration as conf this_dir, this_filename = os.path.split(__file__) def listRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) : """Lists all the datawraps availabe from a remote a remote location.""" loc = location + "/datawraps.json" response = urllib2.urlopen(loc) js = json.loads(response.read()) return js def printRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) : """ print all available datawraps from a remote location the location must have a datawraps.json in the following format:: { "Ordered": { "Reference genomes": { "Human" : ["GRCh37.75", "GRCh38.78"], "Mouse" : ["GRCm38.78"], }, "SNPs":{ } }, "Flat":{ "Reference genomes": { "GRCh37.75": "Human.GRCh37.75.tar.gz", "GRCh38.78": "Human.GRCh37.75.tar.gz", "GRCm38.78": "Mouse.GRCm38.78.tar.gz" }, "SNPs":{ } } } """ l = listRemoteDatawraps(location) printf("Available datawraps for bootstraping\n") print json.dumps(l["Ordered"], sort_keys=True, indent=4, separators=(',', ': ')) def _DW(name, url) : packageDir = tempfile.mkdtemp(prefix = "pyGeno_remote_") printf("~~~:>\n\tDownloading datawrap: %s..." % name) finalFile = os.path.normpath('%s/%s' %(packageDir, name)) urllib.urlretrieve (url, finalFile) printf('\tdone.\n~~~:>') return finalFile def importRemoteGenome(name, batchSize = 100) : """Import a genome available from http://pygeno.iric.ca (might work).""" try : dw = listRemoteDatawraps()["Flat"]["Reference genomes"][name] except AttributeError : raise AttributeError("There's no remote genome datawrap by the name of: '%s'" % name) finalFile = _DW(name, dw["url"]) PG.importGenome(finalFile, batchSize) def importRemoteSNPs(name) : """Import a SNP set available from http://pygeno.iric.ca (might work).""" try : dw = listRemoteDatawraps()["Flat"]["SNPs"] except AttributeError : raise AttributeError("There's no remote genome datawrap by the name of: '%s'" % name) finalFile = _DW(name, dw["url"]) PS.importSNPs(finalFile) def listDatawraps() : """Lists all the datawraps pyGeno comes with""" l = {"Genomes" : [], "SNPs" : []} for f in os.listdir(os.path.join(this_dir, "bootstrap_data/genomes")) : if f.find(".tar.gz") > -1 : l["Genomes"].append(f) for f in os.listdir(os.path.join(this_dir, "bootstrap_data/SNPs")) : if f.find(".tar.gz") > -1 : l["SNPs"].append(f) return l def printDatawraps() : """print all available datawraps for bootstraping""" l = listDatawraps() printf("Available datawraps for boostraping\n") for k, v in l.iteritems() : printf(k) printf("~"*len(k) + "|") for vv in v : printf(" "*len(k) + "|" + "~~~:> " + vv) printf('\n') def importGenome(name, batchSize = 100) : """Import a genome shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties.""" path = os.path.join(this_dir, "bootstrap_data", "genomes/" + name) PG.importGenome(path, batchSize) def importSNPs(name) : """Import a SNP set shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties.""" path = os.path.join(this_dir, "bootstrap_data", "SNPs/" + name) PS.importSNPs(path) ================================================ FILE: pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/manifest.ini ================================================ [package_infos] description = For testing purposes. All polymorphisms at the same position maintainer = Tariq Daouda maintainer_contact = tariq.daouda@umontreal.ca version = 1 [set_infos] species = human name = dummySRY_AGN_indels type = Agnostic source = my place at IRIC [snps] filename = snps.txt ================================================ FILE: pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/snps.txt ================================================ chromosomeNumber uniqueId start end ref alleles quality caller Y 1 2655643 2655646 T AG 30 test Y 2 2655643 2655647 - AG 30 test Y 3 2655643 2655650 TT - 30 test ================================================ FILE: pyGeno/bootstrap_data/SNPs/__init__.py ================================================ ================================================ FILE: pyGeno/bootstrap_data/__init__.py ================================================ ================================================ FILE: pyGeno/bootstrap_data/genomes/__init__.py ================================================ ================================================ FILE: pyGeno/configuration.py ================================================ import sys, os, time from ConfigParser import SafeConfigParser import rabaDB.rabaSetup import rabaDB.Raba class PythonVersionError(Exception) : pass pyGeno_FACE = "~-~-:>" pyGeno_BRANCH = "V2" pyGeno_VERSION_NAME = 'Lean Viper!' pyGeno_VERSION_RELEASE_LEVEL = 'Release' pyGeno_VERSION_NUMBER = 14.09 pyGeno_VERSION_BUILD_TIME = time.ctime(os.path.getmtime(__file__)) pyGeno_RABA_NAMESPACE = 'pyGenoRaba' pyGeno_SETTINGS_DIR = os.path.normpath(os.path.expanduser('~/.pyGeno/')) pyGeno_SETTINGS_PATH = None pyGeno_RABA_DBFILE = None pyGeno_DATA_PATH = None pyGeno_REMOTE_LOCATION = 'http://bioinfo.iric.ca/~daoudat/pyGeno_datawraps' db = None #proxy for the raba database dbConf = None #proxy for the raba database configuration def version() : """returns a tuple describing pyGeno's current version""" return (pyGeno_FACE, pyGeno_BRANCH, pyGeno_VERSION_NAME, pyGeno_VERSION_RELEASE_LEVEL, pyGeno_VERSION_NUMBER, pyGeno_VERSION_BUILD_TIME ) def prettyVersion() : """returns pyGeno's current version in a pretty human readable way""" return "pyGeno %s Branch: %s, Name: %s, Release Level: %s, Version: %s, Build time: %s" % version() def checkPythonVersion() : """pyGeno needs python 2.7+""" if sys.version_info[0] < 2 or (sys.version_info[0] > 2 and sys.version_info[1] < 7) : return False return True def getGenomeSequencePath(specie, name) : return os.path.normpath(pyGeno_DATA_PATH+'/%s/%s' % (specie.lower(), name)) def createDefaultConfigFile() : """Creates a default configuration file""" s = "[pyGeno_config]\nsettings_dir=%s\nremote_location=%s" % (pyGeno_SETTINGS_DIR, pyGeno_REMOTE_LOCATION) f = open('%s/config.ini' % pyGeno_SETTINGS_DIR, 'w') f.write(s) f.close() def getSettingsPath() : """Returns the path where the settings are stored""" parser = SafeConfigParser() try : parser.read(os.path.normpath(pyGeno_SETTINGS_DIR+'/config.ini')) return parser.get('pyGeno_config', 'settings_dir') except : createDefaultConfigFile() return getSettingsPath() def removeFromDBRegistery(obj) : """rabaDB keeps a record of loaded objects to ensure consistency between different queries. This function removes an object from the registery""" rabaDB.Raba.removeFromRegistery(obj) def freeDBRegistery() : """rabaDB keeps a record of loaded objects to ensure consistency between different queries. This function empties the registery""" rabaDB.Raba.freeRegistery() def reload() : """reinitialize pyGeno""" pyGeno_init() def pyGeno_init() : """This function is automatically called at launch""" global db, dbConf global pyGeno_SETTINGS_PATH global pyGeno_RABA_DBFILE global pyGeno_DATA_PATH if not checkPythonVersion() : raise PythonVersionError("==> FATAL: pyGeno only works with python 2.7 and above, please upgrade your python version") if not os.path.exists(pyGeno_SETTINGS_DIR) : os.makedirs(pyGeno_SETTINGS_DIR) pyGeno_SETTINGS_PATH = getSettingsPath() pyGeno_RABA_DBFILE = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, "pyGenoRaba.db") ) pyGeno_DATA_PATH = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, "data") ) if not os.path.exists(pyGeno_SETTINGS_PATH) : os.makedirs(pyGeno_SETTINGS_PATH) if not os.path.exists(pyGeno_DATA_PATH) : os.makedirs(pyGeno_DATA_PATH) #launching the db rabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE, pyGeno_RABA_DBFILE) db = rabaDB.rabaSetup.RabaConnection(pyGeno_RABA_NAMESPACE) dbConf = rabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE) ================================================ FILE: pyGeno/doc/Makefile ================================================ # Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = build # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" @echo " coverage to run coverage check of the documentation (if enabled)" clean: rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/pyGeno.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/pyGeno.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/pyGeno" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/pyGeno" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." coverage: $(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage @echo "Testing of coverage in the sources finished, look at the " \ "results in $(BUILDDIR)/coverage/python.txt." xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." ================================================ FILE: pyGeno/doc/make.bat ================================================ @ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source set I18NSPHINXOPTS=%SPHINXOPTS% source if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. xml to make Docutils-native XML files echo. pseudoxml to make pseudoxml-XML files for display purposes echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled echo. coverage to run coverage check of the documentation if enabled goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) REM Check if sphinx-build is available and fallback to Python version if any %SPHINXBUILD% 2> nul if errorlevel 9009 goto sphinx_python goto sphinx_ok :sphinx_python set SPHINXBUILD=python -m sphinx.__init__ %SPHINXBUILD% 2> nul if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.http://sphinx-doc.org/ exit /b 1 ) :sphinx_ok if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\pyGeno.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\pyGeno.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdf" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf cd %~dp0 echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdfja" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf-ja cd %~dp0 echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) if "%1" == "coverage" ( %SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage if errorlevel 1 exit /b 1 echo. echo.Testing of coverage in the sources finished, look at the ^ results in %BUILDDIR%/coverage/python.txt. goto end ) if "%1" == "xml" ( %SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml if errorlevel 1 exit /b 1 echo. echo.Build finished. The XML files are in %BUILDDIR%/xml. goto end ) if "%1" == "pseudoxml" ( %SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml if errorlevel 1 exit /b 1 echo. echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml. goto end ) :end ================================================ FILE: pyGeno/doc/source/bootstraping.rst ================================================ Bootstraping ================================== pyGeno can be quick-started by importing these built-in data wraps. .. automodule:: bootstrap :members: ================================================ FILE: pyGeno/doc/source/citing.rst ================================================ Citing ========= If you are using pyGeno please mention it to the rest of the universe by including a link to: https://github.com/tariqdaouda/pyGeno ================================================ FILE: pyGeno/doc/source/conf.py ================================================ # -*- coding: utf-8 -*- # # pyGeno documentation build configuration file, created by # sphinx-quickstart on Thu Nov 6 16:45:34 2014. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys import os import sphinx_rtd_theme # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. sys.path.insert(0, os.path.abspath('../..')) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.intersphinx', 'sphinx.ext.todo', 'sphinx.ext.coverage', 'sphinx.ext.mathjax', 'sphinx.ext.ifconfig', 'sphinx.ext.viewcode', ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = u'pyGeno' copyright = u'2014, Tariq Daouda' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '1.x' # The full version, including alpha/beta/rc tags. release = '1.2.x' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = [] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. #keep_warnings = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # html_theme = "default" # html_theme_options = { # "rightsidebar":"true", # "stickysidebar" : "false", # } html_theme = "sphinx_rtd_theme" # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. html_logo = "logo.png" # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. #html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Language to be used for generating the HTML full-text search index. # Sphinx supports the following languages: # 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja' # 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr' #html_search_language = 'en' # A dictionary with options for the search language support, empty by default. # Now only 'ja' uses this config value #html_search_options = {'type': 'default'} # The name of a javascript file (relative to the configuration directory) that # implements a search results scorer. If empty, the default will be used. #html_search_scorer = 'scorer.js' # Output file base name for HTML help builder. htmlhelp_basename = 'pyGenodoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). 'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). 'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', # Latex figure (float) alignment #'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ ('index', 'pyGeno.tex', u'pyGeno Documentation', u'Tariq Daouda', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. # latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'pygeno', u'pyGeno Documentation', [u'Tariq Daouda'], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'pyGeno', u'pyGeno Documentation', u'Tariq Daouda', 'pyGeno', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. #texinfo_no_detailmenu = False # -- Options for Epub output ---------------------------------------------- # Bibliographic Dublin Core info. epub_title = u'pyGeno' epub_author = u'Tariq Daouda' epub_publisher = u'Tariq Daouda' epub_copyright = u'2014, Tariq Daouda' # The basename for the epub file. It defaults to the project name. #epub_basename = u'pyGeno' # The HTML theme for the epub output. Since the default themes are not optimized # for small screen space, using the same theme for HTML and epub output is # usually not wise. This defaults to 'epub', a theme designed to save visual # space. #epub_theme = 'epub' # The language of the text. It defaults to the language option # or 'en' if the language is not set. #epub_language = '' # The scheme of the identifier. Typical schemes are ISBN or URL. #epub_scheme = '' # The unique identifier of the text. This can be a ISBN number # or the project homepage. #epub_identifier = '' # A unique identification for the text. #epub_uid = '' # A tuple containing the cover image and cover page html template filenames. #epub_cover = () # A sequence of (type, uri, title) tuples for the guide element of content.opf. #epub_guide = () # HTML files that should be inserted before the pages created by sphinx. # The format is a list of tuples containing the path and title. #epub_pre_files = [] # HTML files shat should be inserted after the pages created by sphinx. # The format is a list of tuples containing the path and title. #epub_post_files = [] # A list of files that should not be packed into the epub file. epub_exclude_files = ['search.html'] # The depth of the table of contents in toc.ncx. #epub_tocdepth = 3 # Allow duplicate toc entries. #epub_tocdup = True # Choose between 'default' and 'includehidden'. #epub_tocscope = 'default' # Fix unsupported image types using the PIL. #epub_fix_images = False # Scale large images. #epub_max_image_width = 0 # How to display URL addresses: 'footnote', 'no', or 'inline'. #epub_show_urls = 'inline' # If false, no index is generated. #epub_use_index = True # Example configuration for intersphinx: refer to the Python standard library. intersphinx_mapping = {'http://docs.python.org/': None} ================================================ FILE: pyGeno/doc/source/datawraps.rst ================================================ Datawraps ========= Datawraps are used by pyGeno to import data into it's database. All reference genomes are downloaded from Ensembl, dbSNP data from dbSNP. The :doc:`/bootstraping` module has functions to import datawraps shipped with pyGeno. Datawraps can either be tar.gz.archives or folders. Importation ----------- Here's how you import a reference genome datawrap:: from pyGeno.importation.Genomes import * importGenome("my_datawrap.tar.gz") And a SNP set datawrap:: from pyGeno.importation.SNPs import * importSNPs("my_datawrap.tar.gz") Creating you own datawraps -------------------------- For polymorphims, create a file called **manifest.ini** with the following format:: [package_infos] description = SNPs for testing purposes maintainer = Tariq Daouda maintainer_contact = tariq.daouda [at] umontreal version = 1 [set_infos] species = human name = mySNPSET type = Agnostic # or CasavaSNP or dbSNPSNP source = Where do these snps come from? [snps] filename = snps.txt # or http://www.example.com/snps.txt or ftp://www.example.com/snps.txt if you chose not to include the file in the archive And compress the **manifest.ini** file along sith the snps.txt (if you chose to include it and not to specify an url) into a tar.gz archive. Natively pyGeno supports dbSNP and casava(snp.txt), but it also has its own polymorphism file format (AgnosticSNP) wich is simply a tab delemited file in the following format:: chromosomeNumber uniqueId start end ref alleles quality caller Y 1 2655643 2655644 T AG 30 TopHat Y 2 2655645 2655647 - AG 28 TopHat Y 3 2655648 2655650 TT - 10 TopHat Even tough all field are mandatory, the only ones that are critical for pyGeno to be able insert polymorphisms at the right places are: *chromosomeNumber* and *start*. All the other fields are non critical and can follow any convention you wish to apply to them, including the *end* field. You can choose the convention that suits best the query that you are planning to make on SNPs through .get(), or the way you intend to filter them using filtering objtecs. For genomes, the manifet.ini file looks like this:: [package_infos] description = Test package. This package installs only chromosome Y of mus musculus maintainer = Tariq Daouda maintainer_contact = tariq.daouda [at] umontreal version = GRCm38.73 [genome] species = Mus_musculus name = GRCm38_test source = http://useast.ensembl.org/info/data/ftp/index.html [chromosome_files] Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz # or an url such as ftp://... or http:// [gene_set] gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz # or an url such as ftp://... or http:// File URLs for refercence genomes can be found on `Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`_ To learn more about datawraps and how to make your own you can have a look at :doc:`/importation`, and the Wiki_. .. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data .. _`Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`: http://useast.ensembl.org/info/data/ftp/index.html ================================================ FILE: pyGeno/doc/source/importation.rst ================================================ Importation =========== pyGeno's database is populated by importing tar.gz compressed archives called *datawraps*. An importation is a one time step and once a datawrap has been imported it can be discarded with no concequences to the database. Here's how you import a reference genome datawrap:: from pyGeno.importation.Genomes import * importGenome("my_genome_datawrap.tar.gz") And a SNP set datawrap:: from pyGeno.importation.SNPs import * importSNPs("my_snp_datawrap.tar.gz") pyGeno comes with a few datawraps that you can quickly import using the :doc:`/bootstraping` module. You can find a list of datawraps to import here: :doc:`/datawraps` You can also easily create your own by simply putting a bunch of URLs into a *manifest.ini* file and compressing int into a *tar.gz archive* (as explained below or on the Wiki_). .. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data Genomes ------- .. automodule:: importation.Genomes :members: Polymorphisms (SNPs and Indels) ------------------------------- .. automodule:: importation.SNPs :members: ================================================ FILE: pyGeno/doc/source/index.rst ================================================ .. pyGeno documentation master file, created by sphinx-quickstart on Thu Nov 6 16:45:34 2014. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png :alt: pyGeno's logo pyGeno: A Python package for precision medicine and proteogenomics =================================================================== .. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg :alt: depsy :target: http://depsy.org/package/python/pyGeno .. image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg :target: https://opensource.org/licenses/Apache-2.0 .. image:: https://img.shields.io/badge/python-2.7-blue.svg pyGeno's `lair is on Github`_. .. _lair is on Github: http://www.github.com/tariqdaouda/pyGeno Citing pyGeno: -------------- Please cite this paper_. .. _paper: http://f1000research.com/articles/5-381/v1 A Quick Intro: ----------------- Even tough more and more research focuses on Personalized/Precision Medicine, treatments that are specically tailored to the patient, pyGeno is (to our knowlege) the only tool available that will gladly build your specific genomes for you and you give an easy access to them. pyGeno allows you to create and work on **Personalized Genomes**: custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter. pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get direct access to the DNA and Protein sequences of your patients/subjects. To know more about how to create a Personalized Genome, have a look at the :doc:`/quickstart` section. pyGeno can also function as a personal bioinformatic database for Ensembl, that runs directly into python, on your laptop, making faster and more reliable than any REST API. pyGeno makes extracting data such as gene sequences a breeze, and is designed to be able cope with huge queries. .. code:: from pyGeno.Genome import * g = Genome(name = "GRCh37.75") prot = g.get(Protein, id = 'ENSP00000438917')[0] #print the protein sequence print prot.sequence #print the protein's gene biotype print prot.gene.biotype #print protein's transcript sequence print prot.transcript.sequence #fancy queries for exons in g.get(Exons, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) : #print the exon's coding sequence print exon.CDS #print the exon's transcript sequence print exon.transcript.sequence #You can do the same for your subject specific genomes #by combining a reference genome with polymorphisms g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter()) Verbose Introduction --------------------- pyGeno integrates: * **Reference sequences** and annotations from **Ensembl** * Genomic polymorphisms from the **dbSNP** database * SNPs from **next-gen** sequencing pyGeno is a python package that was designed to be: * Fast to install. It has no dependencies but its own backend: `rabaDB`_. * Fast to run and memory efficient, so you can use it on your laptop. * Fast to use. No queries to foreign APIs all the data rests on your computer, so it is readily accessible when you need it. * Fast to learn. One single function **get()** can do the job of several other tools at once. It also comes with: * Parsers for: FASTA, FASTQ, GTF, VCF, CSV. * Useful tools for translation etc... * Optimised genome indexation with *Segment Trees*. * A funky *Progress Bar*. One of the the coolest things about pyGeno is that it also allows to quickly create **personalized genomes**. Genomes that you design yourself by combining a reference genome and SNP sets derived from dbSNP or next-gen sequencing. pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_. For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_. .. _rabaDB: https://github.com/tariqdaouda/rabaDB .. _Tariq Daouda: http://www.tariqdaouda.com .. _IRIC: http://www.iric.ca .. _Sawssan Kaddoura: http://www.sawssankaddoura.com .. _@tariqdaouda: https://www.twitter.com/tariqdaouda Contents: ---------- .. toctree:: :maxdepth: 2 publications quickstart installation bootstraping querying importation datawraps objects snp_filter tools parsers Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` ================================================ FILE: pyGeno/doc/source/installation.rst ================================================ Installation ============= Unix (MacOS, Linux) ------------------- The latest stable version is available from pypi:: pip install pyGeno **Upgrade**:: pip install pyGeno --upgrade If you're more adventurous, the bleeding edge version is available from github (look for the 'bloody' branch):: git clone https://github.com/tariqdaouda/pyGeno.git cd pyGeno python setup.py develop **Upgrade**:: git pull Windows ------- * Goto: https://www.python.org/downloads/ and download the installer for the lastest version of python 2.7 * Double click on the installer to begin installation * Click on the windows start menu * Type *"cmd"* and click on it to launch the command line interface * In the command line interface type:: cd C:\Python27\Scripts * Now type: pip install pyGeno * Now click on the windows start menu. In the python 2.7 menu you can either launch *Python (Command line)* or *IDLE (Python GUI)* * You can now go to: http://pygeno.iric.ca/quickstart.html and type the commands into either one of them **UPGRADE:** to upgrade pyGeno to the latest version, launch *cmd* and type:: cd C:\Python27\Scripts followed by:: pip install pyGeno --upgrade ================================================ FILE: pyGeno/doc/source/objects.rst ================================================ Objects ======= With pyGeno you can manipulate familiar object in intuituive way. All the following classes except SNP inherit from pyGenoObjectWrapper and have therefor access to functions sur as get(), count(), ensureIndex()... Base classes ------------- Base classes are abstract and are not meant to be instanciated, they nonetheless implement most of the functions that classes such as Genome possess. .. automodule:: pyGenoObjectBases :members: Genome ------- .. automodule:: Genome :members: Chromosome ---------- .. automodule:: Chromosome :members: Gene ---- .. automodule:: Gene :members: Transcript ---------- .. automodule:: Transcript :members: Exon ---- .. automodule:: Exon :members: Protein ------- .. automodule:: Protein :members: SNP --- .. automodule:: SNP :members: ================================================ FILE: pyGeno/doc/source/parsers.rst ================================================ Parsers ======= PyGeno comes with a set of parsers that you can use independentely. CSV --- To read and write CSV files. .. automodule:: tools.parsers.CSVTools :members: FASTA ----- To read and write FASTA files. .. automodule:: tools.parsers.FastaTools :members: FASTQ ----- To read and write FASTQ files. .. automodule:: tools.parsers.FastqTools :members: GTF --- To read GTF files. .. automodule:: tools.parsers.GTFTools :members: VCF --- To read VCF files. .. automodule:: tools.parsers.VCFTools :members: Casava ------- To read casava files. .. automodule:: tools.parsers.CasavaTools :members: ================================================ FILE: pyGeno/doc/source/publications.rst ================================================ Publications ============ Please cite this one: --------------------- `pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`_ .. _`pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`: http://f1000research.com/articles/5-381/v2 pyGeno was also used in the following studies: ---------------------------------------------- `MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`_ .. _`MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127664/ `Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015`_ .. _Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015: http://dx.doi.org/10.1038/ncomms10238 `Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014`_ .. _Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014: http://www.ncbi.nlm.nih.gov/pubmed/24714562 `MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012`_ .. _MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012: http://www.ncbi.nlm.nih.gov/pubmed/22438248 ================================================ FILE: pyGeno/doc/source/querying.rst ================================================ Querying ========= pyGeno is a personal database that you can query in many ways. Special emphasis has been placed upon ease of use, and you only need to remember two functions:: * get() * help() **get()** can be called from any pyGeno object to get any objects. **help()** is you best friend when you get lost using **get()**. When called, it will give the list of all field that you can use in get queries. You can call it either of the class:: Gene.help() Or on the object:: ref = Genome(name = "GRCh37.75") g = ref.get(Gene, name = "TPST2")[0] g.help() Both will print:: 'Available fields for Gene: end, name, chromosome, start, biotype, id, strand, genome' ================================================ FILE: pyGeno/doc/source/quickstart.rst ================================================ Quickstart ========== Quick importation ----------------- pyGeno's database is populated by importing data wraps. pyGeno comes with a few datawraps, to get the list you can use: .. code:: python import pyGeno.bootstrap as B B.printDatawraps() .. code:: Available datawraps for bootstraping SNPs ~~~~| |~~~:> Human_agnostic.dummySRY.tar.gz |~~~:> Human.dummySRY_casava.tar.gz |~~~:> dbSNP142_human_GRCh37_common_all.tar.gz |~~~:> dbSNP142_human_common_all.tar.gz Genomes ~~~~~~~| |~~~:> Human.GRCh37.75.tar.gz |~~~:> Human.GRCh37.75_Y-Only.tar.gz |~~~:> Human.GRCh38.78.tar.gz |~~~:> Mouse.GRCm38.78.tar.gz To get a list of remote datawraps that pyGeno can download for you, do: .. code:: python B.printRemoteDatawraps() Importing whole genomes is a demanding process that take more than an hour and requires (according to tests) at least 3GB of memory. Depending on your configuration, more might be required. That being said importating a data wrap is a one time operation and once the importation is complete the datawrap can be discarded without consequences. The bootstrap module also has some handy functions for importing built-in packages. Some of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**): .. code:: python import pyGeno.bootstrap as B #Imports only the Y chromosome from the human reference genome GRCh37.75 #Very fast, requires even less memory. No download required. B.importGenome("Human.GRCh37.75_Y-Only.tar.gz") #A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP format. # This one has one SNP at the begining of the gene SRY B.importSNPs("Human.dummySRY_casava.tar.gz") And for more serious work, the whole reference genome. .. code:: python #Downloads the whole genome (205MB, sequences + annotations), may take an hour or more. B.importGenome("Human.GRCh38.78.tar.gz") That's it, you can now print the sequences of all the proteins that a gene can produce:: from pyGeno.Genome import Genome from pyGeno.Gene import Gene from pyGeno.Protein import Protein #the name of the genome is defined inside the package's manifest.ini file ref = Genome(name = 'GRCh37.75') #get returns a list of elements gene = ref.get(Gene, name = 'SRY')[0] for prot in gene.get(Protein) : print prot.sequence You can see pyGeno achitecture as a graph where everything is connected to everything. For instance you can do things such as:: gene = aProt.gene trans = aProt.transcript prot = anExon.protein genome = anExon.genome Queries ------- PyGeno allows for several kinds of queries, here are some snippets:: #in this case both queries will yield the same result myGene.get(Protein, id = "ENSID...") myGenome.get(Protein, id = "ENSID...") #even complex stuff exons = myChromosome.get(Exons, {'start >=' : x1, 'stop <' : x2}) hlaGenes = myGenome.get(Gene, {'name like' : 'HLA'}) sry = myGenome.get(Transcript, { "gene.name" : 'SRY' }) To know the available fields for queries, there's a "help()" function:: Gene.help() Faster queries --------------- To speed up loops use iterGet():: for prot in gene.iterGet(Protein) : print prot.sequence For more speed create indexes on the fields you need the most:: Gene.ensureGlobalIndex('name') Getting sequences ------------------- Anything that has a sequence can be indexed using the usual python list syntax:: protein[34] # for the 34th amino acid protein[34:40] # for amino acids in [34, 40[ transcript[23] #for the 23rd nucleotide of the transcript transcript[23:30] #for nucletotides in [23, 30[ transcript.cDNA[23:30] #the same but for the protein coding DNA (without the UTRs) Transcripts, Proteins, Exons also have a *.sequence* attribute. This attribute is the string rendered sequence, it is perfect for printing but it may contain '/'s in case of polymorphic sequence that you must take into account in the indexing. On the other hand if you use indexes directly on the object (as shown in the snippet above) pyGeno will use a binary representaion of the sequences thus the indexing is independent of the polymorphisms present in the sequences. Personalized Genomes -------------------- Personalized Genomes are a powerful feature that allow to work on the specific genomes and proteomes of your patients. You can even mix several SNPs together:: from pyGeno.Genome import Genome #the name of the snp set is defined inside the datawraps's manifest.ini file dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY') #you can also define a filter (ex: a quality filter) for the SNPs dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter()) #and even mix several snp sets dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter()) pyGeno allows you to customize the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions:: from pyGeno.SNPFiltering import SNPFilter from pyGeno.SNPFiltering import SequenceSNP class QMax_gt_filter(SNPFilter) : def __init__(self, threshold) : self.threshold = threshold def filter(self, chromosome, dummySRY = None) : if dummySRY.Qmax_gt > self.threshold : #other possibilities of return are SequenceInsert(), SequenceDelete() return SequenceSNP(dummySRY.alt) return None #None means keep the reference allele persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10)) ================================================ FILE: pyGeno/doc/source/snp_filter.rst ================================================ Filtering Polymorphisms (SNPs and Indels) ========================================= Polymorphism filtering is an important part of personalized genomes. By creating your own filters you can easily taylor personalized genomes to your needs. The importaant thing to understand about the filtering process, is that it gives you complete freedom about should be inserted. Once pyGeno finds a polymorphism, it automatically triggers the filter to know what should be inserted at this position, and that can be anything you choose. .. automodule:: SNPFiltering :members: ================================================ FILE: pyGeno/doc/source/tools.rst ================================================ Tools ===== pyGeno provides a set of tools that can be used independentely. Here you'll find goodies for translation, indexation, and more. Progress Bar ------------- pyGeno's awesome progress bar, with logging capabilities and remaining time estimation. .. automodule:: tools.ProgressBar :members: Useful functions ----------------- This module is a bunch of handy bioinfo functions. .. automodule:: tools.UsefulFunctions :members: Binary sequences ----------------- To encode sequence in binary formats .. automodule:: tools.BinarySequence :members: Segment tree ------------- Segment trees are an optimised way to index a genome. .. automodule:: tools.SegmentTree :members: Secure memory map ------------------ A write protected memory map. .. automodule:: tools.SecureMmap :members: ================================================ FILE: pyGeno/examples/__init__.py ================================================ ================================================ FILE: pyGeno/examples/bootstraping.py ================================================ import pyGeno.bootstrap as B #~ imports the whole human reference genome #~ B.importHumanReference() B.importHumanReferenceYOnly() B.importDummySRY() ================================================ FILE: pyGeno/examples/snps_queries.py ================================================ from pyGeno.Genome import Genome from pyGeno.Gene import Gene from pyGeno.Transcript import Transcript from pyGeno.Protein import Protein from pyGeno.Exon import Exon from pyGeno.SNPFiltering import SNPFilter from pyGeno.SNPFiltering import SequenceSNP def printing(gene) : print "printing reference sequences\n-------" for trans in gene.get(Transcript) : print "\t-Transcript name:", trans.name print "\t-Protein:", trans.protein.sequence print "\t-Exons:" for e in trans.exons : print "\t\t Number:", e.number print "\t\t-CDS:", e.CDS print "\t\t-Strand:", e.strand print "\t\t-CDS_start:", e.CDS_start print "\t\t-CDS_end:", e.CDS_end def printVs(refGene, presGene) : print "Vs personalized sequences\n------" #iterGet returns an iterator instead of a list (faster) for trans in presGene.iterGet(Transcript) : refProt = refGene.get(Protein, id = trans.protein.id)[0] persProt = trans.protein print persProt.id print "\tref:" + refProt.sequence[:20] + "..." print "\tper:" + persProt.sequence[:20] + "..." print def fancyExonQuery(gene) : e = gene.get(Exon, {'CDS_start >' : 2655029, 'CDS_end <' : 2655200})[0] print "An exon with a CDS in ']2655029, 2655200[':", e.id class QMax_gt_filter(SNPFilter) : def __init__(self, threshold) : self.threshold = threshold def filter(self, chromosome, dummySRY) : if dummySRY.Qmax_gt > self.threshold : #other possibilities of return are SequenceInsert(), SequenceDelete() return SequenceSNP(dummySRY.alt) return None #None means keep the reference allele if __name__ == "__main__" : refGenome = Genome(name = 'GRCh37.75_Y-Only') persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10)) geneRef = refGenome.get(Gene, name = 'SRY')[0] persGene = persGenome.get(Gene, name = 'SRY')[0] printing(geneRef) print "\n" printVs(geneRef, persGene) print "\n" fancyExonQuery(geneRef) ================================================ FILE: pyGeno/importation/Genomes.py ================================================ import os, glob, gzip, tarfile, shutil, time, sys, gc, cPickle, tempfile, urllib2 from contextlib import closing from ConfigParser import SafeConfigParser from pyGeno.tools.ProgressBar import ProgressBar import pyGeno.configuration as conf from pyGeno.Genome import * from pyGeno.Chromosome import * from pyGeno.Gene import * from pyGeno.Transcript import * from pyGeno.Exon import * from pyGeno.Protein import * from pyGeno.tools.parsers.GTFTools import GTFFile from pyGeno.tools.ProgressBar import ProgressBar from pyGeno.tools.io import printf import gc #~ import objgraph def backUpDB() : """backup the current database version. automatically called by importGenome(). Returns the filename of the backup""" st = time.ctime().replace(' ', '_') fn = conf.pyGeno_RABA_DBFILE.replace('.db', '_%s-bck.db' % st) shutil.copy2(conf.pyGeno_RABA_DBFILE, fn) return fn def _decompressPackage(packageFile) : pFile = tarfile.open(packageFile) packageDir = tempfile.mkdtemp(prefix = "pyGeno_import_") if os.path.isdir(packageDir) : shutil.rmtree(packageDir) os.makedirs(packageDir) for mem in pFile : pFile.extract(mem, packageDir) return packageDir def _getFile(fil, directory) : if fil.find("http://") == 0 or fil.find("ftp://") == 0 : printf("Downloading file: %s..." % fil) finalFile = os.path.normpath('%s/%s' %(directory, fil.split('/')[-1])) # urllib.urlretrieve (fil, finalFile) with closing(urllib2.urlopen(fil)) as r: with open(finalFile, 'wb') as f: shutil.copyfileobj(r, f) printf('done.') else : finalFile = os.path.normpath('%s/%s' %(directory, fil)) return finalFile def deleteGenome(species, name) : """Removes a genome from the database""" printf('deleting genome (%s, %s)...' % (species, name)) conf.db.beginTransaction() objs = [] allGood = True try : genome = Genome_Raba(name = name, species = species.lower()) objs.append(genome) pBar = ProgressBar(label = 'preparing') for typ in (Chromosome_Raba, Gene_Raba, Transcript_Raba, Exon_Raba, Protein_Raba) : pBar.update() f = RabaQuery(typ, namespace = genome._raba_namespace) f.addFilter({'genome' : genome}) for e in f.iterRun() : objs.append(e) pBar.close() pBar = ProgressBar(nbEpochs = len(objs), label = 'deleting objects') for e in objs : pBar.update() e.delete() pBar.close() except KeyError as e : #~ printf("\tWARNING, couldn't remove genome form db, maybe it's not there: ", e) raise KeyError("\tWARNING, couldn't remove genome form db, maybe it's not there: ", e) allGood = False printf('\tdeleting folder') try : shutil.rmtree(conf.getGenomeSequencePath(species, name)) except OSError as e: #~ printf('\tWARNING, Unable to delete folder: ', e) OSError('\tWARNING, Unable to delete folder: ', e) allGood = False conf.db.endTransaction() return allGood def importGenome(packageFile, batchSize = 50, verbose = 0) : """Import a pyGeno genome package. A genome packages is folder or a tar.gz ball that contains at it's root: * gziped fasta files for all chromosomes, or URLs from where them must be downloaded * gziped GTF gene_set file from Ensembl, or an URL from where it must be downloaded * a manifest.ini file such as:: [package_infos] description = Test package. This package installs only chromosome Y of mus musculus maintainer = Tariq Daouda maintainer_contact = tariq.daouda [at] umontreal version = GRCm38.73 [genome] species = Mus_musculus name = GRCm38_test source = http://useast.ensembl.org/info/data/ftp/index.html [chromosome_files] Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz / or an url such as ftp://... or http:// [gene_set] gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz / or an url such as ftp://... or http:// All files except the manifest can be downloaded from: http://useast.ensembl.org/info/data/ftp/index.html A rollback is performed if an exception is caught during importation batchSize sets the number of genes to parse before performing a database save. PCs with little ram like small values, while those endowed with more memory may perform faster with higher ones. Verbose must be an int [0, 4] for various levels of verbosity """ def reformatItems(items) : s = str(items) s = s.replace('[', '').replace(']', '').replace("',", ': ').replace('), ', '\n').replace("'", '').replace('(', '').replace(')', '') return s printf('Importing genome package: %s... (This may take a while)' % packageFile) isDir = False if not os.path.isdir(packageFile) : packageDir = _decompressPackage(packageFile) else : isDir = True packageDir = packageFile parser = SafeConfigParser() parser.read(os.path.normpath(packageDir+'/manifest.ini')) packageInfos = parser.items('package_infos') genomeName = parser.get('genome', 'name') species = parser.get('genome', 'species') genomeSource = parser.get('genome', 'source') seqTargetDir = conf.getGenomeSequencePath(species.lower(), genomeName) if os.path.isdir(seqTargetDir) : raise KeyError("The directory %s already exists, Please call deleteGenome() first if you want to reinstall" % seqTargetDir) gtfFile = _getFile(parser.get('gene_set', 'gtf'), packageDir) chromosomesFiles = {} chromosomeSet = set() for key, fil in parser.items('chromosome_files') : chromosomesFiles[key] = _getFile(fil, packageDir) chromosomeSet.add(key) try : genome = Genome(name = genomeName, species = species) except KeyError: pass else : raise KeyError("There seems to be already a genome (%s, %s), please call deleteGenome() first if you want to reinstall it" % (genomeName, species)) genome = Genome_Raba() genome.set(name = genomeName, species = species, source = genomeSource, packageInfos = packageInfos) printf("Importing:\n\t%s\nGenome:\n\t%s\n..." % (reformatItems(packageInfos).replace('\n', '\n\t'), reformatItems(parser.items('genome')).replace('\n', '\n\t'))) chros = _importGenomeObjects(gtfFile, chromosomeSet, genome, batchSize, verbose) os.makedirs(seqTargetDir) startChro = 0 pBar = ProgressBar(nbEpochs = len(chros)) for chro in chros : pBar.update(label = "Importing DNA, chro %s" % chro.number) length = _importSequence(chro, chromosomesFiles[chro.number.lower()], seqTargetDir) chro.start = startChro chro.end = startChro+length startChro = chro.end chro.save() pBar.close() if not isDir : shutil.rmtree(packageDir) #~ objgraph.show_most_common_types(limit=20) return True #~ @profile def _importGenomeObjects(gtfFilePath, chroSet, genome, batchSize, verbose = 0) : """verbose must be an int [0, 4] for various levels of verbosity""" class Store(object) : def __init__(self, conf) : self.conf = conf self.chromosomes = {} self.genes = {} self.transcripts = {} self.proteins = {} self.exons = {} def batch_save(self) : self.conf.db.beginTransaction() for c in self.genes.itervalues() : c.save() conf.removeFromDBRegistery(c) for c in self.transcripts.itervalues() : c.save() conf.removeFromDBRegistery(c.exons) conf.removeFromDBRegistery(c) for c in self.proteins.itervalues() : c.save() conf.removeFromDBRegistery(c) self.conf.db.endTransaction() del(self.genes) del(self.transcripts) del(self.proteins) del(self.exons) self.genes = {} self.transcripts = {} self.proteins = {} self.exons = {} gc.collect() def save_chros(self) : pBar = ProgressBar(nbEpochs = len(self.chromosomes)) for c in self.chromosomes.itervalues() : pBar.update(label = 'Chr %s' % c.number) c.save() pBar.close() printf('Importing gene set infos from %s...' % gtfFilePath) printf('Backuping indexes...') indexes = conf.db.getIndexes() printf("Droping all your indexes, (don't worry i'll restore them later)...") Genome_Raba.flushIndexes() Chromosome_Raba.flushIndexes() Gene_Raba.flushIndexes() Transcript_Raba.flushIndexes() Protein_Raba.flushIndexes() Exon_Raba.flushIndexes() printf("Parsing gene set...") gtf = GTFFile(gtfFilePath, gziped = True) printf('Done. Importation begins!') store = Store(conf) chroNumber = None pBar = ProgressBar(nbEpochs = len(gtf)) for line in gtf : chroN = line['seqname'] pBar.update(label = "Chr %s" % chroN) if (chroN.upper() in chroSet or chroN.lower() in chroSet): strand = line['strand'] gene_biotype = line['gene_biotype'] regionType = line['feature'] frame = line['frame'] start = int(line['start']) - 1 end = int(line['end']) if start > end : start, end = end, start chroNumber = chroN.upper() if chroNumber not in store.chromosomes : store.chromosomes[chroNumber] = Chromosome_Raba() store.chromosomes[chroNumber].set(genome = genome, number = chroNumber) try : geneId = line['gene_id'] geneName = line['gene_name'] except KeyError : geneId = None geneName = None if verbose : printf('Warning: no gene_id/name found in line %s' % gtf[i]) if geneId is not None : if geneId not in store.genes : if len(store.genes) > batchSize : store.batch_save() if verbose > 0 : printf('\tGene %s, %s...' % (geneId, geneName)) store.genes[geneId] = Gene_Raba() store.genes[geneId].set(genome = genome, id = geneId, chromosome = store.chromosomes[chroNumber], name = geneName, strand = strand, biotype = gene_biotype) if start < store.genes[geneId].start or store.genes[geneId].start is None : store.genes[geneId].start = start if end > store.genes[geneId].end or store.genes[geneId].end is None : store.genes[geneId].end = end try : transId = line['transcript_id'] transName = line['transcript_name'] try : transcript_biotype = line['transcript_biotype'] except KeyError : transcript_biotype = None except KeyError : transId = None transName = None if verbose > 2 : printf('\t\tWarning: no transcript_id, name found in line %s' % gtf[i]) if transId is not None : if transId not in store.transcripts : if verbose > 1 : printf('\t\tTranscript %s, %s...' % (transId, transName)) store.transcripts[transId] = Transcript_Raba() store.transcripts[transId].set(genome = genome, id = transId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), name = transName, biotype=transcript_biotype) if start < store.transcripts[transId].start or store.transcripts[transId].start is None: store.transcripts[transId].start = start if end > store.transcripts[transId].end or store.transcripts[transId].end is None: store.transcripts[transId].end = end try : protId = line['protein_id'] except KeyError : protId = None if verbose > 2 : printf('Warning: no protein_id found in line %s' % gtf[i]) # Store selenocysteine positions in transcript if regionType == 'Selenocysteine': store.transcripts[transId].selenocysteine.append(start) if protId is not None and protId not in store.proteins : if verbose > 1 : printf('\t\tProtein %s...' % (protId)) store.proteins[protId] = Protein_Raba() store.proteins[protId].set(genome = genome, id = protId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), name = transName) store.transcripts[transId].protein = store.proteins[protId] try : exonNumber = int(line['exon_number']) - 1 exonKey = (transId, exonNumber) except KeyError : exonNumber = None exonKey = None if verbose > 2 : printf('Warning: no exon number or id found in line %s' % gtf[i]) if exonKey is not None : if verbose > 3 : printf('\t\t\texon %s...' % (exonId)) if exonKey not in store.exons and regionType == 'exon' : store.exons[exonKey] = Exon_Raba() store.exons[exonKey].set(genome = genome, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), protein = store.proteins.get(protId, None), strand = strand, number = exonNumber, start = start, end = end) store.transcripts[transId].exons.append(store.exons[exonKey]) try : store.exons[exonKey].id = line['exon_id'] except KeyError : pass if regionType == 'exon' : if start < store.exons[exonKey].start or store.exons[exonKey].start is None: store.exons[exonKey].start = start if end > store.transcripts[transId].end or store.exons[exonKey].end is None: store.exons[exonKey].end = end elif regionType == 'CDS' : store.exons[exonKey].CDS_start = start store.exons[exonKey].CDS_end = end store.exons[exonKey].frame = frame elif regionType == 'stop_codon' : if strand == '+' : if store.exons[exonKey].CDS_end != None : store.exons[exonKey].CDS_end += 3 if store.exons[exonKey].end < store.exons[exonKey].CDS_end : store.exons[exonKey].end = store.exons[exonKey].CDS_end if store.transcripts[transId].end < store.exons[exonKey].CDS_end : store.transcripts[transId].end = store.exons[exonKey].CDS_end if store.genes[geneId].end < store.exons[exonKey].CDS_end : store.genes[geneId].end = store.exons[exonKey].CDS_end if strand == '-' : if store.exons[exonKey].CDS_start != None : store.exons[exonKey].CDS_start -= 3 if store.exons[exonKey].start > store.exons[exonKey].CDS_start : store.exons[exonKey].start = store.exons[exonKey].CDS_start if store.transcripts[transId].start > store.exons[exonKey].CDS_start : store.transcripts[transId].start = store.exons[exonKey].CDS_start if store.genes[geneId].start > store.exons[exonKey].CDS_start : store.genes[geneId].start = store.exons[exonKey].CDS_start pBar.close() store.batch_save() conf.db.beginTransaction() printf('almost done saving chromosomes...') store.save_chros() printf('saving genome object...') genome.save() conf.db.endTransaction() conf.db.beginTransaction() printf('restoring core indexes...') # Genome.ensureGlobalIndex(('name', 'species')) # Chromosome.ensureGlobalIndex('genome') # Gene.ensureGlobalIndex('genome') # Transcript.ensureGlobalIndex('genome') # Protein.ensureGlobalIndex('genome') # Exon.ensureGlobalIndex('genome') Transcript.ensureGlobalIndex('exons') printf('commiting changes...') conf.db.endTransaction() conf.db.beginTransaction() printf('restoring user indexes') pBar = ProgressBar(label = "restoring", nbEpochs = len(indexes)) for idx in indexes : pBar.update() conf.db.execute(idx[-1].replace('CREATE INDEX', 'CREATE INDEX IF NOT EXISTS')) pBar.close() printf('commiting changes...') conf.db.endTransaction() return store.chromosomes.values() #~ @profile def _importSequence(chromosome, fastaFile, targetDir) : "Serializes fastas into .dat files" f = gzip.open(fastaFile) header = f.readline() strRes = f.read().upper().replace('\n', '').replace('\r', '') f.close() fn = '%s/chromosome%s.dat' % (targetDir, chromosome.number) f = open(fn, 'w') f.write(strRes) f.close() chromosome.dataFile = fn chromosome.header = header return len(strRes) ================================================ FILE: pyGeno/importation/SNPs.py ================================================ import urllib, shutil from ConfigParser import SafeConfigParser import pyGeno.configuration as conf from pyGeno.SNP import * from pyGeno.tools.ProgressBar import ProgressBar from pyGeno.tools.io import printf from Genomes import _decompressPackage, _getFile from pyGeno.tools.parsers.CasavaTools import SNPsTxtFile from pyGeno.tools.parsers.VCFTools import VCFFile from pyGeno.tools.parsers.CSVTools import CSVFile def importSNPs(packageFile) : """The big wrapper, this function should detect the SNP type by the package manifest and then launch the corresponding function. Here's an example of a SNP manifest file for Casava SNPs:: [package_infos] description = Casava SNPs for testing purposes maintainer = Tariq Daouda maintainer_contact = tariq.daouda [at] umontreal version = 1 [set_infos] species = human name = dummySRY type = Agnostic source = my place at the IRIC [snps] filename = snps.txt # as with genomes you can either include de file at the root of the package or specify an URL from where it must be downloaded """ printf("Importing polymorphism set: %s... (This may take a while)" % packageFile) isDir = False if not os.path.isdir(packageFile) : packageDir = _decompressPackage(packageFile) else : isDir = True packageDir = packageFile fpMan = os.path.normpath(packageDir+'/manifest.ini') if not os.path.isfile(fpMan) : raise ValueError("Not file named manifest.ini! Mais quel SCANDALE!!!!") parser = SafeConfigParser() parser.read(os.path.normpath(packageDir+'/manifest.ini')) packageInfos = parser.items('package_infos') setName = parser.get('set_infos', 'name') typ = parser.get('set_infos', 'type') if typ.lower()[-3:] != 'snp' : typ += 'SNP' species = parser.get('set_infos', 'species').lower() genomeSource = parser.get('set_infos', 'source') snpsFileTmp = parser.get('snps', 'filename').strip() snpsFile = _getFile(parser.get('snps', 'filename'), packageDir) return_value = None try : SMaster = SNPMaster(setName = setName) except KeyError : if typ.lower() == 'casavasnp' : return_value = _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile) elif typ.lower() == 'dbsnpsnp' : return_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile) elif typ.lower() == 'dbsnp' : return_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile) elif typ.lower() == 'tophatsnp' : return_value = _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile) elif typ.lower() == 'agnosticsnp' : return_value = _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile) else : raise FutureWarning('Unknown SNP type in manifest %s' % typ) else : raise KeyError("There's already a SNP set by the name %s. Use deleteSNPs() to remove it first" %setName) if not isDir : shutil.rmtree(packageDir) return return_value def deleteSNPs(setName) : """deletes a set of polymorphisms""" con = conf.db try : SMaster = SNPMaster(setName = setName) con.beginTransaction() SNPType = SMaster.SNPType con.delete(SNPType, 'setName = ?', (setName,)) SMaster.delete() con.endTransaction() except KeyError : raise KeyError("Can't delete the setName %s because i can't find it in SNPMaster, maybe there's not set by that name" % setName) #~ printf("can't delete the setName %s because i can't find it in SNPMaster, maybe there's no set by that name" % setName) return False return True def _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile) : "This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno wil interpret all positions as 0 based" printf('importing SNP set %s for species %s...' % (setName, species)) snpData = CSVFile() snpData.parse(snpsFile, separator = "\t") AgnosticSNP.dropIndex(('start', 'chromosomeNumber', 'setName')) conf.db.beginTransaction() pBar = ProgressBar(len(snpData)) pLabel = '' currChrNumber = None for snpEntry in snpData : tmpChr = snpEntry['chromosomeNumber'] if tmpChr != currChrNumber : currChrNumber = tmpChr pLabel = 'Chr %s...' % currChrNumber snp = AgnosticSNP() snp.species = species snp.setName = setName for f in snp.getFields() : try : setattr(snp, f, snpEntry[f]) except KeyError : if f != 'species' and f != 'setName' : printf("Warning filetype as no key %s", f) snp.quality = float(snp.quality) snp.start = int(snp.start) snp.end = int(snp.end) snp.save() pBar.update(label = pLabel) pBar.close() snpMaster = SNPMaster() snpMaster.set(setName = setName, SNPType = 'AgnosticSNP', species = species) snpMaster.save() printf('saving...') conf.db.endTransaction() printf('creating indexes...') AgnosticSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName')) printf('importation of SNP set %s for species %s done.' %(setName, species)) return True def _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile) : "This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based" printf('importing SNP set %s for species %s...' % (setName, species)) snpData = SNPsTxtFile(snpsFile) CasavaSNP.dropIndex(('start', 'chromosomeNumber', 'setName')) conf.db.beginTransaction() pBar = ProgressBar(len(snpData)) pLabel = '' currChrNumber = None for snpEntry in snpData : tmpChr = snpEntry['chromosomeNumber'] if tmpChr != currChrNumber : currChrNumber = tmpChr pLabel = 'Chr %s...' % currChrNumber snp = CasavaSNP() snp.species = species snp.setName = setName for f in snp.getFields() : try : setattr(snp, f, snpEntry[f]) except KeyError : if f != 'species' and f != 'setName' : printf("Warning filetype as no key %s", f) snp.start -= 1 snp.end -= 1 snp.save() pBar.update(label = pLabel) pBar.close() snpMaster = SNPMaster() snpMaster.set(setName = setName, SNPType = 'CasavaSNP', species = species) snpMaster.save() printf('saving...') conf.db.endTransaction() printf('creating indexes...') CasavaSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName')) printf('importation of SNP set %s for species %s done.' %(setName, species)) return True def _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile) : "This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based" snpData = VCFFile(snpsFile, gziped = True, stream = True) dbSNPSNP.dropIndex(('start', 'chromosomeNumber', 'setName')) conf.db.beginTransaction() pBar = ProgressBar() pLabel = '' for snpEntry in snpData : pBar.update(label = 'Chr %s, %s...' % (snpEntry['#CHROM'], snpEntry['ID'])) snp = dbSNPSNP() for f in snp.getFields() : try : setattr(snp, f, snpEntry[f]) except KeyError : pass snp.chromosomeNumber = snpEntry['#CHROM'] snp.species = species snp.setName = setName snp.start = snpEntry['POS']-1 snp.alt = snpEntry['ALT'] snp.ref = snpEntry['REF'] snp.end = snp.start+len(snp.alt) snp.save() pBar.close() snpMaster = SNPMaster() snpMaster.set(setName = setName, SNPType = 'dbSNPSNP', species = species) snpMaster.save() printf('saving...') conf.db.endTransaction() printf('creating indexes...') dbSNPSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName')) printf('importation of SNP set %s for species %s done.' %(setName, species)) return True def _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile) : raise FutureWarning('Not implemented yet') ================================================ FILE: pyGeno/importation/__init__.py ================================================ __all__ = ['Genomes', 'SNPs'] ================================================ FILE: pyGeno/pyGenoObjectBases.py ================================================ import time, types, string import configuration as conf from rabaDB.rabaSetup import * from rabaDB.Raba import * from rabaDB.filters import RabaQuery def nosave() : raise ValueError('You can only save object that are linked to reference genomes') class pyGenoRabaObject(Raba) : """pyGeno uses rabaDB to persistenly store data. Most persistent objects have classes that inherit from this one (Genome_Raba, Chromosome_Raba, Gene_Raba, Protein_Raba, Exon_Raba). Theses classes are not mean to be accessed directly. Users manipulate wrappers such as : Genome, Chromosome etc... pyGenoRabaObject extends the Raba class by adding a function _curate that is called just before saving. This class is to be considered abstract, and is not meant to be instanciated""" _raba_namespace = conf.pyGeno_RABA_NAMESPACE _raba_abstract = True # not saved in db by default def __init__(self) : if self is pyGenoRabaObject : raise TypeError("This class is abstract") def _curate(self) : "Last operations performed before saving, must be implemented in child" raise TypeError("This method is abstract and should be implemented in child") def save(self) : """Calls _curate() before performing a normal rabaDB lazy save (saving only occurs if the object has been modified)""" if self.mutated() : self._curate() Raba.save(self) class pyGenoRabaObjectWrapper_metaclass(type) : """This metaclass keeps track of the relationship between wrapped classes and wrappers """ _wrappers = {} def __new__(cls, name, bases, dct) : clsObj = type.__new__(cls, name, bases, dct) cls._wrappers[dct['_wrapped_class']] = clsObj return clsObj class RLWrapper(object) : """A wrapper for rabalists that replaces raba objects by pyGeno Object""" def __init__(self, rabaObj, listObjectType, rl) : self.rabaObj = rabaObj self.rl = rl self.listObjectType = listObjectType def __getitem__(self, i) : return self.listObjectType(wrapped_object_and_bag = (self.rl[i], self.rabaObj.bagKey)) def __getattr__(self, name) : rl = object.__getattribute__(self, 'rl') return getattr(rl, name) class pyGenoRabaObjectWrapper(object) : """All the wrapper classes such as Genome and Chromosome inherit from this class. It has most that make pyGeno useful, such as get(), count(), ensureIndex(). This class is to be considered abstract, and is not meant to be instanciated""" __metaclass__ = pyGenoRabaObjectWrapper_metaclass _wrapped_class = None _bags = {} def __init__(self, wrapped_object_and_bag = (), *args, **kwargs) : if self is pyGenoRabaObjectWrapper : raise TypeError("This class is abstract") if wrapped_object_and_bag != () : assert wrapped_object_and_bag[0]._rabaClass is self._wrapped_class self.wrapped_object = wrapped_object_and_bag[0] self.bagKey = wrapped_object_and_bag[1] pyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self else : self.wrapped_object = self._wrapped_class(*args, **kwargs) self.bagKey = time.time() pyGenoRabaObjectWrapper._bags[self.bagKey] = {} pyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self self._load_sequencesTriggers = set() self.loadSequences = True self.loadData = True self.loadBinarySequences = True def _getObjBagKey(self, obj) : """pyGeno objects are kept in bags to ensure that reference objects are loaded only once. This function returns the bag key of the current object""" return (obj._rabaClass.__name__, obj.raba_id) def _makeLoadQuery(self, objectType, *args, **coolArgs) : # conf.db.enableDebug(True) f = RabaQuery(objectType._wrapped_class, namespace = self._wrapped_class._raba_namespace) coolArgs[self._wrapped_class.__name__[:-5]] = self.wrapped_object #[:-5] removes _Raba from class name if len(args) > 0 and type(args[0]) is types.ListType : for a in args[0] : if type(a) is types.DictType : f.addFilter(**a) else : f.addFilter(*args, **coolArgs) return f def count(self, objectType, *args, **coolArgs) : """Returns the number of elements satisfying the query""" return self._makeLoadQuery(objectType, *args, **coolArgs).count() def get(self, objectType, *args, **coolArgs) : """Raba Magic inside. This is th function that you use for querying pyGeno's DB. Usage examples: * myGenome.get("Gene", name = 'TPST2') * myGene.get(Protein, id = 'ENSID...') * myGenome.get(Transcript, {'start >' : x, 'end <' : y})""" ret = [] for e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() : if issubclass(objectType, pyGenoRabaObjectWrapper) : ret.append(objectType(wrapped_object_and_bag = (e, self.bagKey))) else : ret.append(e) return ret def iterGet(self, objectType, *args, **coolArgs) : """Same as get. But retuns the elements one by one, much more efficient for large outputs""" for e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() : if issubclass(objectType, pyGenoRabaObjectWrapper) : yield objectType(wrapped_object_and_bag = (e, self.bagKey)) else : yield e #~ def ensureIndex(self, fields) : #~ """ #~ Warning: May not work on some systems, see ensureGlobalIndex #~ #~ Creates a partial index on self (if it does not exist). #~ Ex: myTranscript.ensureIndex('name')""" #~ #~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object #~ self._wrapped_class.ensureIndex(fields, where, (whereValues,)) #~ def dropIndex(self, fields) : #~ """Warning: May not work on some systems, see dropGlobalIndex #~ #~ Drops a partial index on self. Ex: myTranscript.dropIndex('name')""" #~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object #~ self._wrapped_class.dropIndex(fields, where, (whereValues,)) def __getattr__(self, name) : """If a wrapper does not have a specific field, pyGeno will look for it in the wrapped_object""" # print "pyGenoObjectBases __getattr__ : " + name + " from " + str(type(self)) if name == 'save' or name == 'delete' : raise AttributeError("You can't delete or save an object from wrapper, try .wrapped_object.delete()/save()") if name in self._load_sequencesTriggers and self.loadSequences : self.loadSequences = False self._load_sequences() return getattr(self, name) if name in self._load_sequencesTriggers and self.loadData : self.loadData = False self._load_data() return getattr(self, name) if name[:4] == 'bin_' and self.loadBinarySequences : self.updateBinarySequence = False self._load_bin_sequence() return getattr(self, name) attr = getattr(self.wrapped_object, name) if isRabaObject(attr) : attrKey = self._getObjBagKey(attr) if attrKey in pyGenoRabaObjectWrapper._bags[self.bagKey] : retObj = pyGenoRabaObjectWrapper._bags[self.bagKey][attrKey] else : wCls = pyGenoRabaObjectWrapper_metaclass._wrappers[attr._rabaClass] retObj = wCls(wrapped_object_and_bag = (attr, self.bagKey)) return retObj return attr @classmethod def getIndexes(cls) : """Returns a list of indexes attached to the object's class. Ex Transcript.getIndexes()""" return cls._wrapped_class.getIndexes() @classmethod def flushIndexes(cls) : """Drops all the indexes attached to the object's class. Ex Transcript.flushIndexes()""" return cls._wrapped_class.flushIndexes() @classmethod def help(cls) : """Returns a list of available field for queries. Ex Transcript.help()""" return cls._wrapped_class.help().replace("_Raba", "") @classmethod def ensureGlobalIndex(cls, fields) : """Add a GLOBAL index to the db to speedup lookouts. Fields can be a list of fields for Multi-Column Indices or simply the name of a single field. A global index is an index on the entire type. A global index on 'Transcript' on field 'name', will index the names for all the transcripts in the database""" cls._wrapped_class.ensureIndex(fields) @classmethod def dropGlobalIndex(cls, fields) : """Drops an index, the opposite of ensureGlobalIndex()""" cls._wrapped_class.dropIndex(fields) def getSequencesData(self) : """This lazy abstract function is only called if the object sequences need to be loaded""" raise NotImplementedError("This fct loads non binary sequences and should be implemented in child if needed") def _load_sequences(self) : """This lazy abstract function is only called if the object sequences need to be loaded""" self._load_data() def _load_data(self) : """This lazy abstract function is only called if the object sequences need to be loaded""" raise NotImplementedError("This fct loads non binary sequences and should be implemented in child if needed") def _load_bin_sequence(self) : """Same as _load_sequences(), but loads binary sequences""" raise NotImplementedError("This fct loads binary sequences and should be implemented in child if needed") ================================================ FILE: pyGeno/tests/__init__.py ================================================ ================================================ FILE: pyGeno/tests/test_csv.py ================================================ import unittest from pyGeno.tools.parsers.CSVTools import * class CSVTests(unittest.TestCase): def setUp(self): pass def tearDown(self): pass def test_createParse(self) : testVals = ["test", "test2"] c = CSVFile(legend = ["col1", "col2"], separator = "\t") l = c.newLine() l["col1"] = testVals[0] l = c.newLine() l["col1"] = testVals[1] c.save("test.csv") # print "----", l c2 = CSVFile() c2.parse("test.csv", separator = "\t") i = 0 for l in c2 : self.assertEqual(l["col1"], testVals[i]) i += 1 def runTests() : unittest.main() if __name__ == "__main__" : runTests() ================================================ FILE: pyGeno/tests/test_genome.py ================================================ import unittest from pyGeno.Genome import * import pyGeno.bootstrap as B from pyGeno.importation.Genomes import * from pyGeno.importation.SNPs import * class pyGenoSNPTests(unittest.TestCase): def setUp(self): # try : # B.importGenome("Human.GRCh37.75_Y-Only.tar.gz") # except KeyError : # deleteGenome("human", "GRCh37.75_Y-Only") # B.importGenome("Human.GRCh37.75_Y-Only.tar.gz") # print "--> Seems to already exist in db" # try : # B.importSNPs("Human_agnostic.dummySRY.tar.gz") # except KeyError : # deleteSNPs("dummySRY_AGN") # B.importSNPs("Human_agnostic.dummySRY.tar.gz") # print "--> Seems to already exist in db" # try : # B.importSNPs("Human_agnostic.dummySRY_indels") # except KeyError : # deleteSNPs("dummySRY_AGN_indels") # B.importSNPs("Human_agnostic.dummySRY_indels") # print "--> Seems to already exist in db" self.ref = Genome(name = 'GRCh37.75_Y-Only') def tearDown(self): pass # @unittest.skip("skipping") def test_vanilla(self) : dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN') persProt = dummy.get(Protein, id = 'ENSP00000438917')[0] refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0] self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20]) self.assertEqual('HTGCAATCATATGCTTCTGC', persProt.transcript.cDNA[:20]) # @unittest.skip("skipping") def test_noModif(self) : from pyGeno.SNPFiltering import SNPFilter class MyFilter(SNPFilter) : def __init__(self) : SNPFilter.__init__(self) def filter(self, chromosome, dummySRY_AGN) : return None dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter()) persProt = dummy.get(Protein, id = 'ENSP00000438917')[0] refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0] self.assertEqual(persProt.transcript.cDNA[:20], refProt.transcript.cDNA[:20]) # @unittest.skip("skipping") def test_insert(self) : from pyGeno.SNPFiltering import SNPFilter class MyFilter(SNPFilter) : def __init__(self) : SNPFilter.__init__(self) def filter(self, chromosome, dummySRY_AGN) : from pyGeno.SNPFiltering import SequenceInsert refAllele = chromosome.refSequence[dummySRY_AGN.start] return SequenceInsert('XXX') dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter()) persProt = dummy.get(Protein, id = 'ENSP00000438917')[0] refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0] self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20]) self.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20]) # @unittest.skip("skipping") def test_SNP(self) : from pyGeno.SNPFiltering import SNPFilter class MyFilter(SNPFilter) : def __init__(self) : SNPFilter.__init__(self) def filter(self, chromosome, dummySRY_AGN) : from pyGeno.SNPFiltering import SequenceSNP return SequenceSNP(dummySRY_AGN.alt) dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter()) persProt = dummy.get(Protein, id = 'ENSP00000438917')[0] refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0] self.assertEqual('M', refProt.sequence[0]) self.assertEqual('L', persProt.sequence[0]) # @unittest.skip("skipping") def test_deletion(self) : from pyGeno.SNPFiltering import SNPFilter class MyFilter(SNPFilter) : def __init__(self) : SNPFilter.__init__(self) def filter(self, chromosome, dummySRY_AGN) : from pyGeno.SNPFiltering import SequenceDel refAllele = chromosome.refSequence[dummySRY_AGN.start] return SequenceDel(1) dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter()) persProt = dummy.get(Protein, id = 'ENSP00000438917')[0] refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0] self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20]) self.assertEqual('TGCAATCATATGCTTCTGCT', persProt.transcript.cDNA[:20]) # @unittest.skip("skipping") def test_indels(self) : from pyGeno.SNPFiltering import SNPFilter class MyFilter(SNPFilter) : def __init__(self) : SNPFilter.__init__(self) def filter(self, chromosome, dummySRY_AGN_indels) : from pyGeno.SNPFiltering import SequenceInsert ret = "" for s in dummySRY_AGN_indels : ret += "X" return SequenceInsert(ret) dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN_indels', SNPFilter = MyFilter()) persProt = dummy.get(Protein, id = 'ENSP00000438917')[0] refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0] self.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20]) # @unittest.skip("skipping") def test_bags(self) : dummy = Genome(name = 'GRCh37.75_Y-Only') self.assertEqual(dummy.wrapped_object, self.ref.wrapped_object) # @unittest.skip("skipping") def test_prot_find(self) : prot = self.ref.get(Protein, id = 'ENSP00000438917')[0] needle = prot.sequence[:10] self.assertEqual(0, prot.find(needle)) needle = prot.sequence[-10:] self.assertEqual(len(prot)-10, prot.find(needle)) # @unittest.skip("skipping") def test_trans_find(self) : trans = self.ref.get(Transcript, name = "SRY-001")[0] self.assertEqual(0, trans.find(trans[:5])) # @unittest.skip("remote server down") # def test_import_remote_genome(self) : # self.assertRaises(KeyError, B.importRemoteGenome, "Human.GRCh37.75_Y-Only.tar.gz") # @unittest.skip("remote server down") # def test_import_remote_snps(self) : # self.assertRaises(KeyError, B.importRemoteSNPs, "Human_agnostic.dummySRY.tar.gz") def runTests() : try : B.importGenome("Human.GRCh37.75_Y-Only.tar.gz") except KeyError : deleteGenome("human", "GRCh37.75_Y-Only") B.importGenome("Human.GRCh37.75_Y-Only.tar.gz") print "--> Seems to already exist in db" try : B.importSNPs("Human_agnostic.dummySRY.tar.gz") except KeyError : deleteSNPs("dummySRY_AGN") B.importSNPs("Human_agnostic.dummySRY.tar.gz") print "--> Seems to already exist in db" try : B.importSNPs("Human_agnostic.dummySRY_indels") except KeyError : deleteSNPs("dummySRY_AGN_indels") B.importSNPs("Human_agnostic.dummySRY_indels") print "--> Seems to already exist in db" # import time # time.sleep(10) unittest.main() if __name__ == "__main__" : runTests() ================================================ FILE: pyGeno/tools/BinarySequence.py ================================================ import array, copy import UsefulFunctions as uf class BinarySequence : """A class for representing sequences in a binary format""" ALPHABETA_SIZE = 32 ALPHABETA_KMP = range(ALPHABETA_SIZE) def __init__(self, sequence, arrayForma, charToBinDict) : self.forma = arrayForma self.charToBin = charToBinDict self.sequence = sequence self.binSequence, self.defaultSequence, self.polymorphisms = self.encode(sequence) self.itemsize = self.binSequence.itemsize self.typecode = self.binSequence.typecode #print 'bin', len(self.sequence), len(self.binSequence) def encode(self, sequence): """Returns a tuple (binary reprensentation, default sequence, polymorphisms list)""" polymorphisms = [] defaultSequence = '' binSequence = array.array(self.forma.typecode) b = 0 i = 0 trueI = 0 #not inc in case if poly poly = set() while i < len(sequence)-1: b = b | self.forma[self.charToBin[sequence[i]]] if sequence[i+1] == '/' : poly.add(sequence[i]) i += 2 else : binSequence.append(b) if len(poly) > 0 : poly.add(sequence[i]) polymorphisms.append((trueI, poly)) poly = set() bb = 0 while b % 2 != 0 : b = b/2 defaultSequence += sequence[i] b = 0 i += 1 trueI += 1 if i < len(sequence) : b = b | self.forma[self.charToBin[sequence[i]]] binSequence.append(b) if len(poly) > 0 : if sequence[i] not in poly : poly.add(sequence[i]) polymorphisms.append((trueI, poly)) defaultSequence += sequence[i] return (binSequence, defaultSequence, polymorphisms) def __testFind(self, arr) : if len(arr) == 0: raise TypeError ('binary find, needle is empty') if arr.itemsize != self.itemsize : raise TypeError ('binary find, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize)) def __testBinary(self, arr) : if len(arr) != len(self) : raise TypeError ('bin operator, both arrays must be of same length') if arr.itemsize != self.itemsize : raise TypeError ('bin operator, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize)) def findPolymorphisms(self, strSeq, strict = False): """ Compares strSeq with self.sequence. If not 'strict', this function ignores the cases of matching heterozygocity (ex: for a given position i, strSeq[i] = A and self.sequence[i] = 'A/G'). If 'strict' it returns all positions where strSeq differs self,sequence """ arr = self.encode(strSeq)[0] res = [] if not strict : for i in range(len(arr)+len(self)) : if i >= len(arr) or i > len(self) : break if arr[i] & self[i] == 0: res.append(i) else : for i in range(len(arr)+len(self)) : if i >= len(arr) or i > len(self) : break if arr[i] != self[i] : res.append(i) return res def getPolymorphisms(self): """returns all polymorphsims in the form of a dict pos => alleles""" return self.polymorphisms def getDefaultSequence(self) : """returns a default version of sequence where only the last allele of each polymorphisms is shown""" return self.defaultSequence def __getSequenceVariants(self, x1, polyStart, polyStop, listSequence) : """polyStop, is the polymorphisme at wixh number where the calcul of combinaisons stops""" if polyStart < len(self.polymorphisms) and polyStart < polyStop: sequence = copy.copy(listSequence) ret = [] pk = self.polymorphisms[polyStart] posInSequence = pk[0]-x1 if posInSequence < len(listSequence) : for allele in pk[1] : sequence[posInSequence] = allele ret.extend(self.__getSequenceVariants(x1, polyStart+1, polyStop, sequence)) return ret else : return [''.join(listSequence)] def getSequenceVariants(self, x1 = 0, x2 = -1, maxVariantNumber = 128) : """returns the sequences resulting from all combinaisons of all polymorphismes between x1 and x2. The results is a couple (bool, variants of sequence between x1 and x2), bool == true if there's more combinaisons than maxVariantNumber""" if x2 == -1 : xx2 = len(self.defaultSequence) else : xx2 = x2 polyStart = None nbP = 1 stopped = False j = 0 for p in self.polymorphisms : if p[0] >= xx2 : break if x1 <= p[0] : if polyStart == None : polyStart = j nbP *= len(p[1]) if nbP > maxVariantNumber : stopped = True break j += 1 if polyStart == None : return (stopped, [self.defaultSequence[x1:xx2]]) return (stopped, self.__getSequenceVariants(x1, polyStart, j, list(self.defaultSequence[x1:xx2]))) def getNbVariants(self, x1, x2 = -1) : """returns the nb of variants of sequences between x1 and x2""" if x2 == -1 : xx2 = len(self.defaultSequence) else : xx2 = x2 nbP = 1 for p in self.polymorphisms: if x1 <= p[0] and p[0] <= xx2 : nbP *= len(p[1]) return nbP def _dichFind(self, needle, currHaystack, offset, lst = None) : """dichotomic search, if lst is None, will return the first position found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found""" if len(currHaystack) == 1 : if (offset <= (len(self) - len(needle))) and (currHaystack[0] & needle[0]) > 0 and (self[offset+len(needle)-1] & needle[-1]) > 0 : found = True for i in xrange(1, len(needle)-1) : if self[offset + i] & needle[i] == 0 : found = False break if found : if lst is not None : lst.append(offset) else : return offset else : if lst is None : return -1 else : if (offset <= (len(self) - len(needle))) : if lst is not None : self._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst) self._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst) else : v1 = self._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst) if v1 > -1 : return v1 return self._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst) return -1 def _kmp_construct_next(self, pattern): """the helper function for KMP-string-searching is to construct the DFA. pattern should be an integer array. return a 2D array representing the DFA for moving the pattern.""" next = [[0 for state in pattern] for input_token in self.ALPHABETA_KMP] next[pattern[0]][0] = 1 restart_state = 0 for state in range(1, len(pattern)): for input_token in self.ALPHABETA_KMP: next[input_token][state] = next[input_token][restart_state] next[pattern[state]][state] = state + 1 restart_state = next[pattern[state]][restart_state] return next def _kmp_search_first(self, pInput_sequence, pPattern): """use KMP algorithm to search the first occurrence in the input_sequence of the pattern. both arguments are integer arrays. return the position of the occurence if found; otherwise, -1.""" input_sequence, pattern = pInput_sequence, [len(bin(e)) for e in pPattern] n, m = len(input_sequence), len(pattern) d = p = 0 next = self._kmp_construct_next(pattern) while d < n and p < m: p = next[len(bin(input_sequence[d]))][p] d += 1 if p == m: return d - p else: return -1 def _kmp_search_all(self, pInput_sequence, pPattern): """use KMP algorithm to search all occurrence in the input_sequence of the pattern. both arguments are integer arrays. return a list of the positions of the occurences if found; otherwise, [].""" r = [] input_sequence, pattern = [len(bin(e)) for e in pInput_sequence], [len(bin(e)) for e in pPattern] n, m = len(input_sequence), len(pattern) d = p = 0 next = self._kmp_construct_next(pattern) while d < n: p = next[input_sequence[d]][p] d += 1 if p == m: r.append(d - m) p = 0 return r def _kmp_find(self, needle, haystack, lst = None): """find with KMP-searching. needle is an integer array, reprensenting a pattern. haystack is an integer array, reprensenting the input sequence. if lst is None, return the first position found or -1 if no match found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found.""" if lst != None: return self._kmp_search_all(haystack, needle) else: return self._kmp_search_first(haystack, needle) def findByBiSearch(self, strSeq) : """returns the first occurence of strSeq in self. Takes polymorphisms into account""" arr = self.encode(strSeq) return self._dichFind(arr[0], self, 0, lst = None) def findAllByBiSearch(self, strSeq) : """Same as find but returns a list of all occurences""" arr = self.encode(strSeq) lst = [] self._dichFind(arr[0], self, 0, lst) return lst def find(self, strSeq) : """returns the first occurence of strSeq in self. Takes polymorphisms into account""" arr = self.encode(strSeq) return self._kmp_find(arr[0], self) def findAll(self, strSeq) : """Same as find but returns a list of all occurences""" arr = self.encode(strSeq) lst = [] lst = self._kmp_find(arr[0], self, lst) return lst def __and__(self, arr) : self.__testBinary(arr) ret = BinarySequence(self.typecode, self.forma, self.charToBin) for i in range(len(arr)) : ret.append(self[i] & arr[i]) return ret def __xor__(self, arr) : self.__testBinary(arr) ret = BinarySequence(self.typecode, self.forma, self.charToBin) for i in range(len(arr)) : ret.append(self[i] ^ arr[i]) return ret def __eq__(self, seq) : self.__testBinary(seq) if len(seq) != len(self) : return False return all( self[i] == seq[i] for i in range(len(self)) ) def append(self, arr) : self.binSequence.append(arr) def extend(self, arr) : self.binSequence.extend(arr) def decode(self, binSequence): """decodes a binary sequence to return a string""" try: binSeq = iter(binSequence[0]) except TypeError, te: binSeq = binSequence ret = '' for b in binSeq : ch = '' for c in self.charToBin : if b & self.forma[self.charToBin[c]] > 0 : ch += c +'/' if ch == '' : raise KeyError('Key %d unkowom, bad format' % b) ret += ch[:-1] return ret def getChar(self, i): return self.decode([self.binSequence[i]]) def __len__(self): return len(self.binSequence) def __getitem__(self,i): return self.binSequence[i] def __setitem__(self, i, v): self.binSequence[i] = v class AABinarySequence(BinarySequence) : """A binary sequence of amino acids""" def __init__(self, sequence): f = array.array('I', [1L, 2L, 4L, 8L, 16L, 32L, 64L, 128L, 256L, 512L, 1024L, 2048L, 4096L, 8192L, 16384L, 32768L, 65536L, 131072L, 262144L, 524288L, 1048576L, 2097152L]) c = {'A': 17, 'C': 14, 'E': 19, 'D': 15, 'G': 13, 'F': 16, 'I': 3, 'H': 9, 'K': 8, '*': 1, 'M': 20, 'L': 0, 'N': 4, 'Q': 11, 'P': 6, 'S': 7, 'R': 5, 'T': 2, 'W': 10, 'V': 18, 'Y': 12, 'U': 21} BinarySequence.__init__(self, sequence, f, c) class NucBinarySequence(BinarySequence) : """A binary sequence of nucleotides""" def __init__(self, sequence): f = array.array('B', [1, 2, 4, 8]) c = {'A': 0, 'T': 1, 'C': 2, 'G': 3} ce = { 'R' : 'A/G', 'Y' : 'C/T', 'M': 'A/C', 'K' : 'T/G', 'W' : 'A/T', 'S' : 'C/G', 'B': 'C/G/T', 'D' : 'A/G/T', 'H' : 'A/C/T', 'V' : 'A/C/G', 'N': 'A/C/G/T' } lstSeq = list(sequence) for i in xrange(len(lstSeq)) : if lstSeq[i] in ce : lstSeq[i] = ce[lstSeq[i]] lstSeq = ''.join(lstSeq) BinarySequence.__init__(self, lstSeq, f, c) if __name__=="__main__": def test0() : #seq = 'ACED/E/GFIHK/MLMQPS/RTWVY' seq = 'ACED/E/GFIHK/MLMQPS/RTWVY/A/R' bSeq = AABinarySequence(seq) start = 0 stop = 4 rB = bSeq.getSequenceVariants_bck(start, stop) r = bSeq.getSequenceVariants(start, stop) #print start, stop, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1]) print start, stop#, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1]) #if set(rB[1])!=set(r[1]) : print '-AV-' print start, stop, 'nb_comb_r', len(rB[1]) print '\n'.join(rB[1]) print '=AP========' print start, stop, 'nb_comb_r', len(r[1]) print '\n'.join(r[1]) def testVariants() : seq = 'ATGAGTTTGCCGCGCN' bSeq = NucBinarySequence(seq) print bSeq.getSequenceVariants() testVariants() from random import randint alphabeta = ['A', 'C', 'G', 'T'] seq = '' for _ in range(8192): seq += alphabeta[randint(0, 3)] seq += 'ATGAGTTTGCCGCGCN' bSeq = NucBinarySequence(seq) ROUND = 512 PATTERN = 'GCGC' def testFind(): for i in range(ROUND): bSeq.find(PATTERN) def testFindByBiSearch(): for i in range(ROUND): bSeq.findByBiSearch(PATTERN) def testFindAll(): for i in range(ROUND): bSeq.findAll(PATTERN) def testFindAllByBiSearch(): for i in range(ROUND): bSeq.findAllByBiSearch(PATTERN) import cProfile print('find:\n') cProfile.run('testFind()') print('findAll:\n') cProfile.run('testFindAll()') print('findByBiSearch:\n') cProfile.run('testFindByBiSearch()') print('findAllByBiSearch:\n') cProfile.run('testFindAllByBiSearch()') ================================================ FILE: pyGeno/tools/ProgressBar.py ================================================ import sys, time, cPickle class ProgressBar : """A very simple unthreaded progress bar. This progress bar also logs stats in .logs. Usage example:: p = ProgressBar(nbEpochs = -1) for i in range(200000) : p.update(label = 'value of i %d' % i) p.close() If you don't know the maximum number of epochs you can enter nbEpochs < 1 """ def __init__(self, nbEpochs = -1, width = 25, label = "progress", minRefeshTime = 1) : self.width = width self.currEpoch = 0 self.nbEpochs = float(nbEpochs) self.bar = '' self.label = label self.wheel = ["-", "\\", "|", "/"] self.startTime = time.time() self.lastPrintTime = -1 self.minRefeshTime = minRefeshTime self.runtime = -1 self.runtime_hr = -1 self.avg = -1 self.remtime = -1 self.remtime_hr = -1 self.currTime = time.time() self.lastEpochDuration = -1 self.bars = [] self.miniSnake = '~-~-~-?:>' self.logs = {'epochDuration' : [], 'avg' : [], 'runtime' : [], 'remtime' : []} def formatTime(self, val) : if val < 60 : return '%.3fsc' % val elif val < 3600 : return '%.3fmin' % (val/60) else : return '%dh %dmin' % (int(val)/3600, int(val/60)%60) def _update(self) : tim = time.time() if self.nbEpochs > 1 : if self.currTime > 0 : self.lastEpochDuration = tim - self.currTime self.currTime = tim self.runtime = tim - self.startTime self.runtime_hr = self.formatTime(self.runtime) self.avg = self.runtime/self.currEpoch self.remtime = self.avg * (self.nbEpochs-self.currEpoch) self.remtime_hr = self.formatTime(self.remtime) def log(self) : """logs stats about the progression, without printing anything on screen""" self.logs['epochDuration'].append(self.lastEpochDuration) self.logs['avg'].append(self.avg) self.logs['runtime'].append(self.runtime) self.logs['remtime'].append(self.remtime) def saveLogs(self, filename) : """dumps logs into a nice pickle""" f = open(filename, 'wb') cPickle.dump(self.logs, f) f.close() def update(self, label = '', forceRefresh = False, log = False) : """the function to be called at each iteration. Setting log = True is the same as calling log() just after update()""" self.currEpoch += 1 tim = time.time() if (tim - self.lastPrintTime > self.minRefeshTime) or forceRefresh : self._update() wheelState = self.wheel[self.currEpoch%len(self.wheel)] if label == '' : slabel = self.label else : slabel = label if self.nbEpochs > 1 : ratio = self.currEpoch/self.nbEpochs snakeLen = int(self.width*ratio) voidLen = int(self.width - (self.width*ratio)) if snakeLen + voidLen < self.width : snakeLen = self.width - voidLen self.bar = "%s %s[%s:>%s] %.2f%% (%d/%d) runtime: %s, remaining: %s, avg: %s" %(wheelState, slabel, "~-" * snakeLen, " " * voidLen, ratio*100, self.currEpoch, self.nbEpochs, self.runtime_hr, self.remtime_hr, self.formatTime(self.avg)) if self.currEpoch == self.nbEpochs : self.close() else : w = self.width - len(self.miniSnake) v = self.currEpoch%(w+1) snake = "%s%s%s" %(" " * (v), self.miniSnake, " " * (w-v)) self.bar = "%s %s[%s] %s%% (%d/%s) runtime: %s, remaining: %s, avg: %s" %(wheelState, slabel, snake, '?', self.currEpoch, '?', self.runtime_hr, '?', self.formatTime(self.avg)) sys.stdout.write("\b" * (len(self.bar)+1)) sys.stdout.write(" " * (len(self.bar)+1)) sys.stdout.write("\b" * (len(self.bar)+1)) sys.stdout.write(self.bar) sys.stdout.flush() self.lastPrintTime = time.time() if log : self.log() def close(self) : """Closes the bar so your next print will be on another line""" self.update(forceRefresh = True) print '\n' if __name__ == "__main__" : p = ProgressBar(nbEpochs = 100000000000) for i in xrange(100000000000) : p.update() #time.sleep(3) p.close() ================================================ FILE: pyGeno/tools/SecureMmap.py ================================================ import mmap class SecureMmap: """In a normal mmap, modifying the string modifies the file. This is a mmap with write protection""" def __init__(self, filename, enableWrite = False) : self.enableWrite = enableWrite self.filename = filename self.name = filename f = open(filename, 'r+b') self.data = mmap.mmap(f.fileno(), 0) def forceSet(self, x1, v) : """Forces modification even if the mmap is write protected""" self.data[x1] = v def __getitem__(self, i): return self.data[i] def __setitem__(self, i, v) : if self.enableWrite : raise IOError("Secure mmap is write protected") else : self.data[i] = v def __str__(self) : return "secure mmap, file: %s, writing enabled : %s" % (self.filename, str(self.enableWrite)) def __len__(self) : return len(self.data) ================================================ FILE: pyGeno/tools/SegmentTree.py ================================================ import random, copy, types def aux_insertTree(childTree, parentTree): """This a private (You shouldn't have to call it) recursive function that inserts a child tree into a parent tree.""" if childTree.x1 != None and childTree.x2 != None : parentTree.insert(childTree.x1, childTree.x2, childTree.name, childTree.referedObject) for c in childTree.children: aux_insertTree(c, parentTree) def aux_moveTree(offset, tree): """This a private recursive (You shouldn't have to call it) function that translates tree(and it's children) to a given x1""" if tree.x1 != None and tree.x2 != None : tree.x1, tree.x2 = tree.x1+offset, tree.x2+offset for c in tree.children: aux_moveTree(offset, c) class SegmentTree : """ Optimised genome annotations. A segment tree is an arborescence of segments. First position is inclusive, second exlusive, respectively refered to as x1 and x2. A segment tree has the following properties : * The root has no x1 or x2 (both set to None). * Segment are arrangend in an ascending order * For two segment S1 and S2 : [S2.x1, S2.x2[ C [S1.x1, S1.x2[ <=> S2 is a child of S1 Here's an example of a tree : * Root : 0-15 * ---->Segment : 0-12 * ------->Segment : 1-6 * ---------->Segment : 2-3 * ---------->Segment : 4-5 * ------->Segment : 7-8 * ------->Segment : 9-10 * ---->Segment : 11-14 * ------->Segment : 12-14 * ---->Segment : 13-15 Each segment can have a 'name' and a 'referedObject'. ReferedObject are objects are stored within the graph for future usage. These objects are always stored in lists. If referedObject is already a list it will be stored as is. """ def __init__(self, x1 = None, x2 = None, name = '', referedObject = [], father = None, level = 0) : if x1 > x2 : self.x1, self.x2 = x2, x1 else : self.x1, self.x2 = x1, x2 self.father = father self.level = level self.id = random.randint(0, 10**8) self.name = name self.children = [] self.referedObject = referedObject def __addChild(self, segmentTree, index = -1) : segmentTree.level = self.level + 1 if index < 0 : self.children.append(segmentTree) else : self.children.insert(index, segmentTree) def insert(self, x1, x2, name = '', referedObject = []) : """Insert the segment in it's right place and returns it. If there's already a segment S as S.x1 == x1 and S.x2 == x2. S.name will be changed to 'S.name U name' and the referedObject will be appended to the already existing list""" if x1 > x2 : xx1, xx2 = x2, x1 else : xx1, xx2 = x1, x2 rt = None insertId = None childrenToRemove = [] for i in range(len(self.children)) : if self.children[i].x1 == xx1 and xx2 == self.children[i].x2 : self.children[i].name = self.children[i].name + ' U ' + name self.children[i].referedObject.append(referedObject) return self.children[i] if self.children[i].x1 <= xx1 and xx2 <= self.children[i].x2 : return self.children[i].insert(x1, x2, name, referedObject) elif xx1 <= self.children[i].x1 and self.children[i].x2 <= xx2 : if rt == None : if type(referedObject) is types.ListType : rt = SegmentTree(xx1, xx2, name, referedObject, self, self.level+1) else : rt = SegmentTree(xx1, xx2, name, [referedObject], self, self.level+1) insertId = i rt.__addChild(self.children[i]) self.children[i].father = rt childrenToRemove.append(self.children[i]) elif xx1 <= self.children[i].x1 and xx2 <= self.children[i].x2 : insertId = i break if rt != None : self.__addChild(rt, insertId) for c in childrenToRemove : self.children.remove(c) else : if type(referedObject) is types.ListType : rt = SegmentTree(xx1, xx2, name, referedObject, self, self.level+1) else : rt = SegmentTree(xx1, xx2, name, [referedObject], self, self.level+1) if insertId != None : self.__addChild(rt, insertId) else : self.__addChild(rt) return rt def insertTree(self, childTree): """inserts childTree in the right position (regions will be rearanged to fit the organisation of self)""" aux_insertTree(childTree, self) #~ def included_todo(self, x1, x2=None) : #~ "Returns all the segments where [x1, x2] is included""" #~ pass def intersect(self, x1, x2 = None) : """Returns a list of all segments intersected by [x1, x2]""" def condition(x1, x2, tree) : #print self.id, tree.x1, tree.x2, x1, x2 if (tree.x1 != None and tree.x2 != None) and (tree.x1 <= x1 and x1 < tree.x2 or tree.x1 <= x2 and x2 < tree.x2) : return True return False if x2 == None : xx1, xx2 = x1, x1 elif x1 > x2 : xx1, xx2 = x2, x1 else : xx1, xx2 = x1, x2 c1 = self.__dichotomicSearch(xx1) c2 = self.__dichotomicSearch(xx2) if c1 == -1 or c2 == -1 : return [] if xx1 < self.children[c1].x1 : c1 -= 1 inter = self.__radiateDown(x1, x2, c1, condition) if self.children[c1].id == self.children[c2].id : inter.extend(self.__radiateUp(x1, x2, c2+1, condition)) else : inter.extend(self.__radiateUp(x1, x2, c2, condition)) ret = [] for c in inter : ret.extend(c.intersect(x1, x2)) inter.extend(ret) return inter def __dichotomicSearch(self, x1) : r1 = 0 r2 = len(self.children)-1 pos = -1 while (r1 <= r2) : pos = (r1+r2)/2 val = self.children[pos].x1 if val == x1 : return pos elif x1 < val : r2 = pos -1 else : r1 = pos +1 return pos def __radiateDown(self, x1, x2, childId, condition) : "Radiates down: walks self.children downward until condition is no longer verifed or there's no childrens left " ret = [] i = childId while 0 <= i : if condition(x1, x2, self.children[i]) : ret.append(self.children[i]) else : break i -= 1 return ret def __radiateUp(self, x1, x2, childId, condition) : "Radiates uo: walks self.children upward until condition is no longer verifed or there's no childrens left " ret = [] i = childId while i < len(self.children): if condition(x1, x2, self.children[i]) : ret.append(self.children[i]) else : break i += 1 return ret def emptyChildren(self) : """Kills of children""" self.children = [] def removeGaps(self) : """Remove all gaps between regions""" for i in range(1, len(self.children)) : if self.children[i].x1 > self.children[i-1].x2: aux_moveTree(self.children[i-1].x2-self.children[i].x1, self.children[i]) def getX1(self) : """Returns the starting position of the tree""" if self.x1 != None : return self.x1 return self.children[0].x1 def getX2(self) : """Returns the ending position of the tree""" if self.x2 != None : return self.x2 return self.children[-1].x2 def getIndexedLength(self) : """Returns the total length of indexed regions""" if self.x1 != None and self.x2 != None: return self.x2 - self.x1 else : if len(self.children) == 0 : return 0 else : l = self.children[0].x2 - self.children[0].x1 for i in range(1, len(self.children)) : l += self.children[i].x2 - self.children[i].x1 - max(0, self.children[i-1].x2 - self.children[i].x1) return l def getFirstLevel(self) : """returns a list of couples (x1, x2) of all the first level indexed regions""" res = [] if len(self.children) > 0 : for c in self.children: res.append((c.x1, c.x2)) else : if self.x1 != None : res = [(self.x1, self.x2)] else : res = None return res def flatten(self) : """Flattens the tree. The tree become a tree of depth 1 where overlapping regions have been merged together""" if len(self.children) > 1 : children = self.children self.emptyChildren() children[0].emptyChildren() x1 = children[0].x1 x2 = children[0].x2 refObjs = [children[0].referedObject] name = children[0].name for i in range(1, len(children)) : children[i].emptyChildren() if children[i-1] >= children[i] : x2 = children[i].x2 refObjs.append(children[i].referedObject) name += " U " + children[i].name else : if len(refObjs) == 1 : refObjs = refObjs[0] self.insert(x1, x2, name, refObjs) x1 = children[i].x1 x2 = children[i].x2 refObjs = [children[i].referedObject] name = children[i].name if len(refObjs) == 1 : refObjs = refObjs[0] self.insert(x1, x2, name, refObjs) def move(self, newX1) : """Moves tree to a new starting position, updates x1s of children""" if self.x1 != None and self.x2 != None : offset = newX1-self.x1 aux_moveTree(offset, self) elif len(self.children) > 0 : offset = newX1-self.children[0].x1 aux_moveTree(offset, self) def __str__(self) : strRes = self.__str() offset = '' for i in range(self.level+1) : offset += '\t' for c in self.children : strRes += '\n'+offset+'-->'+str(c) return strRes def __str(self) : if self.x1 == None : if len(self.children) > 0 : return "Root: %d-%d, name: %s, id: %d, obj: %s" %(self.children[0].x1, self.children[-1].x2, self.name, self.id, repr(self.referedObject)) else : return "Root: EMPTY , name: %s, id: %d, obj: %s" %(self.name, self.id, repr(self.referedObject)) else : return "Segment: %d-%d, name: %s, id: %d, father id: %d, obj: %s" %(self.x1, self.x2, self.name, self.id, self.father.id, repr(self.referedObject)) def __len__(self) : "returns the size of the complete indexed region" if self.x1 != None and self.x2 != None : return self.x2-self.x1 else : return self.children[-1].x2 - self.children[0].x1 def __repr__(self): return 'Segment Tree, id:%s, father id:%s, (x1, x2): (%s, %s)' %(self.id, self.father.id, self.x1, self.x2) if __name__== "__main__" : s = SegmentTree() s.insert(5, 10, 'region 1') s.insert(8, 12, 'region 2') s.insert(5, 8, 'region 3') s.insert(34, 40, 'region 4') s.insert(35, 38, 'region 5') s.insert(36, 37, 'region 6', 'aaa') s.insert(36, 37, 'region 6', 'aaa2') print "Tree:" print s print "indexed length", s.getIndexedLength() print "removing gaps and adding region 7 : [13-37[" s.removeGaps() #s.insert(13, 37, 'region 7') print s print "indexed length", s.getIndexedLength() #print "intersections" #for c in [6, 10, 14, 1000] : # print c, s.intersect(c) print "Move" s.move(0) print s print "indexed length", s.getIndexedLength() ================================================ FILE: pyGeno/tools/SingletonManager.py ================================================ #This thing is wonderful objects = {} def add(obj, objName='') : if objName == '' : key = obj.name else : key = objName if key not in objects : objects[key] = obj return obj def contains(k) : return k in objects def get(objName) : try : return objects[objName] except : return None ================================================ FILE: pyGeno/tools/Stats.py ================================================ import numpy as np def kullback_leibler(p, q) : """Discrete Kullback-Leibler divergence D(P||Q)""" p = np.asarray(p, dtype=np.float) q = np.asarray(q, dtype=np.float) if p.shape != q.shape : raise ValueError("p and q must be of the same dimensions") return np.sum(np.where(p > 0, np.log(p / q) * p, 0)) def squaredError_log10(p, q) : p = np.asarray(p, dtype=np.float) q = np.asarray(q, dtype=np.float) if p.shape != q.shape : raise ValueError("p and q must be of the same dimensions") return np.log10(sum((p-q)**2)) - np.log(len(p)) def fisherExactTest(table) : """Fisher's exact test ---------- table: contengency table """ raise NotImplementedError ================================================ FILE: pyGeno/tools/UsefulFunctions.py ================================================ import string, os, copy, types class UnknownNucleotide(Exception) : def __init__(self, nuc) : self.msg = 'Unknown nucleotides %s' % str(nuc) def __str__(self) : return self.msg #This will probably be moved somewhere else in the futur def saveResults(directoryName, fileName, strResults, log = '', args = ''): if not os.path.exists(directoryName): os.makedirs(directoryName) resPath = "%s/%s"%(directoryName, fileName) resFile = open(resPath, 'w') print "Saving results :\n\t%s..."%resPath resFile.write(strResults) resFile.close() if log != '' : errPath = "%s.err.txt"%(resPath) errFile = open(errPath, 'w') print "Saving log :\n\t%s..." %errPath errFile.write(log) errFile.close() if args != '' : paramPath = "%s.args.txt"%(resPath) paramFile = open(paramPath, 'w') print "Saving arguments :\n\t%s..." %paramPath paramFile.write(args) paramFile.close() return "%s/"%(directoryName) nucleotides = ['A', 'T', 'C', 'G'] polymorphicNucleotides = { 'R' : ['A','G'], 'Y' : ['C','T'], 'M': ['A','C'], 'K' : ['T','G'], 'W' : ['A','T'], 'S' : ['C','G'], 'B': ['C','G','T'], 'D' : ['A','G','T'], 'H' : ['A','C','T'], 'V' : ['A','C','G'], 'N': ['A','C','G','T'] } #<7iyed> #from Molecular Systems Biology 8; Article number 572; doi:10.1038/msb.2012.3 codonAffinity = {'CTT': 'low', 'ACC': 'high', 'ACA': 'low', 'ACG': 'high', 'ATC': 'high', 'AAC': 'high', 'ATA': 'low', 'AGG': 'high', 'CCT': 'low', 'ACT': 'low', 'AGC': 'high', 'AAG': 'high', 'AGA': 'low', 'CAT': 'low', 'AAT': 'low', 'ATT': 'low', 'CTG': 'high', 'CTA': 'low', 'CTC': 'high', 'CAC': 'high', 'AAA': 'low', 'CCG': 'high', 'AGT': 'low', 'CCA': 'low', 'CAA': 'low', 'CCC': 'high', 'TAT': 'low', 'GGT': 'low', 'TGT': 'low', 'CGA': 'low', 'CAG': 'high', 'TCT': 'low', 'GAT': 'low', 'CGG': 'high', 'TTT': 'low', 'TGC': 'high', 'GGG': 'high', 'TAG': 'high', 'GGA': 'low', 'TGG': 'high', 'GGC': 'high', 'TAC': 'high', 'TTC': 'high', 'TCG': 'high', 'TTA': 'low', 'TTG': 'high', 'TCC': 'high', 'GAA': 'low', 'TAA': 'high', 'GCA': 'low', 'GTA': 'low', 'GCC': 'high', 'GTC': 'high', 'GCG': 'high', 'GTG': 'high', 'GAG': 'high', 'GTT': 'low', 'GCT': 'low', 'TGA': 'high', 'GAC': 'high', 'CGT': 'low', 'TCA': 'low', 'ATG': 'high', 'CGC': 'high'} lowAffinityCodons = set(['CTT', 'ACA', 'AAA', 'ATA', 'CCT', 'AGA', 'CAT', 'AAT', 'ATT', 'CTA', 'ACT', 'CAA', 'AGT', 'CCA', 'TAT', 'GGT', 'TGT', 'CGA', 'TCT', 'GAT', 'TTT', 'GGA', 'TTA', 'CGT', 'GAA', 'TCA', 'GCA', 'GTA', 'GTT', 'GCT']) highAffinityCodons = set(['ACC', 'ATG', 'AAG', 'ACG', 'ATC', 'AAC', 'AGG', 'AGC', 'CTG', 'CTC', 'CAC', 'CCG', 'CAG', 'CCC', 'CGC', 'CGG', 'TGC', 'GGG', 'TAG', 'TGG', 'GGC', 'TAC', 'TTC', 'TCG', 'TTG', 'TCC', 'TAA', 'GCC', 'GTC', 'GCG', 'GTG', 'GAG', 'TGA', 'GAC']) # #Empiraclly calculated using genome GRCh37.74 and Ensembl annotations humanCodonCounts = {'CTT': 588990, 'ACC': 760250, 'ACA': 671093, 'ACG': 248588, 'ATC': 819539, 'AAC': 777291, 'ATA': 326568, 'AGG': 520514, 'CCT': 784233, 'ACT': 581281, 'AGC': 826157, 'AAG': 1373474, 'AGA': 560614, 'CAT': 487348, 'AAT': 745200, 'ATT': 685951, 'CTG': 1579105, 'CTA': 311963, 'CTC': 772503, 'CAC': 618558, 'AAA': 1111269, 'CCG': 285345, 'AGT': 558788, 'CCA': 771391, 'CAA': 572531, 'CCC': 809928, 'TAT': 507376, 'GGT': 459267, 'TGT': 443487, 'CGA': 276584, 'CAG': 1483627, 'TCT': 675336, 'GAT': 982540, 'CGG': 477748, 'TTT': 721642, 'TGC': 495033, 'GGG': 661842, 'TAG': 28685, 'GGA': 731598, 'TGG': 535340, 'GGC': 877641, 'TAC': 588108, 'TTC': 774303, 'TCG': 185384, 'TTA': 348372, 'TTG': 563764, 'TCC': 729893, 'GAA': 1355256, 'TAA': 37503, 'GCA': 718158, 'GTA': 316640, 'GCC': 1120424, 'GTC': 576027, 'GCG': 289438, 'GTG': 1119171, 'GAG': 1685297, 'GTT': 486471, 'GCT': 806491, 'TGA': 82954, 'GAC': 1033108, 'CGT': 200762, 'TCA': 569093, 'ATG': 935789, 'CGC': 404889} humanCodonCount = 42433513 humanCodonRatios = {'CTT': 0.013880302580651288, 'ACC': 0.017916263496731935, 'ACA': 0.01581516477318293, 'ACG': 0.005858294127097137, 'ATC': 0.019313484603549088, 'AAC': 0.018317856454637634, 'ATA': 0.007695992551924702, 'AGG': 0.012266578070026868, 'CCT': 0.018481453562423644, 'ACT': 0.0136986301369863, 'AGC': 0.01946944623698726, 'AAG': 0.03236767127906662, 'AGA': 0.013211585851965638, 'CAT': 0.011484978865643295, 'AAT': 0.017561591000019253, 'ATT': 0.016165312544356155, 'CTG': 0.0372136287655467, 'CTA': 0.007351807049300867, 'CTC': 0.018205021111497414, 'CAC': 0.014577110313727737, 'AAA': 0.02618847513285077, 'CCG': 0.006724519838835875, 'AGT': 0.01316855382678309, 'CCA': 0.018178815409414725, 'CAA': 0.013492425197037068, 'CCC': 0.01908698909750885, 'TAT': 0.011956964298477951, 'GGT': 0.010823214189218791, 'TGT': 0.010451338308944631, 'CGA': 0.006518055669819277, 'CAG': 0.034963567593378375, 'TCT': 0.015915156494349172, 'GAT': 0.023154811622596506, 'CGG': 0.011258742588670422, 'TTT': 0.017006416602839365, 'TGC': 0.01166608571861585, 'GGG': 0.015597153127529177, 'TAG': 0.0006759987088507143, 'GGA': 0.017241042475083315, 'TGG': 0.012615971720276849, 'GGC': 0.020682732537369696, 'TAC': 0.013859517122704406, 'TTC': 0.01824744041342983, 'TCG': 0.004368811038576985, 'TTA': 0.008209831695999339, 'TTG': 0.013285819630347362, 'TCC': 0.017200861969641778, 'GAA': 0.03193834081095289, 'TAA': 0.0008838061557618385, 'GCA': 0.01692431168732129, 'GTA': 0.007462026535488589, 'GCC': 0.026404224415734798, 'GTC': 0.013574812907901357, 'GCG': 0.006820976618174413, 'GTG': 0.026374695868334068, 'GAG': 0.039716179049328296, 'GTT': 0.011464311239090669, 'GCT': 0.01900599179709679, 'TGA': 0.0019549170958341345, 'GAC': 0.024346511211551115, 'CGT': 0.004731213274752906, 'TCA': 0.013411404330346158, 'ATG': 0.022053064520017467, 'CGC': 0.009541727077840574} synonymousCodonsFrequencies = {'A': {'GCA': 0.24472833804337418, 'GCC': 0.38180943946027124, 'GCT': 0.2748297757275403, 'GCG': 0.09863244676881429}, 'C': {'TGC': 0.5274613220815753, 'TGT': 0.47253867791842474}, 'E': {'GAG': 0.5542731864894314, 'GAA': 0.44572681351056864}, 'D': {'GAT': 0.48745614313610314, 'GAC': 0.5125438568638969}, 'G': {'GGT': 0.1682082284016543, 'GGG': 0.24240206742876733, 'GGA': 0.26795045906236126, 'GGC': 0.3214392451072171}, 'F': {'TTC': 0.51760124870901, 'TTT': 0.48239875129098997}, 'I': {'ATT': 0.3744155479793762, 'ATC': 0.4473324534485262, 'ATA': 0.17825199857209761}, 'H': {'CAC': 0.5593224017231121, 'CAT': 0.4406775982768879}, 'K': {'AAG': 0.552763002048904, 'AAA': 0.44723699795109595}, '*': {'TAG': 0.19233348084375965, 'TGA': 0.5562081774416328, 'TAA': 0.25145834171460757}, 'M': {'ATG': 1.0}, 'L': {'CTT': 0.14142445416797428, 'CTG': 0.37916443861342136, 'CTA': 0.07490652981477404, 'CTC': 0.18548840407837594, 'TTA': 0.08364882247135866, 'TTG': 0.13536735085409574}, 'N': {'AAT': 0.48946102144446174, 'AAC': 0.5105389785555383}, 'Q': {'CAA': 0.27844698705060605, 'CAG': 0.721553012949394}, 'P': {'CCT': 0.29583684315158226, 'CCG': 0.1076409230535928, 'CCA': 0.2909924451987384, 'CCC': 0.30552978859608654}, 'S': {'TCT': 0.19052256484488883, 'AGC': 0.23307146458142142, 'TCG': 0.05229964811768493, 'AGT': 0.15764260007543762, 'TCC': 0.2059139249534016, 'TCA': 0.1605497974271656}, 'R': {'AGG': 0.21322832103906786, 'CGC': 0.16586259289315397, 'CGG': 0.19570924878057572, 'CGA': 0.11330250857089251, 'AGA': 0.2296552676219967, 'CGT': 0.0822420610943132}, 'T': {'ACC': 0.33621349966301256, 'ACA': 0.2967846446949689, 'ACG': 0.10993573358004469, 'ACT': 0.2570661220619738}, 'W': {'TGG': 1.0}, 'V': {'GTA': 0.12674172810489015, 'GTC': 0.230566755353321, 'GTT': 0.19472010868151218, 'GTG': 0.44797140786027667}, 'Y': {'TAT': 0.46315236005272553, 'TAC': 0.5368476399472745}} # Translation tables # Ref: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi translTable = dict() # Standard Code (NCBI transl_table=1) translTable['default'] = { 'TTT' : 'F', 'TCT' : 'S', 'TAT' : 'Y', 'TGT' : 'C', 'TTC' : 'F', 'TCC' : 'S', 'TAC' : 'Y', 'TGC' : 'C', 'TTA' : 'L', 'TCA' : 'S', 'TAA' : '*', 'TGA' : '*', 'TTG' : 'L', 'TCG' : 'S', 'TAG' : '*', 'TGG' : 'W', 'CTT' : 'L', 'CTC' : 'L', 'CTA' : 'L', 'CTG' : 'L', 'CCT' : 'P', 'CCC' : 'P', 'CCA' : 'P', 'CCG' : 'P', 'CAT' : 'H', 'CAC' : 'H', 'CAA' : 'Q', 'CAG' : 'Q', 'CGT' : 'R', 'CGC' : 'R', 'CGA' : 'R', 'CGG' : 'R', 'ATT' : 'I', 'ATC' : 'I', 'ATA' : 'I', 'ATG' : 'M', 'ACT' : 'T', 'ACC' : 'T', 'ACA' : 'T', 'ACG' : 'T', 'AAT' : 'N', 'AAC' : 'N', 'AAA' : 'K', 'AAG' : 'K', 'AGT' : 'S', 'AGC' : 'S', 'AGA' : 'R', 'AGG' : 'R', 'GTT' : 'V', 'GTC' : 'V', 'GTA' : 'V', 'GTG' : 'V', 'GCT' : 'A', 'GCC' : 'A', 'GCA' : 'A', 'GCG' : 'A', 'GAT' : 'D', 'GAC' : 'D', 'GAA' : 'E', 'GAG' : 'E', 'GGT' : 'G', 'GGC' : 'G', 'GGA' : 'G', 'GGG' : 'G', '!GA' : 'U' } codonTable = translTable['default'] # The Vertebrate Mitochondrial Code (NCBI transl_table=2) translTable['mt'] = { 'TTT' : 'F', 'TCT' : 'S', 'TAT' : 'Y', 'TGT' : 'C', 'TTC' : 'F', 'TCC' : 'S', 'TAC' : 'Y', 'TGC' : 'C', 'TTA' : 'L', 'TCA' : 'S', 'TAA' : '*', 'TGA' : 'W', 'TTG' : 'L', 'TCG' : 'S', 'TAG' : '*', 'TGG' : 'W', 'CTT' : 'L', 'CTC' : 'L', 'CTA' : 'L', 'CTG' : 'L', 'CCT' : 'P', 'CCC' : 'P', 'CCA' : 'P', 'CCG' : 'P', 'CAT' : 'H', 'CAC' : 'H', 'CAA' : 'Q', 'CAG' : 'Q', 'CGT' : 'R', 'CGC' : 'R', 'CGA' : 'R', 'CGG' : 'R', 'ATT' : 'I', 'ATC' : 'I', 'ATA' : 'M', 'ATG' : 'M', 'ACT' : 'T', 'ACC' : 'T', 'ACA' : 'T', 'ACG' : 'T', 'AAT' : 'N', 'AAC' : 'N', 'AAA' : 'K', 'AAG' : 'K', 'AGT' : 'S', 'AGC' : 'S', 'AGA' : '*', 'AGG' : '*', 'GTT' : 'V', 'GTC' : 'V', 'GTA' : 'V', 'GTG' : 'V', 'GCT' : 'A', 'GCC' : 'A', 'GCA' : 'A', 'GCG' : 'A', 'GAT' : 'D', 'GAC' : 'D', 'GAA' : 'E', 'GAG' : 'E', 'GGT' : 'G', 'GGC' : 'G', 'GGA' : 'G', 'GGG' : 'G' } AATable = {'A': ['GCA', 'GCC', 'GCG', 'GCT'], 'C': ['TGT', 'TGC'], 'E': ['GAG', 'GAA'], 'D': ['GAT', 'GAC'], 'G': ['GGT', 'GGG', 'GGA', 'GGC'], 'F': ['TTT', 'TTC'], 'I': ['ATC', 'ATA', 'ATT'], 'H': ['CAT', 'CAC'], 'K': ['AAG', 'AAA'], '*': ['TAG', 'TGA', 'TAA'], 'M': ['ATG'], 'L': ['CTT', 'CTG', 'CTA', 'CTC', 'TTA', 'TTG'], 'N': ['AAC', 'AAT'], 'Q': ['CAA', 'CAG'], 'P': ['CCT', 'CCG', 'CCA', 'CCC'], 'S': ['AGC', 'AGT', 'TCT', 'TCG', 'TCC', 'TCA'], 'R': ['AGG', 'AGA', 'CGA', 'CGG', 'CGT', 'CGC'], 'T': ['ACA', 'ACG', 'ACT', 'ACC'], 'W': ['TGG'], 'V': ['GTA', 'GTC', 'GTG', 'GTT'], 'Y': ['TAT', 'TAC']} AAs = ['A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', '*', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y'] toFloat = lambda x: float(x) toInt = lambda x: int(x) floatToStr = lambda x:"%f"%(x) intToStr = lambda x:"%d"%(x) splitStr = lambda x: x.split(';') stripSplitStr = lambda x: x.strip().split(';') def findAll(haystack, needle) : """returns a list of all occurances of needle in haystack""" h = haystack res = [] f = haystack.find(needle) offset = 0 while (f >= 0) : #print h, needle, f, offset res.append(f+offset) offset += f+len(needle) h = h[f+len(needle):] f = h.find(needle) return res def complementTab(seq=[]): """returns a list of complementary sequence without inversing it""" complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'R': 'Y', 'Y': 'R', 'M': 'K', 'K': 'M', 'W': 'W', 'S': 'S', 'B': 'V', 'D': 'H', 'H': 'D', 'V': 'B', 'N': 'N', 'a': 't', 'c': 'g', 'g': 'c', 't': 'a', 'r': 'y', 'y': 'r', 'm': 'k', 'k': 'm', 'w': 'w', 's': 's', 'b': 'v', 'd': 'h', 'h': 'd', 'v': 'b', 'n': 'n'} seq_tmp = [] for bps in seq: if len(bps) == 0: #Need manage '' for deletion seq_tmp.append('') elif len(bps) == 1: seq_tmp.append(complement[bps]) else: #Need manage 'ACT' for insertion #The insertion need to be reverse complement (like seq) seq_tmp.append(reverseComplement(bps)) #Doesn't work in the second for when bps=='' #seq = [complement[bp] if bp != '' else '' for bps in seq for bp in bps] return seq_tmp def reverseComplementTab(seq): ''' Complements a DNA sequence, returning the reverse complement in a list to manage INDEL. ''' return complementTab(seq[::-1]) def reverseComplement(seq): ''' Complements a DNA sequence, returning the reverse complement. ''' return complement(seq)[::-1] def complement(seq) : """returns the complementary sequence without inversing it""" tb = string.maketrans("ACGTRYMKWSBDHVNacgtrymkwsbdhvn", "TGCAYRKMWSVHDBNtgcayrkmwsvhdbn") #just to be sure that seq isn't unicode return str(seq).translate(tb) def translateDNA_6Frames(sequence) : """returns 6 translation of sequence. One for each reading frame""" trans = ( translateDNA(sequence, 'f1'), translateDNA(sequence, 'f2'), translateDNA(sequence, 'f3'), translateDNA(sequence, 'r1'), translateDNA(sequence, 'r2'), translateDNA(sequence, 'r3'), ) return trans def translateDNA(sequence, frame = 'f1', translTable_id='default') : """Translates DNA code, frame : fwd1, fwd2, fwd3, rev1, rev2, rev3""" protein = "" if frame == 'f1' : dna = sequence elif frame == 'f2': dna = sequence[1:] elif frame == 'f3' : dna = sequence[2:] elif frame == 'r1' : dna = reverseComplement(sequence) elif frame == 'r2' : dna = reverseComplement(sequence) dna = dna[1:] elif frame == 'r3' : dna = reverseComplement(sequence) dna = dna[2:] else : raise ValueError('unknown reading frame: %s, should be one of the following: fwd1, fwd2, fwd3, rev1, rev2, rev3' % frame) for i in range(0, len(dna), 3) : codon = dna[i:i+3] # Check if variant messed with selenocysteine codon if '!' in codon and codon != '!GA': codon = codon.replace('!', 'T') if (len(codon) == 3) : try : # MC protein += translTable[translTable_id][codon] except KeyError : combinaisons = polymorphicCodonCombinaisons(list(codon)) translations = set() for ci in range(len(combinaisons)): translations.add(translTable[translTable_id][combinaisons[ci]]) protein += '/'.join(translations) return protein def getSequenceCombinaisons(polymorphipolymorphicDnaSeqSeq, pos = 0) : """Takes a dna sequence with polymorphismes and returns all the possible sequences that it can yield""" if type(polymorphipolymorphicDnaSeqSeq) is not types.ListType : seq = list(polymorphipolymorphicDnaSeqSeq) else : seq = polymorphipolymorphicDnaSeqSeq if pos >= len(seq) : return [''.join(seq)] variants = [] if seq[pos] in polymorphicNucleotides : chars = decodePolymorphicNucleotide(seq[pos]) else : chars = seq[pos]#.split('/') for c in chars : rseq = copy.copy(seq) rseq[pos] = c variants.extend(getSequenceCombinaisons(rseq, pos + 1)) return variants def polymorphicCodonCombinaisons(codon) : """Returns all the possible amino acids encoded by codon""" return getSequenceCombinaisons(codon, 0) def encodePolymorphicNucleotide(polySeq) : """returns a single character encoding all nucletides of polySeq in a single character. PolySeq must have one of the following forms: ['A', 'T', 'G'], 'ATG', 'A/T/G'""" if type(polySeq) is types.StringType : if polySeq.find("/") < 0 : sseq = list(polySeq) else : sseq = polySeq.split('/') else : sseq = polySeq seq = [] for n in sseq : try : for n2 in polymorphicNucleotides[n] : seq.append(n2) except KeyError : seq.append(n) seq = set(seq) if len(seq) == 4: return 'N' elif len(seq) == 3 : if 'T' not in seq : return 'V' elif 'G' not in seq : return 'H' elif 'C' not in seq : return 'D' elif 'A' not in seq : return 'B' elif len(seq) == 2 : if 'A' in seq and 'G' in seq : return 'R' elif 'C' in seq and 'T' in seq : return 'Y' elif 'A' in seq and 'C' in seq : return 'M' elif 'T' in seq and 'G' in seq : return 'K' elif 'A' in seq and 'T' in seq : return 'W' elif 'C' in seq and 'G' in seq : return 'S' elif polySeq[0] in nucleotides : return polySeq[0] else : raise UnknownNucleotide(polySeq) def decodePolymorphicNucleotide(nuc) : """the opposite of encodePolymorphicNucleotide, from 'R' to ['A', 'G']""" if nuc in polymorphicNucleotides : return polymorphicNucleotides[nuc] if nuc in nucleotides : return nuc raise ValueError('nuc: %s, is not a valid nucleotide' % nuc) def decodePolymorphicNucleotide_str(nuc) : """same as decodePolymorphicNucleotide but returns a string instead of a list, from 'R' to 'A/G""" return '/'.join(decodePolymorphicNucleotide(nuc)) def getNucleotideCodon(sequence, x1) : """Returns the entire codon of the nucleotide at pos x1 in sequence, and the position of that nocleotide in the codon in a tuple""" if x1 < 0 or x1 >= len(sequence) : return None p = x1%3 if p == 0 : return (sequence[x1: x1+3], 0) elif p ==1 : return (sequence[x1-1: x1+2], 1) elif p == 2 : return (sequence[x1-2: x1+1], 2) def showDifferences(seq1, seq2) : """Returns a string highligthing differences between seq1 and seq2: * Matches by '-' * Differences : 'A|T' * Exceeded length : '#' """ ret = [] for i in range(max(len(seq1), len(seq2))) : if i >= len(seq1) : c1 = '#' else : c1 = seq1[i] if i >= len(seq2) : c2 = '#' else : c2 = seq2[i] if c1 != c2 : ret.append('%s|%s' % (c1, c2)) else : ret.append('-') return ''.join(ret) def highlightSubsequence(sequence, x1, x2, start=' [', stop = '] ') : """returns a sequence where the subsequence in [x1, x2[ is placed in bewteen 'start' and 'stop'""" seq = list(sequence) print x1, x2-1, len(seq) seq[x1] = start + seq[x1] seq[x2-1] = seq[x2-1] + stop return ''.join(seq) # def highlightSubsequence(sequence, x1, x2, start=' [', stop = '] ') : # """returns a sequence where the subsequence in [x1, x2[ is placed # in bewteen 'start' and 'stop'""" # seq = list(sequence) # ii = 0 # acc = True # for i in range(len(seq)) : # if ii == x1 : # seq[i] = start+seq[i] # if ii == x2-1 : # seq[i] = seq[i] + stop # if i < len(seq) - 1 : # if seq[i+1] == '/': # acc = False # else : # acc = True # if acc : # ii += 1 # return ''.join(seq) ================================================ FILE: pyGeno/tools/__init__.py ================================================ __all__ = ['BinarySequence', 'CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'io', 'ProgressBar', 'SecureMmap', 'SegmentTree', 'SingletonManager', 'UsefulFunctions'] ================================================ FILE: pyGeno/tools/io.py ================================================ import sys def printf(*s) : 'print + sys.stdout.flush()' for e in s[:-1] : print e, print s[-1] sys.stdout.flush() def enterConfirm_prompt(enterMsg) : stopi = False while not stopi : print "====\n At any time you can quit by entering 'quit'\n====" vali = raw_input(enterMsg) if vali.lower() == 'quit' : vali = None stopi = True else : print "You've entered:\n\t%s" % vali valj = confirm_prompt("") if valj == 'yes' : stopi = True if valj == 'quit' : vali = None stopi = True return vali def confirm_prompt(preMsg) : while True : val = raw_input('%splease confirm ("yes", "no", "quit"): ' % preMsg) if val.lower() == 'yes' or val.lower() == 'no' or val.lower() == 'quit': return val.lower() ================================================ FILE: pyGeno/tools/parsers/CSVTools.py ================================================ import os, types, collections class EmptyLine(Exception) : """Raised when an empty or comment line is found (dealt with internally)""" def __init__(self, lineNumber) : message = "Empty line: #%d" % lineNumber Exception.__init__(self, message) self.message = message def __str__(self) : return self.message def removeDuplicates(inFileName, outFileName) : """removes duplicated lines from a 'inFileName' CSV file, the results are witten in 'outFileName'""" f = open(inFileName) legend = f.readline() data = '' h = {} h[legend] = 0 lines = f.readlines() for l in lines : if not h.has_key(l) : h[l] = 0 data += l f.flush() f.close() f = open(outFileName, 'w') f.write(legend+data) f.flush() f.close() def catCSVs(folder, ouputFileName, removeDups = False) : """Concatenates all csv in 'folder' and wites the results in 'ouputFileName'. My not work on non Unix systems""" strCmd = r"""cat %s/*.csv > %s""" %(folder, ouputFileName) os.system(strCmd) if removeDups : removeDuplicates(ouputFileName, ouputFileName) def joinCSVs(csvFilePaths, column, ouputFileName, separator = ',') : """csvFilePaths should be an iterable. Joins all CSVs according to the values in the column 'column'. Write the results in a new file 'ouputFileName' """ res = '' legend = [] csvs = [] for f in csvFilePaths : c = CSVFile() c.parse(f) csvs.append(c) legend.append(separator.join(c.legend.keys())) legend = separator.join(legend) lines = [] for i in range(len(csvs[0])) : val = csvs[0].get(i, column) line = separator.join(csvs[0][i]) for c in csvs[1:] : for j in range(len(c)) : if val == c.get(j, column) : line += separator + separator.join(c[j]) lines.append( line ) res = legend + '\n' + '\n'.join(lines) f = open(ouputFileName, 'w') f.write(res) f.flush() f.close() return res class CSVEntry(object) : """A single entry in a CSV file""" def __init__(self, csvFile, lineNumber = None) : self.csvFile = csvFile self.data = [] if lineNumber != None : self.lineNumber = lineNumber tmpL = csvFile.lines[lineNumber].replace('\r', '\n').replace('\n', '') if len(tmpL) == 0 or tmpL[0] in ["#", "\r", "\n", csvFile.lineSeparator] : raise EmptyLine(lineNumber) tmpData = tmpL.split(csvFile.separator) # tmpDatum = [] i = 0 while i < len(tmpData) : # for d in tmpData : d = tmpData[i] sd = d.strip() # print sd, tmpData, i if len(sd) > 0 and sd[0] == csvFile.stringSeparator : more = [] for i in xrange(i, len(tmpData)) : more.append(tmpData[i]) i+=1 if more[-1][-1] == csvFile.stringSeparator : break self.data.append(",".join(more)[1:-1]) # if len(tmpDatum) > 0 or (len(sd) > 0 and sd[0] == csvFile.stringSeparator) : # tmpDatum.append(sd) # if len(sd) > 0 and sd[-1] == csvFile.stringSeparator : # self.data.append(csvFile.separator.join(tmpDatum)) # tmpDatum = [] else : self.data.append(sd) i += 1 else : self.lineNumber = len(csvFile) for i in range(len(self.csvFile.legend)) : self.data.append('') def commit(self) : """commits the line so it is added to a file stream""" self.csvFile.commitLine(self) def __iter__(self) : self.currentField = -1 return self def next(self) : self.currentField += 1 if self.currentField >= len(self.csvFile.legend) : raise StopIteration k = self.csvFile.legend.keys()[self.currentField] v = self.data[self.currentField] return k, v def __getitem__(self, key) : """Returns the value of field 'key'""" try : indice = self.csvFile.legend[key.lower()] except KeyError : raise KeyError("CSV File has no column: '%s'" % key) return self.data[indice] def __setitem__(self, key, value) : """Sets the value of field 'key' to 'value' """ try : field = self.csvFile.legend[key.lower()] except KeyError : self.csvFile.addField(key) field = self.csvFile.legend[key.lower()] self.data.append(str(value)) else : try: self.data[field] = str(value) except Exception as e: for i in xrange(field-len(self.data)+1) : self.data.append("") self.data[field] = str(value) def toStr(self) : return self.csvFile.separator.join(self.data) def __repr__(self) : r = {} for k, v in self.csvFile.legend.iteritems() : r[k] = self.data[v] return "" %(self.lineNumber, str(r)) def __str__(self) : return repr(self) class CSVFile(object) : """ Represents a whole CSV file:: #reading f = CSVFile() f.parse('hop.csv') for line in f : print line['ref'] #writing, legend can either be a list of a dict {field : column number} f = CSVFile(legend = ['name', 'email']) l = f.newLine() l['name'] = 'toto' l['email'] = "hop@gmail.com" for field, value in l : print field, value f.save('myCSV.csv') """ def __init__(self, legend = [], separator = ',', lineSeparator = '\n') : self.legend = collections.OrderedDict() for i in range(len(legend)) : if legend[i].lower() in self.legend : raise ValueError("%s is already in the legend" % legend[i].lower()) self.legend[legend[i].lower()] = i self.strLegend = separator.join(legend) self.filename = "" self.lines = [] self.separator = separator self.lineSeparator = lineSeparator self.currentPos = -1 self.streamFile = None self.writeRate = None self.streamBuffer = None self.keepInMemory = True def addField(self, field) : """add a filed to the legend""" if field.lower() in self.legend : raise ValueError("%s is already in the legend" % field.lower()) self.legend[field.lower()] = len(self.legend) if len(self.strLegend) > 0 : self.strLegend += self.separator + field else : self.strLegend += field def parse(self, filePath, skipLines=0, separator = ',', stringSeparator = '"', lineSeparator = '\n') : """Loads a CSV file""" self.filename = filePath f = open(filePath) if lineSeparator == '\n' : lines = f.readlines() else : lines = f.read().split(lineSeparator) f.flush() f.close() lines = lines[skipLines:] self.lines = [] self.comments = [] for l in lines : # print l if len(l) != 0 and l[0] != "#" : self.lines.append(l) elif l[0] == "#" : self.comments.append(l) self.separator = separator self.lineSeparator = lineSeparator self.stringSeparator = stringSeparator self.legend = collections.OrderedDict() i = 0 for c in self.lines[0].lower().replace(stringSeparator, '').split(separator) : legendElement = c.strip() if legendElement not in self.legend : self.legend[legendElement] = i i+=1 self.strLegend = self.lines[0].replace('\r', '\n').replace('\n', '') self.lines = self.lines[1:] # sk = skipLines+1 # for l in self.lines : # if l[0] == "#" : # sk += 1 # else : # break # self.header = self.lines[:sk] # self.lines = self.lines[sk:] def streamToFile(self, filename, keepInMemory = False, writeRate = 1) : """Starts a stream to a file. Every line must be committed (l.commit()) to be appended in to the file. If keepInMemory is set to True, the parser will keep a version of the whole CSV in memory, writeRate is the number of lines that must be committed before an automatic save is triggered. """ if len(self.legend) < 1 : raise ValueError("There's no legend defined") try : os.remove(filename) except : pass self.streamFile = open(filename, "a") self.writeRate = writeRate self.streamBuffer = [] self.keepInMemory = keepInMemory self.streamFile.write(self.strLegend + "\n") def commitLine(self, line) : """Commits a line making it ready to be streamed to a file and saves the current buffer if needed. If no stream is active, raises a ValueError""" if self.streamBuffer is None : raise ValueError("Commit lines is only for when you are streaming to a file") self.streamBuffer.append(line) if len(self.streamBuffer) % self.writeRate == 0 : for i in xrange(len(self.streamBuffer)) : self.streamBuffer[i] = str(self.streamBuffer[i]) self.streamFile.write("%s\n" % ('\n'.join(self.streamBuffer))) self.streamFile.flush() self.streamBuffer = [] def closeStreamToFile(self) : """Appends the remaining commited lines and closes the stream. If no stream is active, raises a ValueError""" if self.streamBuffer is None : raise ValueError("Commit lines is only for when you are streaming to a file") for i in xrange(len(self.streamBuffer)) : self.streamBuffer[i] = str(self.streamBuffer[i]) self.streamFile.write('\n'.join(self.streamBuffer)) self.streamFile.close() self.streamFile = None self.writeRate = None self.streamBuffer = None self.keepInMemory = True def _developLine(self, line) : stop = False while not stop : try : if self.lines[line].__class__ is not CSVEntry : devL = CSVEntry(self, line) stop = True else : devL = self.lines[line] stop = True except EmptyLine as e : del(self.lines[line]) self.lines[line] = devL def get(self, line, key) : self._developLine(line) return self.lines[line][key] def set(self, line, key, val) : self._developLine(line) self.lines[line][key] = val def newLine(self) : """Appends an empty line at the end of the CSV and returns it""" l = CSVEntry(self) if self.keepInMemory : self.lines.append(l) return l def insertLine(self, i) : """Inserts an empty line at position i and returns it""" self.data.insert(i, CSVEntry(self)) return self.lines[i] def save(self, filePath) : """save the CSV to a file""" self.filename = filePath f = open(filePath, 'w') f.write(self.toStr()) f.flush() f.close() def toStr(self) : """returns a string version of the CSV""" s = [self.strLegend] for l in self.lines : s.append(l.toStr()) return self.lineSeparator.join(s) def __iter__(self) : self.currentPos = -1 return self def next(self) : self.currentPos += 1 if self.currentPos >= len(self) : raise StopIteration self._developLine(self.currentPos) return self.lines[self.currentPos] def __getitem__(self, line) : try : if self.lines[line].__class__ is not CSVEntry : self._developLine(line) except AttributeError : start = line.start if start is None : start = 0 for l in xrange(len(self.lines[line])) : self._developLine(l + start) # start, stop = line.start, line.stop # if start is None : # start = 0 # if stop is None : # stop = 0 # for l in xrange(start, stop) : # self._developLine(l) return self.lines[line] def __len__(self) : return len(self.lines) ================================================ FILE: pyGeno/tools/parsers/CasavaTools.py ================================================ import gzip import pyGeno.tools.UsefulFunctions as uf class SNPsTxtEntry(object) : """A single entry in the casavas snps.txt file""" def __init__(self, lineNumber, snpsTxtFile) : self.snpsTxtFile = snpsTxtFile self.lineNumber = lineNumber self.values = {} sl = str(snpsTxtFile.data[lineNumber]).replace('\t\t', '\t').split('\t') self.values['chromosomeNumber'] = sl[0].upper().replace('CHR', '') #first column: chro, second first of range (identical to second column) self.values['start'] = int(sl[2]) self.values['end'] = int(sl[2])+1 self.values['bcalls_used'] = sl[3] self.values['bcalls_filt'] = sl[4] self.values['ref'] = sl[5] self.values['QSNP'] = int(sl[6]) self.values['alleles'] = uf.encodePolymorphicNucleotide(sl[7]) #max_gt self.values['Qmax_gt'] = int(sl[8]) self.values['max_gt_poly_site'] = sl[9] self.values['Qmax_gt_poly_site'] = int(sl[10]) self.values['A_used'] = int(sl[11]) self.values['C_used'] = int(sl[12]) self.values['G_used'] = int(sl[13]) self.values['T_used'] = int(sl[14]) def __getitem__(self, fieldName): """Returns the value of field 'fieldName'""" return self.values[fieldName] def __setitem__(self, fieldName, value) : """Sets the value of field 'fieldName' to 'value' """ self.values[fieldName] = value def __str__(self): return str(self.values) class SNPsTxtFile(object) : """ Represents a whole casava's snps.txt file:: f = SNPsTxtFile('snps.txt') for line in f : print line['ref'] """ def __init__(self, fil, gziped = False) : self.reset() if not gziped : f = open(fil) else : f = gzip.open(fil) for l in f : if l[0] != '#' : self.data.append(l) f.close() def reset(self) : """Frees the file""" self.data = [] self.currentPos = 0 def __iter__(self) : self.currentPos = 0 return self def next(self) : if self.currentPos >= len(self) : raise StopIteration() v = self[self.currentPos] self.currentPos += 1 return v def __getitem__(self, i) : if self.data[i].__class__ is not SNPsTxtEntry : self.data[i] = SNPsTxtEntry(i, self) return self.data[i] def __len__(self) : return len(self.data) ================================================ FILE: pyGeno/tools/parsers/FastaTools.py ================================================ import os class FastaFile(object) : """ Represents a whole Fasta file:: #reading f = FastaFile() f.parseFile('hop.fasta') for line in f : print line #writing f = FastaFile() f.add(">prot1", "MLPADEV") f.save('myFasta.fasta') """ def __init__(self, fil = None) : self.reset() if fil != None : self.parseFile(fil) def reset(self) : """Erases everything""" self.data = [] self.currentPos = 0 def parseStr(self, st) : """Parses a string""" self.data = st.split('>')[1:] def parseFile(self, fil) : """Opens a file and parses it""" f = open(fil) self.parseStr(f.read()) f.close() def __splitLine(self, li) : if len(self.data[li]) != 2 : self.data[li] = self.data[li].replace('\r', '\n') self.data[li] = self.data[li].replace('\n\n', '\n') l = self.data[li].split('\n') header = '>'+l[0] data = ''.join(l[1:]) self.data[li] = (header, data) def get(self, i) : """returns the ith entry""" self.__splitLine(i) return self.data[i] def add(self, header, data) : """appends a new entry to the file""" if header[0] != '>' : self.data.append(('>'+header, data)) else : self.data.append((header, data)) def save(self, filePath) : """saves the file into filePath""" f = open(filePath, 'w') f.write(self.toStr()) f.close() def toStr(self) : """returns a string version of self""" st = "" for d in self.data : st += "%s\n%s\n" % (d[0], d[1]) return st def __iter__(self) : self.currentPos = 0 return self def next(self) : #self to call getitem, and split he line if necessary i = self.currentPos +1 #print i-1, self.currentPos if i > len(self) : raise StopIteration() self.currentPos = i return self[self.currentPos-1] def __getitem__(self, i) : """returns the ith entry""" return self.get(i) def __setitem__(self, i, v) : """sets the value of the ith entry""" if len(v) != 2: raise TypeError("v must have a len of 2 : (header, data)") self.data[i] = v def __len__(self) : """returns the number of entries""" return len(self.data) ================================================ FILE: pyGeno/tools/parsers/FastqTools.py ================================================ import os class FastqEntry(object) : """A single entry in the FastqEntry file""" def __init__(self, ident = "", seq = "", plus = "", qual = "") : self.values = {} self.values['identifier'] = ident self.values['sequence'] = seq self.values['+'] = plus self.values['qualities'] = qual def __getitem__(self, i): return self.values[i] def __setitem__(self, i, v) : self.values[i] = v def __str__(self): return "%s\n%s\n%s\n%s" %(self.values['identifier'], self.values['sequence'], self.values['+'], self.values['qualities']) class FastqFile(object) : """ Represents a whole Fastq file:: #reading f = FastqFile() f.parse('hop.fastq') for line in f : print line['sequence'] #writing, legend can either be a list of a dict {field : column number} f = CSVFile(legend = ['name', 'email']) l = f.newLine() l['name'] = 'toto' l['email'] = "hop@gmail.com" f.save('myCSV.csv') """ def __init__(self, fil = None) : self.reset() if fil != None : self.parseFile(fil) def reset(self) : """Frees the file""" self.data = [] self.currentPos = 0 def parseStr(self, st) : """Parses a string""" self.data = st.replace('\r', '\n') self.data = self.data.replace('\n\n', '\n') self.data = self.data.split('\n') def parseFile(self, fil) : """Parses a file on disc""" f = open(fil) self.parseStr(f.read()) f.close() def __splitEntry(self, li) : try : self.data['+'] return self.data except : self.data[li] = FastqEntry(self.data[li], self.data[li+1], self.data[li+2], self.data[li+3]) def get(self, li) : """returns the ith entry""" i = li*4 self.__splitEntry(i) return self.data[i] def newEntry(self, ident = "", seq = "", plus = "", qual = "") : """Appends an empty entry at the end of the CSV and returns it""" e = FastqEntry() self.data.append(e) return e def add(self, fastqEntry) : """appends an entry to self""" self.data.append(fastqEntry) def save(self, filePath) : f = open(filePath, 'w') f.write(self.make()) f.close() def toStr(self) : st = "" for d in self.data : st += "%s\n%s" % (d[0], d[1]) return st def __iter__(self) : self.currentPos = 0 return self def next(self) : #self to call getitem, and split he line if necessary i = self.currentPos +1 #print i-1, self.currentPos if i > len(self) : raise StopIteration() self.currentPos = i return self[self.currentPos-1] def __getitem__(self, i) : """returns the ith entry""" return self.get(i) def __setitem__(self, i, v) : """sets the ith entry""" if len(v) != 2: raise TypeError("v must have a len of 2 : (header, data)") self.data[i] = v def __len__(self) : return len(self.data)/4 ================================================ FILE: pyGeno/tools/parsers/GTFTools.py ================================================ import gzip class GTFEntry(object) : def __init__(self, gtfFile, lineNumber) : """A single entry in a GTF file""" self.lineNumber = lineNumber self.gtfFile = gtfFile self.data = gtfFile.lines[lineNumber][:-2].split('\t') #-2 remove ';\n' proto_atts = self.data[gtfFile.legend['attributes']].strip().split('; ') atts = {} for a in proto_atts : sa = a.split(' ') atts[sa[0]] = sa[1].replace('"', '') self.data[gtfFile.legend['attributes']] = atts def __getitem__(self, k) : try : return self.data[self.gtfFile.legend[k]] except KeyError : try : return self.data[self.gtfFile.legend['attributes']][k] except KeyError : #return None raise KeyError("Line %d does not have an element %s.\nline:%s" %(self.lineNumber, k, self.gtfFile.lines[self.lineNumber])) def __repr__(self) : return "" % self.lineNumber def __str__(self) : return "" % (self.lineNumber, str(self.data)) class GTFFile(object) : """This is a simple GTF2.2 (Revised Ensembl GTF) parser, see http://mblab.wustl.edu/GTF22.html for more infos""" def __init__(self, filename, gziped = False) : self.filename = filename self.legend = {'seqname' : 0, 'source' : 1, 'feature' : 2, 'start' : 3, 'end' : 4, 'score' : 5, 'strand' : 6, 'frame' : 7, 'attributes' : 8} if gziped : f = gzip.open(filename) else : f = open(filename) self.lines = [] for l in f : if l[0] != '#' and l != '' : self.lines.append(l) f.close() self.currentIt = -1 def get(self, line, elmt) : """returns the value of the field 'elmt' of line 'line'""" return self[line][elmt] def __iter__(self) : self.currentPos = -1 return self def next(self) : self.currentPos += 1 try : return GTFEntry(self, self.currentPos) except IndexError: raise StopIteration def __getitem__(self, i) : """returns the ith entry""" if self.lines[i].__class__ is not GTFEntry : self.lines[i] = GTFEntry(self, i) return self.lines[i] def __repr__(self) : return "" % (os.path.basename(self.filename)) def __str__(self) : return "" % (os.path.basename(self.filename), self.gziped, len(self)) def __len__(self) : return len(self.lines) ================================================ FILE: pyGeno/tools/parsers/VCFTools.py ================================================ import os, types, gzip class VCFEntry(object) : """A single entry in a VCF file""" def __init__(self, vcfFile, line, lineNumber) : #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 #20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. self.vcfFile = vcfFile self.lineNumber = lineNumber self.data = {} tmpL = line.replace('\r', '\n').replace('\n', '') tmpData = str(tmpL).split('\t') for i in range(6) : self.data[vcfFile.dnegel[i]] = tmpData[i] self.data['POS'] = int(self.data['POS']) filters = tmpData[6].split(';') if len(filters) == 1 : self.data['FILTER'] = filters else : for filter_value in tmpData[6].split(';') : filt, value = info_value.split('=') self.data['FILTER'][filt] = value self.data['INFO'] = {} for s in tmpData[7].split(';') : info_value = s.split('=') try : typ = self.vcfFile.meta['INFO'][info_value[0]]['Type'] except KeyError : typ = None if len(info_value) == 1 : if typ == 'Flag' or typ == None : self.data['INFO'][info_value[0]] = True else : raise ValueError('%s is not a flag and has no value' % info_value[0]) else : if typ == 'Integer' : self.data['INFO'][info_value[0]] = int(info_value[1]) elif typ == 'Float' : self.data['INFO'][info_value[0]] = float(info_value[1]) else : self.data['INFO'][info_value[0]] = info_value[1] def __getitem__(self, key) : "with the vcf file format some fields are not present in all elements therefor, this fct never raises an exception but returns None or False if the field is definied as a Flag in Meta" try : return self.data[key] except KeyError: try : return self.data['INFO'][key] except KeyError: try : if self.vcfFile.meta['INFO'][key]['Type'] == 'Flag' : self.data['INFO'][key] = False return self.data['INFO'][key] else : return None except KeyError: return None def __repr__(self) : return "" % self.lineNumber def __str__(self) : return " that surounf the string idKey = svalues[0].split('=')[1] self.meta[key][idKey] = {} i = 1 for v in svalues[1:] : sv = v.split("=") field = sv[0] value = sv[1] if field.lower() == 'description' : self.meta[key][idKey][field] = ','.join(svalues[i:])[len(field)+2:-1] break self.meta[key][idKey][field] = value i += 1 elif l[:6] == '#CHROM': #we are in legend sl = l.split('\t') for i in range(len(sl)) : self.legend[sl[i]] = i self.dnegel[i] = sl[i] break lineId += 1 if not stream : self.lines = self.f.readlines() self.f.close() def close(self) : """closes the file""" self.f.close() def _developLine(self, lineNumber) : if self.lines[lineNumber].__class__ is not VCFEntry : self.lines[lineNumber] = VCFEntry(self, self.lines[lineNumber], lineNumber) def __iter__(self) : self.currentPos = -1 return self def next(self) : self.currentPos += 1 if not self.stream : try : return self[self.currentPos-1] except IndexError: raise StopIteration else : midfile_header = True while midfile_header: line = self.f.readline() if not line : raise StopIteration if not line.startswith('#'): midfile_header = False return VCFEntry(self, line, self.currentPos) def __getitem__(self, line) : """returns the lineth element""" if self.stream : raise KeyError("When the file is opened as a stream it's impossible to ask for specific item") if self.lines[line].__class__ is not VCFEntry : self._developLine(line) return self.lines[line] def __len__(self) : """returns the number of entries""" return len(self.lines) def __repr__(self) : return "" % (os.path.basename(self.filename)) def __str__(self) : if self.stream : return "" % (os.path.basename(self.filename), self.gziped, self.stream) else : return "" % (os.path.basename(self.filename), self.gziped, self.stream, len(self)) if __name__ == '__main__' : from pyGeno.tools.ProgressBar import ProgressBar #v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/00-All.vcf.gz', gziped = True, stream = True) v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/test/00-All-test.vcf.gz', gziped = True, stream = True) startTime = time.time() i = 0 pBar = ProgressBar() for f in v : #print f pBar.update('%s' % i) if i > 1000000 : break i += 1 pBar.close() #v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/00-All.vcf', gziped = False, stream = True) v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/test/00-All-test.vcf', gziped = False, stream = False) startTime = time.time() i = 0 pBar = ProgressBar() for f in v : pBar.update('%s' % i) if i > 1000000 : break i += 1 #print f pBar.close() #print v.lines ================================================ FILE: pyGeno/tools/parsers/__init__.py ================================================ #Parsers for different types of files __all__ = ['CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'CasavaTools'] ================================================ FILE: setup.py ================================================ from setuptools import setup, find_packages from codecs import open from os import path here = path.abspath(path.dirname(__file__)) # Get the long description from the relevant file with open(path.join(here, 'DESCRIPTION.rst'), encoding='utf-8') as f: long_description = f.read() setup( name='pyGeno', version='1.3.2', description='A python package for Personalized Genomics and Proteomics', long_description=long_description, url='http://pyGeno.iric.ca', author='Tariq Daouda', author_email='tariq.daouda@umontreal.ca', test_suite="pyGeno.tests", license='ApacheV2.0', # See https://pypi.python.org/pypi?%3Aaction=list_classifiers classifiers=[ # How mature is this project? Common values are # 3 - Alpha # 4 - Beta # 5 - Production/Stable 'Development Status :: 5 - Production/Stable', 'Intended Audience :: Science/Research', 'Intended Audience :: Healthcare Industry', 'Topic :: Scientific/Engineering :: Bio-Informatics', 'Topic :: Software Development :: Libraries', 'Topic :: Software Development :: Libraries :: Application Frameworks', 'License :: OSI Approved :: Apache Software License', 'Programming Language :: Python :: 2.7', ], keywords='proteogenomics genomics proteomics annotations medicine research personalized gene sequence protein', packages=find_packages(exclude=[]), # List run-time dependencies here. These will be installed by pip when your # project is installed. For an analysis of "install_requires" vs pip's # requirements files see: # https://packaging.python.org/en/latest/technical.html#install-requires-vs-requirements-files install_requires=['rabaDB >= 1.0.5'], # If there are data files included in your packages that need to be # installed, specify them here. If using Python 2.6 or less, then these # have to be included in MANIFEST.in as well. package_data={ '': ['*.txt', '*.rst', '*.tar.gz'], }, # Although 'package_data' is the preferred approach, in some case you may # need to place data files outside of your packages. # see http://docs.python.org/3.4/distutils/setupscript.html#installing-additional-files # In this case, 'data_file' will be installed into '/my_data' #~ data_files=[('my_data', ['data/data_file'])], # To provide executable scripts, use entry points in preference to the # "scripts" keyword. Entry points provide cross-platform support and allow # pip to create the appropriate form of executable for the target platform. entry_points={ 'console_scripts': [ 'sample=sample:main', ], }, )