Showing preview only (261K chars total). Download the full file or copy to clipboard to get everything.
Repository: tariqdaouda/pyGeno
Branch: bloody
Commit: 6311c9cd9444
Files: 66
Total size: 243.4 KB
Directory structure:
gitextract_0q2fbl_h/
├── .gitignore
├── .travis.yml
├── CHANGELOG.rst
├── DESCRIPTION.rst
├── LICENSE
├── MANIFEST.in
├── README.rst
├── pyGeno/
│ ├── Chromosome.py
│ ├── Exon.py
│ ├── Gene.py
│ ├── Genome.py
│ ├── Protein.py
│ ├── SNP.py
│ ├── SNPFiltering.py
│ ├── Transcript.py
│ ├── __init__.py
│ ├── bootstrap.py
│ ├── bootstrap_data/
│ │ ├── SNPs/
│ │ │ ├── Human_agnostic.dummySRY_indels/
│ │ │ │ ├── manifest.ini
│ │ │ │ └── snps.txt
│ │ │ └── __init__.py
│ │ ├── __init__.py
│ │ └── genomes/
│ │ └── __init__.py
│ ├── configuration.py
│ ├── doc/
│ │ ├── Makefile
│ │ ├── make.bat
│ │ └── source/
│ │ ├── bootstraping.rst
│ │ ├── citing.rst
│ │ ├── conf.py
│ │ ├── datawraps.rst
│ │ ├── importation.rst
│ │ ├── index.rst
│ │ ├── installation.rst
│ │ ├── objects.rst
│ │ ├── parsers.rst
│ │ ├── publications.rst
│ │ ├── querying.rst
│ │ ├── quickstart.rst
│ │ ├── snp_filter.rst
│ │ └── tools.rst
│ ├── examples/
│ │ ├── __init__.py
│ │ ├── bootstraping.py
│ │ └── snps_queries.py
│ ├── importation/
│ │ ├── Genomes.py
│ │ ├── SNPs.py
│ │ └── __init__.py
│ ├── pyGenoObjectBases.py
│ ├── tests/
│ │ ├── __init__.py
│ │ ├── test_csv.py
│ │ └── test_genome.py
│ └── tools/
│ ├── BinarySequence.py
│ ├── ProgressBar.py
│ ├── SecureMmap.py
│ ├── SegmentTree.py
│ ├── SingletonManager.py
│ ├── Stats.py
│ ├── UsefulFunctions.py
│ ├── __init__.py
│ ├── io.py
│ └── parsers/
│ ├── CSVTools.py
│ ├── CasavaTools.py
│ ├── FastaTools.py
│ ├── FastqTools.py
│ ├── GTFTools.py
│ ├── VCFTools.py
│ └── __init__.py
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.cache
nosetests.xml
coverage.xml
# Translations
*.mo
*.pot
# Django stuff:
*.log
# Sphinx documentation
docs/_build/
# PyBuilder
target/
### PyCharm ###
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm
## Directory-based project format
.idea/
# if you remove the above rule, at least ignore user-specific stuff:
# .idea/workspace.xml
# .idea/tasks.xml
# and these sensitive or high-churn files:
# .idea/dataSources.ids
# .idea/dataSources.xml
# .idea/sqlDataSources.xml
# .idea/dynamic.xml
## File-based project format
*.ipr
*.iws
*.iml
## Additional for IntelliJ
out/
# generated by mpeltonen/sbt-idea plugin
.idea_modules/
================================================
FILE: .travis.yml
================================================
sudo: false
notifications:
email: false
language: python
python:
- "2.7"
before_install:
- wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
- bash miniconda.sh -b -p $HOME/miniconda
- export PATH="$HOME/miniconda/bin:$PATH"
- conda update --yes conda
install:
- conda install --yes python=$TRAVIS_PYTHON_VERSION pip numpy scipy
- pip install coverage
- pip install https://github.com/tariqdaouda/rabaDB/archive/stable.zip
- python setup.py install
script: coverage run -m unittest discover pyGeno/tests/
after_success: bash <(curl -s https://codecov.io/bash)
================================================
FILE: CHANGELOG.rst
================================================
1.3.2
=====
* Search now uses KMD by default instead of dichotomic search (massive speed gain). Many thanks to @Keija for the implementation. Go to https://github.com/tariqdaouda/pyGeno/pull/34 for details and benchmarks.
1.3.1
=====
* CSVFile: fixed bug when slice start was None
* CSVFile: Better support for string separator
* AGN SNPs Quality cast to float by importer
* Travis integration
* Minor CSV parser updates
1.3.0
=====
* CSVFile will now ignore empty lines and comments
* Added synonymousCodonsFrequencies
1.2.9
=====
* It is no longer mandatory to set the whole legend of CSV file at initialization. It can figure it out by itself
* Datawraps can now be uncompressed folders
* Explicit error message if there's no file name manifest.ini in datawrap
1.2.8
=====
* Fixed BUG that prevented proper initialization and importation
1.2.5
=====
* BUG FIX: Opening a lot of chromosomes caused mmap to die screaming
* Removed core indexes. Sqlite sometimes chose them instead of user defined positional indexes, resulting un slow queries
* Doc updates
1.2.3
=====
* Added functions to retrieve the names of imported snps sets and genomes
* Added remote datawraps to the boostrap module that can be downloaded from pyGeno's website or any other location
* Added a field uniqueId to AgnosticSNPs
* Changed all latin datawrap names to english
* Removed datawrap for dbSNP GRCh37
1.2.2
=====
* Updated pypi package to include bootstrap datawraps
1.2.1
=====
* Fixed tests
1.2.0
=====
* BUG FIX: get()/iterGet() now works for SNPs and Indels
* BUG FIX: Default SNP filter used to return unwanted Nones for insertions
* BUG FIX: Added cast of lines to str in VCF and CasavaSNP parsers. Sometimes unicode caracters made the translation bug
* BUG FIX: Corrected a typo that caused find in Transcript to recursively die
* Added a new AgnosticSNP type of SNPs that can be easily made from the results of any snp caller. To make for the loss of support for casava by illumina. See SNP.AgnosticSNP for documentation
* pyGeno now comes with the murine reference genome GRCm38.78
* pyGeno now comes with the human reference genome GRCh38.78, GRCh37.75 is still shipped with pyGeno but might be in the upcoming versions
* pyGeno now comes with a datawrap for common dbSNPs human SNPs (SNP_dbSNP142_human_common_all.tar.gz)
* Added a dummy AgnosticSNP datawrap example for boostraping
* Changed the interface of the bootstrap module
* CSV Parser has now the ability to stream directly to a file
1.1.7
=====
* BUG FIX: looping through CSV lines now works
* Added tests for CSV
1.1.6
=====
* BUG FIX: find in BinarySequence could not find some subsequences at the tail of sequence
1.1.5
=====
* BUG FIX in default SNP filter
* Updated description
1.1.4
=====
* Another BUG FIX in progress bar
1.1.3
=====
* Small BUG FIX in the progress bar that caused epochs to be misrepresented
* 'Specie' has been changed to 'species' everywhere. That breaks the database the only way to fix it is to redo all importations
1.1.2
=====
* Genome import is now much more memory efficient
* BUG FIX: find in BinarySequence could not find subsequences at the tail of sequence
* Added a built-in datawrap with only chr1 and y
* Readme update with more infos about importation and link to doc
1.1.1
=====
Much better SNP filtering interface
------------------------------------
Easier and much morr coherent:
* SNP filtering has now its own module
* SNP Filters are now objects
* SNP Filters must return SequenceSNP, SNPInsert, SNPDeletion or None objects
1.0.0
=====
Freshly hatched
================================================
FILE: DESCRIPTION.rst
================================================
pyGeno: A python package for Precision Medicine, Personalized Genomics and Proteomics
=====================================================================================
Short description:
------------------
pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_).
.. _Tariq Daouda: http://www.tariqdaouda.com
.. _IRIC: http://www.iric.ca
With pyGeno you can do that:
.. code:: python
from pyGeno.Genome import *
#load a genome
ref = Genome(name = 'GRCh37.75')
#load a gene
gene = ref.get(Gene, name = 'TPST2')[0]
#print the sequences of all the isoforms
for prot in gene.get(Protein) :
print prot.sequence
You can also do it for the **specific genomes** of your subjects:
.. code:: python
pers = Genome(name = 'GRCh37.75', SNPs = ["RNA_S1"], SNPFilter = myFilter())
And much more: https://github.com/tariqdaouda/pyGeno
Verbose Description
--------------------
pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend
upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to
be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Presonalized Genomes**.
Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter.
pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get
direct access to the DNA and Protein sequences of your patients.
Multiple sets of of polymorphisms can also be combined together to leverage their independent benefits ex:
RNA-seq and DNA-seq for the same individual to improve the coverage
RNA-seq of an individual + dbSNP for validation
Combine the results of RNA-seq of several individual to create a genome only containing the common polymorphisms
pyGeno is also a personal database that give you access to all the information provided by Ensembl (for both Reference and Personalized Genomes) without the need of queries to distant HTTP APIs. Allowing for much faster and reliable genome wide study pipelines.
It also comes with parsers for several file types and various other useful tools.
Full Documentation
------------------
The full documentation is available here_
.. _here: http://pygeno.iric.ca/
If you like pyGeno, please let me know.
For the latest news, you can follow me on twitter `@tariqdaouda`_.
.. _@tariqdaouda: https://www.twitter.com/tariqdaouda
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: MANIFEST.in
================================================
include *.rst
include LICENSE
================================================
FILE: README.rst
================================================
CODE FREEZE:
============
PyGeno has long been limited due to it's backend. We are now ready to take it to the next level.
We are working on a major port of pyGeno to the open-source multi-modal database ArangoDB. PyGeno's code on both branches master and bloody is frozen until we are finished. No pull request will be merged until then, and we won't implement any new features.
pyGeno: A Python package for precision medicine and proteogenomics
==================================================================
.. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg
:alt: depsy
:target: http://depsy.org/package/python/pyGeno
.. image:: https://pepy.tech/badge/pygeno
:alt: downloads
:target: https://pepy.tech/project/pygeno
.. image:: https://pepy.tech/badge/pygeno/month
:alt: downloads_month
:target: https://pepy.tech/project/pygeno/month
.. image:: https://pepy.tech/badge/pygeno/week
:alt: downloads_week
:target: https://pepy.tech/project/pygeno/week
.. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png
:alt: pyGeno's logo
pyGeno is (to our knowledge) the only tool available that will gladly build your specific genomes for you.
pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_.
For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.
.. _Tariq Daouda: http://wwww.tariqdaouda.com
.. _IRIC: http://www.iric.ca
.. _Sawssan Kaddoura: http://sawssankaddoura.com
Click here for The `full documentation`_.
.. _full documentation: http://pygeno.iric.ca/
For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.
.. _@tariqdaouda: https://www.twitter.com/tariqdaouda
Citing pyGeno:
--------------
Please cite this paper_.
.. _paper: http://f1000research.com/articles/5-381/v1
Installation:
-------------
It is recommended to install pyGeno within a `virtual environement`_, to setup one you can use:
.. code:: shell
virtualenv ~/.pyGenoEnv
source ~/.pyGenoEnv/bin/activate
pyGeno can be installed through pip:
.. code:: shell
pip install pyGeno #for the latest stable version
Or github, for the latest developments:
.. code:: shell
git clone https://github.com/tariqdaouda/pyGeno.git
cd pyGeno
python setup.py develop
.. _`virtual environement`: http://virtualenv.readthedocs.org/
A brief introduction
--------------------
pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend
upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to
be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Personalized Genomes**.
Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphisms and an optional filter.
pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get
direct access to the DNA and Protein sequences of your patients.
.. code:: python
from pyGeno.Genome import *
g = Genome(name = "GRCh37.75")
prot = g.get(Protein, id = 'ENSP00000438917')[0]
#print the protein sequence
print prot.sequence
#print the protein's gene biotype
print prot.gene.biotype
#print protein's transcript sequence
print prot.transcript.sequence
#fancy queries
for exon in g.get(Exon, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) :
#print the exon's coding sequence
print exon.CDS
#print the exon's transcript sequence
print exon.transcript.sequence
#You can do the same for your subject specific genomes
#by combining a reference genome with polymorphisms
g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter())
And if you ever get lost, there's an online **help()** function for each object type:
.. code:: python
from pyGeno.Genome import *
print Exon.help()
Should output:
.. code::
Available fields for Exon: CDS_start, end, chromosome, CDS_length, frame, number, CDS_end, start, genome, length, protein, gene, transcript, id, strand
Creating a Personalized Genome:
-------------------------------
Personalized Genomes are a powerful feature that allow you to work on the specific genomes and proteomes of your patients. You can even mix several SNP sets together.
.. code:: python
from pyGeno.Genome import Genome
#the name of the snp set is defined inside the datawrap's manifest.ini file
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')
#you can also define a filter (ex: a quality filter) for the SNPs
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())
#and even mix several snp sets
dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())
Filtering SNPs:
---------------
pyGeno allows you to select the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions.
.. code:: python
from pyGeno.SNPFiltering import SNPFilter, SequenceSNP
class QMax_gt_filter(SNPFilter) :
def __init__(self, threshold) :
self.threshold = threshold
#Here SNPs is a dictionary: SNPSet Name => polymorphism
#This filter ignores deletions and insertions and
#but applis all SNPs
def filter(self, chromosome, **SNPs) :
sources = {}
alleles = []
for snpSet, snp in SNPs.iteritems() :
pos = snp.start
if snp.alt[0] == '-' :
pass
elif snp.ref[0] == '-' :
pass
else :
sources[snpSet] = snp
alleles.append(snp.alt) #if not an indel append the polymorphism
#appends the refence allele to the lot
refAllele = chromosome.refSequence[pos]
alleles.append(refAllele)
sources['ref'] = refAllele
#optional we keep a record of the polymorphisms that were used during the process
return SequenceSNP(alleles, sources = sources)
The filter function can also be made more specific by using arguments that have the same names as the SNPSets
.. code:: python
def filter(self, chromosome, dummySRY = None) :
if dummySRY.Qmax_gt > self.threshold :
#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
return SequenceSNP(dummySRY.alt)
return None #None means keep the reference allele
To apply the filter simply specify if while loading the genome.
.. code:: python
persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))
To include several SNPSets use a list.
.. code:: python
persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = ['ARN_P1', 'ARN_P2'], SNPFilter = myFilter())
Getting an arbitrary sequence:
------------------------------
You can ask for any sequence of any chromosome:
.. code:: python
chr12 = myGenome.get(Chromosome, number = "12")[0]
print chr12.sequence[x1:x2]
# for the reference sequence
print chr12.refSequence[x1:x2]
Batteries included (bootstraping):
---------------------------------
pyGeno's database is populated by importing datawraps.
pyGeno comes with a few data wraps, to get the list you can use:
.. code:: python
import pyGeno.bootstrap as B
B.printDatawraps()
.. code::
Available datawraps for boostraping
SNPs
~~~~|
|~~~:> Human_agnostic.dummySRY.tar.gz
|~~~:> Human.dummySRY_casava.tar.gz
|~~~:> dbSNP142_human_common_all.tar.gz
Genomes
~~~~~~~|
|~~~:> Human.GRCh37.75.tar.gz
|~~~:> Human.GRCh37.75_Y-Only.tar.gz
|~~~:> Human.GRCh38.78.tar.gz
|~~~:> Mouse.GRCm38.78.tar.gz
To get a list of remote datawraps that pyGeno can download for you, do:
.. code:: python
B.printRemoteDatawraps()
Importing whole genomes is a demanding process that take more than an hour and requires (according to tests)
at least 3GB of memory. Depending on your configuration, more might be required.
That being said importating a data wrap is a one time operation and once the importation is complete the datawrap
can be discarded without consequences.
The bootstrap module also has some handy functions for importing built-in packages.
Some of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**):
.. code:: python
import pyGeno.bootstrap as B
#Imports only the Y chromosome from the human reference genome GRCh37.75
#Very fast, requires even less memory. No download required.
B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP format.
# This one has one SNP at the begining of the gene SRY
B.importSNPs("Human.dummySRY_casava.tar.gz")
And for more **Serious Work**, the whole reference genome.
.. code:: python
#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.
B.importGenome("Human.GRCh38.78.tar.gz")
Importing a custom datawrap:
--------------------------
.. code:: python
from pyGeno.importation.Genomes import *
importGenome('GRCh37.75.tar.gz')
To import a patient's specific polymorphisms
.. code:: python
from pyGeno.importation.SNPs import *
importSNPs('patient1.tar.gz')
For a list of available datawraps available for download, please have a look here_.
You can easily make your own datawraps with any tar.gz compressor.
For more details on how datawraps are made you can check wiki_ or have a look inside the folder bootstrap_data.
.. _here: http://pygeno.iric.ca/datawraps.html
.. _wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-friendly-package-to-import-your-data%3F
Instanciating a genome:
-----------------------
.. code:: python
from pyGeno.Genome import Genome
#the name of the genome is defined inside the package's manifest.ini file
ref = Genome(name = 'GRCh37.75')
Printing all the proteins of a gene:
-----------------------------------
.. code:: python
from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Protein import Protein
Or simply:
.. code:: python
from pyGeno.Genome import *
then:
.. code:: python
ref = Genome(name = 'GRCh37.75')
#get returns a list of elements
gene = ref.get(Gene, name = 'TPST2')[0]
for prot in gene.get(Protein) :
print prot.sequence
Making queries, get() Vs iterGet():
-----------------------------------
iterGet is a faster version of get that returns an iterator instead of a list.
Making queries, syntax:
----------------------
pyGeno's get function uses the expressivity of rabaDB.
These are all possible query formats:
.. code:: python
ref.get(Gene, name = "SRY")
ref.get(Gene, { "name like" : "HLA"})
chr12.get(Exon, { "start >=" : 12000, "end <" : 12300 })
ref.get(Transcript, { "gene.name" : 'SRY' })
Creating indexes to speed up queries:
------------------------------------
.. code:: python
from pyGeno.Gene import Gene
#creating an index on gene names if it does not already exist
Gene.ensureGlobalIndex('name')
#removing the index
Gene.dropIndex('name')
Find in sequences:
------------------
Internally pyGeno uses a binary representation for nucleotides and amino acids to deal with polymorphisms.
For example,both "AGC" and "ATG" will match the following sequence "...AT/GCCG...".
.. code:: python
#returns the position of the first occurence
transcript.find("AT/GCCG")
#returns the positions of all occurences
transcript.findAll("AT/GCCG")
#similarly, you can also do
transcript.findIncDNA("AT/GCCG")
transcript.findAllIncDNA("AT/GCCG")
transcript.findInUTR3("AT/GCCG")
transcript.findAllInUTR3("AT/GCCG")
transcript.findInUTR5("AT/GCCG")
transcript.findAllInUTR5("AT/GCCG")
#same for proteins
protein.find("DEV/RDEM")
protein.findAll("DEV/RDEM")
#and for exons
exon.find("AT/GCCG")
exon.findAll("AT/GCCG")
exon.findInCDS("AT/GCCG")
exon.findAllInCDS("AT/GCCG")
#...
Progress Bar:
-------------
.. code:: python
from pyGeno.tools.ProgressBar import ProgressBar
pg = ProgressBar(nbEpochs = 155)
for i in range(155) :
pg.update(label = '%d' %i) # or simply p.update()
pg.close()
================================================
FILE: pyGeno/Chromosome.py
================================================
#import copy
#import types
#from tools import UsefulFunctions as uf
from types import *
import configuration as conf
from pyGenoObjectBases import *
from SNP import *
import SNPFiltering as SF
from rabaDB.filters import RabaQuery
import rabaDB.fields as rf
from tools.SecureMmap import SecureMmap as SecureMmap
from tools import SingletonManager
import pyGeno.configuration as conf
class ChrosomeSequence(object) :
"""Represents a chromosome sequence. If 'refOnly' no ploymorphisms are applied and the ref sequence is always returned"""
def __init__(self, data, chromosome, refOnly = False) :
self.data = data
self.refOnly = refOnly
self.chromosome = chromosome
self.setSNPFilter(self.chromosome.genome.SNPFilter)
def setSNPFilter(self, SNPFilter) :
self.SNPFilter = SNPFilter
def getSequenceData(self, slic) :
data = self.data[slic]
SNPTypes = self.chromosome.genome.SNPTypes
if SNPTypes is None or self.refOnly :
return data
iterators = []
for setName, SNPType in SNPTypes.iteritems() :
f = RabaQuery(str(SNPType), namespace = self.chromosome._raba_namespace)
chromosomeNumber = self.chromosome.number
if chromosomeNumber == 'MT':
chromosomeNumber = 'M'
f.addFilter({'start >=' : slic.start, 'start <' : slic.stop, 'setName' : str(setName), 'chromosomeNumber' : chromosomeNumber})
# conf.db.enableDebug(True)
iterators.append(f.iterRun(sqlTail = 'ORDER BY start'))
if len(iterators) < 1 :
return data
polys = {}
for iterator in iterators :
for poly in iterator :
if poly.start not in polys :
polys[poly.start] = {poly.setName : poly}
else :
try :
polys[poly.start][poly.setName].append(poly)
except :
polys[poly.start][poly.setName] = [polys[poly.start][poly.setName]]
polys[poly.start][poly.setName].append(poly)
data = list(data)
for start, setPolys in polys.iteritems() :
seqPos = start - slic.start
sequenceModifier = self.SNPFilter.filter(self.chromosome, **setPolys)
# print sequenceModifier.alleles
if sequenceModifier is not None :
if sequenceModifier.__class__ is SF.SequenceDel :
seqPos = seqPos + sequenceModifier.offset
#To avoid to change the length of the sequence who can create some bug or side effect
data[seqPos:(seqPos + sequenceModifier.length)] = [''] * sequenceModifier.length
elif sequenceModifier.__class__ is SF.SequenceSNP :
data[seqPos] = sequenceModifier.alleles
elif sequenceModifier.__class__ is SF.SequenceInsert :
seqPos = seqPos + sequenceModifier.offset
data[seqPos] = "%s%s" % (data[seqPos], sequenceModifier.bases)
else :
raise TypeError("sequenceModifier on chromosome: %s starting at: %s is of unknown type: %s" % (self.chromosome.number, snp.start, sequenceModifier.__class__))
return data
def _getSequence(self, slic) :
return ''.join(self.getSequenceData(slice(0, None, 1)))[slic]
def __getitem__(self, i) :
return self._getSequence(i)
def __len__(self) :
return self.chromosome.length
class Chromosome_Raba(pyGenoRabaObject) :
"""The wrapped Raba object that really holds the data"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
header = rf.Primitive()
number = rf.Primitive()
start = rf.Primitive()
end = rf.Primitive()
length = rf.Primitive()
genome = rf.RabaObject('Genome_Raba')
def _curate(self) :
if self.end != None and self.start != None :
self.length = self.end-self.start
if self.number != None :
self.number = str(self.number).upper()
class Chromosome(pyGenoRabaObjectWrapper) :
"""The wrapper for playing with Chromosomes"""
_wrapped_class = Chromosome_Raba
def __init__(self, *args, **kwargs) :
pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
path = '%s/chromosome%s.dat'%(self.genome.getSequencePath(), self.number)
if not SingletonManager.contains(path) :
datMap = SingletonManager.add(SecureMmap(path), path)
else :
datMap = SingletonManager.get(path)
self.sequence = ChrosomeSequence(datMap, self)
self.refSequence = ChrosomeSequence(datMap, self, refOnly = True)
self.loadSequences = False
def getSequenceData(self, slic) :
return self.sequence.getSequenceData(slic)
def _makeLoadQuery(self, objectType, *args, **coolArgs) :
if issubclass(objectType, SNP_INDEL) :
f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
coolArgs['species'] = self.genome.species
coolArgs['chromosomeNumber'] = self.number
if len(args) > 0 and type(args[0]) is types.ListType :
for a in args[0] :
if type(a) is types.DictType :
f.addFilter(**a)
else :
f.addFilter(*args, **coolArgs)
return f
return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
def __getitem__(self, i) :
return self.sequence[i]
def __str__(self) :
return "Chromosome: number %s > %s" %(self.wrapped_object.number, str(self.wrapped_object.genome))
================================================
FILE: pyGeno/Exon.py
================================================
from pyGenoObjectBases import *
from SNP import SNP_INDEL
import rabaDB.fields as rf
from tools import UsefulFunctions as uf
from tools.BinarySequence import NucBinarySequence
class Exon_Raba(pyGenoRabaObject) :
"""The wrapped Raba object that really holds the data"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
id = rf.Primitive()
number = rf.Primitive()
start = rf.Primitive()
end = rf.Primitive()
length = rf.Primitive()
CDS_length = rf.Primitive()
CDS_start = rf.Primitive()
CDS_end = rf.Primitive()
frame = rf.Primitive()
strand = rf.Primitive()
genome = rf.RabaObject('Genome_Raba')
chromosome = rf.RabaObject('Chromosome_Raba')
gene = rf.RabaObject('Gene_Raba')
transcript = rf.RabaObject('Transcript_Raba')
protein = rf.RabaObject('Protein_Raba')
def _curate(self) :
if self.start != None and self.end != None :
if self.start > self.end :
self.start, self.end = self.end, self.start
self.length = self.end-self.start
if self.CDS_start != None and self.CDS_end != None :
if self.CDS_start > self.CDS_end :
self.CDS_start, self.CDS_end = self.CDS_end, self.CDS_start
self.CDS_length = self.CDS_end - self.CDS_start
if self.number != None :
self.number = int(self.number)
if not self.frame or self.frame == '.' :
self.frame = None
else :
self.frame = int(self.frame)
class Exon(pyGenoRabaObjectWrapper) :
"""The wrapper for playing with Exons"""
_wrapped_class = Exon_Raba
def __init__(self, *args, **kwargs) :
pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
self._load_sequencesTriggers = set(["UTR5", "UTR3", "CDS", "sequence", "data"])
def _makeLoadQuery(self, objectType, *args, **coolArgs) :
if issubclass(objectType, SNP_INDEL) :
f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
coolArgs['species'] = self.genome.species
coolArgs['chromosomeNumber'] = self.chromosome.number
coolArgs['start >='] = self.start
coolArgs['start <'] = self.end
if len(args) > 0 and type(args[0]) is types.ListType :
for a in args[0] :
if type(a) is types.DictType :
f.addFilter(**a)
else :
f.addFilter(*args, **coolArgs)
return f
return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
def _load_data(self) :
data = self.chromosome.getSequenceData(slice(self.start,self.end))
diffLen = (self.end-self.start) - len(data)
if self.strand == '+' :
self.data = data
else :
self.data = uf.reverseComplementTab(data)
if self.hasCDS() :
start = self.CDS_start-self.start
end = self.CDS_end-self.start
if self.strand == '+' :
self.UTR5 = self.data[:start]
self.CDS = self.data[start:end+diffLen]
self.UTR3 = self.data[end+diffLen:]
else :
self.UTR5 = self.data[:len(self.data)-(end-diffLen)]
self.CDS = self.data[len(self.data)-(end-diffLen):len(self.data)-start]
self.UTR3 = self.data[len(self.data)-start:]
else :
self.UTR5 = ''
self.CDS = ''
self.UTR3 = ''
self.sequence = ''.join(self.data)
def _load_bin_sequence(self) :
self.bin_sequence = NucBinarySequence(self.sequence)
self.bin_UTR5 = NucBinarySequence(self.UTR5)
self.bin_CDS = NucBinarySequence(self.CDS)
self.bin_UTR3 = NucBinarySequence(self.UTR3)
def hasCDS(self) :
"""returns true or false depending on if the exon has a CDS"""
if self.CDS_start != None and self.CDS_end != None:
return True
return False
def getCDSLength(self) :
"""returns the length of the CDS sequence"""
return len(self.CDS)
def find(self, sequence) :
"""return the position of the first occurance of sequence"""
return self.bin_sequence.find(sequence)
def findAll(self, sequence):
"""Returns a lits of all positions where sequence was found"""
return self.bin_sequence.findAll(sequence)
def findInCDS(self, sequence) :
"""return the position of the first occurance of sequence"""
return self.bin_CDS.find(sequence)
def findAllInCDS(self, sequence):
"""Returns a lits of all positions where sequence was found"""
return self.bin_CDS.findAll(sequence)
def pluck(self) :
"""Returns a plucked object. Plucks the exon off the tree, set the value of self.transcript into str(self.transcript). This effectively disconnects the object and
makes it much more lighter in case you'd like to pickle it"""
e = copy.copy(self)
e.transcript = str(self.transcript)
return e
def nextExon(self) :
"""Returns the next exon of the transcript, or None if there is none"""
try :
return self.transcript.exons[self.number+1]
except IndexError :
return None
def previousExon(self) :
"""Returns the previous exon of the transcript, or None if there is none"""
if self.number == 0 :
return None
try :
return self.transcript.exons[self.number-1]
except IndexError :
return None
def __str__(self) :
return """EXON, id %s, number: %s, (start, end): (%s, %s), cds: (%s, %s) > %s""" %( self.id, self.number, self.start, self.end, self.CDS_start, self.CDS_end, str(self.transcript))
def __len__(self) :
return len(self.sequence)
================================================
FILE: pyGeno/Gene.py
================================================
import configuration as conf
from pyGenoObjectBases import *
from SNP import SNP_INDEL
import rabaDB.fields as rf
class Gene_Raba(pyGenoRabaObject) :
"""The wrapped Raba object that really holds the data"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
id = rf.Primitive()
name = rf.Primitive()
strand = rf.Primitive()
biotype = rf.Primitive()
start = rf.Primitive()
end = rf.Primitive()
genome = rf.RabaObject('Genome_Raba')
chromosome = rf.RabaObject('Chromosome_Raba')
def _curate(self) :
self.name = self.name.upper()
class Gene(pyGenoRabaObjectWrapper) :
"""The wrapper for playing with genes"""
_wrapped_class = Gene_Raba
def __init__(self, *args, **kwargs) :
pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
def _makeLoadQuery(self, objectType, *args, **coolArgs) :
if issubclass(objectType, SNP_INDEL) :
f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
coolArgs['species'] = self.genome.species
coolArgs['chromosomeNumber'] = self.chromosome.number
coolArgs['start >='] = self.start
coolArgs['start <'] = self.end
if len(args) > 0 and type(args[0]) is types.ListType :
for a in args[0] :
if type(a) is types.DictType :
f.addFilter(**a)
else :
f.addFilter(*args, **coolArgs)
return f
return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
def __str__(self) :
return "Gene, name: %s, id: %s, strand: '%s' > %s" %(self.name, self.id, self.strand, str(self.chromosome))
================================================
FILE: pyGeno/Genome.py
================================================
import types
import configuration as conf
import pyGeno.tools.UsefulFunctions as uf
from pyGenoObjectBases import *
from Chromosome import Chromosome
from Gene import Gene
from Transcript import Transcript
from Protein import Protein
from Exon import Exon
import SNPFiltering as SF
from SNP import *
import rabaDB.fields as rf
def getGenomeList() :
"""Return the names of all imported genomes"""
import rabaDB.filters as rfilt
f = rfilt.RabaQuery(Genome_Raba)
names = []
for g in f.iterRun() :
names.append(g.name)
return names
class Genome_Raba(pyGenoRabaObject) :
"""The wrapped Raba object that really holds the data"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
#_raba_not_a_singleton = True #you can have several instances of the same genome but they all share the same location in the database
name = rf.Primitive()
species = rf.Primitive()
source = rf.Primitive()
packageInfos = rf.Primitive()
def _curate(self) :
self.species = self.species.lower()
def getSequencePath(self) :
return conf.getGenomeSequencePath(self.species, self.name)
def getReferenceSequencePath(self) :
return conf.getReferenceGenomeSequencePath(self.species)
def __len__(self) :
"""Size of the genome in pb"""
l = 0
for c in self.chromosomes :
l += len(c)
return l
class Genome(pyGenoRabaObjectWrapper) :
"""
This is the entry point to pyGeno::
myGeno = Genome(name = 'GRCh37.75', SNPs = ['RNA_S1', 'DNA_S1'], SNPFilter = MyFilter)
for prot in myGeno.get(Protein) :
print prot.sequence
"""
_wrapped_class = Genome_Raba
def __init__(self, SNPs = None, SNPFilter = None, *args, **kwargs) :
pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
if type(SNPs) is types.StringType :
self.SNPsSets = [SNPs]
else :
self.SNPsSets = SNPs
# print "pifpasdf", self.SNPsSets
if SNPFilter is None :
self.SNPFilter = SF.DefaultSNPFilter()
else :
if issubclass(SNPFilter.__class__, SF.SNPFilter) :
self.SNPFilter = SNPFilter
else :
raise ValueError("The value of 'SNPFilter' is not an object deriving from a subclass of SNPFiltering.SNPFilter. Got: '%s'" % SNPFilter)
self.SNPTypes = {}
if SNPs is not None :
f = RabaQuery(SNPMaster, namespace = self._raba_namespace)
for se in self.SNPsSets :
f.addFilter(setName = se, species = self.species)
res = f.run()
if res is None or len(res) < 1 :
raise ValueError("There's no set of SNPs that goes by the name of %s for species %s" % (SNPs, self.species))
for s in res :
# print s.setName, s.SNPType
self.SNPTypes[s.setName] = s.SNPType
def _makeLoadQuery(self, objectType, *args, **coolArgs) :
if issubclass(objectType, SNP_INDEL) :
# conf.db.enableDebug(True)
f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
coolArgs['species'] = self.species
if len(args) > 0 and type(args[0]) is types.ListType :
for a in args[0] :
if type(a) is types.DictType :
f.addFilter(**a)
else :
f.addFilter(*args, **coolArgs)
return f
return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
def __str__(self) :
return "Genome: %s/%s SNPs: %s" %(self.species, self.name, self.SNPTypes)
================================================
FILE: pyGeno/Protein.py
================================================
import configuration as conf
from pyGenoObjectBases import *
from SNP import SNP_INDEL
import rabaDB.fields as rf
from tools import UsefulFunctions as uf
from tools.BinarySequence import AABinarySequence
import copy
class Protein_Raba(pyGenoRabaObject) :
"""The wrapped Raba object that really holds the data"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
id = rf.Primitive()
name = rf.Primitive()
genome = rf.RabaObject('Genome_Raba')
chromosome = rf.RabaObject('Chromosome_Raba')
gene = rf.RabaObject('Gene_Raba')
transcript = rf.RabaObject('Transcript_Raba')
def _curate(self) :
if self.name != None :
self.name = self.name.upper()
class Protein(pyGenoRabaObjectWrapper) :
"""The wrapper for playing with Proteins"""
_wrapped_class = Protein_Raba
def __init__(self, *args, **kwargs) :
pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
self._load_sequencesTriggers = set(["sequence"])
def _makeLoadQuery(self, objectType, *args, **coolArgs) :
if issubclass(objectType, SNP_INDEL) :
f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
coolArgs['species'] = self.genome.species
coolArgs['chromosomeNumber'] = self.chromosome.number
coolArgs['start >='] = self.transcript.start
coolArgs['start <'] = self.transcript.end
if len(args) > 0 and type(args[0]) is types.ListType :
for a in args[0] :
if type(a) is types.DictType :
f.addFilter(**a)
else :
f.addFilter(*args, **coolArgs)
return f
return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
def _load_sequences(self) :
if self.chromosome.number != 'MT':
self.sequence = uf.translateDNA(self.transcript.cDNA).rstrip('*')
else:
self.sequence = uf.translateDNA(self.transcript.cDNA, translTable_id='mt').rstrip('*')
def getSequence(self):
return self.sequence
def _load_bin_sequence(self) :
self.bin_sequence = AABinarySequence(self.sequence)
def getDefaultSequence(self) :
"""Returns a version str sequence where only the last allele of each polymorphisms is shown"""
return self.bin_sequence.defaultSequence
def getPolymorphisms(self) :
"""Returns a list of all polymorphisms contained in the protein"""
return self.bin_sequence.getPolymorphisms()
def find(self, sequence):
"""Returns the position of the first occurence of sequence taking polymorphisms into account"""
return self.bin_sequence.find(sequence)
def findAll(self, sequence):
"""Returns all the position of the occurences of sequence taking polymorphisms into accoun"""
return self.bin_sequence.findAll(sequence)
def findString(self, sequence) :
"""Returns the first occurence of sequence using simple string search in sequence that doesn't care about polymorphisms"""
return self.sequence.find(sequence)
def findStringAll(self, sequence):
"""Returns all first occurences of sequence using simple string search in sequence that doesn't care about polymorphisms"""
return uf.findAll(self.sequence, sequence)
def __getitem__(self, i) :
return self.bin_sequence.getChar(i)
def __len__(self) :
return len(self.bin_sequence)
def __str__(self) :
return "Protein, id: %s > %s" %(self.id, str(self.transcript))
================================================
FILE: pyGeno/SNP.py
================================================
import types
import configuration as conf
from pyGenoObjectBases import *
import rabaDB.fields as rf
# from tools import UsefulFunctions as uf
"""General guidelines for SNP classes:
----
-All classes must inherit from SNP_INDEL
-All classes name must end with SNP
-A SNP_INDELs must have at least chromosomeNumber, setName, species, start and ref filled in order to be inserted into sequences
-User can define an alias for the alt field (snp_indel alleles) to indicate the default field from wich to extract alleles
"""
def getSNPSetsList() :
"""Return the names of all imported snp sets"""
import rabaDB.filters as rfilt
f = rfilt.RabaQuery(SNPMaster)
names = []
for g in f.iterRun() :
names.append(g.setName)
return names
class SNPMaster(Raba) :
'This object keeps track of SNP sets and their types'
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
species = rf.Primitive()
SNPType = rf.Primitive()
setName = rf.Primitive()
_raba_uniques = [('setName',)]
def _curate(self) :
self.species = self.species.lower()
self.setName = self.setName.lower()
class SNP_INDEL(Raba) :
"All SNPs should inherit from me. The name of the class must end with SNP"
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
_raba_abstract = True # not saved in db
species = rf.Primitive()
setName = rf.Primitive()
chromosomeNumber = rf.Primitive()
start = rf.Primitive()
end = rf.Primitive()
ref = rf.Primitive()
#every SNP_INDEL must have a field alt. This variable allows you to set an alias for it. Chamging the alias does not impact the schema
altAlias = 'alt'
def __init__(self, *args, **kwargs) :
pass
def __getattribute__(self, k) :
if k == 'alt' :
cls = Raba.__getattribute__(self, '__class__')
return Raba.__getattribute__(self, cls.altAlias)
return Raba.__getattribute__(self, k)
def __setattr__(self, k, v) :
if k == 'alt' :
cls = Raba.__getattribute__(self, '__class__')
Raba.__setattr__(self, cls.altAlias, v)
Raba.__setattr__(self, k, v)
def _curate(self) :
self.species = self.species.lower()
@classmethod
def ensureGlobalIndex(cls, fields) :
cls.ensureIndex(fields)
def __repr__(self) :
return "%s> chr: %s, start: %s, end: %s, alt: %s, ref: %s" %(self.__class__.__name__, self.chromosomeNumber, self.start, self.end, self.alleles, self.ref)
class CasavaSNP(SNP_INDEL) :
"A SNP of Casava"
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
alleles = rf.Primitive()
bcalls_used = rf.Primitive()
bcalls_filt = rf.Primitive()
QSNP = rf.Primitive()
Qmax_gt = rf.Primitive()
max_gt_poly_site = rf.Primitive()
Qmax_gt_poly_site = rf.Primitive()
A_used = rf.Primitive()
C_used = rf.Primitive()
G_used = rf.Primitive()
T_used = rf.Primitive()
altAlias = 'alleles'
class AgnosticSNP(SNP_INDEL) :
"""This is a generic SNPs/Indels format that you can easily make from the result of any SNP caller. AgnosticSNP files are tab delimited files such as:
chromosomeNumber uniqueId start end ref alleles quality caller
Y 1 2655643 2655644 T AG 30 TopHat
Y 2 2655645 2655647 - AG 28 TopHat
Y 3 2655648 2655650 TT - 10 TopHat
All positions must be 0 based
The '-' indicates a deletion or an insertion. Collumn order has no importance.
"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
alleles = rf.Primitive()
quality = rf.Primitive()
caller = rf.Primitive()
uniqueId = rf.Primitive() # polymorphism id
altAlias = 'alleles'
def __repr__(self) :
return "AgnosticSNP> start: %s, end: %s, quality: %s, caller %s, alt: %s, ref: %s" %(self.start, self.end, self.quality, self.caller, self.alleles, self.ref)
class dbSNPSNP(SNP_INDEL) :
"This class is for SNPs from dbSNP. Feel free to uncomment the fields that you need"
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
# To add/remove a field comment/uncomentd it. Beware, adding or removing a field results in a significant overhead the first time you relaunch pyGeno. You may have to delete and reimport some snps sets.
#####RSPOS = rf.Primitive() #Chr position reported in already saved into field start
RS = rf.Primitive() #dbSNP ID (i.e. rs number)
ALT = rf.Primitive()
RV = rf.Primitive() #RS orientation is reversed
#VP = rf.Primitive() #Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
#GENEINFO = rf.Primitive() #Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)
dbSNPBuildID = rf.Primitive() #First dbSNP Build for RS
#SAO = rf.Primitive() #Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both
#SSR = rf.Primitive() #Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other
#WGT = rf.Primitive() #Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more
VC = rf.Primitive() #Variation Class
PM = rf.Primitive() #Variant is Precious(Clinical,Pubmed Cited)
#TPA = rf.Primitive() #Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)
#PMC = rf.Primitive() #Links exist to PubMed Central article
#S3D = rf.Primitive() #Has 3D structure - SNP3D table
#SLO = rf.Primitive() #Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out
#NSF = rf.Primitive() #Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44
#NSM = rf.Primitive() #Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42
#NSN = rf.Primitive() #Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41
#REF = rf.Primitive() #Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8
#SYN = rf.Primitive() #Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3
#U3 = rf.Primitive() #In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53
#U5 = rf.Primitive() #In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55
#ASS = rf.Primitive() #In acceptor splice site FxnCode = 73
#DSS = rf.Primitive() #In donor splice-site FxnCode = 75
#INT = rf.Primitive() #In Intron FxnCode = 6
#R3 = rf.Primitive() #In 3' gene region FxnCode = 13
#R5 = rf.Primitive() #In 5' gene region FxnCode = 15
#OTH = rf.Primitive() #Has other variant with exactly the same set of mapped positions on NCBI refernce assembly.
#CFL = rf.Primitive() #Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.
#ASP = rf.Primitive() #Is Assembly specific. This is set if the variant only maps to one assembly
MUT = rf.Primitive() #Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources
VLD = rf.Primitive() #Is Validated. This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.
G5A = rf.Primitive() #>5% minor allele frequency in each and all populations
G5 = rf.Primitive() #>5% minor allele frequency in 1+ populations
#HD = rf.Primitive() #Marker is on high density genotyping kit (50K density or greater). The variant may have phenotype associations present in dbGaP.
#GNO = rf.Primitive() #Genotypes available. The variant has individual genotype (in SubInd table).
KGValidated = rf.Primitive() #1000 Genome validated
#KGPhase1 = rf.Primitive() #1000 Genome phase 1 (incl. June Interim phase 1)
#KGPilot123 = rf.Primitive() #1000 Genome discovery all pilots 2010(1,2,3)
#KGPROD = rf.Primitive() #Has 1000 Genome submission
OTHERKG = rf.Primitive() #non-1000 Genome submission
#PH3 = rf.Primitive() #HAP_MAP Phase 3 genotyped: filtered, non-redundant
#CDA = rf.Primitive() #Variation is interrogated in a clinical diagnostic assay
#LSD = rf.Primitive() #Submitted from a locus-specific database
#MTP = rf.Primitive() #Microattribution/third-party annotation(TPA:GWAS,PAGE)
#OM = rf.Primitive() #Has OMIM/OMIA
#NOC = rf.Primitive() #Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation.
#WTD = rf.Primitive() #Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set. If all member ss' are withdrawn, then the rs is deleted to SNPHistory
#NOV = rf.Primitive() #Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common.
#CAF = rf.Primitive() #An ordered, comma delimited list of allele frequencies based on 1000Genomes, starting with the reference allele followed by alternate alleles as ordered in the ALT column. Where a 1000Genomes alternate allele is not in the dbSNPs alternate allele set, the allele is added to the ALT column. The minor allele is the second largest value in the list, and was previuosly reported in VCF as the GMAF. This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter
COMMON = rf.Primitive() #RS is a common SNP. A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.
#CLNHGVS = rf.Primitive() #Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical INFO tags.
#CLNALLE = rf.Primitive() #Variant alleles from REF or ALT columns. 0 is REF, 1 is the first ALT allele, etc. This is used to match alleles with other corresponding clinical (CLN) INFO tags. A value of -1 indicates that no allele was found to match a corresponding HGVS allele name.
#CLNSRC = rf.Primitive() #Variant Clinical Chanels
#CLNORIGIN = rf.Primitive() #Allele Origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other
#CLNSRCID = rf.Primitive() #Variant Clinical Channel IDs
#CLNSIG = rf.Primitive() #Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other
#CLNDSDB = rf.Primitive() #Variant disease database name
#CLNDSDBID = rf.Primitive() #Variant disease database ID
#CLNDBN = rf.Primitive() #Variant disease name
#CLNACC = rf.Primitive() #Variant Accession and Versions
#FILTER_NC = rf.Primitive() #Inconsistent Genotype Submission For At Least One Sample
altAlias = 'ALT'
class TopHatSNP(SNP_INDEL) :
"A SNP from Top Hat, not implemented"
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
pass
================================================
FILE: pyGeno/SNPFiltering.py
================================================
import types
import configuration as conf
from tools import UsefulFunctions as uf
class Sequence_modifiers(object) :
"""Abtract Class. All sequence must inherit from me"""
def __init__(self, sources = {}) :
self.sources = sources
def addSource(self, name, snp) :
"Optional, you can keep a dict that records the polymorphims that were mixed together to make self. They are stored into self.sources"
self.sources[name] = snp
class SequenceSNP(Sequence_modifiers) :
"""Represents a SNP to be applied to the sequence"""
def __init__(self, alleles, sources = {}) :
Sequence_modifiers.__init__(self, sources)
if type(alleles) is types.ListType :
self.alleles = uf.encodePolymorphicNucleotide(''.join(alleles))
else :
self.alleles = uf.encodePolymorphicNucleotide(alleles)
class SequenceInsert(Sequence_modifiers) :
"""Represents an Insertion to be applied to the sequence"""
def __init__(self, bases, sources = {}, ref = '-') :
Sequence_modifiers.__init__(self, sources)
self.bases = bases
self.offset = 0
# Allow to use format like C/CCTGGAA(dbSNP) or CCT/CCTGGAA(samtools)
if ref != '-':
if ref == bases[:len(ref)]:
self.offset = len(ref)
self.bases = self.bases[self.offset:]
#-1 because if the insertion are after the last nuc we go out of table
self.offset -= 1
else:
raise NotImplemented("This format of Insetion is not accepted. Please change your format, or implement your format in pyGeno.")
class SequenceDel(Sequence_modifiers) :
"""Represents a Deletion to be applied to the sequence"""
def __init__(self, length, sources = {}, ref = None, alt = '-') :
Sequence_modifiers.__init__(self, sources)
self.length = length
self.offset = 0
# Allow to use format like CCTGGAA/C(dbSNP) or CCTGGAA/CCT(samtools)
if alt != '-':
if ref is not None:
if alt == ref[:len(alt)]:
self.offset = len(alt)
self.length = self.length - len(alt)
else:
raise NotImplemented("This format of Deletion is not accepted. Please change your format, or implement your format in pyGeno.")
else:
raise Exception("You need to add a ref sequence in your call of SequenceDel. Or implement your format in pyGeno.")
class SNPFilter(object) :
"""Abtract Class. All filters must inherit from me"""
def __init__(self) :
pass
def filter(self, chromosome, **kwargs) :
raise NotImplemented("Must be implemented in child")
class DefaultSNPFilter(SNPFilter) :
"""
Default filtering object, does not filter anything. Doesn't apply insertions or deletions.
This is also a template that you can use for own filters. A prototype for a custom filter might be::
class MyFilter(SNPFilter) :
def __init__(self, thres) :
self.thres = thres
def filter(chromosome, SNP_Set1 = None, SNP_Set2 = None ) :
if SNP_Set1.alt is not None and (SNP_Set1.alt == SNP_Set2.alt) and SNP_Set1.Qmax_gt > self.thres :
return SequenceSNP(SNP_Set1.alt)
return None
Where SNP_Set1 and SNP_Set2 are the actual names of the snp sets supplied to the genome. In the context of the function
they represent single polymorphisms, or lists of polymorphisms, derived from thoses sets that occur at the same position.
Whatever goes on into the function is absolutely up to you, but in the end, it must return an object of one of the following classes:
* SequenceSNP
* SequenceInsert
* SequenceDel
* None
"""
def __init__(self) :
SNPFilter.__init__(self)
def filter(self, chromosome, **kwargs) :
"""The default filter mixes applied all SNPs and ignores Insertions and Deletions."""
def appendAllele(alleles, sources, snp) :
pos = snp.start
if snp.alt[0] == '-' :
pass
# print warn % ('DELETION', snpSet, snp.start, snp.chromosomeNumber)
elif snp.ref[0] == '-' :
pass
# print warn % ('INSERTION', snpSet, snp.start, snp.chromosomeNumber)
else :
sources[snpSet] = snp
alleles.append(snp.alt) #if not an indel append the polymorphism
refAllele = chromosome.refSequence[pos]
alleles.append(refAllele)
sources['ref'] = refAllele
return alleles, sources
warn = 'Warning: the default snp filter ignores indels. IGNORED %s of SNP set: %s at pos: %s of chromosome: %s'
sources = {}
alleles = []
for snpSet, data in kwargs.iteritems() :
if type(data) is list :
for snp in data :
alleles, sources = appendAllele(alleles, sources, snp)
else :
allels, sources = appendAllele(alleles, sources, data)
#appends the refence allele to the lot
#optional we keep a record of the polymorphisms that were used during the process
return SequenceSNP(alleles, sources = sources)
================================================
FILE: pyGeno/Transcript.py
================================================
import configuration as conf
from pyGenoObjectBases import *
import rabaDB.fields as rf
from tools import UsefulFunctions as uf
from Exon import *
from SNP import SNP_INDEL
from tools.BinarySequence import NucBinarySequence
class Transcript_Raba(pyGenoRabaObject) :
"""The wrapped Raba object that really holds the data"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
id = rf.Primitive()
name = rf.Primitive()
length = rf.Primitive()
start = rf.Primitive()
end = rf.Primitive()
coding = rf.Primitive()
biotype = rf.Primitive()
selenocysteine = rf.RList()
genome = rf.RabaObject('Genome_Raba')
chromosome = rf.RabaObject('Chromosome_Raba')
gene = rf.RabaObject('Gene_Raba')
protein = rf.RabaObject('Protein_Raba')
exons = rf.Relation('Exon_Raba')
def _curate(self) :
if self.name != None :
self.name = self.name.upper()
self.length = abs(self.end - self.start)
have_CDS_start = False
have_CDS_end = False
for exon in self.exons :
if exon.CDS_start is not None :
have_CDS_start = True
if exon.CDS_end is not None :
have_CDS_end = True
if have_CDS_start and have_CDS_end :
self.coding = True
else :
self.coding = False
class Transcript(pyGenoRabaObjectWrapper) :
"""The wrapper for playing with Transcripts"""
_wrapped_class = Transcript_Raba
def __init__(self, *args, **kwargs) :
pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
self.exons = RLWrapper(self, Exon, self.wrapped_object.exons)
self._load_sequencesTriggers = set(["UTR5", "UTR3", "cDNA", "sequence", "data"])
self.exonsDict = {}
def _makeLoadQuery(self, objectType, *args, **coolArgs) :
if issubclass(objectType, SNP_INDEL) :
# conf.db.enableDebug(True)
f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
coolArgs['species'] = self.genome.species
coolArgs['chromosomeNumber'] = self.chromosome.number
coolArgs['start >='] = self.start
coolArgs['start <'] = self.end
if len(args) > 0 and type(args[0]) is types.ListType :
for a in args[0] :
if type(a) is types.DictType :
f.addFilter(**a)
else :
f.addFilter(*args, **coolArgs)
return f
return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
def _load_data(self) :
def getV(k) :
return pyGenoRabaObjectWrapper.__getattribute__(self, k)
def setV(k, v) :
return pyGenoRabaObjectWrapper.__setattr__(self, k, v)
self.data = []
cDNA = []
UTR5 = []
UTR3 = []
exons = []
prime5 = True
for ee in self.wrapped_object.exons :
e = pyGenoRabaObjectWrapper_metaclass._wrappers[Exon_Raba](wrapped_object_and_bag = (ee, getV('bagKey')))
self.exonsDict[(e.start, e.end)] = e
exons.append(e)
self.data.extend(e.data)
if e.hasCDS() :
UTR5.append(''.join(e.UTR5))
if self.selenocysteine is not None:
for position in self.selenocysteine:
if e.CDS_start <= position <= e.CDS_end:
if e.strand == '+':
ajusted_position = position - e.CDS_start
else:
ajusted_position = e.CDS_end - position - 3
if e.CDS[ajusted_position] == 'T':
e.CDS = list(e.CDS)
e.CDS[ajusted_position] = '!'
if len(cDNA) == 0 and e.frame != 0 :
e.CDS = e.CDS[e.frame:]
if e.strand == '+':
e.CDS_start += e.frame
else:
e.CDS_end -= e.frame
if len(e.CDS):
cDNA.append(''.join(e.CDS))
UTR3.append(''.join(e.UTR3))
prime5 = False
else :
if prime5 :
UTR5.append(''.join(e.data))
else :
UTR3.append(''.join(e.data))
sequence = ''.join(self.data)
cDNA = ''.join(cDNA)
UTR5 = ''.join(UTR5)
UTR3 = ''.join(UTR3)
setV('exons', exons)
setV('sequence', sequence)
setV('cDNA', cDNA)
setV('UTR5', UTR5)
setV('UTR3', UTR3)
if len(cDNA) > 0 and len(cDNA) % 3 != 0 :
setV('flags', {'DUBIOUS' : True, 'cDNA_LEN_NOT_MULT_3': True})
else :
setV('flags', {'DUBIOUS' : False, 'cDNA_LEN_NOT_MULT_3': False})
def _load_bin_sequence(self) :
self.bin_sequence = NucBinarySequence(self.sequence)
self.bin_UTR5 = NucBinarySequence(self.UTR5)
self.bin_cDNA = NucBinarySequence(self.cDNA)
self.bin_UTR3 = NucBinarySequence(self.UTR3)
def getNucleotideCodon(self, cdnaX1) :
"""Returns the entire codon of the nucleotide at pos cdnaX1 in the cdna, and the position of that nocleotide in the codon"""
return uf.getNucleotideCodon(self.cDNA, cdnaX1)
def getCodon(self, i) :
"""returns the ith codon"""
return self.getNucleotideCodon(i*3)[0]
def iterCodons(self) :
"""iterates through the codons"""
for i in range(len(self.cDNA)/3) :
yield self.getCodon(i)
def find(self, sequence) :
"""return the position of the first occurance of sequence"""
return self.bin_sequence.find(sequence)
def findAll(self, sequence):
"""Returns a list of all positions where sequence was found"""
return self.bin_sequence.findAll(sequence)
def findIncDNA(self, sequence) :
"""return the position of the first occurance of sequence"""
return self.bin_cDNA.find(sequence)
def findAllIncDNA(self, sequence) :
"""Returns a list of all positions where sequence was found in the cDNA"""
return self.bin_cDNA.findAll(sequence)
def getcDNALength(self) :
"""returns the length of the cDNA"""
return len(self.cDNA)
def findInUTR5(self, sequence) :
"""return the position of the first occurance of sequence in the 5'UTR"""
return self.bin_UTR5.find(sequence)
def findAllInUTR5(self, sequence) :
"""Returns a list of all positions where sequence was found in the 5'UTR"""
return self.bin_UTR5.findAll(sequence)
def getUTR5Length(self) :
"""returns the length of the 5'UTR"""
return len(self.bin_UTR5)
def findInUTR3(self, sequence) :
"""return the position of the first occurance of sequence in the 3'UTR"""
return self.bin_UTR3.find(sequence)
def findAllInUTR3(self, sequence) :
"""Returns a lits of all positions where sequence was found in the 3'UTR"""
return self.bin_UTR3.findAll(sequence)
def getUTR3Length(self) :
"""returns the length of the 3'UTR"""
return len(self.bin_UTR3)
def getNbCodons(self) :
"""returns the number of codons in the transcript"""
return len(self.cDNA)/3
def __getattribute__(self, name) :
return pyGenoRabaObjectWrapper.__getattribute__(self, name)
def __getitem__(self, i) :
return self.sequence[i]
def __len__(self) :
return len(self.sequence)
def __str__(self) :
return """Transcript, id: %s, name: %s > %s""" %(self.id, self.name, str(self.gene))
================================================
FILE: pyGeno/__init__.py
================================================
__all__ = ['Genome', 'Chromosome', 'Gene', 'Transcript', 'Exon', 'Protein', 'SNP']
from configuration import pyGeno_init
pyGeno_init()
================================================
FILE: pyGeno/bootstrap.py
================================================
import pyGeno.importation.Genomes as PG
import pyGeno.importation.SNPs as PS
from pyGeno.tools.io import printf
import os, tempfile, urllib, urllib2, json
import pyGeno.configuration as conf
this_dir, this_filename = os.path.split(__file__)
def listRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :
"""Lists all the datawraps availabe from a remote a remote location."""
loc = location + "/datawraps.json"
response = urllib2.urlopen(loc)
js = json.loads(response.read())
return js
def printRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :
"""
print all available datawraps from a remote location the location must have a datawraps.json in the following format::
{
"Ordered": {
"Reference genomes": {
"Human" : ["GRCh37.75", "GRCh38.78"],
"Mouse" : ["GRCm38.78"],
},
"SNPs":{
}
},
"Flat":{
"Reference genomes": {
"GRCh37.75": "Human.GRCh37.75.tar.gz",
"GRCh38.78": "Human.GRCh37.75.tar.gz",
"GRCm38.78": "Mouse.GRCm38.78.tar.gz"
},
"SNPs":{
}
}
}
"""
l = listRemoteDatawraps(location)
printf("Available datawraps for bootstraping\n")
print json.dumps(l["Ordered"], sort_keys=True, indent=4, separators=(',', ': '))
def _DW(name, url) :
packageDir = tempfile.mkdtemp(prefix = "pyGeno_remote_")
printf("~~~:>\n\tDownloading datawrap: %s..." % name)
finalFile = os.path.normpath('%s/%s' %(packageDir, name))
urllib.urlretrieve (url, finalFile)
printf('\tdone.\n~~~:>')
return finalFile
def importRemoteGenome(name, batchSize = 100) :
"""Import a genome available from http://pygeno.iric.ca (might work)."""
try :
dw = listRemoteDatawraps()["Flat"]["Reference genomes"][name]
except AttributeError :
raise AttributeError("There's no remote genome datawrap by the name of: '%s'" % name)
finalFile = _DW(name, dw["url"])
PG.importGenome(finalFile, batchSize)
def importRemoteSNPs(name) :
"""Import a SNP set available from http://pygeno.iric.ca (might work)."""
try :
dw = listRemoteDatawraps()["Flat"]["SNPs"]
except AttributeError :
raise AttributeError("There's no remote genome datawrap by the name of: '%s'" % name)
finalFile = _DW(name, dw["url"])
PS.importSNPs(finalFile)
def listDatawraps() :
"""Lists all the datawraps pyGeno comes with"""
l = {"Genomes" : [], "SNPs" : []}
for f in os.listdir(os.path.join(this_dir, "bootstrap_data/genomes")) :
if f.find(".tar.gz") > -1 :
l["Genomes"].append(f)
for f in os.listdir(os.path.join(this_dir, "bootstrap_data/SNPs")) :
if f.find(".tar.gz") > -1 :
l["SNPs"].append(f)
return l
def printDatawraps() :
"""print all available datawraps for bootstraping"""
l = listDatawraps()
printf("Available datawraps for boostraping\n")
for k, v in l.iteritems() :
printf(k)
printf("~"*len(k) + "|")
for vv in v :
printf(" "*len(k) + "|" + "~~~:> " + vv)
printf('\n')
def importGenome(name, batchSize = 100) :
"""Import a genome shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
path = os.path.join(this_dir, "bootstrap_data", "genomes/" + name)
PG.importGenome(path, batchSize)
def importSNPs(name) :
"""Import a SNP set shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
path = os.path.join(this_dir, "bootstrap_data", "SNPs/" + name)
PS.importSNPs(path)
================================================
FILE: pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/manifest.ini
================================================
[package_infos]
description = For testing purposes. All polymorphisms at the same position
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda@umontreal.ca
version = 1
[set_infos]
species = human
name = dummySRY_AGN_indels
type = Agnostic
source = my place at IRIC
[snps]
filename = snps.txt
================================================
FILE: pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/snps.txt
================================================
chromosomeNumber uniqueId start end ref alleles quality caller
Y 1 2655643 2655646 T AG 30 test
Y 2 2655643 2655647 - AG 30 test
Y 3 2655643 2655650 TT - 30 test
================================================
FILE: pyGeno/bootstrap_data/SNPs/__init__.py
================================================
================================================
FILE: pyGeno/bootstrap_data/__init__.py
================================================
================================================
FILE: pyGeno/bootstrap_data/genomes/__init__.py
================================================
================================================
FILE: pyGeno/configuration.py
================================================
import sys, os, time
from ConfigParser import SafeConfigParser
import rabaDB.rabaSetup
import rabaDB.Raba
class PythonVersionError(Exception) :
pass
pyGeno_FACE = "~-~-:>"
pyGeno_BRANCH = "V2"
pyGeno_VERSION_NAME = 'Lean Viper!'
pyGeno_VERSION_RELEASE_LEVEL = 'Release'
pyGeno_VERSION_NUMBER = 14.09
pyGeno_VERSION_BUILD_TIME = time.ctime(os.path.getmtime(__file__))
pyGeno_RABA_NAMESPACE = 'pyGenoRaba'
pyGeno_SETTINGS_DIR = os.path.normpath(os.path.expanduser('~/.pyGeno/'))
pyGeno_SETTINGS_PATH = None
pyGeno_RABA_DBFILE = None
pyGeno_DATA_PATH = None
pyGeno_REMOTE_LOCATION = 'http://bioinfo.iric.ca/~daoudat/pyGeno_datawraps'
db = None #proxy for the raba database
dbConf = None #proxy for the raba database configuration
def version() :
"""returns a tuple describing pyGeno's current version"""
return (pyGeno_FACE, pyGeno_BRANCH, pyGeno_VERSION_NAME, pyGeno_VERSION_RELEASE_LEVEL, pyGeno_VERSION_NUMBER, pyGeno_VERSION_BUILD_TIME )
def prettyVersion() :
"""returns pyGeno's current version in a pretty human readable way"""
return "pyGeno %s Branch: %s, Name: %s, Release Level: %s, Version: %s, Build time: %s" % version()
def checkPythonVersion() :
"""pyGeno needs python 2.7+"""
if sys.version_info[0] < 2 or (sys.version_info[0] > 2 and sys.version_info[1] < 7) :
return False
return True
def getGenomeSequencePath(specie, name) :
return os.path.normpath(pyGeno_DATA_PATH+'/%s/%s' % (specie.lower(), name))
def createDefaultConfigFile() :
"""Creates a default configuration file"""
s = "[pyGeno_config]\nsettings_dir=%s\nremote_location=%s" % (pyGeno_SETTINGS_DIR, pyGeno_REMOTE_LOCATION)
f = open('%s/config.ini' % pyGeno_SETTINGS_DIR, 'w')
f.write(s)
f.close()
def getSettingsPath() :
"""Returns the path where the settings are stored"""
parser = SafeConfigParser()
try :
parser.read(os.path.normpath(pyGeno_SETTINGS_DIR+'/config.ini'))
return parser.get('pyGeno_config', 'settings_dir')
except :
createDefaultConfigFile()
return getSettingsPath()
def removeFromDBRegistery(obj) :
"""rabaDB keeps a record of loaded objects to ensure consistency between different queries.
This function removes an object from the registery"""
rabaDB.Raba.removeFromRegistery(obj)
def freeDBRegistery() :
"""rabaDB keeps a record of loaded objects to ensure consistency between different queries. This function empties the registery"""
rabaDB.Raba.freeRegistery()
def reload() :
"""reinitialize pyGeno"""
pyGeno_init()
def pyGeno_init() :
"""This function is automatically called at launch"""
global db, dbConf
global pyGeno_SETTINGS_PATH
global pyGeno_RABA_DBFILE
global pyGeno_DATA_PATH
if not checkPythonVersion() :
raise PythonVersionError("==> FATAL: pyGeno only works with python 2.7 and above, please upgrade your python version")
if not os.path.exists(pyGeno_SETTINGS_DIR) :
os.makedirs(pyGeno_SETTINGS_DIR)
pyGeno_SETTINGS_PATH = getSettingsPath()
pyGeno_RABA_DBFILE = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, "pyGenoRaba.db") )
pyGeno_DATA_PATH = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, "data") )
if not os.path.exists(pyGeno_SETTINGS_PATH) :
os.makedirs(pyGeno_SETTINGS_PATH)
if not os.path.exists(pyGeno_DATA_PATH) :
os.makedirs(pyGeno_DATA_PATH)
#launching the db
rabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE, pyGeno_RABA_DBFILE)
db = rabaDB.rabaSetup.RabaConnection(pyGeno_RABA_NAMESPACE)
dbConf = rabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE)
================================================
FILE: pyGeno/doc/Makefile
================================================
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = build
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext
help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " qthelp to make HTML files and a qthelp project"
@echo " devhelp to make HTML files and a Devhelp project"
@echo " epub to make an epub"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " xml to make Docutils-native XML files"
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
@echo " coverage to run coverage check of the documentation (if enabled)"
clean:
rm -rf $(BUILDDIR)/*
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."
qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/pyGeno.qhcp"
@echo "To view the help file:"
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/pyGeno.qhc"
devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo "Build finished."
@echo "To view the help file:"
@echo "# mkdir -p $$HOME/.local/share/devhelp/pyGeno"
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/pyGeno"
@echo "# devhelp"
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
latexpdfja:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through platex and dvipdfmx..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."
info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
coverage:
$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
@echo "Testing of coverage in the sources finished, look at the " \
"results in $(BUILDDIR)/coverage/python.txt."
xml:
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
@echo
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
================================================
FILE: pyGeno/doc/make.bat
================================================
@ECHO OFF
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set BUILDDIR=build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source
set I18NSPHINXOPTS=%SPHINXOPTS% source
if NOT "%PAPER%" == "" (
set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
)
if "%1" == "" goto help
if "%1" == "help" (
:help
echo.Please use `make ^<target^>` where ^<target^> is one of
echo. html to make standalone HTML files
echo. dirhtml to make HTML files named index.html in directories
echo. singlehtml to make a single large HTML file
echo. pickle to make pickle files
echo. json to make JSON files
echo. htmlhelp to make HTML files and a HTML help project
echo. qthelp to make HTML files and a qthelp project
echo. devhelp to make HTML files and a Devhelp project
echo. epub to make an epub
echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
echo. text to make text files
echo. man to make manual pages
echo. texinfo to make Texinfo files
echo. gettext to make PO message catalogs
echo. changes to make an overview over all changed/added/deprecated items
echo. xml to make Docutils-native XML files
echo. pseudoxml to make pseudoxml-XML files for display purposes
echo. linkcheck to check all external links for integrity
echo. doctest to run all doctests embedded in the documentation if enabled
echo. coverage to run coverage check of the documentation if enabled
goto end
)
if "%1" == "clean" (
for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
del /q /s %BUILDDIR%\*
goto end
)
REM Check if sphinx-build is available and fallback to Python version if any
%SPHINXBUILD% 2> nul
if errorlevel 9009 goto sphinx_python
goto sphinx_ok
:sphinx_python
set SPHINXBUILD=python -m sphinx.__init__
%SPHINXBUILD% 2> nul
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
:sphinx_ok
if "%1" == "html" (
%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/html.
goto end
)
if "%1" == "dirhtml" (
%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
goto end
)
if "%1" == "singlehtml" (
%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
goto end
)
if "%1" == "pickle" (
%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the pickle files.
goto end
)
if "%1" == "json" (
%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the JSON files.
goto end
)
if "%1" == "htmlhelp" (
%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
goto end
)
if "%1" == "qthelp" (
%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
echo.^> qcollectiongenerator %BUILDDIR%\qthelp\pyGeno.qhcp
echo.To view the help file:
echo.^> assistant -collectionFile %BUILDDIR%\qthelp\pyGeno.ghc
goto end
)
if "%1" == "devhelp" (
%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished.
goto end
)
if "%1" == "epub" (
%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The epub file is in %BUILDDIR%/epub.
goto end
)
if "%1" == "latex" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
if errorlevel 1 exit /b 1
echo.
echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdf" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf
cd %~dp0
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdfja" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf-ja
cd %~dp0
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "text" (
%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The text files are in %BUILDDIR%/text.
goto end
)
if "%1" == "man" (
%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The manual pages are in %BUILDDIR%/man.
goto end
)
if "%1" == "texinfo" (
%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
goto end
)
if "%1" == "gettext" (
%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
goto end
)
if "%1" == "changes" (
%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
if errorlevel 1 exit /b 1
echo.
echo.The overview file is in %BUILDDIR%/changes.
goto end
)
if "%1" == "linkcheck" (
%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
if errorlevel 1 exit /b 1
echo.
echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
goto end
)
if "%1" == "doctest" (
%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
if errorlevel 1 exit /b 1
echo.
echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
goto end
)
if "%1" == "coverage" (
%SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage
if errorlevel 1 exit /b 1
echo.
echo.Testing of coverage in the sources finished, look at the ^
results in %BUILDDIR%/coverage/python.txt.
goto end
)
if "%1" == "xml" (
%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The XML files are in %BUILDDIR%/xml.
goto end
)
if "%1" == "pseudoxml" (
%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
goto end
)
:end
================================================
FILE: pyGeno/doc/source/bootstraping.rst
================================================
Bootstraping
==================================
pyGeno can be quick-started by importing these built-in data wraps.
.. automodule:: bootstrap
:members:
================================================
FILE: pyGeno/doc/source/citing.rst
================================================
Citing
=========
If you are using pyGeno please mention it to the rest of the universe by including a link to: https://github.com/tariqdaouda/pyGeno
================================================
FILE: pyGeno/doc/source/conf.py
================================================
# -*- coding: utf-8 -*-
#
# pyGeno documentation build configuration file, created by
# sphinx-quickstart on Thu Nov 6 16:45:34 2014.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import sys
import os
import sphinx_rtd_theme
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('../..'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.doctest',
'sphinx.ext.intersphinx',
'sphinx.ext.todo',
'sphinx.ext.coverage',
'sphinx.ext.mathjax',
'sphinx.ext.ifconfig',
'sphinx.ext.viewcode',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
#source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = u'pyGeno'
copyright = u'2014, Tariq Daouda'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '1.x'
# The full version, including alpha/beta/rc tags.
release = '1.2.x'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []
# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
# html_theme = "default"
# html_theme_options = {
# "rightsidebar":"true",
# "stickysidebar" : "false",
# }
html_theme = "sphinx_rtd_theme"
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}
# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
#html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = "logo.png"
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = '%b %d, %Y'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If false, no module index is generated.
#html_domain_indices = True
# If false, no index is generated.
#html_use_index = True
# If true, the index is split into individual pages for each letter.
#html_split_index = False
# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None
# Language to be used for generating the HTML full-text search index.
# Sphinx supports the following languages:
# 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
# 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr'
#html_search_language = 'en'
# A dictionary with options for the search language support, empty by default.
# Now only 'ja' uses this config value
#html_search_options = {'type': 'default'}
# The name of a javascript file (relative to the configuration directory) that
# implements a search results scorer. If empty, the default will be used.
#html_search_scorer = 'scorer.js'
# Output file base name for HTML help builder.
htmlhelp_basename = 'pyGenodoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#'preamble': '',
# Latex figure (float) alignment
#'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'pyGeno.tex', u'pyGeno Documentation',
u'Tariq Daouda', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
# latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# If true, show page references after internal links.
#latex_show_pagerefs = False
# If true, show URL addresses after external links.
#latex_show_urls = False
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# If false, no module index is generated.
#latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'pygeno', u'pyGeno Documentation',
[u'Tariq Daouda'], 1)
]
# If true, show URL addresses after external links.
#man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'pyGeno', u'pyGeno Documentation',
u'Tariq Daouda', 'pyGeno', 'One line description of project.',
'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
#texinfo_appendices = []
# If false, no module index is generated.
#texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False
# -- Options for Epub output ----------------------------------------------
# Bibliographic Dublin Core info.
epub_title = u'pyGeno'
epub_author = u'Tariq Daouda'
epub_publisher = u'Tariq Daouda'
epub_copyright = u'2014, Tariq Daouda'
# The basename for the epub file. It defaults to the project name.
#epub_basename = u'pyGeno'
# The HTML theme for the epub output. Since the default themes are not optimized
# for small screen space, using the same theme for HTML and epub output is
# usually not wise. This defaults to 'epub', a theme designed to save visual
# space.
#epub_theme = 'epub'
# The language of the text. It defaults to the language option
# or 'en' if the language is not set.
#epub_language = ''
# The scheme of the identifier. Typical schemes are ISBN or URL.
#epub_scheme = ''
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#epub_identifier = ''
# A unique identification for the text.
#epub_uid = ''
# A tuple containing the cover image and cover page html template filenames.
#epub_cover = ()
# A sequence of (type, uri, title) tuples for the guide element of content.opf.
#epub_guide = ()
# HTML files that should be inserted before the pages created by sphinx.
# The format is a list of tuples containing the path and title.
#epub_pre_files = []
# HTML files shat should be inserted after the pages created by sphinx.
# The format is a list of tuples containing the path and title.
#epub_post_files = []
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
# The depth of the table of contents in toc.ncx.
#epub_tocdepth = 3
# Allow duplicate toc entries.
#epub_tocdup = True
# Choose between 'default' and 'includehidden'.
#epub_tocscope = 'default'
# Fix unsupported image types using the PIL.
#epub_fix_images = False
# Scale large images.
#epub_max_image_width = 0
# How to display URL addresses: 'footnote', 'no', or 'inline'.
#epub_show_urls = 'inline'
# If false, no index is generated.
#epub_use_index = True
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'http://docs.python.org/': None}
================================================
FILE: pyGeno/doc/source/datawraps.rst
================================================
Datawraps
=========
Datawraps are used by pyGeno to import data into it's database. All reference genomes are downloaded from Ensembl, dbSNP data from dbSNP.
The :doc:`/bootstraping` module has functions to import datawraps shipped with pyGeno.
Datawraps can either be tar.gz.archives or folders.
Importation
-----------
Here's how you import a reference genome datawrap::
from pyGeno.importation.Genomes import *
importGenome("my_datawrap.tar.gz")
And a SNP set datawrap::
from pyGeno.importation.SNPs import *
importSNPs("my_datawrap.tar.gz")
Creating you own datawraps
--------------------------
For polymorphims, create a file called **manifest.ini** with the following format::
[package_infos]
description = SNPs for testing purposes
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda [at] umontreal
version = 1
[set_infos]
species = human
name = mySNPSET
type = Agnostic # or CasavaSNP or dbSNPSNP
source = Where do these snps come from?
[snps]
filename = snps.txt # or http://www.example.com/snps.txt or ftp://www.example.com/snps.txt if you chose not to include the file in the archive
And compress the **manifest.ini** file along sith the snps.txt (if you chose to include it and not to specify an url) into a tar.gz archive.
Natively pyGeno supports dbSNP and casava(snp.txt), but it also has its own polymorphism file format (AgnosticSNP) wich is simply a tab delemited file in the following format::
chromosomeNumber uniqueId start end ref alleles quality caller
Y 1 2655643 2655644 T AG 30 TopHat
Y 2 2655645 2655647 - AG 28 TopHat
Y 3 2655648 2655650 TT - 10 TopHat
Even tough all field are mandatory, the only ones that are critical for pyGeno to be able insert polymorphisms at the right places are: *chromosomeNumber* and *start*. All the other fields are non critical and can follow any convention you wish to apply to them, including the *end* field. You can choose the convention that suits best the query that you are planning to make on SNPs through .get(), or the way you intend to filter them using filtering objtecs.
For genomes, the manifet.ini file looks like this::
[package_infos]
description = Test package. This package installs only chromosome Y of mus musculus
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda [at] umontreal
version = GRCm38.73
[genome]
species = Mus_musculus
name = GRCm38_test
source = http://useast.ensembl.org/info/data/ftp/index.html
[chromosome_files]
Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz # or an url such as ftp://... or http://
[gene_set]
gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz # or an url such as ftp://... or http://
File URLs for refercence genomes can be found on `Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`_
To learn more about datawraps and how to make your own you can have a look at :doc:`/importation`, and the Wiki_.
.. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data
.. _`Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`: http://useast.ensembl.org/info/data/ftp/index.html
================================================
FILE: pyGeno/doc/source/importation.rst
================================================
Importation
===========
pyGeno's database is populated by importing tar.gz compressed archives called *datawraps*. An importation is a one time step and once a datawrap has been imported it can be discarded with no concequences to the database.
Here's how you import a reference genome datawrap::
from pyGeno.importation.Genomes import *
importGenome("my_genome_datawrap.tar.gz")
And a SNP set datawrap::
from pyGeno.importation.SNPs import *
importSNPs("my_snp_datawrap.tar.gz")
pyGeno comes with a few datawraps that you can quickly import using the :doc:`/bootstraping` module.
You can find a list of datawraps to import here: :doc:`/datawraps`
You can also easily create your own by simply putting a bunch of URLs into a *manifest.ini* file and compressing int into a *tar.gz archive* (as explained below or on the Wiki_).
.. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data
Genomes
-------
.. automodule:: importation.Genomes
:members:
Polymorphisms (SNPs and Indels)
-------------------------------
.. automodule:: importation.SNPs
:members:
================================================
FILE: pyGeno/doc/source/index.rst
================================================
.. pyGeno documentation master file, created by
sphinx-quickstart on Thu Nov 6 16:45:34 2014.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
.. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png
:alt: pyGeno's logo
pyGeno: A Python package for precision medicine and proteogenomics
===================================================================
.. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg
:alt: depsy
:target: http://depsy.org/package/python/pyGeno
.. image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg
:target: https://opensource.org/licenses/Apache-2.0
.. image:: https://img.shields.io/badge/python-2.7-blue.svg
pyGeno's `lair is on Github`_.
.. _lair is on Github: http://www.github.com/tariqdaouda/pyGeno
Citing pyGeno:
--------------
Please cite this paper_.
.. _paper: http://f1000research.com/articles/5-381/v1
A Quick Intro:
-----------------
Even tough more and more research focuses on Personalized/Precision Medicine, treatments that are specically tailored to the patient, pyGeno is (to our knowlege) the only tool available that will gladly build your specific genomes for you and you give an easy access to them.
pyGeno allows you to create and work on **Personalized Genomes**: custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter.
pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get
direct access to the DNA and Protein sequences of your patients/subjects.
To know more about how to create a Personalized Genome, have a look at the :doc:`/quickstart` section.
pyGeno can also function as a personal bioinformatic database for Ensembl, that runs directly into python, on your laptop, making faster and more reliable than any REST API. pyGeno makes extracting data such as gene sequences a breeze, and is designed to
be able cope with huge queries.
.. code::
from pyGeno.Genome import *
g = Genome(name = "GRCh37.75")
prot = g.get(Protein, id = 'ENSP00000438917')[0]
#print the protein sequence
print prot.sequence
#print the protein's gene biotype
print prot.gene.biotype
#print protein's transcript sequence
print prot.transcript.sequence
#fancy queries
for exons in g.get(Exons, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) :
#print the exon's coding sequence
print exon.CDS
#print the exon's transcript sequence
print exon.transcript.sequence
#You can do the same for your subject specific genomes
#by combining a reference genome with polymorphisms
g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter())
Verbose Introduction
---------------------
pyGeno integrates:
* **Reference sequences** and annotations from **Ensembl**
* Genomic polymorphisms from the **dbSNP** database
* SNPs from **next-gen** sequencing
pyGeno is a python package that was designed to be:
* Fast to install. It has no dependencies but its own backend: `rabaDB`_.
* Fast to run and memory efficient, so you can use it on your laptop.
* Fast to use. No queries to foreign APIs all the data rests on your computer, so it is readily accessible when you need it.
* Fast to learn. One single function **get()** can do the job of several other tools at once.
It also comes with:
* Parsers for: FASTA, FASTQ, GTF, VCF, CSV.
* Useful tools for translation etc...
* Optimised genome indexation with *Segment Trees*.
* A funky *Progress Bar*.
One of the the coolest things about pyGeno is that it also allows to quickly create **personalized genomes**.
Genomes that you design yourself by combining a reference genome and SNP sets derived from dbSNP or next-gen sequencing.
pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_.
For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.
.. _rabaDB: https://github.com/tariqdaouda/rabaDB
.. _Tariq Daouda: http://www.tariqdaouda.com
.. _IRIC: http://www.iric.ca
.. _Sawssan Kaddoura: http://www.sawssankaddoura.com
.. _@tariqdaouda: https://www.twitter.com/tariqdaouda
Contents:
----------
.. toctree::
:maxdepth: 2
publications
quickstart
installation
bootstraping
querying
importation
datawraps
objects
snp_filter
tools
parsers
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: pyGeno/doc/source/installation.rst
================================================
Installation
=============
Unix (MacOS, Linux)
-------------------
The latest stable version is available from pypi::
pip install pyGeno
**Upgrade**::
pip install pyGeno --upgrade
If you're more adventurous, the bleeding edge version is available from github (look for the 'bloody' branch)::
git clone https://github.com/tariqdaouda/pyGeno.git
cd pyGeno
python setup.py develop
**Upgrade**::
git pull
Windows
-------
* Goto: https://www.python.org/downloads/ and download the installer for the lastest version of python 2.7
* Double click on the installer to begin installation
* Click on the windows start menu
* Type *"cmd"* and click on it to launch the command line interface
* In the command line interface type::
cd C:\Python27\Scripts
* Now type: pip install pyGeno
* Now click on the windows start menu. In the python 2.7 menu you can either launch *Python (Command line)* or *IDLE (Python GUI)*
* You can now go to: http://pygeno.iric.ca/quickstart.html and type the commands into either one of them
**UPGRADE:** to upgrade pyGeno to the latest version, launch *cmd* and type::
cd C:\Python27\Scripts
followed by::
pip install pyGeno --upgrade
================================================
FILE: pyGeno/doc/source/objects.rst
================================================
Objects
=======
With pyGeno you can manipulate familiar object in intuituive way. All the following classes except SNP inherit from pyGenoObjectWrapper and have therefor access to functions sur as get(), count(), ensureIndex()...
Base classes
-------------
Base classes are abstract and are not meant to be instanciated, they nonetheless implement most of the functions that classes such as Genome possess.
.. automodule:: pyGenoObjectBases
:members:
Genome
-------
.. automodule:: Genome
:members:
Chromosome
----------
.. automodule:: Chromosome
:members:
Gene
----
.. automodule:: Gene
:members:
Transcript
----------
.. automodule:: Transcript
:members:
Exon
----
.. automodule:: Exon
:members:
Protein
-------
.. automodule:: Protein
:members:
SNP
---
.. automodule:: SNP
:members:
================================================
FILE: pyGeno/doc/source/parsers.rst
================================================
Parsers
=======
PyGeno comes with a set of parsers that you can use independentely.
CSV
---
To read and write CSV files.
.. automodule:: tools.parsers.CSVTools
:members:
FASTA
-----
To read and write FASTA files.
.. automodule:: tools.parsers.FastaTools
:members:
FASTQ
-----
To read and write FASTQ files.
.. automodule:: tools.parsers.FastqTools
:members:
GTF
---
To read GTF files.
.. automodule:: tools.parsers.GTFTools
:members:
VCF
---
To read VCF files.
.. automodule:: tools.parsers.VCFTools
:members:
Casava
-------
To read casava files.
.. automodule:: tools.parsers.CasavaTools
:members:
================================================
FILE: pyGeno/doc/source/publications.rst
================================================
Publications
============
Please cite this one:
---------------------
`pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`_
.. _`pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`: http://f1000research.com/articles/5-381/v2
pyGeno was also used in the following studies:
----------------------------------------------
`MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`_
.. _`MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127664/
`Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015`_
.. _Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015: http://dx.doi.org/10.1038/ncomms10238
`Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014`_
.. _Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014: http://www.ncbi.nlm.nih.gov/pubmed/24714562
`MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012`_
.. _MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012: http://www.ncbi.nlm.nih.gov/pubmed/22438248
================================================
FILE: pyGeno/doc/source/querying.rst
================================================
Querying
=========
pyGeno is a personal database that you can query in many ways. Special emphasis has been placed upon ease of use, and you only need to remember two functions::
* get()
* help()
**get()** can be called from any pyGeno object to get any objects.
**help()** is you best friend when you get lost using **get()**. When called, it will give the list of all field that you can use in get queries. You can call it either of the class::
Gene.help()
Or on the object::
ref = Genome(name = "GRCh37.75")
g = ref.get(Gene, name = "TPST2")[0]
g.help()
Both will print::
'Available fields for Gene: end, name, chromosome, start, biotype, id, strand, genome'
================================================
FILE: pyGeno/doc/source/quickstart.rst
================================================
Quickstart
==========
Quick importation
-----------------
pyGeno's database is populated by importing data wraps.
pyGeno comes with a few datawraps, to get the list you can use:
.. code:: python
import pyGeno.bootstrap as B
B.printDatawraps()
.. code::
Available datawraps for bootstraping
SNPs
~~~~|
|~~~:> Human_agnostic.dummySRY.tar.gz
|~~~:> Human.dummySRY_casava.tar.gz
|~~~:> dbSNP142_human_GRCh37_common_all.tar.gz
|~~~:> dbSNP142_human_common_all.tar.gz
Genomes
~~~~~~~|
|~~~:> Human.GRCh37.75.tar.gz
|~~~:> Human.GRCh37.75_Y-Only.tar.gz
|~~~:> Human.GRCh38.78.tar.gz
|~~~:> Mouse.GRCm38.78.tar.gz
To get a list of remote datawraps that pyGeno can download for you, do:
.. code:: python
B.printRemoteDatawraps()
Importing whole genomes is a demanding process that take more than an hour and requires (according to tests)
at least 3GB of memory. Depending on your configuration, more might be required.
That being said importating a data wrap is a one time operation and once the importation is complete the datawrap
can be discarded without consequences.
The bootstrap module also has some handy functions for importing built-in packages.
Some of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**):
.. code:: python
import pyGeno.bootstrap as B
#Imports only the Y chromosome from the human reference genome GRCh37.75
#Very fast, requires even less memory. No download required.
B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP format.
# This one has one SNP at the begining of the gene SRY
B.importSNPs("Human.dummySRY_casava.tar.gz")
And for more serious work, the whole reference genome.
.. code:: python
#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.
B.importGenome("Human.GRCh38.78.tar.gz")
That's it, you can now print the sequences of all the proteins that a gene can produce::
from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Protein import Protein
#the name of the genome is defined inside the package's manifest.ini file
ref = Genome(name = 'GRCh37.75')
#get returns a list of elements
gene = ref.get(Gene, name = 'SRY')[0]
for prot in gene.get(Protein) :
print prot.sequence
You can see pyGeno achitecture as a graph where everything is connected to everything. For instance you can do things such as::
gene = aProt.gene
trans = aProt.transcript
prot = anExon.protein
genome = anExon.genome
Queries
-------
PyGeno allows for several kinds of queries, here are some snippets::
#in this case both queries will yield the same result
myGene.get(Protein, id = "ENSID...")
myGenome.get(Protein, id = "ENSID...")
#even complex stuff
exons = myChromosome.get(Exons, {'start >=' : x1, 'stop <' : x2})
hlaGenes = myGenome.get(Gene, {'name like' : 'HLA'})
sry = myGenome.get(Transcript, { "gene.name" : 'SRY' })
To know the available fields for queries, there's a "help()" function::
Gene.help()
Faster queries
---------------
To speed up loops use iterGet()::
for prot in gene.iterGet(Protein) :
print prot.sequence
For more speed create indexes on the fields you need the most::
Gene.ensureGlobalIndex('name')
Getting sequences
-------------------
Anything that has a sequence can be indexed using the usual python list syntax::
protein[34] # for the 34th amino acid
protein[34:40] # for amino acids in [34, 40[
transcript[23] #for the 23rd nucleotide of the transcript
transcript[23:30] #for nucletotides in [23, 30[
transcript.cDNA[23:30] #the same but for the protein coding DNA (without the UTRs)
Transcripts, Proteins, Exons also have a *.sequence* attribute. This attribute is the string rendered sequence, it is perfect for printing but it may contain '/'s
in case of polymorphic sequence that you must
take into account in the indexing. On the other hand if you use indexes directly on the object (as shown in the snippet above) pyGeno will use a binary representaion
of the sequences thus the indexing is independent of the polymorphisms present in the sequences.
Personalized Genomes
--------------------
Personalized Genomes are a powerful feature that allow to work on the specific genomes and proteomes of your patients. You can even mix several SNPs together::
from pyGeno.Genome import Genome
#the name of the snp set is defined inside the datawraps's manifest.ini file
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')
#you can also define a filter (ex: a quality filter) for the SNPs
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())
#and even mix several snp sets
dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())
pyGeno allows you to customize the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions::
from pyGeno.SNPFiltering import SNPFilter
from pyGeno.SNPFiltering import SequenceSNP
class QMax_gt_filter(SNPFilter) :
def __init__(self, threshold) :
self.threshold = threshold
def filter(self, chromosome, dummySRY = None) :
if dummySRY.Qmax_gt > self.threshold :
#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
return SequenceSNP(dummySRY.alt)
return None #None means keep the reference allele
persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))
================================================
FILE: pyGeno/doc/source/snp_filter.rst
================================================
Filtering Polymorphisms (SNPs and Indels)
=========================================
Polymorphism filtering is an important part of personalized genomes. By creating your own filters you can easily taylor personalized genomes to your needs. The importaant thing to understand about the filtering process, is that it gives you complete freedom about should be inserted.
Once pyGeno finds a polymorphism, it automatically triggers the filter to know what should be inserted at this position, and that can be anything you choose.
.. automodule:: SNPFiltering
:members:
================================================
FILE: pyGeno/doc/source/tools.rst
================================================
Tools
=====
pyGeno provides a set of tools that can be used independentely. Here you'll find goodies for translation, indexation, and more.
Progress Bar
-------------
pyGeno's awesome progress bar, with logging capabilities and remaining time estimation.
.. automodule:: tools.ProgressBar
:members:
Useful functions
-----------------
This module is a bunch of handy bioinfo functions.
.. automodule:: tools.UsefulFunctions
:members:
Binary sequences
-----------------
To encode sequence in binary formats
.. automodule:: tools.BinarySequence
:members:
Segment tree
-------------
Segment trees are an optimised way to index a genome.
.. automodule:: tools.SegmentTree
:members:
Secure memory map
------------------
A write protected memory map.
.. automodule:: tools.SecureMmap
:members:
================================================
FILE: pyGeno/examples/__init__.py
================================================
================================================
FILE: pyGeno/examples/bootstraping.py
================================================
import pyGeno.bootstrap as B
#~ imports the whole human reference genome
#~ B.importHumanReference()
B.importHumanReferenceYOnly()
B.importDummySRY()
================================================
FILE: pyGeno/examples/snps_queries.py
================================================
from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Transcript import Transcript
from pyGeno.Protein import Protein
from pyGeno.Exon import Exon
from pyGeno.SNPFiltering import SNPFilter
from pyGeno.SNPFiltering import SequenceSNP
def printing(gene) :
print "printing reference sequences\n-------"
for trans in gene.get(Transcript) :
print "\t-Transcript name:", trans.name
print "\t-Protein:", trans.protein.sequence
print "\t-Exons:"
for e in trans.exons :
print "\t\t Number:", e.number
print "\t\t-CDS:", e.CDS
print "\t\t-Strand:", e.strand
print "\t\t-CDS_start:", e.CDS_start
print "\t\t-CDS_end:", e.CDS_end
def printVs(refGene, presGene) :
print "Vs personalized sequences\n------"
#iterGet returns an iterator instead of a list (faster)
for trans in presGene.iterGet(Transcript) :
refProt = refGene.get(Protein, id = trans.protein.id)[0]
persProt = trans.protein
print persProt.id
print "\tref:" + refProt.sequence[:20] + "..."
print "\tper:" + persProt.sequence[:20] + "..."
print
def fancyExonQuery(gene) :
e = gene.get(Exon, {'CDS_start >' : 2655029, 'CDS_end <' : 2655200})[0]
print "An exon with a CDS in ']2655029, 2655200[':", e.id
class QMax_gt_filter(SNPFilter) :
def __init__(self, threshold) :
self.threshold = threshold
def filter(self, chromosome, dummySRY) :
if dummySRY.Qmax_gt > self.threshold :
#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
return SequenceSNP(dummySRY.alt)
return None #None means keep the reference allele
if __name__ == "__main__" :
refGenome = Genome(name = 'GRCh37.75_Y-Only')
persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))
geneRef = refGenome.get(Gene, name = 'SRY')[0]
persGene = persGenome.get(Gene, name = 'SRY')[0]
printing(geneRef)
print "\n"
printVs(geneRef, persGene)
print "\n"
fancyExonQuery(geneRef)
================================================
FILE: pyGeno/importation/Genomes.py
================================================
import os, glob, gzip, tarfile, shutil, time, sys, gc, cPickle, tempfile, urllib2
from contextlib import closing
from ConfigParser import SafeConfigParser
from pyGeno.tools.ProgressBar import ProgressBar
import pyGeno.configuration as conf
from pyGeno.Genome import *
from pyGeno.Chromosome import *
from pyGeno.Gene import *
from pyGeno.Transcript import *
from pyGeno.Exon import *
from pyGeno.Protein import *
from pyGeno.tools.parsers.GTFTools import GTFFile
from pyGeno.tools.ProgressBar import ProgressBar
from pyGeno.tools.io import printf
import gc
#~ import objgraph
def backUpDB() :
"""backup the current database version. automatically called by importGenome(). Returns the filename of the backup"""
st = time.ctime().replace(' ', '_')
fn = conf.pyGeno_RABA_DBFILE.replace('.db', '_%s-bck.db' % st)
shutil.copy2(conf.pyGeno_RABA_DBFILE, fn)
return fn
def _decompressPackage(packageFile) :
pFile = tarfile.open(packageFile)
packageDir = tempfile.mkdtemp(prefix = "pyGeno_import_")
if os.path.isdir(packageDir) :
shutil.rmtree(packageDir)
os.makedirs(packageDir)
for mem in pFile :
pFile.extract(mem, packageDir)
return packageDir
def _getFile(fil, directory) :
if fil.find("http://") == 0 or fil.find("ftp://") == 0 :
printf("Downloading file: %s..." % fil)
finalFile = os.path.normpath('%s/%s' %(directory, fil.split('/')[-1]))
# urllib.urlretrieve (fil, finalFile)
with closing(urllib2.urlopen(fil)) as r:
with open(finalFile, 'wb') as f:
shutil.copyfileobj(r, f)
printf('done.')
else :
finalFile = os.path.normpath('%s/%s' %(directory, fil))
return finalFile
def deleteGenome(species, name) :
"""Removes a genome from the database"""
printf('deleting genome (%s, %s)...' % (species, name))
conf.db.beginTransaction()
objs = []
allGood = True
try :
genome = Genome_Raba(name = name, species = species.lower())
objs.append(genome)
pBar = ProgressBar(label = 'preparing')
for typ in (Chromosome_Raba, Gene_Raba, Transcript_Raba, Exon_Raba, Protein_Raba) :
pBar.update()
f = RabaQuery(typ, namespace = genome._raba_namespace)
f.addFilter({'genome' : genome})
for e in f.iterRun() :
objs.append(e)
pBar.close()
pBar = ProgressBar(nbEpochs = len(objs), label = 'deleting objects')
for e in objs :
pBar.update()
e.delete()
pBar.close()
except KeyError as e :
#~ printf("\tWARNING, couldn't remove genome form db, maybe it's not there: ", e)
raise KeyError("\tWARNING, couldn't remove genome form db, maybe it's not there: ", e)
allGood = False
printf('\tdeleting folder')
try :
shutil.rmtree(conf.getGenomeSequencePath(species, name))
except OSError as e:
#~ printf('\tWARNING, Unable to delete folder: ', e)
OSError('\tWARNING, Unable to delete folder: ', e)
allGood = False
conf.db.endTransaction()
return allGood
def importGenome(packageFile, batchSize = 50, verbose = 0) :
"""Import a pyGeno genome package. A genome packages is folder or a tar.gz ball that contains at it's root:
* gziped fasta files for all chromosomes, or URLs from where them must be downloaded
* gziped GTF gene_set file from Ensembl, or an URL from where it must be downloaded
* a manifest.ini file such as::
[package_infos]
description = Test package. This package installs only chromosome Y of mus musculus
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda [at] umontreal
version = GRCm38.73
[genome]
species = Mus_musculus
name = GRCm38_test
source = http://useast.ensembl.org/info/data/ftp/index.html
[chromosome_files]
Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz / or an url such as ftp://... or http://
[gene_set]
gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz / or an url such as ftp://... or http://
All files except the manifest can be downloaded from: http://useast.ensembl.org/info/data/ftp/index.html
A rollback is performed if an exception is caught during importation
batchSize sets the number of genes to parse before performing a database save. PCs with little ram like
small values, while those endowed with more memory may perform faster with higher ones.
Verbose must be an int [0, 4] for various levels of verbosity
"""
def reformatItems(items) :
s = str(items)
s = s.replace('[', '').replace(']', '').replace("',", ': ').replace('), ', '\n').replace("'", '').replace('(', '').replace(')', '')
return s
printf('Importing genome package: %s... (This may take a while)' % packageFile)
isDir = False
if not os.path.isdir(packageFile) :
packageDir = _decompressPackage(packageFile)
else :
isDir = True
packageDir = packageFile
parser = SafeConfigParser()
parser.read(os.path.normpath(packageDir+'/manifest.ini'))
packageInfos = parser.items('package_infos')
genomeName = parser.get('genome', 'name')
species = parser.get('genome', 'species')
genomeSource = parser.get('genome', 'source')
seqTargetDir = conf.getGenomeSequencePath(species.lower(), genomeName)
if os.path.isdir(seqTargetDir) :
raise KeyError("The directory %s already exists, Please call deleteGenome() first if you want to reinstall" % seqTargetDir)
gtfFile = _getFile(parser.get('gene_set', 'gtf'), packageDir)
chromosomesFiles = {}
chromosomeSet = set()
for key, fil in parser.items('chromosome_files') :
chromosomesFiles[key] = _getFile(fil, packageDir)
chromosomeSet.add(key)
try :
genome = Genome(name = genomeName, species = species)
except KeyError:
pass
else :
raise KeyError("There seems to be already a genome (%s, %s), please call deleteGenome() first if you want to reinstall it" % (genomeName, species))
genome = Genome_Raba()
genome.set(name = genomeName, species = species, source = genomeSource, packageInfos = packageInfos)
printf("Importing:\n\t%s\nGenome:\n\t%s\n..." % (reformatItems(packageInfos).replace('\n', '\n\t'), reformatItems(parser.items('genome')).replace('\n', '\n\t')))
chros = _importGenomeObjects(gtfFile, chromosomeSet, genome, batchSize, verbose)
os.makedirs(seqTargetDir)
startChro = 0
pBar = ProgressBar(nbEpochs = len(chros))
for chro in chros :
pBar.update(label = "Importing DNA, chro %s" % chro.number)
length = _importSequence(chro, chromosomesFiles[chro.number.lower()], seqTargetDir)
chro.start = startChro
chro.end = startChro+length
startChro = chro.end
chro.save()
pBar.close()
if not isDir :
shutil.rmtree(packageDir)
#~ objgraph.show_most_common_types(limit=20)
return True
#~ @profile
def _importGenomeObjects(gtfFilePath, chroSet, genome, batchSize, verbose = 0) :
"""verbose must be an int [0, 4] for various levels of verbosity"""
class Store(object) :
def __init__(self, conf) :
self.conf = conf
self.chromosomes = {}
self.genes = {}
self.transcripts = {}
self.proteins = {}
self.exons = {}
def batch_save(self) :
self.conf.db.beginTransaction()
for c in self.genes.itervalues() :
c.save()
conf.removeFromDBRegistery(c)
for c in self.transcripts.itervalues() :
c.save()
conf.removeFromDBRegistery(c.exons)
conf.removeFromDBRegistery(c)
for c in self.proteins.itervalues() :
c.save()
conf.removeFromDBRegistery(c)
self.conf.db.endTransaction()
del(self.genes)
del(self.transcripts)
del(self.proteins)
del(self.exons)
self.genes = {}
self.transcripts = {}
self.proteins = {}
self.exons = {}
gc.collect()
def save_chros(self) :
pBar = ProgressBar(nbEpochs = len(self.chromosomes))
for c in self.chromosomes.itervalues() :
pBar.update(label = 'Chr %s' % c.number)
c.save()
pBar.close()
printf('Importing gene set infos from %s...' % gtfFilePath)
printf('Backuping indexes...')
indexes = conf.db.getIndexes()
printf("Droping all your indexes, (don't worry i'll restore them later)...")
Genome_Raba.flushIndexes()
Chromosome_Raba.flushIndexes()
Gene_Raba.flushIndexes()
Transcript_Raba.flushIndexes()
Protein_Raba.flushIndexes()
Exon_Raba.flushIndexes()
printf("Parsing gene set...")
gtf = GTFFile(gtfFilePath, gziped = True)
printf('Done. Importation begins!')
store = Store(conf)
chroNumber = None
pBar = ProgressBar(nbEpochs = len(gtf))
for line in gtf :
chroN = line['seqname']
pBar.update(label = "Chr %s" % chroN)
if (chroN.upper() in chroSet or chroN.lower() in chroSet):
strand = line['strand']
gene_biotype = line['gene_biotype']
regionType = line['feature']
frame = line['frame']
start = int(line['start']) - 1
end = int(line['end'])
if start > end :
start, end = end, start
chroNumber = chroN.upper()
if chroNumber not in store.chromosomes :
store.chromosomes[chroNumber] = Chromosome_Raba()
store.chromosomes[chroNumber].set(genome = genome, number = chroNumber)
try :
geneId = line['gene_id']
geneName = line['gene_name']
except KeyError :
geneId = None
geneName = None
if verbose :
printf('Warning: no gene_id/name found in line %s' % gtf[i])
if geneId is not None :
if geneId not in store.genes :
if len(store.genes) > batchSize :
store.batch_save()
if verbose > 0 :
printf('\tGene %s, %s...' % (geneId, geneName))
store.genes[geneId] = Gene_Raba()
store.genes[geneId].set(genome = genome, id = geneId, chromosome = store.chromosomes[chroNumber], name = geneName, strand = strand, biotype = gene_biotype)
if start < store.genes[geneId].start or store.genes[geneId].start is None :
store.genes[geneId].start = start
if end > store.genes[geneId].end or store.genes[geneId].end is None :
store.genes[geneId].end = end
try :
transId = line['transcript_id']
transName = line['transcript_name']
try :
transcript_biotype = line['transcript_biotype']
except KeyError :
transcript_biotype = None
except KeyError :
transId = None
transName = None
if verbose > 2 :
printf('\t\tWarning: no transcript_id, name found in line %s' % gtf[i])
if transId is not None :
if transId not in store.transcripts :
if verbose > 1 :
printf('\t\tTranscript %s, %s...' % (transId, transName))
store.transcripts[transId] = Transcript_Raba()
store.transcripts[transId].set(genome = genome, id = transId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), name = transName, biotype=transcript_biotype)
if start < store.transcripts[transId].start or store.transcripts[transId].start is None:
store.transcripts[transId].start = start
if end > store.transcripts[transId].end or store.transcripts[transId].end is None:
store.transcripts[transId].end = end
try :
protId = line['protein_id']
except KeyError :
protId = None
if verbose > 2 :
printf('Warning: no protein_id found in line %s' % gtf[i])
# Store selenocysteine positions in transcript
if regionType == 'Selenocysteine':
store.transcripts[transId].selenocysteine.append(start)
if protId is not None and protId not in store.proteins :
if verbose > 1 :
printf('\t\tProtein %s...' % (protId))
store.proteins[protId] = Protein_Raba()
store.proteins[protId].set(genome = genome, id = protId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), name = transName)
store.transcripts[transId].protein = store.proteins[protId]
try :
exonNumber = int(line['exon_number']) - 1
exonKey = (transId, exonNumber)
except KeyError :
exonNumber = None
exonKey = None
if verbose > 2 :
printf('Warning: no exon number or id found in line %s' % gtf[i])
if exonKey is not None :
if verbose > 3 :
printf('\t\t\texon %s...' % (exonId))
if exonKey not in store.exons and regionType == 'exon' :
store.exons[exonKey] = Exon_Raba()
store.exons[exonKey].set(genome = genome, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), protein = store.proteins.get(protId, None), strand = strand, number = exonNumber, start = start, end = end)
store.transcripts[transId].exons.append(store.exons[exonKey])
try :
store.exons[exonKey].id = line['exon_id']
except KeyError :
pass
if regionType == 'exon' :
if start < store.exons[exonKey].start or store.exons[exonKey].start is None:
store.exons[exonKey].start = start
if end > store.transcripts[transId].end or store.exons[exonKey].end is None:
store.exons[exonKey].end = end
elif regionType == 'CDS' :
store.exons[exonKey].CDS_start = start
store.exons[exonKey].CDS_end = end
store.exons[exonKey].frame = frame
elif regionType == 'stop_codon' :
if strand == '+' :
if store.exons[exonKey].CDS_end != None :
store.exons[exonKey].CDS_end += 3
if store.exons[exonKey].end < store.exons[exonKey].CDS_end :
store.exons[exonKey].end = store.exons[exonKey].CDS_end
if store.transcripts[transId].end < store.exons[exonKey].CDS_end :
store.transcripts[transId].end = store.exons[exonKey].CDS_end
if store.genes[geneId].end < store.exons[exonKey].CDS_end :
store.genes[geneId].end = store.exons[exonKey].CDS_end
if strand == '-' :
if store.exons[exonKey].CDS_start != None :
store.exons[exonKey].CDS_start -= 3
if store.exons[exonKey].start > store.exons[exonKey].CDS_start :
store.exons[exonKey].start = store.exons[exonKey].CDS_start
if store.transcripts[transId].start > store.exons[exonKey].CDS_start :
store.transcripts[transId].start = store.exons[exonKey].CDS_start
if store.genes[geneId].start > store.exons[exonKey].CDS_start :
store.genes[geneId].start = store.exons[exonKey].CDS_start
pBar.close()
store.batch_save()
conf.db.beginTransaction()
printf('almost done saving chromosomes...')
store.save_chros()
printf('saving genome object...')
genome.save()
conf.db.endTransaction()
conf.db.beginTransaction()
printf('restoring core indexes...')
# Genome.ensureGlobalIndex(('name', 'species'))
# Chromosome.ensureGlobalIndex('genome')
# Gene.ensureGlobalIndex('genome')
# Transcript.ensureGlobalIndex('genome')
# Protein.ensureGlobalIndex('genome')
# Exon.ensureGlobalIndex('genome')
Transcript.ensureGlobalIndex('exons')
printf('commiting changes...')
conf.db.endTransaction()
conf.db.beginTransaction()
printf('restoring user indexes')
pBar = ProgressBar(label = "restoring", nbEpochs = len(indexes))
for idx in indexes :
pBar.update()
conf.db.execute(idx[-1].replace('CREATE INDEX', 'CREATE INDEX IF NOT EXISTS'))
pBar.close()
printf('commiting changes...')
conf.db.endTransaction()
return store.chromosomes.values()
#~ @profile
def _importSequence(chromosome, fastaFile, targetDir) :
"Serializes fastas into .dat files"
f = gzip.open(fastaFile)
header = f.readline()
strRes = f.read().upper().replace('\n', '').replace('\r', '')
f.close()
fn = '%s/chromosome%s.dat' % (targetDir, chromosome.number)
f = open(fn, 'w')
f.write(strRes)
f.close()
chromosome.dataFile = fn
chromosome.header = header
return len(strRes)
================================================
FILE: pyGeno/importation/SNPs.py
================================================
import urllib, shutil
from ConfigParser import SafeConfigParser
import pyGeno.configuration as conf
from pyGeno.SNP import *
from pyGeno.tools.ProgressBar import ProgressBar
from pyGeno.tools.io import printf
from Genomes import _decompressPackage, _getFile
from pyGeno.tools.parsers.CasavaTools import SNPsTxtFile
from pyGeno.tools.parsers.VCFTools import VCFFile
from pyGeno.tools.parsers.CSVTools import CSVFile
def importSNPs(packageFile) :
"""The big wrapper, this function should detect the SNP type by the package manifest and then launch the corresponding function.
Here's an example of a SNP manifest file for Casava SNPs::
[package_infos]
description = Casava SNPs for testing purposes
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda [at] umontreal
version = 1
[set_infos]
species = human
name = dummySRY
type = Agnostic
source = my place at the IRIC
[snps]
filename = snps.txt # as with genomes you can either include de file at the root of the package or specify an URL from where it must be downloaded
"""
printf("Importing polymorphism set: %s... (This may take a while)" % packageFile)
isDir = False
if not os.path.isdir(packageFile) :
packageDir = _decompressPackage(packageFile)
else :
isDir = True
packageDir = packageFile
fpMan = os.path.normpath(packageDir+'/manifest.ini')
if not os.path.isfile(fpMan) :
raise ValueError("Not file named manifest.ini! Mais quel SCANDALE!!!!")
parser = SafeConfigParser()
parser.read(os.path.normpath(packageDir+'/manifest.ini'))
packageInfos = parser.items('package_infos')
setName = parser.get('set_infos', 'name')
typ = parser.get('set_infos', 'type')
if typ.lower()[-3:] != 'snp' :
typ += 'SNP'
species = parser.get('set_infos', 'species').lower()
genomeSource = parser.get('set_infos', 'source')
snpsFileTmp = parser.get('snps', 'filename').strip()
snpsFile = _getFile(parser.get('snps', 'filename'), packageDir)
return_value = None
try :
SMaster = SNPMaster(setName = setName)
except KeyError :
if typ.lower() == 'casavasnp' :
return_value = _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile)
elif typ.lower() == 'dbsnpsnp' :
return_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)
elif typ.lower() == 'dbsnp' :
return_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)
elif typ.lower() == 'tophatsnp' :
return_value = _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile)
elif typ.lower() == 'agnosticsnp' :
return_value = _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile)
else :
raise FutureWarning('Unknown SNP type in manifest %s' % typ)
else :
raise KeyError("There's already a SNP set by the name %s. Use deleteSNPs() to remove it first" %setName)
if not isDir :
shutil.rmtree(packageDir)
return return_value
def deleteSNPs(setName) :
"""deletes a set of polymorphisms"""
con = conf.db
try :
SMaster = SNPMaster(setName = setName)
con.beginTransaction()
SNPType = SMaster.SNPType
con.delete(SNPType, 'setName = ?', (setName,))
SMaster.delete()
con.endTransaction()
except KeyError :
raise KeyError("Can't delete the setName %s because i can't find it in SNPMaster, maybe there's not set by that name" % setName)
#~ printf("can't delete the setName %s because i can't find it in SNPMaster, maybe there's no set by that name" % setName)
return False
return True
def _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile) :
"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno wil interpret all positions as 0 based"
printf('importing SNP set %s for species %s...' % (setName, species))
snpData = CSVFile()
snpData.parse(snpsFile, separator = "\t")
AgnosticSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))
conf.db.beginTransaction()
pBar = ProgressBar(len(snpData))
pLabel = ''
currChrNumber = None
for snpEntry in snpData :
tmpChr = snpEntry['chromosomeNumber']
if tmpChr != currChrNumber :
currChrNumber = tmpChr
pLabel = 'Chr %s...' % currChrNumber
snp = AgnosticSNP()
snp.species = species
snp.setName = setName
for f in snp.getFields() :
try :
setattr(snp, f, snpEntry[f])
except KeyError :
if f != 'species' and f != 'setName' :
printf("Warning filetype as no key %s", f)
snp.quality = float(snp.quality)
snp.start = int(snp.start)
snp.end = int(snp.end)
snp.save()
pBar.update(label = pLabel)
pBar.close()
snpMaster = SNPMaster()
snpMaster.set(setName = setName, SNPType = 'AgnosticSNP', species = species)
snpMaster.save()
printf('saving...')
conf.db.endTransaction()
printf('creating indexes...')
AgnosticSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))
printf('importation of SNP set %s for species %s done.' %(setName, species))
return True
def _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile) :
"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based"
printf('importing SNP set %s for species %s...' % (setName, species))
snpData = SNPsTxtFile(snpsFile)
CasavaSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))
conf.db.beginTransaction()
pBar = ProgressBar(len(snpData))
pLabel = ''
currChrNumber = None
for snpEntry in snpData :
tmpChr = snpEntry['chromosomeNumber']
if tmpChr != currChrNumber :
currChrNumber = tmpChr
pLabel = 'Chr %s...' % currChrNumber
snp = CasavaSNP()
snp.species = species
snp.setName = setName
for f in snp.getFields() :
try :
setattr(snp, f, snpEntry[f])
except KeyError :
if f != 'species' and f != 'setName' :
printf("Warning filetype as no key %s", f)
snp.start -= 1
snp.end -= 1
snp.save()
pBar.update(label = pLabel)
pBar.close()
snpMaster = SNPMaster()
snpMaster.set(setName = setName, SNPType = 'CasavaSNP', species = species)
snpMaster.save()
printf('saving...')
conf.db.endTransaction()
printf('creating indexes...')
CasavaSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))
printf('importation of SNP set %s for species %s done.' %(setName, species))
return True
def _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile) :
"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based"
snpData = VCFFile(snpsFile, gziped = True, stream = True)
dbSNPSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))
conf.db.beginTransaction()
pBar = ProgressBar()
pLabel = ''
for snpEntry in snpData :
pBar.update(label = 'Chr %s, %s...' % (snpEntry['#CHROM'], snpEntry['ID']))
snp = dbSNPSNP()
for f in snp.getFields() :
try :
setattr(snp, f, snpEntry[f])
except KeyError :
pass
snp.chromosomeNumber = snpEntry['#CHROM']
snp.species = species
snp.setName = setName
snp.start = snpEntry['POS']-1
snp.alt = snpEntry['ALT']
snp.ref = snpEntry['REF']
snp.end = snp.start+len(snp.alt)
snp.save()
pBar.close()
snpMaster = SNPMaster()
snpMaster.set(setName = setName, SNPType = 'dbSNPSNP', species = species)
snpMaster.save()
printf('saving...')
conf.db.endTransaction()
printf('creating indexes...')
dbSNPSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))
printf('importation of SNP set %s for species %s done.' %(setName, species))
return True
def _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile) :
raise FutureWarning('Not implemented yet')
================================================
FILE: pyGeno/importation/__init__.py
================================================
__all__ = ['Genomes', 'SNPs']
================================================
FILE: pyGeno/pyGenoObjectBases.py
================================================
import time, types, string
import configuration as conf
from rabaDB.rabaSetup import *
from rabaDB.Raba import *
from rabaDB.filters import RabaQuery
def nosave() :
raise ValueError('You can only save object that are linked to reference genomes')
class pyGenoRabaObject(Raba) :
"""pyGeno uses rabaDB to persistenly store data. Most persistent
objects have classes that inherit from this one (Genome_Raba,
Chromosome_Raba, Gene_Raba, Protein_Raba, Exon_Raba). Theses classes
are not mean to be accessed directly. Users manipulate wrappers
such as : Genome, Chromosome etc... pyGenoRabaObject extends
the Raba class by adding a function _curate that is called just
before saving. This class is to be considered abstract, and is not
meant to be instanciated"""
_raba_namespace = conf.pyGeno_RABA_NAMESPACE
_raba_abstract = True # not saved in db by default
def __init__(self) :
if self is pyGenoRabaObject :
raise TypeError("This class is abstract")
def _curate(self) :
"Last operations performed before saving, must be implemented in child"
raise TypeError("This method is abstract and should be implemented in child")
def save(self) :
"""Calls _curate() before performing a normal rabaDB lazy save
(saving only occurs if the object has been modified)"""
if self.mutated() :
self._curate()
Raba.save(self)
class pyGenoRabaObjectWrapper_metaclass(type) :
"""This metaclass keeps track of the relationship between wrapped
classes and wrappers """
_wrappers = {}
def __new__(cls, name, bases, dct) :
clsObj = type.__new__(cls, name, bases, dct)
cls._wrappers[dct['_wrapped_class']] = clsObj
return clsObj
class RLWrapper(object) :
"""A wrapper for rabalists that replaces raba objects by pyGeno Object"""
def __init__(self, rabaObj, listObjectType, rl) :
self.rabaObj = rabaObj
self.rl = rl
self.listObjectType = listObjectType
def __getitem__(self, i) :
return self.listObjectType(wrapped_object_and_bag = (self.rl[i], self.rabaObj.bagKey))
def __getattr__(self, name) :
rl = object.__getattribute__(self, 'rl')
return getattr(rl, name)
class pyGenoRabaObjectWrapper(object) :
"""All the wrapper classes such as Genome and Chromosome inherit
from this class. It has most that make pyGeno useful, such as
get(), count(), ensureIndex(). This class is to be considered
abstract, and is not meant to be instanciated"""
__metaclass__ = pyGenoRabaObjectWrapper_metaclass
_wrapped_class = None
_bags = {}
def __init__(self, wrapped_object_and_bag = (), *args, **kwargs) :
if self is pyGenoRabaObjectWrapper :
raise TypeError("This class is abstract")
if wrapped_object_and_bag != () :
assert wrapped_object_and_bag[0]._rabaClass is self._wrapped_class
self.wrapped_object = wrapped_object_and_bag[0]
self.bagKey = wrapped_object_and_bag[1]
pyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self
else :
self.wrapped_object = self._wrapped_class(*args, **kwargs)
self.bagKey = time.time()
pyGenoRabaObjectWrapper._bags[self.bagKey] = {}
pyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self
self._load_sequencesTriggers = set()
self.loadSequences = True
self.loadData = True
self.loadBinarySequences = True
def _getObjBagKey(self, obj) :
"""pyGeno objects are kept in bags to ensure that reference
objects are loaded only once. This function returns the bag key
of the current object"""
return (obj._rabaClass.__name__, obj.raba_id)
def _makeLoadQuery(self, objectType, *args, **coolArgs) :
# conf.db.enableDebug(True)
f = RabaQuery(objectType._wrapped_class, namespace = self._wrapped_class._raba_namespace)
coolArgs[self._wrapped_class.__name__[:-5]] = self.wrapped_object #[:-5] removes _Raba from class name
if len(args) > 0 and type(args[0]) is types.ListType :
for a in args[0] :
if type(a) is types.DictType :
f.addFilter(**a)
else :
f.addFilter(*args, **coolArgs)
return f
def count(self, objectType, *args, **coolArgs) :
"""Returns the number of elements satisfying the query"""
return self._makeLoadQuery(objectType, *args, **coolArgs).count()
def get(self, objectType, *args, **coolArgs) :
"""Raba Magic inside. This is th function that you use for
querying pyGeno's DB.
Usage examples:
* myGenome.get("Gene", name = 'TPST2')
* myGene.get(Protein, id = 'ENSID...')
* myGenome.get(Transcript, {'start >' : x, 'end <' : y})"""
ret = []
for e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() :
if issubclass(objectType, pyGenoRabaObjectWrapper) :
ret.append(objectType(wrapped_object_and_bag = (e, self.bagKey)))
else :
ret.append(e)
return ret
def iterGet(self, objectType, *args, **coolArgs) :
"""Same as get. But retuns the elements one by one, much more efficient for large outputs"""
for e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() :
if issubclass(objectType, pyGenoRabaObjectWrapper) :
yield objectType(wrapped_object_and_bag = (e, self.bagKey))
else :
yield e
#~ def ensureIndex(self, fields) :
#~ """
#~ Warning: May not work on some systems, see ensureGlobalIndex
#~
#~ Creates a partial index on self (if it does not exist).
#~ Ex: myTranscript.ensureIndex('name')"""
#~
#~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object
#~ self._wrapped_class.ensureIndex(fields, where, (whereValues,))
#~ def dropIndex(self, fields) :
#~ """Warning: May not work on some systems, see dropGlobalIndex
#~
#~ Drops a partial index on self. Ex: myTranscript.dropIndex('name')"""
#~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object
#~ self._wrapped_class.dropIndex(fields, where, (whereValues,))
def __getattr__(self, name) :
"""If a wrapper does not have a specific field, pyGeno will
look for it in the wrapped_object"""
# print "pyGenoObjectBases __getattr__ : " + name + " from " + str(type(self))
if name == 'save' or name == 'delete' :
raise AttributeError("You can't delete or save an object from wrapper, try .wrapped_object.delete()/save()")
if name in self._load_sequencesTriggers and self.loadSequences :
self.loadSequences = False
self._load_sequences()
return getattr(self, name)
if name in self._load_sequencesTriggers and self.loadData :
self.loadData = False
self._load_data()
return getattr(self, name)
if name[:4] == 'bin_' and self.loadBinarySequences :
self.updateBinarySequence = False
self._load_bin_sequence()
return getattr(self, name)
attr = getattr(self.wrapped_object, name)
if isRabaObject(attr) :
attrKey = self._getObjBagKey(attr)
if attrKey in pyGenoRabaObjectWrapper._bags[self.bagKey] :
retObj = pyGenoRabaObjectWrapper._bags[self.bagKey][attrKey]
else :
wCls = pyGenoRabaObjectWrapper_metaclass._wrappers[attr._rabaClass]
retObj = wCls(wrapped_object_and_bag = (attr, self.bagKey))
return retObj
return attr
@classmethod
def getIndexes(cls) :
"""Returns a list of indexes attached to the object's class. Ex
Transcript.getIndexes()"""
return cls._wrapped_class.getIndexes()
@classmethod
def flushIndexes(cls) :
"""Drops all the indexes attached to the object's class. Ex
Transcript.flushIndexes()"""
return cls._wrapped_class.flushIndexes()
@classmethod
def help(cls) :
"""Returns a list of available field for queries. Ex
Transcript.help()"""
return cls._wrapped_class.help().replace("_Raba", "")
@classmethod
def ensureGlobalIndex(cls, fields) :
"""Add a GLOBAL index to the db to speedup lookouts. Fields can be a
list of fields for Multi-Column Indices or simply the name of a
single field. A global index is an index on the entire type.
A global index on 'Transcript' on field 'name', will index the names for all the transcripts in the database"""
cls._wrapped_class.ensureIndex(fields)
@classmethod
def dropGlobalIndex(cls, fields) :
"""Drops an index, the opposite of ensureGlobalIndex()"""
cls._wrapped_class.dropIndex(fields)
def getSequencesData(self) :
"""This lazy abstract function is only called if the object
sequences need to be loaded"""
raise NotImplementedError("This fct loads non binary sequences and should be implemented in child if needed")
def _load_sequences(self) :
"""This lazy abstract function is only called if the object
sequences need to be loaded"""
self._load_data()
def _load_data(self) :
"""This lazy abstract function is only called if the object
sequences need to be loaded"""
raise NotImplementedError("This fct loads non binary sequences and should be implemented in child if needed")
def _load_bin_sequence(self) :
"""Same as _load_sequences(), but loads binary sequences"""
raise NotImplementedError("This fct loads binary sequences and should be implemented in child if needed")
================================================
FILE: pyGeno/tests/__init__.py
================================================
================================================
FILE: pyGeno/tests/test_csv.py
================================================
import unittest
from pyGeno.tools.parsers.CSVTools import *
class CSVTests(unittest.TestCase):
def setUp(self):
pass
def tearDown(self):
pass
def test_createParse(self) :
testVals = ["test", "test2"]
c = CSVFile(legend = ["col1", "col2"], separator = "\t")
l = c.newLine()
l["col1"] = testVals[0]
l = c.newLine()
l["col1"] = testVals[1]
c.save("test.csv")
# print "----", l
c2 = CSVFile()
c2.parse("test.csv", separator = "\t")
i = 0
for l in c2 :
self.assertEqual(l["col1"], testVals[i])
i += 1
def runTests() :
unittest.main()
if __name__ == "__main__" :
runTests()
================================================
FILE: pyGeno/tests/test_genome.py
================================================
import unittest
from pyGeno.Genome import *
import pyGeno.bootstrap as B
from pyGeno.importation.Genomes import *
from pyGeno.importation.SNPs import *
class pyGenoSNPTests(unittest.TestCase):
def setUp(self):
# try :
# B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
# except KeyError :
# deleteGenome("human", "GRCh37.75_Y-Only")
# B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
# print "--> Seems to already exist in db"
# try :
# B.importSNPs("Human_agnostic.dummySRY.tar.gz")
# except KeyError :
# deleteSNPs("dummySRY_AGN")
# B.importSNPs("Human_agnostic.dummySRY.tar.gz")
# print "--> Seems to already exist in db"
# try :
# B.importSNPs("Human_agnostic.dummySRY_indels")
# except KeyError :
# deleteSNPs("dummySRY_AGN_indels")
# B.importSNPs("Human_agnostic.dummySRY_indels")
# print "--> Seems to already exist in db"
self.ref = Genome(name = 'GRCh37.75_Y-Only')
def tearDown(self):
pass
# @unittest.skip("skipping")
def test_vanilla(self) :
dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN')
persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])
self.assertEqual('HTGCAATCATATGCTTCTGC', persProt.transcript.cDNA[:20])
# @unittest.skip("skipping")
def test_noModif(self) :
from pyGeno.SNPFiltering import SNPFilter
class MyFilter(SNPFilter) :
def __init__(self) :
SNPFilter.__init__(self)
def filter(self, chromosome, dummySRY_AGN) :
return None
dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
self.assertEqual(persProt.transcript.cDNA[:20], refProt.transcript.cDNA[:20])
# @unittest.skip("skipping")
def test_insert(self) :
from pyGeno.SNPFiltering import SNPFilter
class MyFilter(SNPFilter) :
def __init__(self) :
SNPFilter.__init__(self)
def filter(self, chromosome, dummySRY_AGN) :
from pyGeno.SNPFiltering import SequenceInsert
refAllele = chromosome.refSequence[dummySRY_AGN.start]
return SequenceInsert('XXX')
dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])
self.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20])
# @unittest.skip("skipping")
def test_SNP(self) :
from pyGeno.SNPFiltering import SNPFilter
class MyFilter(SNPFilter) :
def __init__(self) :
SNPFilter.__init__(self)
def filter(self, chromosome, dummySRY_AGN) :
from pyGeno.SNPFiltering import SequenceSNP
return SequenceSNP(dummySRY_AGN.alt)
dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
self.assertEqual('M', refProt.sequence[0])
self.assertEqual('L', persProt.sequence[0])
# @unittest.skip("skipping")
def test_deletion(self) :
from pyGeno.SNPFiltering import SNPFilter
class MyFilter(SNPFilter) :
def __init__(self) :
SNPFilter.__init__(self)
def filter(self, chromosome, dummySRY_AGN) :
from pyGeno.SNPFiltering import SequenceDel
refAllele = chromosome.refSequence[dummySRY_AGN.start]
return SequenceDel(1)
dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])
self.assertEqual('TGCAATCATATGCTTCTGCT', persProt.transcript.cDNA[:20])
# @unittest.skip("skipping")
def test_indels(self) :
from pyGeno.SNPFiltering import SNPFilter
class MyFilter(SNPFilter) :
def __init__(self) :
SNPFilter.__init__(self)
def filter(self, chromosome, dummySRY_AGN_indels) :
from pyGeno.SNPFiltering import SequenceInsert
ret = ""
for s in dummySRY_AGN_indels :
ret += "X"
return SequenceInsert(ret)
dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN_indels', SNPFilter = MyFilter())
persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
self.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20])
# @unittest.skip("skipping")
def test_bags(self) :
dummy = Genome(name = 'GRCh37.75_Y-Only')
self.assertEqual(dummy.wrapped_object, self.ref.wrapped_object)
# @unittest.skip("skipping")
def test_prot_find(self) :
prot = self.ref.get(Protein, id = 'ENSP00000438917')[0]
needle = prot.sequence[:10]
self.assertEqual(0, prot.find(needle))
needle = prot.sequence[-10:]
self.assertEqual(len(prot)-10, prot.find(needle))
# @unittest.skip("skipping")
def test_trans_find(self) :
trans = self.ref.get(Transcript, name = "SRY-001")[0]
self.assertEqual(0, trans.find(trans[:5]))
# @unittest.skip("remote server down")
# def test_import_remote_genome(self) :
# self.assertRaises(KeyError, B.importRemoteGenome, "Human.GRCh37.75_Y-Only.tar.gz")
# @unittest.skip("remote server down")
# def test_import_remote_snps(self) :
# self.assertRaises(KeyError, B.importRemoteSNPs, "Human_agnostic.dummySRY.tar.gz")
def runTests() :
try :
B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
except KeyError :
deleteGenome("human", "GRCh37.75_Y-Only")
B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
print "--> Seems to already exist in db"
try :
B.importSNPs("Human_agnostic.dummySRY.tar.gz")
except KeyError :
deleteSNPs("dummySRY_AGN")
B.importSNPs("Human_agnostic.dummySRY.tar.gz")
print "--> Seems to already exist in db"
try :
B.importSNPs("Human_agnostic.dummySRY_indels")
except KeyError :
deleteSNPs("dummySRY_AGN_indels")
B.importSNPs("Human_agnostic.dummySRY_indels")
print "--> Seems to already exist in db"
# import time
# time.sleep(10)
unittest.main()
if __name__ == "__main__" :
runTests()
================================================
FILE: pyGeno/tools/BinarySequence.py
================================================
import array, copy
import UsefulFunctions as uf
class BinarySequence :
"""A class for representing sequences in a binary format"""
ALPHABETA_SIZE = 32
ALPHABETA_KMP = range(ALPHABETA_SIZE)
def __init__(self, sequence, arrayForma, charToBinDict) :
self.forma = arrayForma
self.charToBin = charToBinDict
self.sequence = sequence
self.binSequence, self.defaultSequence, self.polymorphisms = self.encode(sequence)
self.itemsize = self.binSequence.itemsize
self.typecode = self.binSequence.typecode
#print 'bin', len(self.sequence), len(self.binSequence)
def encode(self, sequence):
"""Returns a tuple (binary reprensentation, default sequence, polymorphisms list)"""
polymorphisms = []
defaultSequence = ''
binSequence = array.array(self.forma.typecode)
b = 0
i = 0
trueI = 0 #not inc in case if poly
poly = set()
while i < len(sequence)-1:
b = b | self.forma[self.charToBin[sequence[i]]]
if sequence[i+1] == '/' :
poly.add(sequence[i])
i += 2
else :
binSequence.append(b)
if len(poly) > 0 :
poly.add(sequence[i])
polymorphisms.append((trueI, poly))
poly = set()
bb = 0
while b % 2 != 0 :
b = b/2
defaultSequence += sequence[i]
b = 0
i += 1
trueI += 1
if i < len(sequence) :
b = b | self.forma[self.charToBin[sequence[i]]]
binSequence.append(b)
if len(poly) > 0 :
if sequence[i] not in poly :
poly.add(sequence[i])
polymorphisms.append((trueI, poly))
defaultSequence += sequence[i]
return (binSequence, defaultSequence, polymorphisms)
def __testFind(self, arr) :
if len(arr) == 0:
raise TypeError ('binary find, needle is empty')
if arr.itemsize != self.itemsize :
raise TypeError ('binary find, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize))
def __testBinary(self, arr) :
if len(arr) != len(self) :
raise TypeError ('bin operator, both arrays must be of same length')
if arr.itemsize != self.itemsize :
raise TypeError ('bin operator, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize))
def findPolymorphisms(self, strSeq, strict = False):
"""
Compares strSeq with self.sequence.
If not 'strict', this function ignores the cases of matching heterozygocity (ex: for a given position i, strSeq[i] = A and self.sequence[i] = 'A/G'). If 'strict' it returns all positions where strSeq differs self,sequence
"""
arr = self.encode(strSeq)[0]
res = []
if not strict :
for i in range(len(arr)+len(self)) :
if i >= len(arr) or i > len(self) :
break
if arr[i] & self[i] == 0:
res.append(i)
else :
for i in range(len(arr)+len(self)) :
if i >= len(arr) or i > len(self) :
break
if arr[i] != self[i] :
res.append(i)
return res
def getPolymorphisms(self):
"""returns all polymorphsims in the form of a dict pos => alleles"""
return self.polymorphisms
def getDefaultSequence(self) :
"""returns a default version of sequence where only the last allele of each polymorphisms is shown"""
return self.defaultSequence
def __getSequenceVariants(self, x1, polyStart, polyStop, listSequence) :
"""polyStop, is the polymorphisme at wixh number where the calcul of combinaisons stops"""
if polyStart < len(self.polymorphisms) and polyStart < polyStop:
sequence = copy.copy(listSequence)
ret = []
pk = self.polymorphisms[polyStart]
posInSequence = pk[0]-x1
if posInSequence < len(listSequence) :
for allele in pk[1] :
sequence[posInSequence] = allele
ret.extend(self.__getSequenceVariants(x1, polyStart+1, polyStop, sequence))
return ret
else :
return [''.join(listSequence)]
def getSequenceVariants(self, x1 = 0, x2 = -1, maxVariantNumber = 128) :
"""returns the sequences resulting from all combinaisons of all polymorphismes between x1 and x2. The results is a couple (bool, variants of sequence between x1 and x2), bool == true if there's more combinaisons than maxVariantNumber"""
if x2 == -1 :
xx2 = len(self.defaultSequence)
else :
xx2 = x2
polyStart = None
nbP = 1
stopped = False
j = 0
for p in self.polymorphisms :
if p[0] >= xx2 :
break
if x1 <= p[0] :
if polyStart == None :
polyStart = j
nbP *= len(p[1])
if nbP > maxVariantNumber :
stopped = True
break
j += 1
if polyStart == None :
return (stopped, [self.defaultSequence[x1:xx2]])
return (stopped, self.__getSequenceVariants(x1, polyStart, j, list(self.defaultSequence[x1:xx2])))
def getNbVariants(self, x1, x2 = -1) :
"""returns the nb of variants of sequences between x1 and x2"""
if x2 == -1 :
xx2 = len(self.defaultSequence)
else :
xx2 = x2
nbP = 1
for p in self.polymorphisms:
if x1 <= p[0] and p[0] <= xx2 :
nbP *= len(p[1])
return nbP
def _dichFind(self, needle, currHaystack, offset, lst = None) :
"""dichotomic search, if lst is None, will return the first position found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found"""
if len(currHaystack) == 1 :
if (offset <= (len(self) - len(needle))) and (currHaystack[0] & needle[0]) > 0 and (self[offset+len(needle)-1] & needle[-1]) > 0 :
found = True
for i in xrange(1, len(needle)-1) :
if self[offset + i] & needle[i] == 0 :
found = False
break
if found :
if lst is not None :
lst.append(offset)
else :
return offset
else :
if lst is None :
return -1
else :
if (offset <= (len(self) - len(needle))) :
if lst is not None :
self._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst)
self._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst)
else :
v1 = self._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst)
if v1 > -1 :
return v1
return self._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst)
return -1
def _kmp_construct_next(self, pattern):
"""the helper function for KMP-string-searching is to construct the DFA. pattern should be an integer array. return a 2D array representing the DFA for moving the pattern."""
next = [[0 for state in pattern] for input_token in self.ALPHABETA_KMP]
next[pattern[0]][0] = 1
restart_state = 0
for state in range(1, len(pattern)):
for input_token in self.ALPHABETA_KMP:
next[input_token][state] = next[input_token][restart_state]
next[pattern[state]][state] = state + 1
restart_state = next[pattern[state]][restart_state]
return next
def _kmp_search_first(self, pInput_sequence, pPattern):
"""use KMP algorithm to search the first occurrence in the input_sequence of the pattern. both arguments are integer arrays. return the position of the occurence if found; otherwise, -1."""
input_sequence, pattern = pInput_sequence, [len(bin(e)) for e in pPattern]
n, m = len(input_sequence), len(pattern)
d = p = 0
next = self._kmp_construct_next(pattern)
while d < n and p < m:
p = next[len(bin(input_sequence[d]))][p]
d += 1
if p == m: return d - p
else: return -1
def _kmp_search_all(self, pInput_sequence, pPattern):
"""use KMP algorithm to search all occurrence in the input_sequence of the pattern. both arguments are integer arrays. return a list of the positions of the occurences if found; otherwise, []."""
r = []
input_sequence, pattern = [len(bin(e)) for e in pInput_sequence], [len(bin(e)) for e in pPattern]
n, m = len(input_sequence), len(pattern)
d = p = 0
next = self._kmp_construct_next(pattern)
while d < n:
p = next[input_sequence[d]][p]
d += 1
if p == m:
r.append(d - m)
p = 0
return r
def _kmp_find(self, needle, haystack, lst = None):
"""find with KMP-searching. needle is an integer array, reprensenting a pattern. haystack is an integer array, reprensenting the input sequence. if lst is None, return the first position found or -1 if no match found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found."""
if lst != None: return self._kmp_search_all(haystack, needle)
else: return self._kmp_search_first(haystack, needle)
def findByBiSearch(self, strSeq) :
"""returns the first occurence of strSeq in self. Takes polymorphisms into account"""
arr = self.encode(strSeq)
return self._dichFind(arr[0], self, 0, lst = None)
def findAllByBiSearch(self, strSeq) :
"""Same as find but returns a list of all occurences"""
arr = self.encode(strSeq)
lst = []
self._dichFind(arr[0], self, 0, lst)
return lst
def find(self, strSeq) :
"""returns the first occurence of strSeq in self. Takes polymorphisms into account"""
arr = self.encode(strSeq)
return self._kmp_find(arr[0], self)
def findAll(self, strSeq) :
"""Same as find but returns a list of all occurences"""
arr = self.encode(strSeq)
lst = []
lst = self._kmp_find(arr[0], self, lst)
return lst
def __and__(self, arr) :
self.__testBinary(arr)
ret = BinarySequence(self.typecode, self.forma, self.charToBin)
for i in range(len(arr)) :
ret.append(self[i] & arr[i])
return ret
def __xor__(self, arr) :
self.__testBinary(arr)
ret = BinarySequence(self.typecode, self.forma, self.charToBin)
for i in range(len(arr)) :
ret.append(self[i] ^ arr[i])
return ret
def __eq__(self, seq) :
self.__testBinary(seq)
if len(seq) != len(self) :
return False
return all( self[i] == seq[i] for i in range(len(self)) )
def append(self, arr) :
self.binSequence.append(arr)
def extend(self, arr) :
self.binSequence.extend(arr)
def decode(self, binSequence):
"""decodes a binary sequence to return a string"""
try:
binSeq = iter(binSequence[0])
except TypeError, te:
binSeq = binSequence
ret = ''
for b in binSeq :
ch = ''
for c in self.charToBin :
if b & self.forma[self.charToBin[c]] > 0 :
ch += c +'/'
if ch == '' :
raise KeyError('Key %d unkowom, bad format' % b)
ret += ch[:-1]
return ret
def getChar(self, i):
return self.decode([self.binSequence[i]])
def __len__(self):
return len(self.binSequence)
def __getitem__(self,i):
return self.binSequence[i]
def __setitem__(self, i, v):
self.binSequence[i] = v
class AABinarySequence(BinarySequence) :
"""A binary sequence of amino acids"""
def __init__(self, sequence):
f = array.array('I', [1L, 2L, 4L, 8L, 16L, 32L, 64L, 128L, 256L, 512L, 1024L, 2048L, 4096L, 8192L, 16384L, 32768L, 65536L, 131072L, 262144L, 524288L, 1048576L, 2097152L])
c = {'A': 17, 'C': 14, 'E': 19, 'D': 15, 'G': 13, 'F': 16, 'I': 3, 'H': 9, 'K': 8, '*': 1, 'M': 20, 'L': 0, 'N': 4, 'Q': 11, 'P': 6, 'S': 7, 'R': 5, 'T': 2, 'W': 10, 'V': 18, 'Y': 12, 'U': 21}
BinarySequence.__init__(self, sequence, f, c)
class NucBinarySequence(BinarySequence) :
"""A binary sequence of nucleotides"""
def __init__(self, sequence):
f = array.array('B', [1, 2, 4, 8])
c = {'A': 0, 'T': 1, 'C': 2, 'G': 3}
ce = {
'R' : 'A/G', 'Y' : 'C/T', 'M': 'A/C',
'K' : 'T/G', 'W' : 'A/T', 'S' : 'C/G', 'B': 'C/G/T',
'D' : 'A/G/T', 'H' : 'A/C/T', 'V' : 'A/C/G', 'N': 'A/C/G/T'
}
lstSeq = list(sequence)
for i in xrange(len(lstSeq)) :
if lstSeq[i] in ce :
lstSeq[i] = ce[lstSeq[i]]
lstSeq = ''.join(lstSeq)
BinarySequence.__init__(self, lstSeq, f, c)
if __name__=="__main__":
def test0() :
#seq = 'ACED/E/GFIHK/MLMQPS/RTWVY'
seq = 'ACED/E/GFIHK/MLMQPS/RTWVY/A/R'
bSeq = AABinarySequence(seq)
start = 0
stop = 4
rB = bSeq.getSequenceVariants_bck(start, stop)
r = bSeq.getSequenceVariants(start, stop)
#print start, stop, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1])
print start, stop#, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1])
#if set(rB[1])!=set(r[1]) :
print '-AV-'
print start, stop, 'nb_comb_r', len(rB[1])
print '\n'.join(rB[1])
print '=AP========'
print start, stop, 'nb_comb_r', len(r[1])
print '\n'.join(r[1])
def testVariants() :
seq = 'ATGAGTTTGCCGCGCN'
bSeq = NucBinarySequence(seq)
print bSeq.getSequenceVariants()
testVariants()
from random import randint
alphabeta = ['A', 'C', 'G', 'T']
seq = ''
for _ in range(8192):
seq += alphabeta[randint(0, 3)]
seq += 'ATGAGTTTGCCGCGCN'
bSeq = NucBinarySequence(seq)
ROUND = 512
PATTERN = 'GCGC'
def testFind():
for i in range(ROUND):
bSeq.find(PATTERN)
def testFindByBiSearch():
for i in range(ROUND):
bSeq.findByBiSearch(PATTERN)
def testFindAll():
for i in range(ROUND):
bSeq.findAll(PATTERN)
def testFindAllByBiSearch():
for i in range(ROUND):
bSeq.findAllByBiSearch(PATTERN)
import cProfile
print('find:\n')
cProfile.run('testFind()')
print('findAll:\n')
cProfile.run('testFindAll()')
print('findByBiSearch:\n')
cProfile.run('testFindByBiSearch()')
print('findAllByBiSearch:\n')
cProfile.run('testFindAllByBiSearch()')
================================================
FILE: pyGeno/tools/ProgressBar.py
================================================
import sys, time, cPickle
class ProgressBar :
"""A very simple unthreaded progress bar. This progress bar also logs stats in .logs.
Usage example::
p = ProgressBar(nbEpochs = -1)
for i in range(200000) :
p.update(label = 'value of i %d' % i)
p.close()
If you don't know the maximum number of epochs you can enter nbEpochs < 1
"""
def __init__(self, nbEpochs = -1, width = 25, label = "progress", minRefeshTime = 1) :
self.width = width
self.currEpoch = 0
self.nbEpochs = float(nbEpochs)
self.bar = ''
self.label = label
self.wheel = ["-", "\\", "|", "/"]
self.startTime = time.time()
self.lastPrintTime = -1
self.minRefeshTime = minRefeshTime
self.runtime = -1
self.runtime_hr = -1
self.avg = -1
self.remtime = -1
self.remtime_hr = -1
self.currTime = time.time()
self.lastEpochDuration = -1
self.bars = []
self.miniSnake = '~-~-~-?:>'
self.logs = {'epochDuration' : [], 'avg' : [], 'runtime' : [], 'remtime' : []}
def formatTime(self, val) :
if val < 60 :
return '%.3fsc' % val
elif val < 3600 :
return '%.3fmin' % (val/60)
else :
return '%dh %dmin' % (int(val)/3600, int(val/60)%60)
def _update(self) :
tim = time.time()
if self.nbEpochs > 1 :
if self.currTime > 0 :
self.lastEpochDuration = tim - self.currTime
self.currTime = tim
self.runtime = tim - self.startTime
self.runtime_hr = self.formatTime(self.runtime)
self.avg = self.runtime/self.currEpoch
self.remtime = self.avg * (self.nbEpochs-self.currEpoch)
self.remtime_hr = self.formatTime(self.remtime)
def log(self) :
"""logs stats about the progression, without printing anything on screen"""
self.logs['epochDuration'].append(self.lastEpochDuration)
self.logs['avg'].append(self.avg)
self.logs['runtime'].append(self.runtime)
self.logs['remtime'].append(self.remtime)
def saveLogs(self, filename) :
"""dumps logs into a nice pickle"""
f = open(filename, 'wb')
cPickle.dump(self.logs, f)
f.close()
def update(self, label = '', forceRefresh = False, log = False) :
"""the function to be called at each iteration. Setting log = True is the same as calling log() just after update()"""
self.currEpoch += 1
tim = time.time()
if (tim - self.lastPrintTime > self.minRefeshTime) or forceRefresh :
self._update()
wheelState = self.wheel[self.currEpoch%len(self.wheel)]
if label == '' :
slabel = self.label
else :
slabel = label
if self.nbEpochs > 1 :
ratio = self.currEpoch/self.nbEpochs
snakeLen = int(self.width*ratio)
voidLen = int(self.width - (self.width*ratio))
if snakeLen + voidLen < self.width :
snakeLen = self.width - voidLen
self.bar = "%s %s[%s:>%s] %.2f%% (%d/%d) runtime: %s, remaining: %s, avg: %s" %(wheelState, slabel, "~-" * snakeLen, " " * voidLen, ratio*100, self.currEpoch, self.nbEpochs, self.runtime_hr, self.remtime_hr, self.formatTime(self.avg))
if self.currEpoch == self.nbEpochs :
self.close()
else :
w = self.width - len(self.miniSnake)
v = self.currEpoch%(w+1)
snake = "%s%s%s" %(" " * (v), self.miniSnake, " " * (w-v))
self.bar = "%s %s[%s] %s%% (%d/%s) runtime: %s, remaining: %s, avg: %s" %(wheelState, slabel, snake, '?', self.currEpoch, '?', self.runtime_hr, '?', self.formatTime(self.avg))
sys.stdout.write("\b" * (len(self.bar)+1))
sys.stdout.write(" " * (len(self.bar)+1))
sys.stdout.write("\b" * (len(self.bar)+1))
sys.stdout.write(self.bar)
sys.stdout.flush()
self.lastPrintTime = time.time()
if log :
self.log()
def close(self) :
"""Closes the bar so your next print will be on another line"""
self.update(forceRefresh = True)
print '\n'
if __name__ == "__main__" :
p = ProgressBar(nbEpochs = 100000000000)
for i in xrange(100000000000) :
p.update()
#time.sleep(3)
p.close()
================================================
FILE: pyGeno/tools/SecureMmap.py
================================================
import mmap
class SecureMmap:
"""In a normal mmap, modifying the string modifies the file. This is a mmap with write protection"""
def __init__(self, filename, enableWrite = False) :
self.enableWrite = enableWrite
self.filename = filename
self.name = filename
f = open(filename, 'r+b')
self.data = mmap.mmap(f.fileno(), 0)
def forceSet(self, x1, v) :
"""Forces modification even if the mmap is write protected"""
self.data[x1] = v
def __getitem__(self, i):
return self.data[i]
def __setitem__(self, i, v) :
if self.enableWrite :
raise IOError("Secure mmap is write protected")
else :
self.data[i] = v
def __str__(self) :
return "secure mmap, file: %s, writing enabled : %s" % (self.filename, str(self.enableWrite))
def __len__(self) :
return len(self.data)
================================================
FILE: pyGeno/tools/SegmentTree.py
================================================
import random, copy, types
def aux_insertTree(childTree, parentTree):
"""This a private (You shouldn't have to call it) recursive function that inserts a child tree into a parent tree."""
if childTree.x1 != None and childTree.x2 != None :
parentTree.insert(childTree.x1, childTree.x2, childTree.name, childTree.referedObject)
for c in childTree.children:
aux_insertTree(c, parentTree)
def aux_moveTree(offset, tree):
"""This a private recursive (You shouldn't have to call it) function that translates tree(and it's children) to a given x1"""
if tree.x1 != None and tree.x2 != None :
tree.x1, tree.x2 = tree.x1+offset, tree.x2+offset
for c in tree.children:
aux_moveTree(offset, c)
class SegmentTree :
""" Optimised genome annotations.
A segment tree is an arborescence of segments. First position is inclusive, second exlusive, respectively refered to as x1 and x2.
A segment tree has the following properties :
* The root has no x1 or x2 (both set to None).
* Segment are arrangend in an ascending order
* For two segment S1 and S2 : [S2.x1, S2.x2[ C [S1.x1, S1.x2[ <=> S2 is a child of S1
Here's an example of a tree :
* Root : 0-15
* ---->Segment : 0-12
* ------->Segment : 1-6
* ---------->Segment : 2-3
* ---------->Segment : 4-5
* ------->Segment : 7-8
* ------->Segment : 9-10
* ---->Segment : 11-14
* ------->Segment : 12-14
* ---->Segment : 13-15
Each segment can have a 'name' and a 'referedObject'. ReferedObject are objects are stored within the graph for future usage.
These objects are always stored in lists. If referedObject is already a list it will be stored as is.
"""
def __init__(self, x1 = None, x2 = None, name = '', referedObject = [], father = None, level = 0) :
if x1 > x2 :
self.x1, self.x2 = x2, x1
else :
self.x1, self.x2 = x1, x2
self.father = father
self.level = level
self.id = random.randint(0, 10**8)
self.name = name
self.children = []
self.referedObject = referedObject
def __addChild(self, segmentTree, index = -1) :
segmentTree.level = self.level + 1
if index < 0 :
self.children.append(segmentTree)
else :
self.children.insert(index, segmentTree)
def insert(self, x1, x2, name = '', referedObject = []) :
"""Insert the segment in it's right place and returns it.
If there's already a segment S as S.x1 == x1 and S.x2 == x2. S.name will be changed to 'S.name U name' and the
referedObject will be appended to the already existing list"""
if x1 > x2 :
xx1, xx2 = x2, x1
else :
xx1, xx2 = x1, x2
rt = None
insertId = None
childrenToRemove = []
for i in range(len(self.children)) :
if self.children[i].x1 == xx1 and xx2 == self.children[i].x2 :
self.children[i].name = self.children[i].name + ' U ' + name
self.children[i].referedObject.append(referedObject)
return self.children[i]
if self.children[i].x1 <= xx1 and xx2 <= self.children[i].x2 :
return self.children[i].insert(x1, x2, name, referedObject)
elif xx1 <= self.children[i].x1 and self.children[i].x2 <= xx2 :
if rt == None :
if type(referedObject) is types.ListType :
rt = SegmentTree(xx1, xx2, name, referedObject, self, self.level+1)
else :
rt = SegmentTree(xx1, xx2, name, [referedObject], self, self.level+1)
insertId = i
rt.__addChild(self.children[i])
self.children[i].father = rt
childrenToRemove.append(self.children[i])
elif xx1 <= self.children[i].x1 and xx2 <= self.children[i].x2
gitextract_0q2fbl_h/ ├── .gitignore ├── .travis.yml ├── CHANGELOG.rst ├── DESCRIPTION.rst ├── LICENSE ├── MANIFEST.in ├── README.rst ├── pyGeno/ │ ├── Chromosome.py │ ├── Exon.py │ ├── Gene.py │ ├── Genome.py │ ├── Protein.py │ ├── SNP.py │ ├── SNPFiltering.py │ ├── Transcript.py │ ├── __init__.py │ ├── bootstrap.py │ ├── bootstrap_data/ │ │ ├── SNPs/ │ │ │ ├── Human_agnostic.dummySRY_indels/ │ │ │ │ ├── manifest.ini │ │ │ │ └── snps.txt │ │ │ └── __init__.py │ │ ├── __init__.py │ │ └── genomes/ │ │ └── __init__.py │ ├── configuration.py │ ├── doc/ │ │ ├── Makefile │ │ ├── make.bat │ │ └── source/ │ │ ├── bootstraping.rst │ │ ├── citing.rst │ │ ├── conf.py │ │ ├── datawraps.rst │ │ ├── importation.rst │ │ ├── index.rst │ │ ├── installation.rst │ │ ├── objects.rst │ │ ├── parsers.rst │ │ ├── publications.rst │ │ ├── querying.rst │ │ ├── quickstart.rst │ │ ├── snp_filter.rst │ │ └── tools.rst │ ├── examples/ │ │ ├── __init__.py │ │ ├── bootstraping.py │ │ └── snps_queries.py │ ├── importation/ │ │ ├── Genomes.py │ │ ├── SNPs.py │ │ └── __init__.py │ ├── pyGenoObjectBases.py │ ├── tests/ │ │ ├── __init__.py │ │ ├── test_csv.py │ │ └── test_genome.py │ └── tools/ │ ├── BinarySequence.py │ ├── ProgressBar.py │ ├── SecureMmap.py │ ├── SegmentTree.py │ ├── SingletonManager.py │ ├── Stats.py │ ├── UsefulFunctions.py │ ├── __init__.py │ ├── io.py │ └── parsers/ │ ├── CSVTools.py │ ├── CasavaTools.py │ ├── FastaTools.py │ ├── FastqTools.py │ ├── GTFTools.py │ ├── VCFTools.py │ └── __init__.py └── setup.py
SYMBOL INDEX (425 symbols across 30 files)
FILE: pyGeno/Chromosome.py
class ChrosomeSequence (line 20) | class ChrosomeSequence(object) :
method __init__ (line 23) | def __init__(self, data, chromosome, refOnly = False) :
method setSNPFilter (line 30) | def setSNPFilter(self, SNPFilter) :
method getSequenceData (line 33) | def getSequenceData(self, slic) :
method _getSequence (line 88) | def _getSequence(self, slic) :
method __getitem__ (line 91) | def __getitem__(self, i) :
method __len__ (line 94) | def __len__(self) :
class Chromosome_Raba (line 97) | class Chromosome_Raba(pyGenoRabaObject) :
method _curate (line 110) | def _curate(self) :
class Chromosome (line 116) | class Chromosome(pyGenoRabaObjectWrapper) :
method __init__ (line 121) | def __init__(self, *args, **kwargs) :
method getSequenceData (line 134) | def getSequenceData(self, slic) :
method _makeLoadQuery (line 137) | def _makeLoadQuery(self, objectType, *args, **coolArgs) :
method __getitem__ (line 154) | def __getitem__(self, i) :
method __str__ (line 157) | def __str__(self) :
FILE: pyGeno/Exon.py
class Exon_Raba (line 8) | class Exon_Raba(pyGenoRabaObject) :
method _curate (line 30) | def _curate(self) :
class Exon (line 49) | class Exon(pyGenoRabaObjectWrapper) :
method __init__ (line 54) | def __init__(self, *args, **kwargs) :
method _makeLoadQuery (line 58) | def _makeLoadQuery(self, objectType, *args, **coolArgs) :
method _load_data (line 77) | def _load_data(self) :
method _load_bin_sequence (line 106) | def _load_bin_sequence(self) :
method hasCDS (line 112) | def hasCDS(self) :
method getCDSLength (line 118) | def getCDSLength(self) :
method find (line 122) | def find(self, sequence) :
method findAll (line 126) | def findAll(self, sequence):
method findInCDS (line 130) | def findInCDS(self, sequence) :
method findAllInCDS (line 134) | def findAllInCDS(self, sequence):
method pluck (line 138) | def pluck(self) :
method nextExon (line 145) | def nextExon(self) :
method previousExon (line 152) | def previousExon(self) :
method __str__ (line 163) | def __str__(self) :
method __len__ (line 166) | def __len__(self) :
FILE: pyGeno/Gene.py
class Gene_Raba (line 8) | class Gene_Raba(pyGenoRabaObject) :
method _curate (line 24) | def _curate(self) :
class Gene (line 27) | class Gene(pyGenoRabaObjectWrapper) :
method __init__ (line 32) | def __init__(self, *args, **kwargs) :
method _makeLoadQuery (line 35) | def _makeLoadQuery(self, objectType, *args, **coolArgs) :
method __str__ (line 54) | def __str__(self) :
FILE: pyGeno/Genome.py
function getGenomeList (line 16) | def getGenomeList() :
class Genome_Raba (line 25) | class Genome_Raba(pyGenoRabaObject) :
method _curate (line 37) | def _curate(self) :
method getSequencePath (line 40) | def getSequencePath(self) :
method getReferenceSequencePath (line 43) | def getReferenceSequencePath(self) :
method __len__ (line 46) | def __len__(self) :
class Genome (line 54) | class Genome(pyGenoRabaObjectWrapper) :
method __init__ (line 65) | def __init__(self, SNPs = None, SNPFilter = None, *args, **kwargs) :
method _makeLoadQuery (line 99) | def _makeLoadQuery(self, objectType, *args, **coolArgs) :
method __str__ (line 116) | def __str__(self) :
FILE: pyGeno/Protein.py
class Protein_Raba (line 12) | class Protein_Raba(pyGenoRabaObject) :
method _curate (line 25) | def _curate(self) :
class Protein (line 29) | class Protein(pyGenoRabaObjectWrapper) :
method __init__ (line 34) | def __init__(self, *args, **kwargs) :
method _makeLoadQuery (line 38) | def _makeLoadQuery(self, objectType, *args, **coolArgs) :
method _load_sequences (line 57) | def _load_sequences(self) :
method getSequence (line 64) | def getSequence(self):
method _load_bin_sequence (line 67) | def _load_bin_sequence(self) :
method getDefaultSequence (line 70) | def getDefaultSequence(self) :
method getPolymorphisms (line 74) | def getPolymorphisms(self) :
method find (line 78) | def find(self, sequence):
method findAll (line 82) | def findAll(self, sequence):
method findString (line 86) | def findString(self, sequence) :
method findStringAll (line 90) | def findStringAll(self, sequence):
method __getitem__ (line 94) | def __getitem__(self, i) :
method __len__ (line 97) | def __len__(self) :
method __str__ (line 100) | def __str__(self) :
FILE: pyGeno/SNP.py
function getSNPSetsList (line 18) | def getSNPSetsList() :
class SNPMaster (line 27) | class SNPMaster(Raba) :
method _curate (line 35) | def _curate(self) :
class SNP_INDEL (line 39) | class SNP_INDEL(Raba) :
method __init__ (line 56) | def __init__(self, *args, **kwargs) :
method __getattribute__ (line 59) | def __getattribute__(self, k) :
method __setattr__ (line 66) | def __setattr__(self, k, v) :
method _curate (line 73) | def _curate(self) :
method ensureGlobalIndex (line 77) | def ensureGlobalIndex(cls, fields) :
method __repr__ (line 80) | def __repr__(self) :
class CasavaSNP (line 83) | class CasavaSNP(SNP_INDEL) :
class AgnosticSNP (line 101) | class AgnosticSNP(SNP_INDEL) :
method __repr__ (line 122) | def __repr__(self) :
class dbSNPSNP (line 125) | class dbSNPSNP(SNP_INDEL) :
class TopHatSNP (line 198) | class TopHatSNP(SNP_INDEL) :
FILE: pyGeno/SNPFiltering.py
class Sequence_modifiers (line 7) | class Sequence_modifiers(object) :
method __init__ (line 9) | def __init__(self, sources = {}) :
method addSource (line 12) | def addSource(self, name, snp) :
class SequenceSNP (line 16) | class SequenceSNP(Sequence_modifiers) :
method __init__ (line 18) | def __init__(self, alleles, sources = {}) :
class SequenceInsert (line 25) | class SequenceInsert(Sequence_modifiers) :
method __init__ (line 28) | def __init__(self, bases, sources = {}, ref = '-') :
class SequenceDel (line 44) | class SequenceDel(Sequence_modifiers) :
method __init__ (line 46) | def __init__(self, length, sources = {}, ref = None, alt = '-') :
class SNPFilter (line 64) | class SNPFilter(object) :
method __init__ (line 67) | def __init__(self) :
method filter (line 70) | def filter(self, chromosome, **kwargs) :
class DefaultSNPFilter (line 73) | class DefaultSNPFilter(SNPFilter) :
method __init__ (line 102) | def __init__(self) :
method filter (line 105) | def filter(self, chromosome, **kwargs) :
FILE: pyGeno/Transcript.py
class Transcript_Raba (line 14) | class Transcript_Raba(pyGenoRabaObject) :
method _curate (line 34) | def _curate(self) :
class Transcript (line 52) | class Transcript(pyGenoRabaObjectWrapper) :
method __init__ (line 57) | def __init__(self, *args, **kwargs) :
method _makeLoadQuery (line 63) | def _makeLoadQuery(self, objectType, *args, **coolArgs) :
method _load_data (line 83) | def _load_data(self) :
method _load_bin_sequence (line 151) | def _load_bin_sequence(self) :
method getNucleotideCodon (line 157) | def getNucleotideCodon(self, cdnaX1) :
method getCodon (line 161) | def getCodon(self, i) :
method iterCodons (line 165) | def iterCodons(self) :
method find (line 170) | def find(self, sequence) :
method findAll (line 174) | def findAll(self, sequence):
method findIncDNA (line 178) | def findIncDNA(self, sequence) :
method findAllIncDNA (line 182) | def findAllIncDNA(self, sequence) :
method getcDNALength (line 186) | def getcDNALength(self) :
method findInUTR5 (line 190) | def findInUTR5(self, sequence) :
method findAllInUTR5 (line 194) | def findAllInUTR5(self, sequence) :
method getUTR5Length (line 198) | def getUTR5Length(self) :
method findInUTR3 (line 202) | def findInUTR3(self, sequence) :
method findAllInUTR3 (line 206) | def findAllInUTR3(self, sequence) :
method getUTR3Length (line 210) | def getUTR3Length(self) :
method getNbCodons (line 214) | def getNbCodons(self) :
method __getattribute__ (line 218) | def __getattribute__(self, name) :
method __getitem__ (line 221) | def __getitem__(self, i) :
method __len__ (line 224) | def __len__(self) :
method __str__ (line 227) | def __str__(self) :
FILE: pyGeno/bootstrap.py
function listRemoteDatawraps (line 11) | def listRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :
function printRemoteDatawraps (line 19) | def printRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :
function _DW (line 49) | def _DW(name, url) :
function importRemoteGenome (line 58) | def importRemoteGenome(name, batchSize = 100) :
function importRemoteSNPs (line 68) | def importRemoteSNPs(name) :
function listDatawraps (line 78) | def listDatawraps() :
function printDatawraps (line 91) | def printDatawraps() :
function importGenome (line 102) | def importGenome(name, batchSize = 100) :
function importSNPs (line 107) | def importSNPs(name) :
FILE: pyGeno/configuration.py
class PythonVersionError (line 6) | class PythonVersionError(Exception) :
function version (line 28) | def version() :
function prettyVersion (line 32) | def prettyVersion() :
function checkPythonVersion (line 36) | def checkPythonVersion() :
function getGenomeSequencePath (line 43) | def getGenomeSequencePath(specie, name) :
function createDefaultConfigFile (line 46) | def createDefaultConfigFile() :
function getSettingsPath (line 53) | def getSettingsPath() :
function removeFromDBRegistery (line 63) | def removeFromDBRegistery(obj) :
function freeDBRegistery (line 68) | def freeDBRegistery() :
function reload (line 72) | def reload() :
function pyGeno_init (line 76) | def pyGeno_init() :
FILE: pyGeno/examples/snps_queries.py
function printing (line 10) | def printing(gene) :
function printVs (line 23) | def printVs(refGene, presGene) :
function fancyExonQuery (line 35) | def fancyExonQuery(gene) :
class QMax_gt_filter (line 39) | class QMax_gt_filter(SNPFilter) :
method __init__ (line 41) | def __init__(self, threshold) :
method filter (line 44) | def filter(self, chromosome, dummySRY) :
FILE: pyGeno/importation/Genomes.py
function backUpDB (line 22) | def backUpDB() :
function _decompressPackage (line 30) | def _decompressPackage(packageFile) :
function _getFile (line 43) | def _getFile(fil, directory) :
function deleteGenome (line 58) | def deleteGenome(species, name) :
function importGenome (line 99) | def importGenome(packageFile, batchSize = 50, verbose = 0) :
function _importGenomeObjects (line 201) | def _importGenomeObjects(gtfFilePath, chroSet, genome, batchSize, verbos...
function _importSequence (line 445) | def _importSequence(chromosome, fastaFile, targetDir) :
FILE: pyGeno/importation/SNPs.py
function importSNPs (line 14) | def importSNPs(packageFile) :
function deleteSNPs (line 86) | def deleteSNPs(setName) :
function _importSNPs_AgnosticSNP (line 102) | def _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile) :
function _importSNPs_CasavaSNP (line 150) | def _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile) :
function _importSNPs_dbSNPSNP (line 197) | def _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile) :
function _importSNPs_TopHatSNP (line 236) | def _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile) :
FILE: pyGeno/pyGenoObjectBases.py
function nosave (line 7) | def nosave() :
class pyGenoRabaObject (line 10) | class pyGenoRabaObject(Raba) :
method __init__ (line 23) | def __init__(self) :
method _curate (line 27) | def _curate(self) :
method save (line 31) | def save(self) :
class pyGenoRabaObjectWrapper_metaclass (line 39) | class pyGenoRabaObjectWrapper_metaclass(type) :
method __new__ (line 44) | def __new__(cls, name, bases, dct) :
class RLWrapper (line 49) | class RLWrapper(object) :
method __init__ (line 51) | def __init__(self, rabaObj, listObjectType, rl) :
method __getitem__ (line 56) | def __getitem__(self, i) :
method __getattr__ (line 59) | def __getattr__(self, name) :
class pyGenoRabaObjectWrapper (line 63) | class pyGenoRabaObjectWrapper(object) :
method __init__ (line 73) | def __init__(self, wrapped_object_and_bag = (), *args, **kwargs) :
method _getObjBagKey (line 92) | def _getObjBagKey(self, obj) :
method _makeLoadQuery (line 98) | def _makeLoadQuery(self, objectType, *args, **coolArgs) :
method count (line 112) | def count(self, objectType, *args, **coolArgs) :
method get (line 116) | def get(self, objectType, *args, **coolArgs) :
method iterGet (line 137) | def iterGet(self, objectType, *args, **coolArgs) :
method __getattr__ (line 164) | def __getattr__(self, name) :
method getIndexes (line 200) | def getIndexes(cls) :
method flushIndexes (line 206) | def flushIndexes(cls) :
method help (line 212) | def help(cls) :
method ensureGlobalIndex (line 218) | def ensureGlobalIndex(cls, fields) :
method dropGlobalIndex (line 226) | def dropGlobalIndex(cls, fields) :
method getSequencesData (line 230) | def getSequencesData(self) :
method _load_sequences (line 235) | def _load_sequences(self) :
method _load_data (line 240) | def _load_data(self) :
method _load_bin_sequence (line 245) | def _load_bin_sequence(self) :
FILE: pyGeno/tests/test_csv.py
class CSVTests (line 4) | class CSVTests(unittest.TestCase):
method setUp (line 6) | def setUp(self):
method tearDown (line 9) | def tearDown(self):
method test_createParse (line 12) | def test_createParse(self) :
function runTests (line 28) | def runTests() :
FILE: pyGeno/tests/test_genome.py
class pyGenoSNPTests (line 8) | class pyGenoSNPTests(unittest.TestCase):
method setUp (line 10) | def setUp(self):
method tearDown (line 33) | def tearDown(self):
method test_vanilla (line 37) | def test_vanilla(self) :
method test_noModif (line 46) | def test_noModif(self) :
method test_insert (line 63) | def test_insert(self) :
method test_SNP (line 83) | def test_SNP(self) :
method test_deletion (line 103) | def test_deletion(self) :
method test_indels (line 123) | def test_indels(self) :
method test_bags (line 143) | def test_bags(self) :
method test_prot_find (line 148) | def test_prot_find(self) :
method test_trans_find (line 156) | def test_trans_find(self) :
function runTests (line 168) | def runTests() :
FILE: pyGeno/tools/BinarySequence.py
class BinarySequence (line 4) | class BinarySequence :
method __init__ (line 10) | def __init__(self, sequence, arrayForma, charToBinDict) :
method encode (line 21) | def encode(self, sequence):
method __testFind (line 63) | def __testFind(self, arr) :
method __testBinary (line 69) | def __testBinary(self, arr) :
method findPolymorphisms (line 75) | def findPolymorphisms(self, strSeq, strict = False):
method getPolymorphisms (line 96) | def getPolymorphisms(self):
method getDefaultSequence (line 100) | def getDefaultSequence(self) :
method __getSequenceVariants (line 104) | def __getSequenceVariants(self, x1, polyStart, polyStop, listSequence) :
method getSequenceVariants (line 121) | def getSequenceVariants(self, x1 = 0, x2 = -1, maxVariantNumber = 128) :
method getNbVariants (line 152) | def getNbVariants(self, x1, x2 = -1) :
method _dichFind (line 166) | def _dichFind(self, needle, currHaystack, offset, lst = None) :
method _kmp_construct_next (line 197) | def _kmp_construct_next(self, pattern):
method _kmp_search_first (line 209) | def _kmp_search_first(self, pInput_sequence, pPattern):
method _kmp_search_all (line 221) | def _kmp_search_all(self, pInput_sequence, pPattern):
method _kmp_find (line 236) | def _kmp_find(self, needle, haystack, lst = None):
method findByBiSearch (line 241) | def findByBiSearch(self, strSeq) :
method findAllByBiSearch (line 246) | def findAllByBiSearch(self, strSeq) :
method find (line 253) | def find(self, strSeq) :
method findAll (line 258) | def findAll(self, strSeq) :
method __and__ (line 265) | def __and__(self, arr) :
method __xor__ (line 274) | def __xor__(self, arr) :
method __eq__ (line 283) | def __eq__(self, seq) :
method append (line 292) | def append(self, arr) :
method extend (line 295) | def extend(self, arr) :
method decode (line 298) | def decode(self, binSequence):
method getChar (line 317) | def getChar(self, i):
method __len__ (line 320) | def __len__(self):
method __getitem__ (line 323) | def __getitem__(self,i):
method __setitem__ (line 326) | def __setitem__(self, i, v):
class AABinarySequence (line 329) | class AABinarySequence(BinarySequence) :
method __init__ (line 332) | def __init__(self, sequence):
class NucBinarySequence (line 337) | class NucBinarySequence(BinarySequence) :
method __init__ (line 340) | def __init__(self, sequence):
function test0 (line 357) | def test0() :
function testVariants (line 377) | def testVariants() :
function testFind (line 395) | def testFind():
function testFindByBiSearch (line 399) | def testFindByBiSearch():
function testFindAll (line 403) | def testFindAll():
function testFindAllByBiSearch (line 407) | def testFindAllByBiSearch():
FILE: pyGeno/tools/ProgressBar.py
class ProgressBar (line 3) | class ProgressBar :
method __init__ (line 15) | def __init__(self, nbEpochs = -1, width = 25, label = "progress", minR...
method formatTime (line 39) | def formatTime(self, val) :
method _update (line 47) | def _update(self) :
method log (line 60) | def log(self) :
method saveLogs (line 68) | def saveLogs(self, filename) :
method update (line 74) | def update(self, label = '', forceRefresh = False, log = False) :
method close (line 116) | def close(self) :
FILE: pyGeno/tools/SecureMmap.py
class SecureMmap (line 3) | class SecureMmap:
method __init__ (line 6) | def __init__(self, filename, enableWrite = False) :
method forceSet (line 15) | def forceSet(self, x1, v) :
method __getitem__ (line 19) | def __getitem__(self, i):
method __setitem__ (line 22) | def __setitem__(self, i, v) :
method __str__ (line 28) | def __str__(self) :
method __len__ (line 31) | def __len__(self) :
FILE: pyGeno/tools/SegmentTree.py
function aux_insertTree (line 3) | def aux_insertTree(childTree, parentTree):
function aux_moveTree (line 11) | def aux_moveTree(offset, tree):
class SegmentTree (line 19) | class SegmentTree :
method __init__ (line 56) | def __init__(self, x1 = None, x2 = None, name = '', referedObject = []...
method __addChild (line 70) | def __addChild(self, segmentTree, index = -1) :
method insert (line 77) | def insert(self, x1, x2, name = '', referedObject = []) :
method insertTree (line 133) | def insertTree(self, childTree):
method intersect (line 141) | def intersect(self, x1, x2 = None) :
method __dichotomicSearch (line 179) | def __dichotomicSearch(self, x1) :
method __radiateDown (line 196) | def __radiateDown(self, x1, x2, childId, condition) :
method __radiateUp (line 208) | def __radiateUp(self, x1, x2, childId, condition) :
method emptyChildren (line 220) | def emptyChildren(self) :
method removeGaps (line 224) | def removeGaps(self) :
method getX1 (line 231) | def getX1(self) :
method getX2 (line 237) | def getX2(self) :
method getIndexedLength (line 243) | def getIndexedLength(self) :
method getFirstLevel (line 256) | def getFirstLevel(self) :
method flatten (line 269) | def flatten(self) :
method move (line 302) | def move(self, newX1) :
method __str__ (line 311) | def __str__(self) :
method __str (line 323) | def __str(self) :
method __len__ (line 333) | def __len__(self) :
method __repr__ (line 340) | def __repr__(self):
FILE: pyGeno/tools/SingletonManager.py
function add (line 4) | def add(obj, objName='') :
function contains (line 16) | def contains(k) :
function get (line 19) | def get(objName) :
FILE: pyGeno/tools/Stats.py
function kullback_leibler (line 3) | def kullback_leibler(p, q) :
function squaredError_log10 (line 13) | def squaredError_log10(p, q) :
function fisherExactTest (line 22) | def fisherExactTest(table) :
FILE: pyGeno/tools/UsefulFunctions.py
class UnknownNucleotide (line 3) | class UnknownNucleotide(Exception) :
method __init__ (line 4) | def __init__(self, nuc) :
method __str__ (line 7) | def __str__(self) :
function saveResults (line 11) | def saveResults(directoryName, fileName, strResults, log = '', args = ''):
function findAll (line 136) | def findAll(haystack, needle) :
function complementTab (line 153) | def complementTab(seq=[]):
function reverseComplementTab (line 176) | def reverseComplementTab(seq):
function reverseComplement (line 182) | def reverseComplement(seq):
function complement (line 188) | def complement(seq) :
function translateDNA_6Frames (line 196) | def translateDNA_6Frames(sequence) :
function translateDNA (line 210) | def translateDNA(sequence, frame = 'f1', translTable_id='default') :
function getSequenceCombinaisons (line 252) | def getSequenceCombinaisons(polymorphipolymorphicDnaSeqSeq, pos = 0) :
function polymorphicCodonCombinaisons (line 276) | def polymorphicCodonCombinaisons(codon) :
function encodePolymorphicNucleotide (line 280) | def encodePolymorphicNucleotide(polySeq) :
function decodePolymorphicNucleotide (line 333) | def decodePolymorphicNucleotide(nuc) :
function decodePolymorphicNucleotide_str (line 343) | def decodePolymorphicNucleotide_str(nuc) :
function getNucleotideCodon (line 348) | def getNucleotideCodon(sequence, x1) :
function showDifferences (line 363) | def showDifferences(seq1, seq2) :
function highlightSubsequence (line 392) | def highlightSubsequence(sequence, x1, x2, start=' [', stop = '] ') :
FILE: pyGeno/tools/io.py
function printf (line 3) | def printf(*s) :
function enterConfirm_prompt (line 11) | def enterConfirm_prompt(enterMsg) :
function confirm_prompt (line 30) | def confirm_prompt(preMsg) :
FILE: pyGeno/tools/parsers/CSVTools.py
class EmptyLine (line 3) | class EmptyLine(Exception) :
method __init__ (line 6) | def __init__(self, lineNumber) :
method __str__ (line 11) | def __str__(self) :
function removeDuplicates (line 14) | def removeDuplicates(inFileName, outFileName) :
function catCSVs (line 36) | def catCSVs(folder, ouputFileName, removeDups = False) :
function joinCSVs (line 44) | def joinCSVs(csvFilePaths, column, ouputFileName, separator = ',') :
class CSVEntry (line 78) | class CSVEntry(object) :
method __init__ (line 81) | def __init__(self, csvFile, lineNumber = None) :
method commit (line 124) | def commit(self) :
method __iter__ (line 128) | def __iter__(self) :
method next (line 132) | def next(self) :
method __getitem__ (line 141) | def __getitem__(self, key) :
method __setitem__ (line 150) | def __setitem__(self, key, value) :
method toStr (line 166) | def toStr(self) :
method __repr__ (line 169) | def __repr__(self) :
method __str__ (line 176) | def __str__(self) :
class CSVFile (line 179) | class CSVFile(object) :
method __init__ (line 201) | def __init__(self, legend = [], separator = ',', lineSeparator = '\n') :
method addField (line 221) | def addField(self, field) :
method parse (line 231) | def parse(self, filePath, skipLines=0, separator = ',', stringSeparato...
method streamToFile (line 278) | def streamToFile(self, filename, keepInMemory = False, writeRate = 1) :
method commitLine (line 299) | def commitLine(self, line) :
method closeStreamToFile (line 312) | def closeStreamToFile(self) :
method _developLine (line 327) | def _developLine(self, line) :
method get (line 342) | def get(self, line, key) :
method set (line 346) | def set(self, line, key, val) :
method newLine (line 350) | def newLine(self) :
method insertLine (line 357) | def insertLine(self, i) :
method save (line 362) | def save(self, filePath) :
method toStr (line 370) | def toStr(self) :
method __iter__ (line 377) | def __iter__(self) :
method next (line 381) | def next(self) :
method __getitem__ (line 389) | def __getitem__(self, line) :
method __len__ (line 413) | def __len__(self) :
FILE: pyGeno/tools/parsers/CasavaTools.py
class SNPsTxtEntry (line 4) | class SNPsTxtEntry(object) :
method __init__ (line 7) | def __init__(self, lineNumber, snpsTxtFile) :
method __getitem__ (line 30) | def __getitem__(self, fieldName):
method __setitem__ (line 34) | def __setitem__(self, fieldName, value) :
method __str__ (line 38) | def __str__(self):
class SNPsTxtFile (line 41) | class SNPsTxtFile(object) :
method __init__ (line 50) | def __init__(self, fil, gziped = False) :
method reset (line 63) | def reset(self) :
method __iter__ (line 68) | def __iter__(self) :
method next (line 72) | def next(self) :
method __getitem__ (line 79) | def __getitem__(self, i) :
method __len__ (line 84) | def __len__(self) :
FILE: pyGeno/tools/parsers/FastaTools.py
class FastaFile (line 3) | class FastaFile(object) :
method __init__ (line 18) | def __init__(self, fil = None) :
method reset (line 23) | def reset(self) :
method parseStr (line 28) | def parseStr(self, st) :
method parseFile (line 32) | def parseFile(self, fil) :
method __splitLine (line 38) | def __splitLine(self, li) :
method get (line 47) | def get(self, i) :
method add (line 52) | def add(self, header, data) :
method save (line 59) | def save(self, filePath) :
method toStr (line 65) | def toStr(self) :
method __iter__ (line 73) | def __iter__(self) :
method next (line 77) | def next(self) :
method __getitem__ (line 87) | def __getitem__(self, i) :
method __setitem__ (line 91) | def __setitem__(self, i, v) :
method __len__ (line 98) | def __len__(self) :
FILE: pyGeno/tools/parsers/FastqTools.py
class FastqEntry (line 3) | class FastqEntry(object) :
method __init__ (line 6) | def __init__(self, ident = "", seq = "", plus = "", qual = "") :
method __getitem__ (line 13) | def __getitem__(self, i):
method __setitem__ (line 16) | def __setitem__(self, i, v) :
method __str__ (line 19) | def __str__(self):
class FastqFile (line 22) | class FastqFile(object) :
method __init__ (line 41) | def __init__(self, fil = None) :
method reset (line 46) | def reset(self) :
method parseStr (line 51) | def parseStr(self, st) :
method parseFile (line 57) | def parseFile(self, fil) :
method __splitEntry (line 63) | def __splitEntry(self, li) :
method get (line 70) | def get(self, li) :
method newEntry (line 76) | def newEntry(self, ident = "", seq = "", plus = "", qual = "") :
method add (line 82) | def add(self, fastqEntry) :
method save (line 86) | def save(self, filePath) :
method toStr (line 91) | def toStr(self) :
method __iter__ (line 98) | def __iter__(self) :
method next (line 102) | def next(self) :
method __getitem__ (line 112) | def __getitem__(self, i) :
method __setitem__ (line 116) | def __setitem__(self, i, v) :
method __len__ (line 123) | def __len__(self) :
FILE: pyGeno/tools/parsers/GTFTools.py
class GTFEntry (line 3) | class GTFEntry(object) :
method __init__ (line 4) | def __init__(self, gtfFile, lineNumber) :
method __getitem__ (line 17) | def __getitem__(self, k) :
method __repr__ (line 27) | def __repr__(self) :
method __str__ (line 30) | def __str__(self) :
class GTFFile (line 33) | class GTFFile(object) :
method __init__ (line 35) | def __init__(self, filename, gziped = False) :
method get (line 53) | def get(self, line, elmt) :
method __iter__ (line 57) | def __iter__(self) :
method next (line 61) | def next(self) :
method __getitem__ (line 68) | def __getitem__(self, i) :
method __repr__ (line 74) | def __repr__(self) :
method __str__ (line 77) | def __str__(self) :
method __len__ (line 80) | def __len__(self) :
FILE: pyGeno/tools/parsers/VCFTools.py
class VCFEntry (line 3) | class VCFEntry(object) :
method __init__ (line 6) | def __init__(self, vcfFile, line, lineNumber) :
method __getitem__ (line 48) | def __getitem__(self, key) :
method __repr__ (line 65) | def __repr__(self) :
method __str__ (line 68) | def __str__(self) :
class VCFFile (line 71) | class VCFFile(object) :
method __init__ (line 83) | def __init__(self, filename = None, gziped = False, stream = False) :
method parse (line 91) | def parse(self, filename, gziped = False, stream = False) :
method close (line 144) | def close(self) :
method _developLine (line 148) | def _developLine(self, lineNumber) :
method __iter__ (line 152) | def __iter__(self) :
method next (line 156) | def next(self) :
method __getitem__ (line 173) | def __getitem__(self, line) :
method __len__ (line 182) | def __len__(self) :
method __repr__ (line 186) | def __repr__(self) :
method __str__ (line 189) | def __str__(self) :
Condensed preview — 66 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (272K chars).
[
{
"path": ".gitignore",
"chars": 1192,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\n"
},
{
"path": ".travis.yml",
"chars": 644,
"preview": "sudo: false\n\nnotifications:\n email: false\n\nlanguage: python\n\npython:\n - \"2.7\"\n\nbefore_install:\n - wget http://rep"
},
{
"path": "CHANGELOG.rst",
"chars": 3620,
"preview": "1.3.2\n=====\n\n* Search now uses KMD by default instead of dichotomic search (massive speed gain). Many thanks to @Keija f"
},
{
"path": "DESCRIPTION.rst",
"chars": 2592,
"preview": "pyGeno: A python package for Precision Medicine, Personalized Genomics and Proteomics\n=================================="
},
{
"path": "LICENSE",
"chars": 11324,
"preview": "Apache License\n Version 2.0, January 2004\n http://www.apache.org/licens"
},
{
"path": "MANIFEST.in",
"chars": 30,
"preview": "include *.rst\ninclude LICENSE\n"
},
{
"path": "README.rst",
"chars": 12216,
"preview": "CODE FREEZE:\n============\n\nPyGeno has long been limited due to it's backend. We are now ready to take it to the next lev"
},
{
"path": "pyGeno/Chromosome.py",
"chars": 4963,
"preview": "#import copy\n#import types\n#from tools import UsefulFunctions as uf\n\nfrom types import *\nimport configuration as conf\nfr"
},
{
"path": "pyGeno/Exon.py",
"chars": 5111,
"preview": "from pyGenoObjectBases import *\nfrom SNP import SNP_INDEL\n\nimport rabaDB.fields as rf\nfrom tools import UsefulFunctions "
},
{
"path": "pyGeno/Gene.py",
"chars": 1528,
"preview": "import configuration as conf\n\nfrom pyGenoObjectBases import *\nfrom SNP import SNP_INDEL\n\nimport rabaDB.fields as rf\n\ncla"
},
{
"path": "pyGeno/Genome.py",
"chars": 3236,
"preview": "import types\nimport configuration as conf\nimport pyGeno.tools.UsefulFunctions as uf\nfrom pyGenoObjectBases import *\n\nfro"
},
{
"path": "pyGeno/Protein.py",
"chars": 3238,
"preview": "import configuration as conf\n\nfrom pyGenoObjectBases import *\nfrom SNP import SNP_INDEL\n\nimport rabaDB.fields as rf\n\nfro"
},
{
"path": "pyGeno/SNP.py",
"chars": 11108,
"preview": "import types\n\nimport configuration as conf\n\nfrom pyGenoObjectBases import *\nimport rabaDB.fields as rf\n\n# from tools imp"
},
{
"path": "pyGeno/SNPFiltering.py",
"chars": 4665,
"preview": "import types\n\nimport configuration as conf\n\nfrom tools import UsefulFunctions as uf\n\nclass Sequence_modifiers(object) :\n"
},
{
"path": "pyGeno/Transcript.py",
"chars": 6546,
"preview": "import configuration as conf\n\nfrom pyGenoObjectBases import *\n\nimport rabaDB.fields as rf\n\nfrom tools import UsefulFunct"
},
{
"path": "pyGeno/__init__.py",
"chars": 136,
"preview": "__all__ = ['Genome', 'Chromosome', 'Gene', 'Transcript', 'Exon', 'Protein', 'SNP']\n\nfrom configuration import pyGeno_ini"
},
{
"path": "pyGeno/bootstrap.py",
"chars": 3386,
"preview": "import pyGeno.importation.Genomes as PG\nimport pyGeno.importation.SNPs as PS\nfrom pyGeno.tools.io import printf\nimport o"
},
{
"path": "pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/manifest.ini",
"chars": 302,
"preview": "[package_infos]\ndescription = For testing purposes. All polymorphisms at the same position\nmaintainer = Tariq Daouda\nmai"
},
{
"path": "pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/snps.txt",
"chars": 162,
"preview": "chromosomeNumber\tuniqueId\tstart\tend\tref\talleles\tquality\tcaller\nY\t1\t2655643\t2655646\tT\tAG\t30\ttest\nY\t2\t2655643\t2655647\t-\tAG"
},
{
"path": "pyGeno/bootstrap_data/SNPs/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "pyGeno/bootstrap_data/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "pyGeno/bootstrap_data/genomes/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "pyGeno/configuration.py",
"chars": 3493,
"preview": "import sys, os, time\nfrom ConfigParser import SafeConfigParser\nimport rabaDB.rabaSetup\nimport rabaDB.Raba\n\nclass PythonV"
},
{
"path": "pyGeno/doc/Makefile",
"chars": 7057,
"preview": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHINXBUILD "
},
{
"path": "pyGeno/doc/make.bat",
"chars": 7253,
"preview": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\n"
},
{
"path": "pyGeno/doc/source/bootstraping.rst",
"chars": 157,
"preview": "Bootstraping\n==================================\npyGeno can be quick-started by importing these built-in data wraps.\n\n.. "
},
{
"path": "pyGeno/doc/source/citing.rst",
"chars": 149,
"preview": "Citing\n=========\n\nIf you are using pyGeno please mention it to the rest of the universe by including a link to: https://"
},
{
"path": "pyGeno/doc/source/conf.py",
"chars": 11529,
"preview": "# -*- coding: utf-8 -*-\n#\n# pyGeno documentation build configuration file, created by\n# sphinx-quickstart on Thu Nov 6 "
},
{
"path": "pyGeno/doc/source/datawraps.rst",
"chars": 3307,
"preview": "Datawraps\n=========\n\nDatawraps are used by pyGeno to import data into it's database. All reference genomes are downloade"
},
{
"path": "pyGeno/doc/source/importation.rst",
"chars": 1126,
"preview": "Importation\n===========\npyGeno's database is populated by importing tar.gz compressed archives called *datawraps*. An im"
},
{
"path": "pyGeno/doc/source/index.rst",
"chars": 4593,
"preview": ".. pyGeno documentation master file, created by\n sphinx-quickstart on Thu Nov 6 16:45:34 2014.\n You can adapt this "
},
{
"path": "pyGeno/doc/source/installation.rst",
"chars": 1184,
"preview": "Installation\n=============\n\nUnix (MacOS, Linux)\n-------------------\n\nThe latest stable version is available from pypi::\n"
},
{
"path": "pyGeno/doc/source/objects.rst",
"chars": 823,
"preview": "Objects\n=======\nWith pyGeno you can manipulate familiar object in intuituive way. All the following classes except SNP i"
},
{
"path": "pyGeno/doc/source/parsers.rst",
"chars": 631,
"preview": "Parsers\n=======\n\nPyGeno comes with a set of parsers that you can use independentely.\n\nCSV\n---\nTo read and write CSV file"
},
{
"path": "pyGeno/doc/source/publications.rst",
"chars": 1522,
"preview": "Publications\n============\n\nPlease cite this one:\n---------------------\n\n`pyGeno: A Python package for precision medicine"
},
{
"path": "pyGeno/doc/source/querying.rst",
"chars": 679,
"preview": "Querying\n=========\n\npyGeno is a personal database that you can query in many ways. Special emphasis has been placed upon"
},
{
"path": "pyGeno/doc/source/quickstart.rst",
"chars": 5569,
"preview": "Quickstart\n==========\n\nQuick importation\n-----------------\npyGeno's database is populated by importing data wraps.\npyGen"
},
{
"path": "pyGeno/doc/source/snp_filter.rst",
"chars": 571,
"preview": "Filtering Polymorphisms (SNPs and Indels)\n=========================================\n\nPolymorphism filtering is an import"
},
{
"path": "pyGeno/doc/source/tools.rst",
"chars": 813,
"preview": "Tools\n=====\n\npyGeno provides a set of tools that can be used independentely. Here you'll find goodies for translation, i"
},
{
"path": "pyGeno/examples/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "pyGeno/examples/bootstraping.py",
"chars": 151,
"preview": "import pyGeno.bootstrap as B\n\n#~ imports the whole human reference genome\n#~ B.importHumanReference()\nB.importHumanRefer"
},
{
"path": "pyGeno/examples/snps_queries.py",
"chars": 1953,
"preview": "from pyGeno.Genome import Genome\nfrom pyGeno.Gene import Gene\nfrom pyGeno.Transcript import Transcript\nfrom pyGeno.Prote"
},
{
"path": "pyGeno/importation/Genomes.py",
"chars": 18616,
"preview": "import os, glob, gzip, tarfile, shutil, time, sys, gc, cPickle, tempfile, urllib2\nfrom contextlib import closing\nfrom Co"
},
{
"path": "pyGeno/importation/SNPs.py",
"chars": 7555,
"preview": "import urllib, shutil\n\nfrom ConfigParser import SafeConfigParser\nimport pyGeno.configuration as conf\nfrom pyGeno.SNP imp"
},
{
"path": "pyGeno/importation/__init__.py",
"chars": 30,
"preview": "__all__ = ['Genomes', 'SNPs']\n"
},
{
"path": "pyGeno/pyGenoObjectBases.py",
"chars": 9009,
"preview": "import time, types, string\nimport configuration as conf\nfrom rabaDB.rabaSetup import *\nfrom rabaDB.Raba import *\nfrom ra"
},
{
"path": "pyGeno/tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "pyGeno/tests/test_csv.py",
"chars": 615,
"preview": "import unittest\nfrom pyGeno.tools.parsers.CSVTools import *\n\nclass CSVTests(unittest.TestCase):\n\t\t\n\tdef setUp(self):\n\t\tp"
},
{
"path": "pyGeno/tests/test_genome.py",
"chars": 6330,
"preview": "import unittest\nfrom pyGeno.Genome import *\n\nimport pyGeno.bootstrap as B\nfrom pyGeno.importation.Genomes import *\nfrom "
},
{
"path": "pyGeno/tools/BinarySequence.py",
"chars": 14017,
"preview": "import array, copy\nimport UsefulFunctions as uf\n\nclass BinarySequence :\n\t\"\"\"A class for representing sequences in a bina"
},
{
"path": "pyGeno/tools/ProgressBar.py",
"chars": 3883,
"preview": "import sys, time, cPickle\n\nclass ProgressBar :\n\t\"\"\"A very simple unthreaded progress bar. This progress bar also logs st"
},
{
"path": "pyGeno/tools/SecureMmap.py",
"chars": 811,
"preview": "import mmap\n\nclass SecureMmap:\n\t\"\"\"In a normal mmap, modifying the string modifies the file. This is a mmap with write p"
},
{
"path": "pyGeno/tools/SegmentTree.py",
"chars": 10443,
"preview": "import random, copy, types\n\ndef aux_insertTree(childTree, parentTree):\n\t\"\"\"This a private (You shouldn't have to call it"
},
{
"path": "pyGeno/tools/SingletonManager.py",
"chars": 309,
"preview": "#This thing is wonderful\n\nobjects = {}\ndef add(obj, objName='') :\n\t\n\tif objName == '' :\n\t\tkey = obj.name\n\telse :\n\t\tkey ="
},
{
"path": "pyGeno/tools/Stats.py",
"chars": 681,
"preview": "import numpy as np\n\ndef kullback_leibler(p, q) :\n\t\"\"\"Discrete Kullback-Leibler divergence D(P||Q)\"\"\"\n\tp = np.asarray(p, "
},
{
"path": "pyGeno/tools/UsefulFunctions.py",
"chars": 18006,
"preview": "import string, os, copy, types\n\nclass UnknownNucleotide(Exception) :\n\tdef __init__(self, nuc) :\n\t\tself.msg = 'Unknown n"
},
{
"path": "pyGeno/tools/__init__.py",
"chars": 170,
"preview": "__all__ = ['BinarySequence', 'CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'io', 'ProgressBar', 'SecureMmap', 'Seg"
},
{
"path": "pyGeno/tools/io.py",
"chars": 757,
"preview": "import sys\n\ndef printf(*s) :\n\t'print + sys.stdout.flush()'\n\tfor e in s[:-1] :\n\t\tprint e,\n\tprint s[-1]\n\n\tsys.stdout.flush"
},
{
"path": "pyGeno/tools/parsers/CSVTools.py",
"chars": 10748,
"preview": "import os, types, collections\n\nclass EmptyLine(Exception) :\n\t\"\"\"Raised when an empty or comment line is found (dealt wit"
},
{
"path": "pyGeno/tools/parsers/CasavaTools.py",
"chars": 2182,
"preview": "import gzip\nimport pyGeno.tools.UsefulFunctions as uf\n\nclass SNPsTxtEntry(object) :\n\t\"\"\"A single entry in the casavas sn"
},
{
"path": "pyGeno/tools/parsers/FastaTools.py",
"chars": 2111,
"preview": "import os\n\nclass FastaFile(object) :\n\t\"\"\"\n\tRepresents a whole Fasta file::\n\t\t\n\t\t#reading\n\t\tf = FastaFile()\n\t\tf.parseFile"
},
{
"path": "pyGeno/tools/parsers/FastqTools.py",
"chars": 2766,
"preview": "import os\n\nclass FastqEntry(object) :\n\t\"\"\"A single entry in the FastqEntry file\"\"\"\n\t\n\tdef __init__(self, ident = \"\", seq"
},
{
"path": "pyGeno/tools/parsers/GTFTools.py",
"chars": 2278,
"preview": "import gzip\n\nclass GTFEntry(object) :\n\tdef __init__(self, gtfFile, lineNumber) :\n\t\t\"\"\"A single entry in a GTF file\"\"\"\n\t\t"
},
{
"path": "pyGeno/tools/parsers/VCFTools.py",
"chars": 6458,
"preview": "import os, types, gzip\n\nclass VCFEntry(object) :\n\t\"\"\"A single entry in a VCF file\"\"\"\n\t\n\tdef __init__(self, vcfFile, line"
},
{
"path": "pyGeno/tools/parsers/__init__.py",
"chars": 116,
"preview": "#Parsers for different types of files\n__all__ = ['CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'CasavaTools']\n"
},
{
"path": "setup.py",
"chars": 2766,
"preview": "from setuptools import setup, find_packages \nfrom codecs import open\nfrom os import path\n\nhere = path.abspath(path.dirna"
}
]
About this extraction
This page contains the full source code of the tariqdaouda/pyGeno GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 66 files (243.4 KB), approximately 74.1k tokens, and a symbol index with 425 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.