Repository: tariqdaouda/pyGeno
Branch: bloody
Commit: 6311c9cd9444
Files: 66
Total size: 243.4 KB

Directory structure:
gitextract_0q2fbl_h/

├── .gitignore
├── .travis.yml
├── CHANGELOG.rst
├── DESCRIPTION.rst
├── LICENSE
├── MANIFEST.in
├── README.rst
├── pyGeno/
│   ├── Chromosome.py
│   ├── Exon.py
│   ├── Gene.py
│   ├── Genome.py
│   ├── Protein.py
│   ├── SNP.py
│   ├── SNPFiltering.py
│   ├── Transcript.py
│   ├── __init__.py
│   ├── bootstrap.py
│   ├── bootstrap_data/
│   │   ├── SNPs/
│   │   │   ├── Human_agnostic.dummySRY_indels/
│   │   │   │   ├── manifest.ini
│   │   │   │   └── snps.txt
│   │   │   └── __init__.py
│   │   ├── __init__.py
│   │   └── genomes/
│   │       └── __init__.py
│   ├── configuration.py
│   ├── doc/
│   │   ├── Makefile
│   │   ├── make.bat
│   │   └── source/
│   │       ├── bootstraping.rst
│   │       ├── citing.rst
│   │       ├── conf.py
│   │       ├── datawraps.rst
│   │       ├── importation.rst
│   │       ├── index.rst
│   │       ├── installation.rst
│   │       ├── objects.rst
│   │       ├── parsers.rst
│   │       ├── publications.rst
│   │       ├── querying.rst
│   │       ├── quickstart.rst
│   │       ├── snp_filter.rst
│   │       └── tools.rst
│   ├── examples/
│   │   ├── __init__.py
│   │   ├── bootstraping.py
│   │   └── snps_queries.py
│   ├── importation/
│   │   ├── Genomes.py
│   │   ├── SNPs.py
│   │   └── __init__.py
│   ├── pyGenoObjectBases.py
│   ├── tests/
│   │   ├── __init__.py
│   │   ├── test_csv.py
│   │   └── test_genome.py
│   └── tools/
│       ├── BinarySequence.py
│       ├── ProgressBar.py
│       ├── SecureMmap.py
│       ├── SegmentTree.py
│       ├── SingletonManager.py
│       ├── Stats.py
│       ├── UsefulFunctions.py
│       ├── __init__.py
│       ├── io.py
│       └── parsers/
│           ├── CSVTools.py
│           ├── CasavaTools.py
│           ├── FastaTools.py
│           ├── FastqTools.py
│           ├── GTFTools.py
│           ├── VCFTools.py
│           └── __init__.py
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.cache
nosetests.xml
coverage.xml

# Translations
*.mo
*.pot

# Django stuff:
*.log

# Sphinx documentation
docs/_build/

# PyBuilder
target/

### PyCharm ###
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm

## Directory-based project format
.idea/
# if you remove the above rule, at least ignore user-specific stuff:
# .idea/workspace.xml
# .idea/tasks.xml
# and these sensitive or high-churn files:
# .idea/dataSources.ids
# .idea/dataSources.xml
# .idea/sqlDataSources.xml
# .idea/dynamic.xml

## File-based project format
*.ipr
*.iws
*.iml

## Additional for IntelliJ
out/

# generated by mpeltonen/sbt-idea plugin
.idea_modules/


================================================
FILE: .travis.yml
================================================
sudo: false

notifications:
    email: false

language: python

python:
  - "2.7"

before_install:
    - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
    - bash miniconda.sh -b -p $HOME/miniconda
    - export PATH="$HOME/miniconda/bin:$PATH"
    - conda update --yes conda

install:
    - conda install --yes python=$TRAVIS_PYTHON_VERSION pip numpy scipy
    - pip install coverage
    - pip install https://github.com/tariqdaouda/rabaDB/archive/stable.zip
    - python setup.py install

script: coverage run -m unittest discover pyGeno/tests/

after_success: bash <(curl -s https://codecov.io/bash)


================================================
FILE: CHANGELOG.rst
================================================
1.3.2
=====

* Search now uses KMD by default instead of dichotomic search (massive speed gain). Many thanks to @Keija for the implementation. Go to https://github.com/tariqdaouda/pyGeno/pull/34 for details and benchmarks.

1.3.1
=====

* CSVFile: fixed bug when slice start was None
* CSVFile: Better support for string separator
* AGN SNPs Quality cast to float by importer
* Travis integration
* Minor CSV parser updates

1.3.0
=====

* CSVFile will now ignore empty lines and comments

* Added synonymousCodonsFrequencies

1.2.9
=====

* It is no longer mandatory to set the whole legend of CSV file at initialization. It can figure it out by itself

* Datawraps can now be uncompressed folders

* Explicit error message if there's no file name manifest.ini in datawrap


1.2.8
=====

* Fixed BUG that prevented proper initialization and importation

1.2.5
=====

* BUG FIX: Opening a lot of chromosomes caused mmap to die screaming

* Removed core indexes. Sqlite sometimes chose them instead of user defined positional indexes, resulting un slow queries

* Doc updates

1.2.3
=====

* Added functions to retrieve the names of imported snps sets and genomes

* Added remote datawraps to the boostrap module that can be downloaded from pyGeno's website or any other location

* Added a field uniqueId to AgnosticSNPs

* Changed all latin datawrap names to english

* Removed datawrap for dbSNP GRCh37

1.2.2
=====

* Updated pypi package to include bootstrap datawraps

1.2.1
=====

* Fixed tests

1.2.0
=====
* BUG FIX: get()/iterGet() now works for SNPs and Indels

* BUG FIX: Default SNP filter used to return unwanted Nones for insertions

* BUG FIX: Added cast of lines to str in VCF and CasavaSNP parsers. Sometimes unicode caracters made the translation bug  

* BUG FIX: Corrected a typo that caused find in Transcript to recursively die 

* Added a new AgnosticSNP type of SNPs that can be easily made from the results of any snp caller. To make for the loss of support for casava by illumina. See SNP.AgnosticSNP for documentation

* pyGeno now comes with the murine reference genome GRCm38.78

* pyGeno now comes with the human reference genome GRCh38.78, GRCh37.75 is still shipped with pyGeno but might be in the upcoming versions

* pyGeno now comes with a datawrap for common dbSNPs human SNPs (SNP_dbSNP142_human_common_all.tar.gz)

* Added a dummy AgnosticSNP datawrap example for boostraping

* Changed the interface of the bootstrap module

* CSV Parser has now the ability to stream directly to a file


1.1.7
=====

* BUG FIX: looping through CSV lines now works

* Added tests for CSV

1.1.6
=====

* BUG FIX: find in BinarySequence could not find some subsequences at the tail of sequence

1.1.5
=====

* BUG FIX in default SNP filter

* Updated description

1.1.4
=====

* Another BUG FIX in progress bar

1.1.3
=====

* Small BUG FIX in the progress bar that caused epochs to be misrepresented

* 'Specie' has been changed to 'species' everywhere. That breaks the database the only way to fix it is to redo all importations

1.1.2
=====

* Genome import is now much more memory efficient

* BUG FIX: find in BinarySequence could not find subsequences at the tail of sequence

* Added a built-in datawrap with only chr1 and y

* Readme update with more infos about importation and link to doc
 
1.1.1
=====

Much better SNP filtering interface
------------------------------------
Easier and much morr coherent:

* SNP filtering has now its own module

* SNP Filters are now objects

* SNP Filters must return SequenceSNP, SNPInsert, SNPDeletion or None objects

1.0.0
=====
Freshly hatched


================================================
FILE: DESCRIPTION.rst
================================================
pyGeno: A python package for Precision Medicine, Personalized Genomics and Proteomics
=====================================================================================

Short description:
------------------

pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_).

.. _Tariq Daouda: http://www.tariqdaouda.com
.. _IRIC: http://www.iric.ca

With pyGeno you can do that:

.. code:: python

 from pyGeno.Genome import *
 
 #load a genome 
 ref = Genome(name = 'GRCh37.75')
 #load a gene
 gene = ref.get(Gene, name = 'TPST2')[0]
 #print the sequences of all the isoforms
 for prot in gene.get(Protein) :
  print prot.sequence

You can also do it for the **specific genomes** of your subjects:

.. code:: python

 pers = Genome(name = 'GRCh37.75', SNPs = ["RNA_S1"], SNPFilter = myFilter())

And much more: https://github.com/tariqdaouda/pyGeno

Verbose Description
--------------------

pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend
upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to
be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Presonalized Genomes**.

Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter.
pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get
direct access to the DNA and Protein sequences of your patients.

Multiple sets of of polymorphisms can also be combined together to leverage their independent benefits ex: 

RNA-seq and DNA-seq for the same individual to improve the coverage
RNA-seq of an individual + dbSNP for validation
Combine the results of RNA-seq of several individual to create a genome only containing the common polymorphisms
pyGeno is also a personal database that give you access to all the information provided by Ensembl (for both Reference and Personalized Genomes) without the need of queries to distant HTTP APIs. Allowing for much faster and reliable genome wide study pipelines.

It also comes with parsers for several file types and various other useful tools.

Full Documentation
------------------

The full documentation is available here_

.. _here: http://pygeno.iric.ca/

If you like pyGeno, please let me know.
For the latest news, you can follow me on twitter `@tariqdaouda`_.

.. _@tariqdaouda: https://www.twitter.com/tariqdaouda


================================================
FILE: LICENSE
================================================
Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "{}"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright {yyyy} {name of copyright owner}

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: MANIFEST.in
================================================
include *.rst
include LICENSE


================================================
FILE: README.rst
================================================
CODE FREEZE:
============

PyGeno has long been limited due to it's backend. We are now ready to take it to the next level.

We are working on a major port of pyGeno to the open-source multi-modal database ArangoDB. PyGeno's code on both branches master and bloody is frozen until we are finished. No pull request will be merged until then, and we won't implement any new features.

pyGeno: A Python package for precision medicine and proteogenomics
==================================================================

.. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg
   :alt: depsy
   :target: http://depsy.org/package/python/pyGeno

.. image:: https://pepy.tech/badge/pygeno
   :alt: downloads
   :target: https://pepy.tech/project/pygeno

.. image:: https://pepy.tech/badge/pygeno/month
   :alt: downloads_month
   :target: https://pepy.tech/project/pygeno/month

.. image:: https://pepy.tech/badge/pygeno/week
   :alt: downloads_week
   :target: https://pepy.tech/project/pygeno/week

.. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png
   :alt: pyGeno's logo
   

pyGeno is (to our knowledge) the only tool available that will gladly build your specific genomes for you.

pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_.
For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.

.. _Tariq Daouda: http://wwww.tariqdaouda.com
.. _IRIC: http://www.iric.ca
.. _Sawssan Kaddoura: http://sawssankaddoura.com

Click here for The `full documentation`_.

.. _full documentation: http://pygeno.iric.ca/

For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.

.. _@tariqdaouda: https://www.twitter.com/tariqdaouda

Citing pyGeno:
--------------
Please cite this paper_.

.. _paper: http://f1000research.com/articles/5-381/v1

Installation:
-------------

It is recommended to install pyGeno within a `virtual environement`_, to setup one you can use:

.. code:: shell

        virtualenv ~/.pyGenoEnv
        source ~/.pyGenoEnv/bin/activate

pyGeno can be installed through pip:

.. code:: shell
	
	pip install pyGeno #for the latest stable version

Or github, for the latest developments:

.. code:: shell

	git clone https://github.com/tariqdaouda/pyGeno.git
	cd pyGeno
        python setup.py develop

.. _`virtual environement`: http://virtualenv.readthedocs.org/

A brief introduction
--------------------

pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend
upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to
be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Personalized Genomes**.

Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphisms and an optional filter.
pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get
direct access to the DNA and Protein sequences of your patients.

.. code:: python

	from pyGeno.Genome import *
	
	g = Genome(name = "GRCh37.75")
	prot = g.get(Protein, id = 'ENSP00000438917')[0]
	#print the protein sequence
	print prot.sequence
	#print the protein's gene biotype
	print prot.gene.biotype
	#print protein's transcript sequence
	print prot.transcript.sequence
	
	#fancy queries
	for exon in g.get(Exon, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) :
		#print the exon's coding sequence
		print exon.CDS
		#print the exon's transcript sequence
		print exon.transcript.sequence
	
	#You can do the same for your subject specific genomes
	#by combining a reference genome with polymorphisms
	g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter())

And if you ever get lost, there's an online **help()** function for each object type:

.. code:: python

	from pyGeno.Genome import *
	
	print Exon.help()

Should output:

.. code::
	
	Available fields for Exon: CDS_start, end, chromosome, CDS_length, frame, number, CDS_end, start, genome, length, protein, gene, transcript, id, strand

	
Creating a Personalized Genome:
-------------------------------
Personalized Genomes are a powerful feature that allow you to work on the specific genomes and proteomes of your patients. You can even mix several SNP sets together.

.. code:: python
  
  from pyGeno.Genome import Genome
  #the name of the snp set is defined inside the datawrap's manifest.ini file
  dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')
  #you can also define a filter (ex: a quality filter) for the SNPs
  dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())
  #and even mix several snp sets  
  dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())

Filtering SNPs:
---------------
pyGeno allows you to select the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions.

.. code:: python
	
	from pyGeno.SNPFiltering import SNPFilter, SequenceSNP

	class QMax_gt_filter(SNPFilter) :
		
		def __init__(self, threshold) :
			self.threshold = threshold
		
		#Here SNPs is a dictionary: SNPSet Name => polymorphism  
		#This filter ignores deletions and insertions and
		#but applis all SNPs
		def filter(self, chromosome, **SNPs) :
			sources = {}
			alleles = []
			for snpSet, snp in SNPs.iteritems() :
				pos = snp.start
				if snp.alt[0] == '-' :
					pass
				elif snp.ref[0] == '-' :
					pass
				else :
					sources[snpSet] = snp
					alleles.append(snp.alt) #if not an indel append the polymorphism
				
			#appends the refence allele to the lot
			refAllele = chromosome.refSequence[pos]
			alleles.append(refAllele)
			sources['ref'] = refAllele
	
			#optional we keep a record of the polymorphisms that were used during the process
			return SequenceSNP(alleles, sources = sources)
		
The filter function can also be made more specific by using arguments that have the same names as the SNPSets

.. code:: python

	def filter(self, chromosome, dummySRY = None) :
		if dummySRY.Qmax_gt > self.threshold :
			#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
			return SequenceSNP(dummySRY.alt)
		return None #None means keep the reference allele

To apply the filter simply specify if while loading the genome.

.. code:: python

	persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))

To include several SNPSets use a list.

.. code:: python

	persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = ['ARN_P1', 'ARN_P2'], SNPFilter = myFilter())

Getting an arbitrary sequence:
------------------------------
You can ask for any sequence of any chromosome:

.. code:: python
	
	chr12 = myGenome.get(Chromosome, number = "12")[0]
	print chr12.sequence[x1:x2]
	# for the reference sequence
  	print chr12.refSequence[x1:x2]

Batteries included (bootstraping):
---------------------------------

pyGeno's database is populated by importing datawraps.
pyGeno comes with a few data wraps, to get the list you can use:

.. code:: python
	
	import pyGeno.bootstrap as B
	B.printDatawraps()

.. code::

	Available datawraps for boostraping
	
	SNPs
	~~~~|
	    |~~~:> Human_agnostic.dummySRY.tar.gz
	    |~~~:> Human.dummySRY_casava.tar.gz
	    |~~~:> dbSNP142_human_common_all.tar.gz
	
	
	Genomes
	~~~~~~~|
	       |~~~:> Human.GRCh37.75.tar.gz
	       |~~~:> Human.GRCh37.75_Y-Only.tar.gz
	       |~~~:> Human.GRCh38.78.tar.gz
	       |~~~:> Mouse.GRCm38.78.tar.gz

To get a list of remote datawraps that pyGeno can download for you, do:

.. code:: python

	B.printRemoteDatawraps()

Importing whole genomes is a demanding process that take more than an hour and requires (according to tests) 
at least 3GB of memory. Depending on your configuration, more might be required.

That being said importating a data wrap is a one time operation and once the importation is complete the datawrap
can be discarded without consequences.

The bootstrap module also has some handy functions for importing built-in packages.

Some of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**):

.. code:: python
	
	import pyGeno.bootstrap as B

	#Imports only the Y chromosome from the human reference genome GRCh37.75
	#Very fast, requires even less memory. No download required.
	B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
	
	#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP  format. 
	# This one has one SNP at the begining of the gene SRY
	B.importSNPs("Human.dummySRY_casava.tar.gz")

And for more **Serious Work**, the whole reference genome.

.. code:: python

	#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.
	B.importGenome("Human.GRCh38.78.tar.gz")
	
Importing a custom datawrap:
--------------------------

.. code:: python

  from pyGeno.importation.Genomes import *
  importGenome('GRCh37.75.tar.gz')

To import a patient's specific polymorphisms

.. code:: python

  from pyGeno.importation.SNPs import *
  importSNPs('patient1.tar.gz')

For a list of available datawraps available for download, please have a look here_.

You can easily make your own datawraps with any tar.gz compressor.
For more details on how datawraps are made you can check wiki_ or have a look inside the folder bootstrap_data.

.. _here: http://pygeno.iric.ca/datawraps.html
.. _wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-friendly-package-to-import-your-data%3F

Instanciating a genome:
-----------------------
.. code:: python
	
	from pyGeno.Genome import Genome
	#the name of the genome is defined inside the package's manifest.ini file
	ref = Genome(name = 'GRCh37.75')

Printing all the proteins of a gene:
-----------------------------------
.. code:: python

  from pyGeno.Genome import Genome
  from pyGeno.Gene import Gene
  from pyGeno.Protein import Protein

Or simply:

.. code:: python

  from pyGeno.Genome import *

then:

.. code:: python

  ref = Genome(name = 'GRCh37.75')
  #get returns a list of elements
  gene = ref.get(Gene, name = 'TPST2')[0]
  for prot in gene.get(Protein) :
  	print prot.sequence

Making queries, get() Vs iterGet():
-----------------------------------
iterGet is a faster version of get that returns an iterator instead of a list.

Making queries, syntax:
----------------------
pyGeno's get function uses the expressivity of rabaDB.

These are all possible query formats:

.. code:: python

  ref.get(Gene, name = "SRY")
  ref.get(Gene, { "name like" : "HLA"})
  chr12.get(Exon, { "start >=" : 12000, "end <" : 12300 })
  ref.get(Transcript, { "gene.name" : 'SRY' })

Creating indexes to speed up queries:
------------------------------------
.. code:: python

  from pyGeno.Gene import Gene
  #creating an index on gene names if it does not already exist
  Gene.ensureGlobalIndex('name')
  #removing the index
  Gene.dropIndex('name')

Find in sequences:
------------------

Internally pyGeno uses a binary representation for nucleotides and amino acids to deal with polymorphisms. 
For example,both "AGC" and "ATG" will match the following sequence "...AT/GCCG...".

.. code:: python

	#returns the position of the first occurence
	transcript.find("AT/GCCG")
	#returns the positions of all occurences
	transcript.findAll("AT/GCCG")
	
	#similarly, you can also do
	transcript.findIncDNA("AT/GCCG")
	transcript.findAllIncDNA("AT/GCCG")
	transcript.findInUTR3("AT/GCCG")
	transcript.findAllInUTR3("AT/GCCG")
	transcript.findInUTR5("AT/GCCG")
	transcript.findAllInUTR5("AT/GCCG")
	
	#same for proteins
	protein.find("DEV/RDEM")
	protein.findAll("DEV/RDEM")
	
	#and for exons
	exon.find("AT/GCCG")
	exon.findAll("AT/GCCG")
	exon.findInCDS("AT/GCCG")
	exon.findAllInCDS("AT/GCCG")
	#...

	
Progress Bar:
-------------
.. code:: python

  from pyGeno.tools.ProgressBar import ProgressBar
  pg = ProgressBar(nbEpochs = 155)
  for i in range(155) :
  	pg.update(label = '%d' %i) # or simply p.update() 
  pg.close()


================================================
FILE: pyGeno/Chromosome.py
================================================
#import copy
#import types
#from tools import UsefulFunctions as uf

from types import *
import configuration as conf
from pyGenoObjectBases import *

from SNP import *
import SNPFiltering as SF

from rabaDB.filters import RabaQuery
import rabaDB.fields as rf

from tools.SecureMmap import SecureMmap as SecureMmap
from tools import SingletonManager

import pyGeno.configuration as conf

class ChrosomeSequence(object) :
	"""Represents a chromosome sequence. If 'refOnly' no ploymorphisms are applied and the ref sequence is always returned"""

	def __init__(self, data, chromosome, refOnly = False) :
		
		self.data = data
		self.refOnly = refOnly
		self.chromosome = chromosome
		self.setSNPFilter(self.chromosome.genome.SNPFilter)
	
	def setSNPFilter(self, SNPFilter) :
		self.SNPFilter = SNPFilter
	
	def getSequenceData(self, slic) :
		data = self.data[slic]
		SNPTypes = self.chromosome.genome.SNPTypes
		if SNPTypes is None or self.refOnly :
			return data
		
		iterators = []
		for setName, SNPType in SNPTypes.iteritems() :
			f = RabaQuery(str(SNPType), namespace = self.chromosome._raba_namespace)
			
			chromosomeNumber = self.chromosome.number

			if chromosomeNumber == 'MT':
				chromosomeNumber = 'M'
			
			f.addFilter({'start >=' : slic.start, 'start <' : slic.stop, 'setName' : str(setName), 'chromosomeNumber' : chromosomeNumber})
			# conf.db.enableDebug(True)
			iterators.append(f.iterRun(sqlTail = 'ORDER BY start'))
		
		if len(iterators) < 1 :
			return data
		
		polys = {}
		for iterator in iterators :
			for poly in iterator :
				if poly.start not in polys :
					polys[poly.start] = {poly.setName : poly}
				else :
					try :
						polys[poly.start][poly.setName].append(poly)
					except :
						polys[poly.start][poly.setName] = [polys[poly.start][poly.setName]]
						polys[poly.start][poly.setName].append(poly)
						
		data = list(data)
		for start, setPolys in polys.iteritems() :
			
			seqPos = start - slic.start
			sequenceModifier = self.SNPFilter.filter(self.chromosome, **setPolys)
			# print sequenceModifier.alleles
			if sequenceModifier is not None :
				if sequenceModifier.__class__ is SF.SequenceDel :
					seqPos = seqPos + sequenceModifier.offset
					#To avoid to change the length of the sequence who can create some bug or side effect
					data[seqPos:(seqPos + sequenceModifier.length)] = [''] * sequenceModifier.length
				elif sequenceModifier.__class__ is SF.SequenceSNP :
					data[seqPos] = sequenceModifier.alleles
				elif sequenceModifier.__class__ is SF.SequenceInsert :
					seqPos = seqPos + sequenceModifier.offset
					data[seqPos] = "%s%s" % (data[seqPos], sequenceModifier.bases)
				else :
					raise TypeError("sequenceModifier on chromosome: %s starting at: %s is of unknown type: %s" % (self.chromosome.number, snp.start, sequenceModifier.__class__))

		return data
	
	def _getSequence(self, slic) :
		return ''.join(self.getSequenceData(slice(0, None, 1)))[slic]

	def __getitem__(self, i) :
		return self._getSequence(i)

	def __len__(self) :
		return self.chromosome.length

class Chromosome_Raba(pyGenoRabaObject) :
	"""The wrapped Raba object that really holds the data"""
	
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE

	header = rf.Primitive()
	number = rf.Primitive()
	start = rf.Primitive()
	end = rf.Primitive()
	length = rf.Primitive()

	genome = rf.RabaObject('Genome_Raba')

	def _curate(self) :
		if  self.end != None and self.start != None :
			self.length = self.end-self.start
		if self.number != None :
			self.number =  str(self.number).upper()

class Chromosome(pyGenoRabaObjectWrapper) :
	"""The wrapper for playing with Chromosomes"""
	
	_wrapped_class = Chromosome_Raba

	def __init__(self, *args, **kwargs) :
		pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)

		path = '%s/chromosome%s.dat'%(self.genome.getSequencePath(), self.number)
		if not SingletonManager.contains(path) :
			datMap = SingletonManager.add(SecureMmap(path), path)
		else :
			datMap = SingletonManager.get(path)
			
		self.sequence = ChrosomeSequence(datMap, self)
		self.refSequence = ChrosomeSequence(datMap, self, refOnly = True)
		self.loadSequences = False

	def getSequenceData(self, slic) :
		return self.sequence.getSequenceData(slic)

	def _makeLoadQuery(self, objectType, *args, **coolArgs) :
		if issubclass(objectType, SNP_INDEL) :
			f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
			coolArgs['species'] = self.genome.species
			coolArgs['chromosomeNumber'] = self.number

			if len(args) > 0 and type(args[0]) is types.ListType :
				for a in args[0] :
					if type(a) is types.DictType :
						f.addFilter(**a)
			else :
				f.addFilter(*args, **coolArgs)

			return f
		
		return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)

	def __getitem__(self, i) :
		return self.sequence[i]

	def __str__(self) :
		return "Chromosome: number %s > %s" %(self.wrapped_object.number, str(self.wrapped_object.genome))


================================================
FILE: pyGeno/Exon.py
================================================
from pyGenoObjectBases import *
from SNP import SNP_INDEL

import rabaDB.fields as rf
from tools import UsefulFunctions as uf
from tools.BinarySequence import NucBinarySequence

class Exon_Raba(pyGenoRabaObject) :
	"""The wrapped Raba object that really holds the data"""
	
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE

	id = rf.Primitive()
	number = rf.Primitive()
	start = rf.Primitive()
	end = rf.Primitive()
	length = rf.Primitive()
	CDS_length = rf.Primitive()
	CDS_start = rf.Primitive()
	CDS_end = rf.Primitive()
	frame = rf.Primitive()
	strand = rf.Primitive()

	genome = rf.RabaObject('Genome_Raba')
	chromosome = rf.RabaObject('Chromosome_Raba')
	gene = rf.RabaObject('Gene_Raba')
	transcript = rf.RabaObject('Transcript_Raba')
	protein = rf.RabaObject('Protein_Raba')

	def _curate(self) :
		if self.start != None and self.end != None :
			if self.start > self.end :
				self.start, self.end = self.end, self.start
			self.length = self.end-self.start

		if self.CDS_start != None and self.CDS_end != None :
			if self.CDS_start > self.CDS_end :
				self.CDS_start, self.CDS_end = self.CDS_end, self.CDS_start
			self.CDS_length = self.CDS_end - self.CDS_start
		
		if self.number != None :
			self.number = int(self.number)

		if not self.frame or self.frame == '.' :
			self.frame = None
		else :
			self.frame = int(self.frame)

class Exon(pyGenoRabaObjectWrapper) :
	"""The wrapper for playing with Exons"""
		
	_wrapped_class = Exon_Raba

	def __init__(self, *args, **kwargs) :
		pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
		self._load_sequencesTriggers = set(["UTR5", "UTR3", "CDS", "sequence", "data"])

	def _makeLoadQuery(self, objectType, *args, **coolArgs) :
		if issubclass(objectType, SNP_INDEL) :
			f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
			coolArgs['species'] = self.genome.species
			coolArgs['chromosomeNumber'] = self.chromosome.number
			coolArgs['start >='] = self.start
			coolArgs['start <'] = self.end
		
			if len(args) > 0 and type(args[0]) is types.ListType :
				for a in args[0] :
					if type(a) is types.DictType :
						f.addFilter(**a)
			else :
				f.addFilter(*args, **coolArgs)

			return f
		
		return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
	
	def _load_data(self) :
		data = self.chromosome.getSequenceData(slice(self.start,self.end))

		diffLen = (self.end-self.start) - len(data)
		
		if self.strand == '+' :
			self.data = data
		else :
			self.data = uf.reverseComplementTab(data)

		if self.hasCDS() :
			start = self.CDS_start-self.start
			end = self.CDS_end-self.start
			
			if self.strand == '+' :
				self.UTR5 = self.data[:start]
				self.CDS = self.data[start:end+diffLen]
				self.UTR3 = self.data[end+diffLen:]
			else :
				self.UTR5 = self.data[:len(self.data)-(end-diffLen)]
				self.CDS = self.data[len(self.data)-(end-diffLen):len(self.data)-start]
				self.UTR3 = self.data[len(self.data)-start:]
		else :
			self.UTR5 = ''
			self.CDS = ''
			self.UTR3 = ''

		self.sequence = ''.join(self.data)

	def _load_bin_sequence(self) :
		self.bin_sequence = NucBinarySequence(self.sequence)
		self.bin_UTR5 =  NucBinarySequence(self.UTR5)
		self.bin_CDS =  NucBinarySequence(self.CDS)
		self.bin_UTR3 =  NucBinarySequence(self.UTR3)
		
	def hasCDS(self) :
		"""returns true or false depending on if the exon has a CDS"""
		if self.CDS_start != None and self.CDS_end != None:
			return True
		return False

	def getCDSLength(self) :
		"""returns the length of the CDS sequence"""
		return len(self.CDS)

	def find(self, sequence) :
		"""return the position of the first occurance of sequence"""
		return self.bin_sequence.find(sequence)

	def findAll(self, sequence):
		"""Returns a lits of all positions where sequence was found"""
		return self.bin_sequence.findAll(sequence)

	def findInCDS(self, sequence) :
		"""return the position of the first occurance of sequence"""
		return self.bin_CDS.find(sequence)

	def findAllInCDS(self, sequence):
		"""Returns a lits of all positions where sequence was found"""
		return self.bin_CDS.findAll(sequence)

	def pluck(self) :
		"""Returns a plucked object. Plucks the exon off the tree, set the value of self.transcript into str(self.transcript). This effectively disconnects the object and
		makes it much more lighter in case you'd like to pickle it"""
		e = copy.copy(self)
		e.transcript = str(self.transcript)
		return e

	def nextExon(self) :
		"""Returns the next exon of the transcript, or None if there is none"""
		try :
			return self.transcript.exons[self.number+1]
		except IndexError :
			return None
	
	def previousExon(self) :
		"""Returns the previous exon of the transcript, or None if there is none"""
		
		if self.number == 0 :
			return None
		
		try :
			return self.transcript.exons[self.number-1]
		except IndexError :
			return None
		
	def __str__(self) :
		return """EXON, id %s, number: %s, (start, end): (%s, %s), cds: (%s, %s) > %s""" %( self.id, self.number, self.start, self.end, self.CDS_start, self.CDS_end, str(self.transcript))

	def __len__(self) :
		return len(self.sequence)


================================================
FILE: pyGeno/Gene.py
================================================
import configuration as conf

from pyGenoObjectBases import *
from SNP import SNP_INDEL

import rabaDB.fields as rf

class Gene_Raba(pyGenoRabaObject) :
	"""The wrapped Raba object that really holds the data"""
	
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE

	id = rf.Primitive()
	name = rf.Primitive()
	strand = rf.Primitive()
	biotype = rf.Primitive()
	
	start = rf.Primitive()
	end = rf.Primitive()
	
	genome = rf.RabaObject('Genome_Raba')
	chromosome = rf.RabaObject('Chromosome_Raba')

	def _curate(self) :
		self.name = self.name.upper()

class Gene(pyGenoRabaObjectWrapper) :
	"""The wrapper for playing with genes"""
	
	_wrapped_class = Gene_Raba

	def __init__(self, *args, **kwargs) :
		pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)

	def _makeLoadQuery(self, objectType, *args, **coolArgs) :
		if issubclass(objectType, SNP_INDEL) :
			f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
			coolArgs['species'] = self.genome.species
			coolArgs['chromosomeNumber'] = self.chromosome.number
			coolArgs['start >='] = self.start
			coolArgs['start <'] = self.end
		
			if len(args) > 0 and type(args[0]) is types.ListType :
				for a in args[0] :
					if type(a) is types.DictType :
						f.addFilter(**a)
			else :
				f.addFilter(*args, **coolArgs)

			return f
		
		return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
	
	def __str__(self) :
		return "Gene, name: %s, id: %s, strand: '%s' > %s" %(self.name, self.id, self.strand, str(self.chromosome))


================================================
FILE: pyGeno/Genome.py
================================================
import types
import configuration as conf
import pyGeno.tools.UsefulFunctions as uf
from pyGenoObjectBases import *

from Chromosome import Chromosome
from Gene import Gene
from Transcript import Transcript
from Protein import Protein
from Exon import Exon
import SNPFiltering as SF
from SNP import *

import rabaDB.fields as rf

def getGenomeList() :
	"""Return the names of all imported genomes"""
	import rabaDB.filters as rfilt
	f = rfilt.RabaQuery(Genome_Raba)
	names = []
	for g in f.iterRun() :
		names.append(g.name)
	return names

class Genome_Raba(pyGenoRabaObject) :
	"""The wrapped Raba object that really holds the data"""
	
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	#_raba_not_a_singleton = True #you can have several instances of the same genome but they all share the same location in the database

	name = rf.Primitive()
	species = rf.Primitive()

	source = rf.Primitive()
	packageInfos = rf.Primitive()

	def _curate(self) :
		self.species = self.species.lower()

	def getSequencePath(self) :
		return conf.getGenomeSequencePath(self.species, self.name)

	def getReferenceSequencePath(self) :
		return conf.getReferenceGenomeSequencePath(self.species)

	def __len__(self) :
		"""Size of the genome in pb"""
		l = 0
		for c in self.chromosomes :
			l +=  len(c)

		return l

class Genome(pyGenoRabaObjectWrapper) :	
	"""
	This is the entry point to pyGeno::
		
		myGeno = Genome(name = 'GRCh37.75', SNPs = ['RNA_S1', 'DNA_S1'], SNPFilter = MyFilter)
		for prot in myGeno.get(Protein) :
			print prot.sequence
	
	"""
	_wrapped_class = Genome_Raba

	def __init__(self, SNPs = None, SNPFilter = None,  *args, **kwargs) :
		
		pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)

		if type(SNPs) is types.StringType :
			self.SNPsSets = [SNPs]
		else :
			self.SNPsSets = SNPs

		# print "pifpasdf", self.SNPsSets

		if SNPFilter is None :
			self.SNPFilter = SF.DefaultSNPFilter()
		else :
			if issubclass(SNPFilter.__class__, SF.SNPFilter) :
				self.SNPFilter = SNPFilter
			else :
				raise ValueError("The value of 'SNPFilter' is not an object deriving from a subclass of SNPFiltering.SNPFilter. Got: '%s'" % SNPFilter)

		self.SNPTypes = {}
		
		if SNPs is not None :
			f = RabaQuery(SNPMaster, namespace = self._raba_namespace)
			for se in self.SNPsSets :
				f.addFilter(setName = se, species = self.species)

			res = f.run()
			if res is None or len(res) < 1 :
				raise ValueError("There's no set of SNPs that goes by the name of %s for species %s" % (SNPs, self.species))

			for s in res :
				# print s.setName, s.SNPType
				self.SNPTypes[s.setName] = s.SNPType

	def _makeLoadQuery(self, objectType, *args, **coolArgs) :
		if issubclass(objectType, SNP_INDEL) :
			# conf.db.enableDebug(True)
			f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
			coolArgs['species'] = self.species

			if len(args) > 0 and type(args[0]) is types.ListType :
				for a in args[0] :
					if type(a) is types.DictType :
						f.addFilter(**a)
			else :
				f.addFilter(*args, **coolArgs)

			return f
		
		return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
	
	def __str__(self) :
		return "Genome: %s/%s SNPs: %s" %(self.species, self.name, self.SNPTypes)


================================================
FILE: pyGeno/Protein.py
================================================
import configuration as conf

from pyGenoObjectBases import *
from SNP import SNP_INDEL

import rabaDB.fields as rf

from tools import UsefulFunctions as uf
from tools.BinarySequence import AABinarySequence
import copy

class Protein_Raba(pyGenoRabaObject) :
	"""The wrapped Raba object that really holds the data"""
	
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE

	id = rf.Primitive()
	name = rf.Primitive()

	genome = rf.RabaObject('Genome_Raba')
	chromosome = rf.RabaObject('Chromosome_Raba')
	gene = rf.RabaObject('Gene_Raba')
	transcript = rf.RabaObject('Transcript_Raba')

	def _curate(self) :
		if self.name != None :
			self.name = self.name.upper()

class Protein(pyGenoRabaObjectWrapper) :
	"""The wrapper for playing with Proteins"""
	
	_wrapped_class = Protein_Raba

	def __init__(self, *args, **kwargs) :
		pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
		self._load_sequencesTriggers = set(["sequence"])

	def _makeLoadQuery(self, objectType, *args, **coolArgs) :
		if issubclass(objectType, SNP_INDEL) :
			f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
			coolArgs['species'] = self.genome.species
			coolArgs['chromosomeNumber'] = self.chromosome.number
			coolArgs['start >='] = self.transcript.start
			coolArgs['start <'] = self.transcript.end
		
			if len(args) > 0 and type(args[0]) is types.ListType :
				for a in args[0] :
					if type(a) is types.DictType :
						f.addFilter(**a)
			else :
				f.addFilter(*args, **coolArgs)

			return f
		
		return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
	
	def _load_sequences(self) :
		if self.chromosome.number != 'MT':
			self.sequence = uf.translateDNA(self.transcript.cDNA).rstrip('*')
		else:
			self.sequence = uf.translateDNA(self.transcript.cDNA, translTable_id='mt').rstrip('*')

	
	def getSequence(self):
		return self.sequence

	def _load_bin_sequence(self) :
		self.bin_sequence = AABinarySequence(self.sequence)

	def getDefaultSequence(self) :
		"""Returns a version str sequence where only the last allele of each polymorphisms is shown"""
		return self.bin_sequence.defaultSequence

	def getPolymorphisms(self) :
		"""Returns a list of all polymorphisms contained in the protein"""
		return self.bin_sequence.getPolymorphisms()

	def find(self, sequence):
		"""Returns the position of the first occurence of sequence taking polymorphisms into account"""
		return self.bin_sequence.find(sequence)

	def findAll(self, sequence):
		"""Returns all the position of the occurences of sequence taking polymorphisms into accoun"""
		return self.bin_sequence.findAll(sequence)

	def findString(self, sequence) :
		"""Returns the first occurence of sequence using simple string search in sequence that doesn't care about polymorphisms"""
		return self.sequence.find(sequence)

	def findStringAll(self, sequence):
		"""Returns all first occurences of sequence using simple string search in sequence that doesn't care about polymorphisms"""
		return uf.findAll(self.sequence, sequence)

	def __getitem__(self, i) :
		return self.bin_sequence.getChar(i)
		
	def __len__(self) :
		return len(self.bin_sequence)

	def __str__(self) :
		return "Protein, id: %s > %s" %(self.id, str(self.transcript))


================================================
FILE: pyGeno/SNP.py
================================================
import types

import configuration as conf

from pyGenoObjectBases import *
import rabaDB.fields as rf

# from tools import UsefulFunctions as uf

"""General guidelines for SNP classes:
----
-All classes must inherit from SNP_INDEL
-All classes name must end with SNP
-A SNP_INDELs must have at least chromosomeNumber, setName, species, start and ref filled in order to be inserted into sequences
-User can define an alias for the alt field (snp_indel alleles) to indicate the default field from wich to extract alleles
"""

def getSNPSetsList() :
	"""Return the names of all imported snp sets"""
	import rabaDB.filters as rfilt
	f = rfilt.RabaQuery(SNPMaster)
	names = []
	for g in f.iterRun() :
		names.append(g.setName)
	return names

class SNPMaster(Raba) :
	'This object keeps track of SNP sets and their types'
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	species = rf.Primitive()
	SNPType = rf.Primitive()
	setName = rf.Primitive()
	_raba_uniques = [('setName',)]

	def _curate(self) :
		self.species = self.species.lower()
		self.setName = self.setName.lower()

class SNP_INDEL(Raba) :
	"All SNPs should inherit from me. The name of the class must end with SNP"
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	_raba_abstract = True # not saved in db

	species = rf.Primitive()
	setName = rf.Primitive()
	chromosomeNumber = rf.Primitive()

	start = rf.Primitive()
	end = rf.Primitive()
	
	ref = rf.Primitive()
	
	#every SNP_INDEL must have a field alt. This variable allows you to set an alias for it. Chamging the alias does not impact the schema
	altAlias = 'alt'
	
	def __init__(self, *args, **kwargs) :
		pass

	def __getattribute__(self, k) :
		if k == 'alt' :
			cls = Raba.__getattribute__(self, '__class__')
			return Raba.__getattribute__(self, cls.altAlias)
		
		return Raba.__getattribute__(self, k)

	def __setattr__(self, k, v) :
		if k == 'alt' :
			cls = Raba.__getattribute__(self, '__class__')
			Raba.__setattr__(self, cls.altAlias, v)
			
		Raba.__setattr__(self, k, v)
	
	def _curate(self) :
		self.species = self.species.lower()

	@classmethod
	def ensureGlobalIndex(cls, fields) :
		cls.ensureIndex(fields)

	def __repr__(self) :
		return "%s> chr: %s, start: %s, end: %s, alt: %s, ref: %s" %(self.__class__.__name__, self.chromosomeNumber, self.start, self.end, self.alleles, self.ref)
	
class CasavaSNP(SNP_INDEL) :
	"A SNP of Casava"
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	
	alleles = rf.Primitive()
	bcalls_used = rf.Primitive()
	bcalls_filt = rf.Primitive()
	QSNP = rf.Primitive()
	Qmax_gt = rf.Primitive()
	max_gt_poly_site = rf.Primitive()
	Qmax_gt_poly_site = rf.Primitive()
	A_used = rf.Primitive()
	C_used = rf.Primitive()
	G_used = rf.Primitive()
	T_used = rf.Primitive()
	
	altAlias = 'alleles'

class AgnosticSNP(SNP_INDEL) :
	"""This is a generic SNPs/Indels format that you can easily make from the result of any SNP caller. AgnosticSNP files are tab delimited files such as:

	chromosomeNumber	uniqueId  start	      end	   ref    alleles	quality	 caller
	Y					   1 	 2655643	2655644		T		AG	     30		 TopHat
	Y					   2 	 2655645	2655647		-		AG	     28		 TopHat
	Y					   3 	 2655648	2655650		TT		-	     10		 TopHat

	All positions must be 0 based
	The '-' indicates a deletion or an insertion. Collumn order has no importance.
	"""

	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	
	alleles = rf.Primitive()
	quality = rf.Primitive()
	caller = rf.Primitive()
	uniqueId = rf.Primitive() # polymorphism id
	
	altAlias = 'alleles'

	def __repr__(self) :
		return "AgnosticSNP> start: %s, end: %s, quality: %s, caller %s, alt: %s, ref: %s" %(self.start, self.end, self.quality, self.caller, self.alleles, self.ref)

class dbSNPSNP(SNP_INDEL) :
	"This class is for SNPs from dbSNP. Feel free to uncomment the fields that you need"
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	
	# To add/remove a field comment/uncomentd it. Beware, adding or removing a field results in a significant overhead the first time you relaunch pyGeno. You may have to delete and reimport some snps sets.
	
	#####RSPOS = rf.Primitive() #Chr position reported in already saved into field start
	RS = rf.Primitive() #dbSNP ID (i.e. rs number)
	ALT =  rf.Primitive()
	RV = rf.Primitive() #RS orientation is reversed
	#VP = rf.Primitive() #Variation Property.  Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
	#GENEINFO = rf.Primitive() #Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)
	dbSNPBuildID = rf.Primitive() #First dbSNP Build for RS
	#SAO = rf.Primitive() #Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both
	#SSR = rf.Primitive() #Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other
	#WGT = rf.Primitive() #Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more
	VC = rf.Primitive() #Variation Class
	PM = rf.Primitive() #Variant is Precious(Clinical,Pubmed Cited)
	#TPA = rf.Primitive() #Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)
	#PMC = rf.Primitive() #Links exist to PubMed Central article
	#S3D = rf.Primitive() #Has 3D structure - SNP3D table
	#SLO = rf.Primitive() #Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out
	#NSF = rf.Primitive() #Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44
	#NSM = rf.Primitive() #Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42
	#NSN = rf.Primitive() #Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41
	#REF = rf.Primitive() #Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8
	#SYN = rf.Primitive() #Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3
	#U3 = rf.Primitive() #In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53
	#U5 = rf.Primitive() #In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55
	#ASS = rf.Primitive() #In acceptor splice site FxnCode = 73
	#DSS = rf.Primitive() #In donor splice-site FxnCode = 75
	#INT = rf.Primitive() #In Intron FxnCode = 6
	#R3 = rf.Primitive() #In 3' gene region FxnCode = 13
	#R5 = rf.Primitive() #In 5' gene region FxnCode = 15
	#OTH = rf.Primitive() #Has other variant with exactly the same set of mapped positions on NCBI refernce assembly.
	#CFL = rf.Primitive() #Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.
	#ASP = rf.Primitive() #Is Assembly specific. This is set if the variant only maps to one assembly
	MUT = rf.Primitive() #Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources
	VLD = rf.Primitive() #Is Validated.  This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.
	G5A = rf.Primitive() #>5% minor allele frequency in each and all populations
	G5 = rf.Primitive() #>5% minor allele frequency in 1+ populations
	#HD = rf.Primitive() #Marker is on high density genotyping kit (50K density or greater).  The variant may have phenotype associations present in dbGaP.
	#GNO = rf.Primitive() #Genotypes available. The variant has individual genotype (in SubInd table).
	KGValidated = rf.Primitive() #1000 Genome validated
	#KGPhase1 = rf.Primitive() #1000 Genome phase 1 (incl. June Interim phase 1)
	#KGPilot123 = rf.Primitive() #1000 Genome discovery all pilots 2010(1,2,3)
	#KGPROD = rf.Primitive() #Has 1000 Genome submission
	OTHERKG = rf.Primitive() #non-1000 Genome submission
	#PH3 = rf.Primitive() #HAP_MAP Phase 3 genotyped: filtered, non-redundant
	#CDA = rf.Primitive() #Variation is interrogated in a clinical diagnostic assay
	#LSD = rf.Primitive() #Submitted from a locus-specific database
	#MTP = rf.Primitive() #Microattribution/third-party annotation(TPA:GWAS,PAGE)
	#OM = rf.Primitive() #Has OMIM/OMIA
	#NOC = rf.Primitive() #Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation.
	#WTD = rf.Primitive() #Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set.  If all member ss' are withdrawn, then the rs is deleted to SNPHistory
	#NOV = rf.Primitive() #Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common.
	#CAF = rf.Primitive() #An ordered, comma delimited list of allele frequencies based on 1000Genomes, starting with the reference allele followed by alternate alleles as ordered in the ALT column. Where a 1000Genomes alternate allele is not in the dbSNPs alternate allele set, the allele is added to the ALT column.  The minor allele is the second largest value in the list, and was previuosly reported in VCF as the GMAF.  This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter
	COMMON = rf.Primitive() #RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.
	#CLNHGVS = rf.Primitive() #Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical  INFO tags.
	#CLNALLE = rf.Primitive() #Variant alleles from REF or ALT columns.  0 is REF, 1 is the first ALT allele, etc.  This is used to match alleles with other corresponding clinical (CLN) INFO tags.  A value of -1 indicates that no allele was found to match a corresponding HGVS allele name.
	#CLNSRC = rf.Primitive() #Variant Clinical Chanels
	#CLNORIGIN = rf.Primitive() #Allele Origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other
	#CLNSRCID = rf.Primitive() #Variant Clinical Channel IDs
	#CLNSIG = rf.Primitive() #Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other
	#CLNDSDB = rf.Primitive() #Variant disease database name
	#CLNDSDBID = rf.Primitive() #Variant disease database ID
	#CLNDBN = rf.Primitive() #Variant disease name
	#CLNACC = rf.Primitive() #Variant Accession and Versions
	
	#FILTER_NC = rf.Primitive() #Inconsistent Genotype Submission For At Least One Sample
	
	altAlias = 'ALT'

class TopHatSNP(SNP_INDEL) :
	"A SNP from Top Hat, not implemented"
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	pass


================================================
FILE: pyGeno/SNPFiltering.py
================================================
import types

import configuration as conf

from tools import UsefulFunctions as uf

class Sequence_modifiers(object) :
	"""Abtract Class. All sequence must inherit from me"""
	def __init__(self, sources = {}) :
		self.sources = sources

	def addSource(self, name, snp) :
		"Optional, you can keep a dict that records the polymorphims that were mixed together to make self. They are stored into self.sources"
		self.sources[name] = snp

class SequenceSNP(Sequence_modifiers) :
	"""Represents a SNP to be applied to the sequence"""
	def __init__(self, alleles, sources = {}) :
		Sequence_modifiers.__init__(self, sources)
		if type(alleles) is types.ListType :
			self.alleles = uf.encodePolymorphicNucleotide(''.join(alleles))
		else :
			self.alleles = uf.encodePolymorphicNucleotide(alleles)
	
class SequenceInsert(Sequence_modifiers) :
	"""Represents an Insertion to be applied to the sequence"""
	
	def __init__(self, bases, sources = {}, ref = '-') :
		Sequence_modifiers.__init__(self, sources)
		self.bases = bases
		self.offset = 0

		# Allow to use format like C/CCTGGAA(dbSNP) or CCT/CCTGGAA(samtools)
		if ref != '-':
			if ref == bases[:len(ref)]:
				self.offset = len(ref) 
				self.bases = self.bases[self.offset:]
				#-1 because if the insertion are after the last nuc we go out of table
				self.offset -= 1
			else:
				raise NotImplemented("This format of Insetion is not accepted. Please change your format, or implement your format in pyGeno.")


class SequenceDel(Sequence_modifiers) :
	"""Represents a Deletion to be applied to the sequence"""
	def __init__(self, length, sources = {}, ref = None, alt = '-') :
		Sequence_modifiers.__init__(self, sources)
		self.length = length
		self.offset = 0

		# Allow to use format like CCTGGAA/C(dbSNP) or CCTGGAA/CCT(samtools)
		if alt != '-':
			if ref is not None:
				if alt == ref[:len(alt)]:
					self.offset = len(alt)
					self.length = self.length - len(alt)
				else:
					raise NotImplemented("This format of Deletion is not accepted. Please change your format, or implement your format in pyGeno.")
			else:
				raise Exception("You need to add a ref sequence in your call of SequenceDel. Or implement your format in pyGeno.")


class SNPFilter(object) :
	"""Abtract Class. All filters must inherit from me"""
	
	def __init__(self) :
		pass

	def filter(self, chromosome, **kwargs) :
		raise NotImplemented("Must be implemented in child")

class DefaultSNPFilter(SNPFilter) :
	"""
	Default filtering object, does not filter anything. Doesn't apply insertions or deletions.
	This is also a template that you can use for own filters. A prototype for a custom filter might be::
	
		class MyFilter(SNPFilter) :
			def __init__(self, thres) :
				self.thres = thres
			
			def filter(chromosome, SNP_Set1 = None, SNP_Set2 = None ) :
				if SNP_Set1.alt is not None and (SNP_Set1.alt == SNP_Set2.alt) and SNP_Set1.Qmax_gt > self.thres :
					return SequenceSNP(SNP_Set1.alt)
				return None
			
	Where SNP_Set1 and SNP_Set2 are the actual names of the snp sets supplied to the genome. In the context of the function
	they represent single polymorphisms, or lists of polymorphisms, derived from thoses sets that occur at the same position.

	Whatever goes on into the function is absolutely up to you, but in the end, it must return an object of one of the following classes:

		* SequenceSNP

		* SequenceInsert

		* SequenceDel

		* None

		"""

	def __init__(self) :
		SNPFilter.__init__(self)

	def filter(self, chromosome, **kwargs) :
		"""The default filter mixes applied all SNPs and ignores Insertions and Deletions."""
		def appendAllele(alleles, sources, snp) :
			pos = snp.start
			if snp.alt[0] == '-' :
				pass
				# print warn % ('DELETION', snpSet, snp.start, snp.chromosomeNumber)
			elif snp.ref[0] == '-' :
				pass
				# print warn % ('INSERTION', snpSet, snp.start, snp.chromosomeNumber)
			else :
				sources[snpSet] = snp
				alleles.append(snp.alt) #if not an indel append the polymorphism
			
			refAllele = chromosome.refSequence[pos]
			alleles.append(refAllele)
			sources['ref'] = refAllele
			return alleles, sources
				
		warn = 'Warning: the default snp filter ignores indels. IGNORED %s of SNP set: %s at pos: %s of chromosome: %s'
		sources = {}
		alleles = []
		for snpSet, data in kwargs.iteritems() :
			if type(data) is list :
				for snp in data :
					alleles, sources = appendAllele(alleles, sources, snp)
			else :
				allels, sources = appendAllele(alleles, sources, data)

		#appends the refence allele to the lot

		#optional we keep a record of the polymorphisms that were used during the process
		return SequenceSNP(alleles, sources = sources)


================================================
FILE: pyGeno/Transcript.py
================================================
import configuration as conf

from pyGenoObjectBases import *

import rabaDB.fields as rf

from tools import UsefulFunctions as uf
from Exon import *
from SNP import SNP_INDEL

from tools.BinarySequence import NucBinarySequence


class Transcript_Raba(pyGenoRabaObject) :
	"""The wrapped Raba object that really holds the data"""
	
	_raba_namespace = conf.pyGeno_RABA_NAMESPACE

	id = rf.Primitive()
	name = rf.Primitive()
	length = rf.Primitive()
	start = rf.Primitive()
	end = rf.Primitive()
	coding = rf.Primitive()
	biotype = rf.Primitive()
	selenocysteine = rf.RList()
	
	genome = rf.RabaObject('Genome_Raba')
	chromosome = rf.RabaObject('Chromosome_Raba')
	gene = rf.RabaObject('Gene_Raba')
	protein = rf.RabaObject('Protein_Raba')
	exons = rf.Relation('Exon_Raba')
	
	def _curate(self) :
		if self.name != None :
			self.name = self.name.upper()
		
		self.length = abs(self.end - self.start)
		have_CDS_start = False
		have_CDS_end = False
		for exon in self.exons :
			if exon.CDS_start is not None :
				have_CDS_start = True
			if exon.CDS_end is not None :
				have_CDS_end = True

		if have_CDS_start and have_CDS_end :
			self.coding = True
		else :
			self.coding = False

class Transcript(pyGenoRabaObjectWrapper) :
	"""The wrapper for playing with Transcripts"""
	
	_wrapped_class = Transcript_Raba

	def __init__(self, *args, **kwargs) :
		pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
		self.exons = RLWrapper(self, Exon, self.wrapped_object.exons)
		self._load_sequencesTriggers = set(["UTR5", "UTR3", "cDNA", "sequence", "data"])
		self.exonsDict = {}
	
	def _makeLoadQuery(self, objectType, *args, **coolArgs) :
		if issubclass(objectType, SNP_INDEL) :
			# conf.db.enableDebug(True)
			f = RabaQuery(objectType, namespace = self._wrapped_class._raba_namespace)
			coolArgs['species'] = self.genome.species
			coolArgs['chromosomeNumber'] = self.chromosome.number
			coolArgs['start >='] = self.start
			coolArgs['start <'] = self.end
		
			if len(args) > 0 and type(args[0]) is types.ListType :
				for a in args[0] :
					if type(a) is types.DictType :
						f.addFilter(**a)
			else :
				f.addFilter(*args, **coolArgs)

			return f
		
		return pyGenoRabaObjectWrapper._makeLoadQuery(self, objectType, *args, **coolArgs)
		
	def _load_data(self) :
		def getV(k) :
			return pyGenoRabaObjectWrapper.__getattribute__(self, k)

		def setV(k, v) :
			return pyGenoRabaObjectWrapper.__setattr__(self, k, v)

		self.data = []
		cDNA = []
		UTR5 = []
		UTR3 = []
		exons = []
		prime5 = True
		for ee in self.wrapped_object.exons :
			e = pyGenoRabaObjectWrapper_metaclass._wrappers[Exon_Raba](wrapped_object_and_bag = (ee, getV('bagKey')))
			self.exonsDict[(e.start, e.end)] = e
			exons.append(e)
			self.data.extend(e.data)
			
			if e.hasCDS() :
				UTR5.append(''.join(e.UTR5))

				if self.selenocysteine is not None:
					for position in self.selenocysteine:
						if e.CDS_start <= position <= e.CDS_end:

							if e.strand == '+':
								ajusted_position = position - e.CDS_start
							else:
								ajusted_position = e.CDS_end - position - 3

							if e.CDS[ajusted_position] == 'T':
								e.CDS = list(e.CDS)
								e.CDS[ajusted_position] = '!'			
				
				if len(cDNA) == 0 and e.frame != 0 :
					e.CDS = e.CDS[e.frame:]
					
					if e.strand == '+':
						e.CDS_start += e.frame
					else:
						e.CDS_end -= e.frame
				
				if len(e.CDS):
					cDNA.append(''.join(e.CDS))
				UTR3.append(''.join(e.UTR3))
				prime5 = False
			else :
				if prime5 :
					UTR5.append(''.join(e.data))
				else :
					UTR3.append(''.join(e.data))
		
		sequence = ''.join(self.data)
		cDNA = ''.join(cDNA)
		UTR5 = ''.join(UTR5)
		UTR3 = ''.join(UTR3)
		setV('exons', exons)
		setV('sequence', sequence)
		setV('cDNA', cDNA)
		setV('UTR5', UTR5)
		setV('UTR3', UTR3)
		
		if len(cDNA) > 0 and len(cDNA) % 3 != 0 :
			setV('flags', {'DUBIOUS' : True, 'cDNA_LEN_NOT_MULT_3': True})
		else :
			setV('flags', {'DUBIOUS' : False, 'cDNA_LEN_NOT_MULT_3': False})

	def _load_bin_sequence(self) :
		self.bin_sequence = NucBinarySequence(self.sequence)
		self.bin_UTR5 =  NucBinarySequence(self.UTR5)
		self.bin_cDNA =  NucBinarySequence(self.cDNA)
		self.bin_UTR3 =  NucBinarySequence(self.UTR3)

	def getNucleotideCodon(self, cdnaX1) :
		"""Returns the entire codon of the nucleotide at pos cdnaX1 in the cdna, and the position of that nocleotide in the codon"""
		return uf.getNucleotideCodon(self.cDNA, cdnaX1)

	def getCodon(self, i) :
		"""returns the ith codon"""
		return self.getNucleotideCodon(i*3)[0]

	def iterCodons(self) :
		"""iterates through the codons"""
		for i in range(len(self.cDNA)/3) :
			yield self.getCodon(i)

	def find(self, sequence) :
		"""return the position of the first occurance of sequence"""
		return self.bin_sequence.find(sequence)

	def findAll(self, sequence):
		"""Returns a list of all positions where sequence was found"""
		return self.bin_sequence.findAll(sequence)

	def findIncDNA(self, sequence) :
		"""return the position of the first occurance of sequence"""
		return self.bin_cDNA.find(sequence)

	def findAllIncDNA(self, sequence) :
		"""Returns a list of all positions where sequence was found in the cDNA"""
		return self.bin_cDNA.findAll(sequence)

	def getcDNALength(self) :
		"""returns the length of the cDNA"""
		return len(self.cDNA)

	def findInUTR5(self, sequence) :
		"""return the position of the first occurance of sequence in the 5'UTR"""
		return self.bin_UTR5.find(sequence)

	def findAllInUTR5(self, sequence) :
		"""Returns a list of all positions where sequence was found in the 5'UTR"""
		return self.bin_UTR5.findAll(sequence)

	def getUTR5Length(self) :
		"""returns the length of the 5'UTR"""
		return len(self.bin_UTR5)

	def findInUTR3(self, sequence) :
		"""return the position of the first occurance of sequence in the 3'UTR"""
		return self.bin_UTR3.find(sequence)

	def findAllInUTR3(self, sequence) :
		"""Returns a lits of all positions where sequence was found in the 3'UTR"""
		return self.bin_UTR3.findAll(sequence)

	def getUTR3Length(self) :
		"""returns the length of the 3'UTR"""
		return len(self.bin_UTR3)

	def getNbCodons(self) :
		"""returns the number of codons in the transcript"""
		return len(self.cDNA)/3
	
	def __getattribute__(self, name) :
		return pyGenoRabaObjectWrapper.__getattribute__(self, name)

	def __getitem__(self, i) :
		return self.sequence[i]

	def __len__(self) :
		return len(self.sequence)

	def __str__(self) :
		return """Transcript, id: %s, name: %s > %s""" %(self.id, self.name, str(self.gene))


================================================
FILE: pyGeno/__init__.py
================================================
__all__ = ['Genome', 'Chromosome', 'Gene', 'Transcript', 'Exon', 'Protein', 'SNP']

from configuration import pyGeno_init
pyGeno_init()


================================================
FILE: pyGeno/bootstrap.py
================================================
import pyGeno.importation.Genomes as PG
import pyGeno.importation.SNPs as PS
from pyGeno.tools.io import printf
import os, tempfile, urllib, urllib2, json
import pyGeno.configuration as conf

this_dir, this_filename = os.path.split(__file__)


def listRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :
	"""Lists all the datawraps availabe from a remote a remote location."""
	loc = location + "/datawraps.json"
	response = urllib2.urlopen(loc)
	js = json.loads(response.read())
	
	return js

def printRemoteDatawraps(location = conf.pyGeno_REMOTE_LOCATION) :
	"""
		print all available datawraps from a remote location the location must have a datawraps.json in the following format::

			{
			"Ordered": {
				"Reference genomes": {
					"Human" :	["GRCh37.75", "GRCh38.78"],
					"Mouse" : ["GRCm38.78"],
				},
				"SNPs":{
					}
			},
			"Flat":{
				"Reference genomes": {
					"GRCh37.75": "Human.GRCh37.75.tar.gz",
					"GRCh38.78": "Human.GRCh37.75.tar.gz",
					"GRCm38.78": "Mouse.GRCm38.78.tar.gz"
				},
				"SNPs":{
				}
			}
		}
		
	"""
	
	l = listRemoteDatawraps(location)
	printf("Available datawraps for bootstraping\n")
	print json.dumps(l["Ordered"], sort_keys=True, indent=4, separators=(',', ': '))

def _DW(name, url) :
	packageDir = tempfile.mkdtemp(prefix = "pyGeno_remote_")
	
	printf("~~~:>\n\tDownloading datawrap: %s..." % name)
	finalFile = os.path.normpath('%s/%s' %(packageDir, name))
	urllib.urlretrieve (url, finalFile)
	printf('\tdone.\n~~~:>')
	return finalFile

def importRemoteGenome(name, batchSize = 100) :
	"""Import a genome available from http://pygeno.iric.ca (might work)."""
	try :
		dw = listRemoteDatawraps()["Flat"]["Reference genomes"][name]
	except AttributeError :
		raise AttributeError("There's no remote genome datawrap by the name of: '%s'" % name)

	finalFile = _DW(name, dw["url"])
	PG.importGenome(finalFile, batchSize)

def importRemoteSNPs(name) :
	"""Import a SNP set available from http://pygeno.iric.ca (might work)."""
	try :
		dw = listRemoteDatawraps()["Flat"]["SNPs"]
	except AttributeError :
		raise AttributeError("There's no remote genome datawrap by the name of: '%s'" % name)

	finalFile = _DW(name, dw["url"])
	PS.importSNPs(finalFile)

def listDatawraps() :
	"""Lists all the datawraps pyGeno comes with"""
	l = {"Genomes" : [], "SNPs" : []}
	for f in os.listdir(os.path.join(this_dir, "bootstrap_data/genomes")) :
		if f.find(".tar.gz") > -1 :
			l["Genomes"].append(f)
	
	for f in os.listdir(os.path.join(this_dir, "bootstrap_data/SNPs")) :
		if f.find(".tar.gz") > -1 :
			l["SNPs"].append(f)

	return l

def printDatawraps() :
	"""print all available datawraps for bootstraping"""
	l = listDatawraps()
	printf("Available datawraps for boostraping\n")
	for k, v in l.iteritems() :
		printf(k)
		printf("~"*len(k) + "|")
		for vv in v :
			printf(" "*len(k) + "|" + "~~~:> " + vv)
		printf('\n')

def importGenome(name, batchSize = 100) :
	"""Import a genome shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
	path = os.path.join(this_dir, "bootstrap_data", "genomes/" + name)
	PG.importGenome(path, batchSize)

def importSNPs(name) :
	"""Import a SNP set shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
	path = os.path.join(this_dir, "bootstrap_data", "SNPs/" + name)
	PS.importSNPs(path)

================================================
FILE: pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/manifest.ini
================================================
[package_infos]
description = For testing purposes. All polymorphisms at the same position
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda@umontreal.ca
version = 1

[set_infos]
species = human
name = dummySRY_AGN_indels
type = Agnostic
source = my place at IRIC

[snps]
filename = snps.txt


================================================
FILE: pyGeno/bootstrap_data/SNPs/Human_agnostic.dummySRY_indels/snps.txt
================================================
chromosomeNumber	uniqueId	start	end	ref	alleles	quality	caller
Y	1	2655643	2655646	T	AG	30	test
Y	2	2655643	2655647	-	AG	30	test
Y	3	2655643	2655650	TT	-	30	test


================================================
FILE: pyGeno/bootstrap_data/SNPs/__init__.py
================================================


================================================
FILE: pyGeno/bootstrap_data/__init__.py
================================================


================================================
FILE: pyGeno/bootstrap_data/genomes/__init__.py
================================================


================================================
FILE: pyGeno/configuration.py
================================================
import sys, os, time
from ConfigParser import SafeConfigParser
import rabaDB.rabaSetup
import rabaDB.Raba

class PythonVersionError(Exception) :
	pass

pyGeno_FACE = "~-~-:>"
pyGeno_BRANCH = "V2"

pyGeno_VERSION_NAME = 'Lean Viper!'
pyGeno_VERSION_RELEASE_LEVEL = 'Release'
pyGeno_VERSION_NUMBER = 14.09
pyGeno_VERSION_BUILD_TIME = time.ctime(os.path.getmtime(__file__))

pyGeno_RABA_NAMESPACE = 'pyGenoRaba'

pyGeno_SETTINGS_DIR = os.path.normpath(os.path.expanduser('~/.pyGeno/'))
pyGeno_SETTINGS_PATH = None
pyGeno_RABA_DBFILE = None
pyGeno_DATA_PATH = None
pyGeno_REMOTE_LOCATION = 'http://bioinfo.iric.ca/~daoudat/pyGeno_datawraps'

db = None #proxy for the raba database
dbConf = None #proxy for the raba database configuration

def version() :
	"""returns a tuple describing pyGeno's current version"""
	return (pyGeno_FACE, pyGeno_BRANCH, pyGeno_VERSION_NAME, pyGeno_VERSION_RELEASE_LEVEL, pyGeno_VERSION_NUMBER, pyGeno_VERSION_BUILD_TIME )

def prettyVersion() :
	"""returns pyGeno's current version in a pretty human readable way"""
	return "pyGeno %s Branch: %s, Name: %s, Release Level: %s, Version: %s, Build time: %s" % version()

def checkPythonVersion() :
	"""pyGeno needs python 2.7+"""
	
	if sys.version_info[0] < 2 or (sys.version_info[0] > 2  and sys.version_info[1] < 7) :
		return False
	return True

def getGenomeSequencePath(specie, name) :
	return os.path.normpath(pyGeno_DATA_PATH+'/%s/%s' % (specie.lower(), name))

def createDefaultConfigFile() :
	"""Creates a default configuration file"""
	s = "[pyGeno_config]\nsettings_dir=%s\nremote_location=%s" % (pyGeno_SETTINGS_DIR, pyGeno_REMOTE_LOCATION)
	f = open('%s/config.ini' % pyGeno_SETTINGS_DIR, 'w')
	f.write(s)
	f.close()

def getSettingsPath() :
	"""Returns the path where the settings are stored"""
	parser = SafeConfigParser()
	try :
		parser.read(os.path.normpath(pyGeno_SETTINGS_DIR+'/config.ini'))
		return parser.get('pyGeno_config', 'settings_dir')
	except :
		createDefaultConfigFile()
		return getSettingsPath()

def removeFromDBRegistery(obj) :
	"""rabaDB keeps a record of loaded objects to ensure consistency between different queries.
	This function removes an object from the registery"""
	rabaDB.Raba.removeFromRegistery(obj)

def freeDBRegistery() :
	"""rabaDB keeps a record of loaded objects to ensure consistency between different queries. This function empties the registery"""
	rabaDB.Raba.freeRegistery()

def reload() :
	"""reinitialize pyGeno"""
	pyGeno_init()

def pyGeno_init() :
	"""This function is automatically called at launch"""
	
	global db, dbConf
	
	global pyGeno_SETTINGS_PATH
	global pyGeno_RABA_DBFILE
	global pyGeno_DATA_PATH
	
	if not checkPythonVersion() :
		raise PythonVersionError("==> FATAL: pyGeno only works with python 2.7 and above, please upgrade your python version")

	if not os.path.exists(pyGeno_SETTINGS_DIR) :
		os.makedirs(pyGeno_SETTINGS_DIR)
	
	pyGeno_SETTINGS_PATH = getSettingsPath()
	pyGeno_RABA_DBFILE = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, "pyGenoRaba.db") )
	pyGeno_DATA_PATH = os.path.normpath( os.path.join(pyGeno_SETTINGS_PATH, "data") )
	
	if not os.path.exists(pyGeno_SETTINGS_PATH) :
		os.makedirs(pyGeno_SETTINGS_PATH)

	if not os.path.exists(pyGeno_DATA_PATH) :
		os.makedirs(pyGeno_DATA_PATH)

	#launching the db
	rabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE, pyGeno_RABA_DBFILE)
	db = rabaDB.rabaSetup.RabaConnection(pyGeno_RABA_NAMESPACE)
	dbConf = rabaDB.rabaSetup.RabaConfiguration(pyGeno_RABA_NAMESPACE)

================================================
FILE: pyGeno/doc/Makefile
================================================
# Makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
PAPER         =
BUILDDIR      = build

# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif

# Internal variables.
PAPEROPT_a4     = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source

.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext

help:
	@echo "Please use \`make <target>' where <target> is one of"
	@echo "  html       to make standalone HTML files"
	@echo "  dirhtml    to make HTML files named index.html in directories"
	@echo "  singlehtml to make a single large HTML file"
	@echo "  pickle     to make pickle files"
	@echo "  json       to make JSON files"
	@echo "  htmlhelp   to make HTML files and a HTML help project"
	@echo "  qthelp     to make HTML files and a qthelp project"
	@echo "  devhelp    to make HTML files and a Devhelp project"
	@echo "  epub       to make an epub"
	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
	@echo "  text       to make text files"
	@echo "  man        to make manual pages"
	@echo "  texinfo    to make Texinfo files"
	@echo "  info       to make Texinfo files and run them through makeinfo"
	@echo "  gettext    to make PO message catalogs"
	@echo "  changes    to make an overview of all changed/added/deprecated items"
	@echo "  xml        to make Docutils-native XML files"
	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
	@echo "  linkcheck  to check all external links for integrity"
	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
	@echo "  coverage   to run coverage check of the documentation (if enabled)"

clean:
	rm -rf $(BUILDDIR)/*

html:
	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

dirhtml:
	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."

singlehtml:
	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
	@echo
	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."

pickle:
	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
	@echo
	@echo "Build finished; now you can process the pickle files."

json:
	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
	@echo
	@echo "Build finished; now you can process the JSON files."

htmlhelp:
	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
	@echo
	@echo "Build finished; now you can run HTML Help Workshop with the" \
	      ".hhp project file in $(BUILDDIR)/htmlhelp."

qthelp:
	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
	@echo
	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/pyGeno.qhcp"
	@echo "To view the help file:"
	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/pyGeno.qhc"

devhelp:
	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
	@echo
	@echo "Build finished."
	@echo "To view the help file:"
	@echo "# mkdir -p $$HOME/.local/share/devhelp/pyGeno"
	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/pyGeno"
	@echo "# devhelp"

epub:
	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
	@echo
	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."

latex:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo
	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
	@echo "Run \`make' in that directory to run these through (pdf)latex" \
	      "(use \`make latexpdf' here to do that automatically)."

latexpdf:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through pdflatex..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

latexpdfja:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through platex and dvipdfmx..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

text:
	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
	@echo
	@echo "Build finished. The text files are in $(BUILDDIR)/text."

man:
	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
	@echo
	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."

texinfo:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo
	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
	@echo "Run \`make' in that directory to run these through makeinfo" \
	      "(use \`make info' here to do that automatically)."

info:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo "Running Texinfo files through makeinfo..."
	make -C $(BUILDDIR)/texinfo info
	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."

gettext:
	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
	@echo
	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."

changes:
	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
	@echo
	@echo "The overview file is in $(BUILDDIR)/changes."

linkcheck:
	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
	@echo
	@echo "Link check complete; look for any errors in the above output " \
	      "or in $(BUILDDIR)/linkcheck/output.txt."

doctest:
	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
	@echo "Testing of doctests in the sources finished, look at the " \
	      "results in $(BUILDDIR)/doctest/output.txt."

coverage:
	$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
	@echo "Testing of coverage in the sources finished, look at the " \
	      "results in $(BUILDDIR)/coverage/python.txt."

xml:
	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
	@echo
	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."

pseudoxml:
	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
	@echo
	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."


================================================
FILE: pyGeno/doc/make.bat
================================================
@ECHO OFF

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
	set SPHINXBUILD=sphinx-build
)
set BUILDDIR=build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source
set I18NSPHINXOPTS=%SPHINXOPTS% source
if NOT "%PAPER%" == "" (
	set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
	set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
)

if "%1" == "" goto help

if "%1" == "help" (
	:help
	echo.Please use `make ^<target^>` where ^<target^> is one of
	echo.  html       to make standalone HTML files
	echo.  dirhtml    to make HTML files named index.html in directories
	echo.  singlehtml to make a single large HTML file
	echo.  pickle     to make pickle files
	echo.  json       to make JSON files
	echo.  htmlhelp   to make HTML files and a HTML help project
	echo.  qthelp     to make HTML files and a qthelp project
	echo.  devhelp    to make HTML files and a Devhelp project
	echo.  epub       to make an epub
	echo.  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter
	echo.  text       to make text files
	echo.  man        to make manual pages
	echo.  texinfo    to make Texinfo files
	echo.  gettext    to make PO message catalogs
	echo.  changes    to make an overview over all changed/added/deprecated items
	echo.  xml        to make Docutils-native XML files
	echo.  pseudoxml  to make pseudoxml-XML files for display purposes
	echo.  linkcheck  to check all external links for integrity
	echo.  doctest    to run all doctests embedded in the documentation if enabled
	echo.  coverage   to run coverage check of the documentation if enabled
	goto end
)

if "%1" == "clean" (
	for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
	del /q /s %BUILDDIR%\*
	goto end
)


REM Check if sphinx-build is available and fallback to Python version if any
%SPHINXBUILD% 2> nul
if errorlevel 9009 goto sphinx_python
goto sphinx_ok

:sphinx_python

set SPHINXBUILD=python -m sphinx.__init__
%SPHINXBUILD% 2> nul
if errorlevel 9009 (
	echo.
	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
	echo.installed, then set the SPHINXBUILD environment variable to point
	echo.to the full path of the 'sphinx-build' executable. Alternatively you
	echo.may add the Sphinx directory to PATH.
	echo.
	echo.If you don't have Sphinx installed, grab it from
	echo.http://sphinx-doc.org/
	exit /b 1
)

:sphinx_ok


if "%1" == "html" (
	%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/html.
	goto end
)

if "%1" == "dirhtml" (
	%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
	goto end
)

if "%1" == "singlehtml" (
	%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
	goto end
)

if "%1" == "pickle" (
	%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can process the pickle files.
	goto end
)

if "%1" == "json" (
	%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can process the JSON files.
	goto end
)

if "%1" == "htmlhelp" (
	%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
	goto end
)

if "%1" == "qthelp" (
	%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
	echo.^> qcollectiongenerator %BUILDDIR%\qthelp\pyGeno.qhcp
	echo.To view the help file:
	echo.^> assistant -collectionFile %BUILDDIR%\qthelp\pyGeno.ghc
	goto end
)

if "%1" == "devhelp" (
	%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished.
	goto end
)

if "%1" == "epub" (
	%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The epub file is in %BUILDDIR%/epub.
	goto end
)

if "%1" == "latex" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "latexpdf" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	cd %BUILDDIR%/latex
	make all-pdf
	cd %~dp0
	echo.
	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "latexpdfja" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	cd %BUILDDIR%/latex
	make all-pdf-ja
	cd %~dp0
	echo.
	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "text" (
	%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The text files are in %BUILDDIR%/text.
	goto end
)

if "%1" == "man" (
	%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The manual pages are in %BUILDDIR%/man.
	goto end
)

if "%1" == "texinfo" (
	%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
	goto end
)

if "%1" == "gettext" (
	%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
	goto end
)

if "%1" == "changes" (
	%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
	if errorlevel 1 exit /b 1
	echo.
	echo.The overview file is in %BUILDDIR%/changes.
	goto end
)

if "%1" == "linkcheck" (
	%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
	if errorlevel 1 exit /b 1
	echo.
	echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
	goto end
)

if "%1" == "doctest" (
	%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
	if errorlevel 1 exit /b 1
	echo.
	echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
	goto end
)

if "%1" == "coverage" (
	%SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage
	if errorlevel 1 exit /b 1
	echo.
	echo.Testing of coverage in the sources finished, look at the ^
results in %BUILDDIR%/coverage/python.txt.
	goto end
)

if "%1" == "xml" (
	%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The XML files are in %BUILDDIR%/xml.
	goto end
)

if "%1" == "pseudoxml" (
	%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
	goto end
)

:end


================================================
FILE: pyGeno/doc/source/bootstraping.rst
================================================
Bootstraping
==================================
pyGeno can be quick-started by importing these built-in data wraps.

.. automodule:: bootstrap
   :members:


================================================
FILE: pyGeno/doc/source/citing.rst
================================================
Citing
=========

If you are using pyGeno please mention it to the rest of the universe by including a link to: https://github.com/tariqdaouda/pyGeno

================================================
FILE: pyGeno/doc/source/conf.py
================================================
# -*- coding: utf-8 -*-
#
# pyGeno documentation build configuration file, created by
# sphinx-quickstart on Thu Nov  6 16:45:34 2014.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

import sys
import os

import sphinx_rtd_theme

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('../..'))

# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.doctest',
    'sphinx.ext.intersphinx',
    'sphinx.ext.todo',
    'sphinx.ext.coverage',
    'sphinx.ext.mathjax',
    'sphinx.ext.ifconfig',
    'sphinx.ext.viewcode',
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix of source filenames.
source_suffix = '.rst'

# The encoding of source files.
#source_encoding = 'utf-8-sig'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = u'pyGeno'
copyright = u'2014, Tariq Daouda'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '1.x'
# The full version, including alpha/beta/rc tags.
release = '1.2.x'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None

# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []

# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None

# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True

# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True

# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []

# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
# html_theme = "default"
# html_theme_options = {
# 	"rightsidebar":"true",
# 	"stickysidebar" : "false",
# }
html_theme = "sphinx_rtd_theme"

# Theme options are theme-specific and customize the look and feel of a theme
# further.  For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}

# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

# The name for this set of Sphinx documents.  If None, it defaults to
# "<project> v<release> documentation".
#html_title = None

# A shorter title for the navigation bar.  Default is the same as html_title.
#html_short_title = None

# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = "logo.png"

# The name of an image file (within the static path) to use as favicon of the
# docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []

# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = '%b %d, %Y'

# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True

# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}

# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}

# If false, no module index is generated.
#html_domain_indices = True

# If false, no index is generated.
#html_use_index = True

# If true, the index is split into individual pages for each letter.
#html_split_index = False

# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True

# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True

# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True

# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it.  The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''

# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None

# Language to be used for generating the HTML full-text search index.
# Sphinx supports the following languages:
#   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
#   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr'
#html_search_language = 'en'

# A dictionary with options for the search language support, empty by default.
# Now only 'ja' uses this config value
#html_search_options = {'type': 'default'}

# The name of a javascript file (relative to the configuration directory) that
# implements a search results scorer. If empty, the default will be used.
#html_search_scorer = 'scorer.js'

# Output file base name for HTML help builder.
htmlhelp_basename = 'pyGenodoc'

# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'letterpaper',

# The font size ('10pt', '11pt' or '12pt').
'pointsize': '10pt',

# Additional stuff for the LaTeX preamble.
#'preamble': '',

# Latex figure (float) alignment
#'figure_align': 'htbp',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
  ('index', 'pyGeno.tex', u'pyGeno Documentation',
   u'Tariq Daouda', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
# the title page.
# latex_logo = None

# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False

# If true, show page references after internal links.
#latex_show_pagerefs = False

# If true, show URL addresses after external links.
#latex_show_urls = False

# Documents to append as an appendix to all manuals.
#latex_appendices = []

# If false, no module index is generated.
#latex_domain_indices = True


# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
    ('index', 'pygeno', u'pyGeno Documentation',
     [u'Tariq Daouda'], 1)
]

# If true, show URL addresses after external links.
#man_show_urls = False


# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
  ('index', 'pyGeno', u'pyGeno Documentation',
   u'Tariq Daouda', 'pyGeno', 'One line description of project.',
   'Miscellaneous'),
]

# Documents to append as an appendix to all manuals.
#texinfo_appendices = []

# If false, no module index is generated.
#texinfo_domain_indices = True

# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'

# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False


# -- Options for Epub output ----------------------------------------------

# Bibliographic Dublin Core info.
epub_title = u'pyGeno'
epub_author = u'Tariq Daouda'
epub_publisher = u'Tariq Daouda'
epub_copyright = u'2014, Tariq Daouda'

# The basename for the epub file. It defaults to the project name.
#epub_basename = u'pyGeno'

# The HTML theme for the epub output. Since the default themes are not optimized
# for small screen space, using the same theme for HTML and epub output is
# usually not wise. This defaults to 'epub', a theme designed to save visual
# space.
#epub_theme = 'epub'

# The language of the text. It defaults to the language option
# or 'en' if the language is not set.
#epub_language = ''

# The scheme of the identifier. Typical schemes are ISBN or URL.
#epub_scheme = ''

# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#epub_identifier = ''

# A unique identification for the text.
#epub_uid = ''

# A tuple containing the cover image and cover page html template filenames.
#epub_cover = ()

# A sequence of (type, uri, title) tuples for the guide element of content.opf.
#epub_guide = ()

# HTML files that should be inserted before the pages created by sphinx.
# The format is a list of tuples containing the path and title.
#epub_pre_files = []

# HTML files shat should be inserted after the pages created by sphinx.
# The format is a list of tuples containing the path and title.
#epub_post_files = []

# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']

# The depth of the table of contents in toc.ncx.
#epub_tocdepth = 3

# Allow duplicate toc entries.
#epub_tocdup = True

# Choose between 'default' and 'includehidden'.
#epub_tocscope = 'default'

# Fix unsupported image types using the PIL.
#epub_fix_images = False

# Scale large images.
#epub_max_image_width = 0

# How to display URL addresses: 'footnote', 'no', or 'inline'.
#epub_show_urls = 'inline'

# If false, no index is generated.
#epub_use_index = True


# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'http://docs.python.org/': None}


================================================
FILE: pyGeno/doc/source/datawraps.rst
================================================
Datawraps
=========

Datawraps are used by pyGeno to import data into it's database. All reference genomes are downloaded from Ensembl, dbSNP data from dbSNP.
The :doc:`/bootstraping` module has functions to import datawraps shipped with pyGeno.
Datawraps can either be tar.gz.archives or folders.

Importation
-----------

Here's how you import a reference genome datawrap::

	from pyGeno.importation.Genomes import *
	importGenome("my_datawrap.tar.gz")


And a SNP set datawrap::
	
	from pyGeno.importation.SNPs import *
	importSNPs("my_datawrap.tar.gz")


Creating you own datawraps
--------------------------

For polymorphims, create a file called **manifest.ini** with the following format::

	[package_infos]
	description = SNPs for testing purposes
	maintainer = Tariq Daouda
	maintainer_contact = tariq.daouda [at] umontreal
	version = 1

	[set_infos]
	species = human
	name = mySNPSET
	type = Agnostic # or CasavaSNP or dbSNPSNP
	source = Where do these snps come from?

	[snps]
	filename = snps.txt # or http://www.example.com/snps.txt or ftp://www.example.com/snps.txt if you chose not to include the file in the archive

And compress the **manifest.ini** file along sith the snps.txt (if you chose to include it and not to specify an url) into a tar.gz archive.


Natively pyGeno supports dbSNP and casava(snp.txt), but it also has its own polymorphism file format (AgnosticSNP) wich is simply a tab delemited file in the following format::

	chromosomeNumber uniqueId   start        end      ref    alleles   quality       caller
	        Y          1       2655643      2655644	   T       AG        30          TopHat
	        Y          2       2655645      2655647    -       AG        28          TopHat
	        Y          3       2655648      2655650    TT      -         10          TopHat

Even tough all field are mandatory, the only ones that are critical for pyGeno to be able insert polymorphisms at the right places are: *chromosomeNumber* and *start*. All the other fields are non critical and can follow any convention you wish to apply to them, including the *end* field. You can choose the convention that suits best the query that you are planning to make on SNPs through .get(), or the way you intend to filter them using filtering objtecs.

For genomes, the manifet.ini file looks like this::

	[package_infos]
	description = Test package. This package installs only chromosome Y of mus musculus
	maintainer = Tariq Daouda
	maintainer_contact = tariq.daouda [at] umontreal
	version = GRCm38.73

	[genome]
	species = Mus_musculus
	name = GRCm38_test
	source = http://useast.ensembl.org/info/data/ftp/index.html

	[chromosome_files]
	Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz # or an url such as ftp://... or http://

	[gene_set]
	gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz # or an url such as ftp://... or http://

File URLs for refercence genomes can be found on `Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`_

To learn more about datawraps and how to make your own you can have a look at :doc:`/importation`, and the Wiki_.

.. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data
.. _`Ensembl: http://useast.ensembl.org/info/data/ftp/index.html`: http://useast.ensembl.org/info/data/ftp/index.html


================================================
FILE: pyGeno/doc/source/importation.rst
================================================
Importation
===========
pyGeno's database is populated by importing tar.gz compressed archives called *datawraps*. An importation is a one time step and once a datawrap has been imported it can be discarded with no concequences to the database. 

Here's how you import a reference genome datawrap::
	
	from pyGeno.importation.Genomes import *
	importGenome("my_genome_datawrap.tar.gz")


And a SNP set datawrap::
	
	from pyGeno.importation.SNPs import *
	importSNPs("my_snp_datawrap.tar.gz")

pyGeno comes with a few datawraps that you can quickly import using the :doc:`/bootstraping` module.

You can find a list of datawraps to import here: :doc:`/datawraps`

You can also easily create your own by simply putting a bunch of URLs into a *manifest.ini* file and compressing int into a *tar.gz archive* (as explained below or on the Wiki_).

.. _Wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-datawrap-to-import-your-data

Genomes
-------
.. automodule:: importation.Genomes
   :members:

Polymorphisms (SNPs and Indels)
-------------------------------

.. automodule:: importation.SNPs
   :members:


================================================
FILE: pyGeno/doc/source/index.rst
================================================
.. pyGeno documentation master file, created by
   sphinx-quickstart on Thu Nov  6 16:45:34 2014.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

.. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png
   :alt: pyGeno's logo

pyGeno: A Python package for precision medicine and proteogenomics
===================================================================
.. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg
   :alt: depsy
   :target: http://depsy.org/package/python/pyGeno
.. image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg
    :target: https://opensource.org/licenses/Apache-2.0
.. image:: https://img.shields.io/badge/python-2.7-blue.svg 

pyGeno's `lair is on Github`_.

.. _lair is on Github: http://www.github.com/tariqdaouda/pyGeno

Citing pyGeno:
--------------

Please cite this paper_.

.. _paper: http://f1000research.com/articles/5-381/v1

A Quick Intro:
-----------------

Even tough more and more research focuses on Personalized/Precision Medicine, treatments that are specically tailored to the patient, pyGeno is (to our knowlege) the only tool available that will gladly build your specific genomes for you and you give an easy access to them.

pyGeno allows you to create and work on **Personalized Genomes**: custom genomes that you create by combining a reference genome, sets of polymorphims and an optional filter.
pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get
direct access to the DNA and Protein sequences of your patients/subjects.
To know more about how to create a Personalized Genome, have a look at the :doc:`/quickstart` section.

pyGeno can also function as a personal bioinformatic database for Ensembl, that runs directly into python, on your laptop, making faster and more reliable than any REST API. pyGeno makes extracting data such as gene sequences a breeze, and is designed to
be able cope with huge queries.


.. code::

	from pyGeno.Genome import *

	g = Genome(name = "GRCh37.75")
	prot = g.get(Protein, id = 'ENSP00000438917')[0]
	#print the protein sequence
	print prot.sequence
	#print the protein's gene biotype
	print prot.gene.biotype
	#print protein's transcript sequence
	print prot.transcript.sequence

	#fancy queries
	for exons in g.get(Exons, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) :
		#print the exon's coding sequence
		print exon.CDS
		#print the exon's transcript sequence
		print exon.transcript.sequence

	#You can do the same for your subject specific genomes
	#by combining a reference genome with polymorphisms 
	g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter())


Verbose Introduction
---------------------

pyGeno integrates:

* **Reference sequences** and annotations from **Ensembl**

* Genomic polymorphisms from the **dbSNP** database

* SNPs from **next-gen** sequencing

pyGeno is a python package that  was designed to be:

* Fast to install. It has no dependencies but its own backend: `rabaDB`_.
* Fast to run and memory efficient, so you can use it on your laptop.
* Fast to use. No queries to foreign APIs all the data rests on your computer, so it is readily accessible when you need it.
* Fast to learn. One single function **get()** can do the job of several other tools at once. 

It also comes with:

* Parsers for: FASTA, FASTQ, GTF, VCF, CSV.
* Useful tools for translation etc...
* Optimised genome indexation with *Segment Trees*.
* A funky *Progress Bar*.

One of the the coolest things about pyGeno is that it also allows to quickly create **personalized genomes**.
Genomes that you design yourself by combining a reference genome and SNP sets derived from dbSNP or next-gen sequencing.

pyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_.
For the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.

.. _rabaDB: https://github.com/tariqdaouda/rabaDB
.. _Tariq Daouda: http://www.tariqdaouda.com
.. _IRIC: http://www.iric.ca
.. _Sawssan Kaddoura: http://www.sawssankaddoura.com
.. _@tariqdaouda: https://www.twitter.com/tariqdaouda

Contents:
----------

.. toctree::
   :maxdepth: 2
   
   publications
   quickstart
   installation
   bootstraping
   querying
   importation
   datawraps
   objects
   snp_filter
   tools
   parsers

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


================================================
FILE: pyGeno/doc/source/installation.rst
================================================
Installation
=============

Unix (MacOS, Linux)
-------------------

The latest stable version is available from pypi::
	
	pip install pyGeno

**Upgrade**::

	pip install pyGeno --upgrade

If you're more adventurous, the bleeding edge version is available from github (look for the 'bloody' branch)::

	git clone https://github.com/tariqdaouda/pyGeno.git
	cd pyGeno
	python setup.py develop

**Upgrade**::

	git pull

Windows
-------

* Goto: https://www.python.org/downloads/ and download the installer for the lastest version of python 2.7
* Double click on the installer to begin installation
* Click on the windows start menu
* Type *"cmd"* and click on it to launch the command line interface
* In the command line interface type::
	
	 cd C:\Python27\Scripts

* Now type: pip install pyGeno
* Now click on the windows start menu. In the python 2.7 menu you can either launch *Python (Command line)* or *IDLE (Python GUI)*
* You can now go to: http://pygeno.iric.ca/quickstart.html and type the commands into either one of them

**UPGRADE:** to upgrade pyGeno to the latest version, launch *cmd* and type::

	cd C:\Python27\Scripts

followed by::
	
	pip install pyGeno --upgrade


================================================
FILE: pyGeno/doc/source/objects.rst
================================================
Objects
=======
With pyGeno you can manipulate familiar object in intuituive way. All the following classes except SNP inherit from pyGenoObjectWrapper and have therefor access to functions sur as get(), count(), ensureIndex()... 

Base classes
-------------
Base classes are abstract and are not meant to be instanciated, they nonetheless implement most of the functions that classes such as Genome possess.

.. automodule:: pyGenoObjectBases
   :members:

Genome
-------
.. automodule:: Genome
   :members:

Chromosome
----------
.. automodule:: Chromosome
   :members:

Gene
----
.. automodule:: Gene
   :members:

Transcript
----------
.. automodule:: Transcript
   :members:

Exon
----
.. automodule:: Exon
   :members:

Protein
-------
.. automodule:: Protein
   :members:

SNP
---
.. automodule:: SNP
   :members:


================================================
FILE: pyGeno/doc/source/parsers.rst
================================================
Parsers
=======

PyGeno comes with a set of parsers that you can use independentely.

CSV
---
To read and write CSV files.

.. automodule:: tools.parsers.CSVTools
   :members:

FASTA
-----
To read and write FASTA files.

.. automodule:: tools.parsers.FastaTools
   :members:

FASTQ
-----
To read and write FASTQ files.

.. automodule:: tools.parsers.FastqTools
   :members:

GTF
---
To read GTF files.

.. automodule:: tools.parsers.GTFTools
   :members:

VCF
---
To read VCF files.

.. automodule:: tools.parsers.VCFTools
   :members:
 
Casava
-------
To read casava files.

.. automodule:: tools.parsers.CasavaTools
   :members:


================================================
FILE: pyGeno/doc/source/publications.rst
================================================
Publications
============

Please cite this one:
---------------------

`pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`_

.. _`pyGeno: A Python package for precision medicine and proteogenomics. F1000Research, 2016`: http://f1000research.com/articles/5-381/v2

pyGeno was also used in the following studies:
----------------------------------------------

`MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`_

.. _`MHC class I–associated peptides derive from selective regions of the human genome. J Clin Invest. 2016`: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127664/


`Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015`_

.. _Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Comm. 2015: http://dx.doi.org/10.1038/ncomms10238

`Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014`_

.. _Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nat. Comm. 2014: http://www.ncbi.nlm.nih.gov/pubmed/24714562

`MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012`_

.. _MHC I-associated peptides preferentially derive from transcripts bearing miRNA response elements. Blood. 2012: http://www.ncbi.nlm.nih.gov/pubmed/22438248


================================================
FILE: pyGeno/doc/source/querying.rst
================================================
Querying
=========

pyGeno is a personal database that you can query in many ways. Special emphasis has been placed upon ease of use, and you only need to remember two functions::

	* get()
	* help()

**get()** can be called from any pyGeno object to get any objects.

**help()** is you best friend when you get lost using **get()**. When called, it will give the list of all field that you can use in get queries. You can call it either of the class::

	Gene.help()

Or on the object::

	ref = Genome(name = "GRCh37.75")
	g = ref.get(Gene, name = "TPST2")[0]

	g.help()

Both will print::

	'Available fields for Gene: end, name, chromosome, start, biotype, id, strand, genome'


================================================
FILE: pyGeno/doc/source/quickstart.rst
================================================
Quickstart
==========

Quick importation
-----------------
pyGeno's database is populated by importing data wraps.
pyGeno comes with a few datawraps, to get the list you can use:

.. code:: python
	
	import pyGeno.bootstrap as B
	B.printDatawraps()

.. code::

	Available datawraps for bootstraping
	
	SNPs
	~~~~|
	    |~~~:> Human_agnostic.dummySRY.tar.gz
	    |~~~:> Human.dummySRY_casava.tar.gz
	    |~~~:> dbSNP142_human_GRCh37_common_all.tar.gz
	    |~~~:> dbSNP142_human_common_all.tar.gz
	
	
	Genomes
	~~~~~~~|
	       |~~~:> Human.GRCh37.75.tar.gz
	       |~~~:> Human.GRCh37.75_Y-Only.tar.gz
	       |~~~:> Human.GRCh38.78.tar.gz
	       |~~~:> Mouse.GRCm38.78.tar.gz

To get a list of remote datawraps that pyGeno can download for you, do:

.. code:: python

	B.printRemoteDatawraps()


Importing whole genomes is a demanding process that take more than an hour and requires (according to tests) 
at least 3GB of memory. Depending on your configuration, more might be required.

That being said importating a data wrap is a one time operation and once the importation is complete the datawrap
can be discarded without consequences.

The bootstrap module also has some handy functions for importing built-in packages.

Some of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**):

.. code:: python
	
	import pyGeno.bootstrap as B

	#Imports only the Y chromosome from the human reference genome GRCh37.75
	#Very fast, requires even less memory. No download required.
	B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
	
	#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP  format. 
	# This one has one SNP at the begining of the gene SRY
	B.importSNPs("Human.dummySRY_casava.tar.gz")

And for more serious work, the whole reference genome.

.. code:: python

	#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.
	B.importGenome("Human.GRCh38.78.tar.gz")

That's it, you can now print the sequences of all the proteins that a gene can produce::

	from pyGeno.Genome import Genome
	from pyGeno.Gene import Gene
	from pyGeno.Protein import Protein

	#the name of the genome is defined inside the package's manifest.ini file
	ref = Genome(name = 'GRCh37.75')
	#get returns a list of elements
	gene = ref.get(Gene, name = 'SRY')[0]
	for prot in gene.get(Protein) :
		  print prot.sequence

You can see pyGeno achitecture as a graph where everything is connected to everything. For instance you can do things such as::

	gene = aProt.gene
	trans = aProt.transcript
	prot = anExon.protein
	genome = anExon.genome

Queries
-------

PyGeno allows for several kinds of queries, here are some snippets::

	#in this case both queries will yield the same result
	myGene.get(Protein, id = "ENSID...")
	myGenome.get(Protein, id = "ENSID...")
	
	#even complex stuff
	exons = myChromosome.get(Exons, {'start >=' : x1, 'stop <' : x2})
	hlaGenes = myGenome.get(Gene, {'name like' : 'HLA'})

	sry = myGenome.get(Transcript, { "gene.name" : 'SRY' })

To know the available fields for queries, there's a "help()" function::

	Gene.help()


Faster queries
---------------

To speed up loops use iterGet()::
	
	for prot in gene.iterGet(Protein) :
	  print prot.sequence

For more speed create indexes on the fields you need the most::
	
	Gene.ensureGlobalIndex('name')


Getting sequences
-------------------

Anything that has a sequence can be indexed using the usual python list syntax::

	protein[34] # for the 34th amino acid
	protein[34:40] # for amino acids in [34, 40[

	transcript[23] #for the 23rd nucleotide of the transcript
	transcript[23:30] #for nucletotides in [23, 30[

	transcript.cDNA[23:30] #the same but for the protein coding DNA (without the UTRs)

Transcripts, Proteins, Exons also have a *.sequence* attribute. This attribute is the string rendered sequence, it is perfect for printing but it  may contain '/'s 
in case of polymorphic sequence that you must
take into account in the indexing. On the other hand if you use indexes directly on the object (as shown in the snippet above) pyGeno will use a binary representaion
of the sequences thus the indexing is independent of the polymorphisms present in the sequences.

Personalized Genomes
--------------------

Personalized Genomes are a powerful feature that allow to work on the specific genomes and proteomes of your patients. You can even mix several SNPs together::
	
	from pyGeno.Genome import Genome
	#the name of the snp set is defined inside the datawraps's manifest.ini file
	dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')
	#you can also define a filter (ex: a quality filter) for the SNPs
	dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())
	#and even mix several snp sets
	dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())

pyGeno allows you to customize the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions::
	
	from pyGeno.SNPFiltering import SNPFilter
	from pyGeno.SNPFiltering import SequenceSNP

	class QMax_gt_filter(SNPFilter) :

		def __init__(self, threshold) :
			self.threshold = threshold

		def filter(self, chromosome, dummySRY = None) :
			if dummySRY.Qmax_gt > self.threshold :
				#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
				return SequenceSNP(dummySRY.alt)
			return None #None means keep the reference allele

	persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))


================================================
FILE: pyGeno/doc/source/snp_filter.rst
================================================
Filtering Polymorphisms (SNPs and Indels)
=========================================

Polymorphism filtering is an important part of personalized genomes. By creating your own filters you can easily taylor personalized genomes to your needs. The importaant thing to understand about the filtering process, is that it gives you complete freedom about should be inserted. 
Once pyGeno finds a polymorphism, it automatically triggers the filter to know what should be inserted at this position, and that can be anything you choose.

.. automodule:: SNPFiltering
   :members:


================================================
FILE: pyGeno/doc/source/tools.rst
================================================
Tools
=====

pyGeno provides a set of tools that can be used independentely. Here you'll find goodies for translation, indexation, and more.

Progress Bar
-------------
pyGeno's awesome progress bar, with logging capabilities and remaining time estimation.

.. automodule:: tools.ProgressBar
   :members:

Useful functions
-----------------
This module is a bunch of handy bioinfo functions.

.. automodule:: tools.UsefulFunctions
   :members:

Binary sequences
-----------------
To encode sequence in binary formats

.. automodule:: tools.BinarySequence
   :members:

Segment tree
-------------
Segment trees are an optimised way to index a genome.

.. automodule:: tools.SegmentTree
   :members:

Secure memory map
------------------
A write protected memory map.

.. automodule:: tools.SecureMmap
   :members:


================================================
FILE: pyGeno/examples/__init__.py
================================================


================================================
FILE: pyGeno/examples/bootstraping.py
================================================
import pyGeno.bootstrap as B

#~ imports the whole human reference genome
#~ B.importHumanReference()
B.importHumanReferenceYOnly()
B.importDummySRY()


================================================
FILE: pyGeno/examples/snps_queries.py
================================================
from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Transcript import Transcript
from pyGeno.Protein import Protein
from pyGeno.Exon import Exon

from pyGeno.SNPFiltering import SNPFilter
from pyGeno.SNPFiltering import SequenceSNP

def printing(gene) :
	print "printing reference sequences\n-------"
	for trans in gene.get(Transcript) :
		print "\t-Transcript name:", trans.name
		print "\t-Protein:", trans.protein.sequence
		print "\t-Exons:"
		for e in trans.exons :
			print "\t\t Number:", e.number
			print "\t\t-CDS:", e.CDS
			print "\t\t-Strand:", e.strand
			print "\t\t-CDS_start:", e.CDS_start
			print "\t\t-CDS_end:", e.CDS_end

def printVs(refGene, presGene) :
	print "Vs personalized sequences\n------"

	#iterGet returns an iterator instead of a list (faster)
	for trans in presGene.iterGet(Transcript) :
		refProt = refGene.get(Protein, id = trans.protein.id)[0]
		persProt = trans.protein
		print persProt.id
		print "\tref:" + refProt.sequence[:20] + "..."
		print "\tper:" + persProt.sequence[:20] + "..."
		print

def fancyExonQuery(gene) :
	e = gene.get(Exon, {'CDS_start >' : 2655029, 'CDS_end <' : 2655200})[0]
	print "An exon with a CDS in ']2655029, 2655200[':", e.id
	
class QMax_gt_filter(SNPFilter) :
	
	def __init__(self, threshold) :
		self.threshold = threshold
		
	def filter(self, chromosome, dummySRY) :
		if dummySRY.Qmax_gt > self.threshold :
			#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
			return SequenceSNP(dummySRY.alt)
		return None #None means keep the reference allele

if __name__ == "__main__" :
	refGenome = Genome(name = 'GRCh37.75_Y-Only')
	persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))
	
	geneRef = refGenome.get(Gene, name = 'SRY')[0]
	persGene = persGenome.get(Gene, name = 'SRY')[0]
	
	printing(geneRef)
	print "\n"
	printVs(geneRef, persGene)
	print "\n"
	fancyExonQuery(geneRef)


================================================
FILE: pyGeno/importation/Genomes.py
================================================
import os, glob, gzip, tarfile, shutil, time, sys, gc, cPickle, tempfile, urllib2
from contextlib import closing
from ConfigParser import SafeConfigParser

from pyGeno.tools.ProgressBar import ProgressBar
import pyGeno.configuration as conf

from pyGeno.Genome import *
from pyGeno.Chromosome import *
from pyGeno.Gene import *
from pyGeno.Transcript import *
from pyGeno.Exon import *
from pyGeno.Protein import *

from pyGeno.tools.parsers.GTFTools import GTFFile
from pyGeno.tools.ProgressBar import ProgressBar
from pyGeno.tools.io import printf

import gc
#~ import objgraph

def backUpDB() :
    """backup the current database version. automatically called by importGenome(). Returns the filename of the backup"""
    st = time.ctime().replace(' ', '_')
    fn = conf.pyGeno_RABA_DBFILE.replace('.db', '_%s-bck.db' % st)
    shutil.copy2(conf.pyGeno_RABA_DBFILE, fn)

    return fn

def _decompressPackage(packageFile) :
    pFile = tarfile.open(packageFile)
    
    packageDir = tempfile.mkdtemp(prefix = "pyGeno_import_")
    if os.path.isdir(packageDir) :
        shutil.rmtree(packageDir)
    os.makedirs(packageDir)

    for mem in pFile :
        pFile.extract(mem, packageDir)

    return packageDir

def _getFile(fil, directory) :
    if fil.find("http://") == 0 or fil.find("ftp://") == 0 :
        printf("Downloading file: %s..." % fil)
        finalFile = os.path.normpath('%s/%s' %(directory, fil.split('/')[-1]))
        # urllib.urlretrieve (fil, finalFile)
        with closing(urllib2.urlopen(fil)) as r:
            with open(finalFile, 'wb') as f:
                shutil.copyfileobj(r, f)
        
        printf('done.')
    else :
        finalFile = os.path.normpath('%s/%s' %(directory, fil))
    
    return finalFile

def deleteGenome(species, name) :
    """Removes a genome from the database"""

    printf('deleting genome (%s, %s)...' % (species, name))

    conf.db.beginTransaction()
    objs = []
    allGood = True
    try :
        genome = Genome_Raba(name = name, species = species.lower())
        objs.append(genome)
        pBar = ProgressBar(label = 'preparing')
        for typ in (Chromosome_Raba, Gene_Raba, Transcript_Raba, Exon_Raba, Protein_Raba) :
            pBar.update()
            f = RabaQuery(typ, namespace = genome._raba_namespace)
            f.addFilter({'genome' : genome})
            for e in f.iterRun() :
                objs.append(e)
        pBar.close()
        
        pBar = ProgressBar(nbEpochs = len(objs), label = 'deleting objects')
        for e in objs :
            pBar.update()
            e.delete()
        pBar.close()
        
    except KeyError as e :
        #~ printf("\tWARNING, couldn't remove genome form db, maybe it's not there: ", e)
        raise KeyError("\tWARNING, couldn't remove genome form db, maybe it's not there: ", e)
        allGood = False
    printf('\tdeleting folder')
    try :
        shutil.rmtree(conf.getGenomeSequencePath(species, name))
    except OSError as e:
        #~ printf('\tWARNING, Unable to delete folder: ', e)
        OSError('\tWARNING, Unable to delete folder: ', e)
        allGood = False
        
    conf.db.endTransaction()
    return allGood

def importGenome(packageFile, batchSize = 50, verbose = 0) :
    """Import a pyGeno genome package. A genome packages is folder or a tar.gz ball that contains at it's root:

    * gziped fasta files for all chromosomes, or URLs from where them must be downloaded
    
    * gziped GTF gene_set file from Ensembl, or an URL from where it must be downloaded
    
    * a manifest.ini file such as::
    
        [package_infos]
        description = Test package. This package installs only chromosome Y of mus musculus
        maintainer = Tariq Daouda
        maintainer_contact = tariq.daouda [at] umontreal
        version = GRCm38.73

        [genome]
        species = Mus_musculus
        name = GRCm38_test
        source = http://useast.ensembl.org/info/data/ftp/index.html

        [chromosome_files]
        Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz / or an url such as ftp://... or http://

        [gene_set]
        gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz / or an url such as ftp://... or http://

    All files except the manifest can be downloaded from: http://useast.ensembl.org/info/data/ftp/index.html
    
    A rollback is performed if an exception is caught during importation
    
    batchSize sets the number of genes to parse before performing a database save. PCs with little ram like
    small values, while those endowed with more memory may perform faster with higher ones.
    
    Verbose must be an int [0, 4] for various levels of verbosity
    """

    def reformatItems(items) :
        s = str(items)
        s = s.replace('[', '').replace(']', '').replace("',", ': ').replace('), ', '\n').replace("'", '').replace('(', '').replace(')', '')
        return s

    printf('Importing genome package: %s... (This may take a while)' % packageFile)

    isDir = False
    if not os.path.isdir(packageFile) :
        packageDir = _decompressPackage(packageFile)
    else :
        isDir = True
        packageDir = packageFile

    parser = SafeConfigParser()
    parser.read(os.path.normpath(packageDir+'/manifest.ini'))
    packageInfos = parser.items('package_infos')

    genomeName = parser.get('genome', 'name')
    species = parser.get('genome', 'species')
    genomeSource = parser.get('genome', 'source')
    
    seqTargetDir = conf.getGenomeSequencePath(species.lower(), genomeName)
    if os.path.isdir(seqTargetDir) :
        raise KeyError("The directory %s already exists, Please call deleteGenome() first if you want to reinstall" % seqTargetDir)
        
    gtfFile = _getFile(parser.get('gene_set', 'gtf'), packageDir)
    
    chromosomesFiles = {}
    chromosomeSet = set()
    for key, fil in parser.items('chromosome_files') :
        chromosomesFiles[key] = _getFile(fil, packageDir)
        chromosomeSet.add(key)

    try :
        genome = Genome(name = genomeName, species = species)
    except KeyError:
        pass
    else :
        raise KeyError("There seems to be already a genome (%s, %s), please call deleteGenome() first if you want to reinstall it" % (genomeName, species))

    genome = Genome_Raba()
    genome.set(name = genomeName, species = species, source = genomeSource, packageInfos = packageInfos)

    printf("Importing:\n\t%s\nGenome:\n\t%s\n..."  % (reformatItems(packageInfos).replace('\n', '\n\t'), reformatItems(parser.items('genome')).replace('\n', '\n\t')))

    chros = _importGenomeObjects(gtfFile, chromosomeSet, genome, batchSize, verbose)
    os.makedirs(seqTargetDir)
    startChro = 0
    pBar = ProgressBar(nbEpochs = len(chros))
    for chro in chros :
        pBar.update(label = "Importing DNA, chro %s" % chro.number)
        length = _importSequence(chro, chromosomesFiles[chro.number.lower()], seqTargetDir)
        chro.start = startChro
        chro.end = startChro+length
        startChro = chro.end
        chro.save()
    pBar.close()
    
    if not isDir :
        shutil.rmtree(packageDir)
    
    #~ objgraph.show_most_common_types(limit=20)
    return True

#~ @profile
def _importGenomeObjects(gtfFilePath, chroSet, genome, batchSize, verbose = 0) :
    """verbose must be an int [0, 4] for various levels of verbosity"""

    class Store(object) :
        
        def __init__(self, conf) :
            self.conf = conf
            self.chromosomes = {}
            
            self.genes = {}
            self.transcripts = {}
            self.proteins = {}
            self.exons = {}

        def batch_save(self) :
            self.conf.db.beginTransaction()
            
            for c in self.genes.itervalues() :
                c.save()
                conf.removeFromDBRegistery(c)
                
            for c in self.transcripts.itervalues() :
                c.save()
                conf.removeFromDBRegistery(c.exons)
                conf.removeFromDBRegistery(c)
            
            for c in self.proteins.itervalues() :
                c.save()
                conf.removeFromDBRegistery(c)
            
            self.conf.db.endTransaction()
            
            del(self.genes)
            del(self.transcripts)
            del(self.proteins)
            del(self.exons)
            
            self.genes = {}
            self.transcripts = {}
            self.proteins = {}
            self.exons = {}

            gc.collect()

        def save_chros(self) :
            pBar = ProgressBar(nbEpochs = len(self.chromosomes))
            for c in self.chromosomes.itervalues() :
                pBar.update(label = 'Chr %s' % c.number)
                c.save()
            pBar.close()
        
    printf('Importing gene set infos from %s...' % gtfFilePath)
    
    printf('Backuping indexes...')
    indexes = conf.db.getIndexes()
    printf("Droping all your indexes, (don't worry i'll restore them later)...")
    Genome_Raba.flushIndexes()
    Chromosome_Raba.flushIndexes()
    Gene_Raba.flushIndexes()
    Transcript_Raba.flushIndexes()
    Protein_Raba.flushIndexes()
    Exon_Raba.flushIndexes()
    
    printf("Parsing gene set...")
    gtf = GTFFile(gtfFilePath, gziped = True)
    printf('Done. Importation begins!')
    
    store = Store(conf)
    chroNumber = None
    pBar = ProgressBar(nbEpochs = len(gtf))
    for line in gtf :
        chroN = line['seqname']
        pBar.update(label = "Chr %s" % chroN)
        
        if (chroN.upper() in chroSet or chroN.lower() in chroSet):
            strand = line['strand']
            gene_biotype = line['gene_biotype']
            regionType = line['feature']
            frame = line['frame']

            start = int(line['start']) - 1
            end = int(line['end'])
            if start > end :
                start, end = end, start

            chroNumber = chroN.upper()
            if chroNumber not in store.chromosomes :
                store.chromosomes[chroNumber] = Chromosome_Raba()
                store.chromosomes[chroNumber].set(genome = genome, number = chroNumber)
            
            try :
                geneId = line['gene_id']
                geneName =  line['gene_name']
            except KeyError :
                geneId = None
                geneName = None
                if verbose :
                    printf('Warning: no gene_id/name found in line %s' % gtf[i])
            
            if geneId is not None :
                if geneId not in store.genes :
                    if len(store.genes) > batchSize :
                        store.batch_save()
                    
                    if verbose > 0 :
                        printf('\tGene %s, %s...' % (geneId, geneName))
                    store.genes[geneId] = Gene_Raba()
                    store.genes[geneId].set(genome = genome, id = geneId, chromosome = store.chromosomes[chroNumber], name = geneName, strand = strand, biotype = gene_biotype)
                if start < store.genes[geneId].start or store.genes[geneId].start is None :
                        store.genes[geneId].start = start
                if end > store.genes[geneId].end or store.genes[geneId].end is None :
                    store.genes[geneId].end = end
            try :
                transId = line['transcript_id']
                transName = line['transcript_name']
                try :
                    transcript_biotype = line['transcript_biotype']
                except KeyError :
                    transcript_biotype = None
            except KeyError :
                transId = None
                transName = None
                if verbose > 2 :
                    printf('\t\tWarning: no transcript_id, name found in line %s' % gtf[i])
            
            if transId is not None :
                if transId not in store.transcripts :
                    if verbose > 1 :
                        printf('\t\tTranscript %s, %s...' % (transId, transName))
                    store.transcripts[transId] = Transcript_Raba()
                    store.transcripts[transId].set(genome = genome, id = transId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), name = transName, biotype=transcript_biotype)
                if start < store.transcripts[transId].start or store.transcripts[transId].start is None:
                    store.transcripts[transId].start = start
                if end > store.transcripts[transId].end or store.transcripts[transId].end is None:
                    store.transcripts[transId].end = end
            
                try :
                    protId = line['protein_id']
                except KeyError :
                    protId = None
                    if verbose > 2 :
                        printf('Warning: no protein_id found in line %s' % gtf[i])

                # Store selenocysteine positions in transcript
                if regionType == 'Selenocysteine':
                    store.transcripts[transId].selenocysteine.append(start)
                        
                if protId is not None and protId not in store.proteins :
                    if verbose > 1 :
                        printf('\t\tProtein %s...' % (protId))
                    store.proteins[protId] = Protein_Raba()
                    store.proteins[protId].set(genome = genome, id = protId, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), name = transName)
                    store.transcripts[transId].protein = store.proteins[protId]

                try :
                    exonNumber = int(line['exon_number']) - 1
                    exonKey = (transId, exonNumber)
                except KeyError :
                    exonNumber = None
                    exonKey = None
                    if verbose > 2 :
                        printf('Warning: no exon number or id found in line %s' % gtf[i])
                
                if exonKey is not None :
                    if verbose > 3 :
                        printf('\t\t\texon %s...' % (exonId))
                    
                    if exonKey not in store.exons and regionType == 'exon' :
                        store.exons[exonKey] = Exon_Raba()
                        store.exons[exonKey].set(genome = genome, chromosome = store.chromosomes[chroNumber], gene = store.genes.get(geneId, None), transcript = store.transcripts.get(transId, None), protein = store.proteins.get(protId, None), strand = strand, number = exonNumber, start = start, end = end)
                        store.transcripts[transId].exons.append(store.exons[exonKey])
                    
                    try :
                        store.exons[exonKey].id = line['exon_id']
                    except KeyError :
                        pass
                    
                    if regionType == 'exon' :
                        if start < store.exons[exonKey].start or store.exons[exonKey].start is None:
                            store.exons[exonKey].start = start
                        if end > store.transcripts[transId].end or store.exons[exonKey].end is None:
                            store.exons[exonKey].end = end
                    elif regionType == 'CDS' :
                        store.exons[exonKey].CDS_start = start
                        store.exons[exonKey].CDS_end = end
                        store.exons[exonKey].frame = frame
                    elif regionType == 'stop_codon' :
                        if strand == '+' :
                            if store.exons[exonKey].CDS_end != None :
                                store.exons[exonKey].CDS_end += 3
                                if store.exons[exonKey].end < store.exons[exonKey].CDS_end :
                                    store.exons[exonKey].end = store.exons[exonKey].CDS_end
                                if store.transcripts[transId].end < store.exons[exonKey].CDS_end :
                                    store.transcripts[transId].end = store.exons[exonKey].CDS_end
                                if store.genes[geneId].end < store.exons[exonKey].CDS_end :
                                    store.genes[geneId].end = store.exons[exonKey].CDS_end
                        if strand == '-' :
                            if store.exons[exonKey].CDS_start != None :
                                store.exons[exonKey].CDS_start -= 3
                                if store.exons[exonKey].start > store.exons[exonKey].CDS_start :
                                    store.exons[exonKey].start = store.exons[exonKey].CDS_start
                                if store.transcripts[transId].start > store.exons[exonKey].CDS_start :
                                    store.transcripts[transId].start = store.exons[exonKey].CDS_start
                                if store.genes[geneId].start > store.exons[exonKey].CDS_start :
                                    store.genes[geneId].start = store.exons[exonKey].CDS_start
    pBar.close()
    
    store.batch_save()
    
    conf.db.beginTransaction()
    printf('almost done saving chromosomes...')
    store.save_chros()
    
    printf('saving genome object...')
    genome.save()
    conf.db.endTransaction()
    
    conf.db.beginTransaction()
    printf('restoring core indexes...')
    # Genome.ensureGlobalIndex(('name', 'species'))
    # Chromosome.ensureGlobalIndex('genome')
    # Gene.ensureGlobalIndex('genome')
    # Transcript.ensureGlobalIndex('genome')
    # Protein.ensureGlobalIndex('genome')
    # Exon.ensureGlobalIndex('genome')
    Transcript.ensureGlobalIndex('exons')
    
    printf('commiting changes...')
    conf.db.endTransaction()
    
    conf.db.beginTransaction()
    printf('restoring user indexes')
    pBar = ProgressBar(label = "restoring", nbEpochs = len(indexes))
    for idx in indexes :
        pBar.update()
        conf.db.execute(idx[-1].replace('CREATE INDEX', 'CREATE INDEX IF NOT EXISTS'))
    pBar.close()
    
    printf('commiting changes...')
    conf.db.endTransaction()
    
    return store.chromosomes.values()

#~ @profile
def _importSequence(chromosome, fastaFile, targetDir) :
    "Serializes fastas into .dat files"

    f = gzip.open(fastaFile)
    header = f.readline()
    strRes = f.read().upper().replace('\n', '').replace('\r', '')
    f.close()

    fn = '%s/chromosome%s.dat' % (targetDir, chromosome.number)
    f = open(fn, 'w')
    f.write(strRes)
    f.close()
    chromosome.dataFile = fn
    chromosome.header = header
    return len(strRes)


================================================
FILE: pyGeno/importation/SNPs.py
================================================
import urllib, shutil

from ConfigParser import SafeConfigParser
import pyGeno.configuration as conf
from pyGeno.SNP import *
from pyGeno.tools.ProgressBar import ProgressBar
from pyGeno.tools.io import printf
from Genomes import _decompressPackage, _getFile

from pyGeno.tools.parsers.CasavaTools import SNPsTxtFile
from pyGeno.tools.parsers.VCFTools import VCFFile
from pyGeno.tools.parsers.CSVTools import CSVFile

def importSNPs(packageFile) :
	"""The big wrapper, this function should detect the SNP type by the package manifest and then launch the corresponding function.
	Here's an example of a SNP manifest file for Casava SNPs::

		[package_infos]
		description = Casava SNPs for testing purposes
		maintainer = Tariq Daouda
		maintainer_contact = tariq.daouda [at] umontreal
		version = 1

		[set_infos]
		species = human
		name = dummySRY
		type = Agnostic
		source = my place at the IRIC

		[snps]
		filename = snps.txt # as with genomes you can either include de file at the root of the package or specify an URL from where it must be downloaded
	"""
	printf("Importing polymorphism set: %s... (This may take a while)" % packageFile)
	
	isDir = False
	if not os.path.isdir(packageFile) :
		packageDir = _decompressPackage(packageFile)
	else :
		isDir = True
		packageDir = packageFile

	fpMan = os.path.normpath(packageDir+'/manifest.ini')
	if not os.path.isfile(fpMan) :
		raise ValueError("Not file named manifest.ini! Mais quel SCANDALE!!!!")

	parser = SafeConfigParser()
	parser.read(os.path.normpath(packageDir+'/manifest.ini'))
	packageInfos = parser.items('package_infos')

	setName = parser.get('set_infos', 'name')
	typ = parser.get('set_infos', 'type')
	
	if typ.lower()[-3:] != 'snp' :
		typ += 'SNP'

	species = parser.get('set_infos', 'species').lower()
	genomeSource = parser.get('set_infos', 'source')
	snpsFileTmp = parser.get('snps', 'filename').strip()
	snpsFile = _getFile(parser.get('snps', 'filename'), packageDir)
	
	return_value = None

	try :
		SMaster = SNPMaster(setName = setName)
	except KeyError :
		if typ.lower() == 'casavasnp' :
			return_value = _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile)
		elif typ.lower() == 'dbsnpsnp' :
			return_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)
		elif typ.lower() == 'dbsnp' :
			return_value = _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)
		elif typ.lower() == 'tophatsnp' :
			return_value = _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile)
		elif typ.lower() == 'agnosticsnp' :
			return_value = _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile)
		else :
			raise FutureWarning('Unknown SNP type in manifest %s' % typ)
	else :
		raise KeyError("There's already a SNP set by the name %s. Use deleteSNPs() to remove it first" %setName)

	if not isDir :
		shutil.rmtree(packageDir)

	return return_value

def deleteSNPs(setName) :
	"""deletes a set of polymorphisms"""
	con = conf.db
	try :
		SMaster = SNPMaster(setName = setName)
		con.beginTransaction()
		SNPType = SMaster.SNPType
		con.delete(SNPType, 'setName = ?', (setName,))
		SMaster.delete()
		con.endTransaction()
	except KeyError :
		raise KeyError("Can't delete the setName %s because i can't find it in SNPMaster, maybe there's not set by that name" % setName)
		#~ printf("can't delete the setName %s because i can't find it in SNPMaster, maybe there's no set by that name" % setName)
		return False
	return True

def _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile) :
	"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno wil interpret all positions as 0 based"
	printf('importing SNP set %s for species %s...' % (setName, species))

	snpData = CSVFile()
	snpData.parse(snpsFile, separator = "\t")

	AgnosticSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))
	conf.db.beginTransaction()
	
	pBar = ProgressBar(len(snpData))
	pLabel = ''
	currChrNumber = None
	for snpEntry in snpData :
		tmpChr = snpEntry['chromosomeNumber']
		if tmpChr != currChrNumber :
			currChrNumber = tmpChr
			pLabel = 'Chr %s...' % currChrNumber

		snp = AgnosticSNP()
		snp.species = species
		snp.setName = setName
		for f in snp.getFields() :
			try :
				setattr(snp, f, snpEntry[f])
			except KeyError :
				if f != 'species' and f != 'setName' :
					printf("Warning filetype as no key %s", f)
		snp.quality = float(snp.quality)
		snp.start = int(snp.start)
		snp.end = int(snp.end)
		snp.save()
		pBar.update(label = pLabel)

	pBar.close()
	
	snpMaster = SNPMaster()
	snpMaster.set(setName = setName, SNPType = 'AgnosticSNP', species = species)
	snpMaster.save()

	printf('saving...')
	conf.db.endTransaction()
	printf('creating indexes...')
	AgnosticSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))
	printf('importation of SNP set %s for species %s done.' %(setName, species))
	
	return True

def _importSNPs_CasavaSNP(setName, species, genomeSource, snpsFile) :
	"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based"
	printf('importing SNP set %s for species %s...' % (setName, species))

	snpData = SNPsTxtFile(snpsFile)
	
	CasavaSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))
	conf.db.beginTransaction()
	
	pBar = ProgressBar(len(snpData))
	pLabel = ''
	currChrNumber = None
	for snpEntry in snpData :
		tmpChr = snpEntry['chromosomeNumber']
		if tmpChr != currChrNumber :
			currChrNumber = tmpChr
			pLabel = 'Chr %s...' % currChrNumber

		snp = CasavaSNP()
		snp.species = species
		snp.setName = setName
		
		for f in snp.getFields() :
			try :
				setattr(snp, f, snpEntry[f])
			except KeyError :
				if f != 'species' and f != 'setName' :
					printf("Warning filetype as no key %s", f)
		snp.start -= 1
		snp.end -= 1
		snp.save()
		pBar.update(label = pLabel)

	pBar.close()
	
	snpMaster = SNPMaster()
	snpMaster.set(setName = setName, SNPType = 'CasavaSNP', species = species)
	snpMaster.save()

	printf('saving...')
	conf.db.endTransaction()
	printf('creating indexes...')
	CasavaSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))
	printf('importation of SNP set %s for species %s done.' %(setName, species))
	
	return True

def _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile) :
	"This function will also create an index on start->chromosomeNumber->setName. Warning : pyGeno positions are 0 based"
	snpData = VCFFile(snpsFile, gziped = True, stream = True)
	dbSNPSNP.dropIndex(('start', 'chromosomeNumber', 'setName'))
	conf.db.beginTransaction()
	pBar = ProgressBar()
	pLabel = ''
	for snpEntry in snpData :
		pBar.update(label = 'Chr %s, %s...' %  (snpEntry['#CHROM'], snpEntry['ID']))
		
		snp = dbSNPSNP()
		for f in snp.getFields() :
			try :
				setattr(snp, f, snpEntry[f])
			except KeyError :
				pass
		snp.chromosomeNumber = snpEntry['#CHROM']
		snp.species = species
		snp.setName = setName
		snp.start = snpEntry['POS']-1
		snp.alt = snpEntry['ALT']
		snp.ref = snpEntry['REF']
		snp.end = snp.start+len(snp.alt)
		snp.save()
	
	pBar.close()
	
	snpMaster = SNPMaster()
	snpMaster.set(setName = setName, SNPType = 'dbSNPSNP', species = species)
	snpMaster.save()
	
	printf('saving...')
	conf.db.endTransaction()
	printf('creating indexes...')
	dbSNPSNP.ensureGlobalIndex(('start', 'chromosomeNumber', 'setName'))
	printf('importation of SNP set %s for species %s done.' %(setName, species))

	return True
	
def _importSNPs_TopHatSNP(setName, species, genomeSource, snpsFile) :
	raise FutureWarning('Not implemented yet')


================================================
FILE: pyGeno/importation/__init__.py
================================================
__all__ = ['Genomes', 'SNPs']


================================================
FILE: pyGeno/pyGenoObjectBases.py
================================================
import time, types, string
import configuration as conf
from rabaDB.rabaSetup import *
from rabaDB.Raba import *
from rabaDB.filters import RabaQuery

def nosave() :
	raise ValueError('You can only save object that are linked to reference genomes')

class pyGenoRabaObject(Raba) :
	"""pyGeno uses rabaDB to persistenly store data. Most persistent 
	objects have classes that inherit from this one (Genome_Raba, 
	Chromosome_Raba, Gene_Raba, Protein_Raba, Exon_Raba). Theses classes 
	are not mean to be accessed directly. Users manipulate wrappers 
	such as : Genome, Chromosome etc... pyGenoRabaObject extends 
	the Raba class by adding a function _curate that is called just 
	before saving. This class is to be considered abstract, and is not 
	meant to be instanciated"""

	_raba_namespace = conf.pyGeno_RABA_NAMESPACE
	_raba_abstract = True # not saved in db by default

	def __init__(self) :
		if self is pyGenoRabaObject :
			raise TypeError("This class is abstract")
	
	def _curate(self) :
		"Last operations performed before saving, must be implemented in child"
		raise TypeError("This method is abstract and should be implemented in child")

	def save(self) :
		"""Calls _curate() before performing a normal rabaDB lazy save 
		(saving only occurs if the object has been modified)"""
		
		if self.mutated() :
			self._curate()
		Raba.save(self)

class pyGenoRabaObjectWrapper_metaclass(type) :
	"""This metaclass keeps track of the relationship between wrapped 
	classes and wrappers """
	_wrappers = {}

	def __new__(cls, name, bases, dct) :
		clsObj = type.__new__(cls, name, bases, dct)
		cls._wrappers[dct['_wrapped_class']] = clsObj
		return clsObj

class RLWrapper(object) :
	"""A wrapper for rabalists that replaces raba objects by pyGeno Object"""
	def __init__(self, rabaObj, listObjectType, rl) :
		self.rabaObj = rabaObj
		self.rl = rl
		self.listObjectType = listObjectType

	def __getitem__(self, i) :
		return self.listObjectType(wrapped_object_and_bag = (self.rl[i], self.rabaObj.bagKey))
	
	def __getattr__(self, name) :
		rl =  object.__getattribute__(self, 'rl')
		return getattr(rl, name)

class pyGenoRabaObjectWrapper(object) :
	"""All the wrapper classes such as Genome and Chromosome inherit 
	from this class. It has most that make pyGeno useful, such as 
	get(), count(), ensureIndex(). This class is to be considered 
	abstract, and is not meant to be instanciated"""
	__metaclass__ = pyGenoRabaObjectWrapper_metaclass

	_wrapped_class = None
	_bags = {}

	def __init__(self, wrapped_object_and_bag = (), *args, **kwargs) :
		if self is pyGenoRabaObjectWrapper :
			raise TypeError("This class is abstract")
		if wrapped_object_and_bag != () :
			assert wrapped_object_and_bag[0]._rabaClass is self._wrapped_class
			self.wrapped_object = wrapped_object_and_bag[0]
			self.bagKey = wrapped_object_and_bag[1]
			pyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self
		else :
			self.wrapped_object = self._wrapped_class(*args, **kwargs)
			self.bagKey = time.time()
			pyGenoRabaObjectWrapper._bags[self.bagKey] = {}
			pyGenoRabaObjectWrapper._bags[self.bagKey][self._getObjBagKey(self.wrapped_object)] = self

		self._load_sequencesTriggers = set()
		self.loadSequences = True
		self.loadData = True
		self.loadBinarySequences = True

	def _getObjBagKey(self, obj) :
		"""pyGeno objects are kept in bags to ensure that reference 
		objects are loaded only once. This function returns the bag key 
		of the current object"""
		return (obj._rabaClass.__name__, obj.raba_id)

	def _makeLoadQuery(self, objectType, *args, **coolArgs) :
		# conf.db.enableDebug(True)
		f = RabaQuery(objectType._wrapped_class, namespace = self._wrapped_class._raba_namespace)
		coolArgs[self._wrapped_class.__name__[:-5]] = self.wrapped_object #[:-5] removes _Raba from class name

		if len(args) > 0 and type(args[0]) is types.ListType :
			for a in args[0] :
				if type(a) is types.DictType :
					f.addFilter(**a)
		else :
			f.addFilter(*args, **coolArgs)

		return f

	def count(self, objectType, *args, **coolArgs) :
		"""Returns the number of elements satisfying the query"""
		return self._makeLoadQuery(objectType, *args, **coolArgs).count()

	def get(self, objectType, *args, **coolArgs) :
		"""Raba Magic inside. This is th function that you use for 
		querying pyGeno's DB.
		
		Usage examples:
		
			* myGenome.get("Gene", name = 'TPST2')
		
			* myGene.get(Protein, id = 'ENSID...')
		
			* myGenome.get(Transcript, {'start >' : x, 'end <' : y})"""
		
		ret = []
		for e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() :
			if issubclass(objectType, pyGenoRabaObjectWrapper) :
				ret.append(objectType(wrapped_object_and_bag = (e, self.bagKey)))
			else :
				ret.append(e)

		return ret

	def iterGet(self, objectType, *args, **coolArgs) :
		"""Same as get. But retuns the elements one by one, much more efficient for large outputs"""

		for e in self._makeLoadQuery(objectType, *args, **coolArgs).iterRun() :
			if issubclass(objectType, pyGenoRabaObjectWrapper) :
				yield objectType(wrapped_object_and_bag = (e, self.bagKey))
			else :
				yield e

	#~ def ensureIndex(self, fields) :
		#~ """
		#~ Warning: May not work on some systems, see ensureGlobalIndex
		#~ 
		#~ Creates a partial index on self (if it does not exist). 
		#~ Ex: myTranscript.ensureIndex('name')"""
		#~ 
		#~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object
		#~ self._wrapped_class.ensureIndex(fields, where, (whereValues,))

	#~ def dropIndex(self, fields) :
		#~ """Warning: May not work on some systems, see dropGlobalIndex
		#~ 
		#~ Drops a partial index on self. Ex: myTranscript.dropIndex('name')"""

		#~ where, whereValues = '%s=?' %(self._wrapped_class.__name__[:-5]), self.wrapped_object
		#~ self._wrapped_class.dropIndex(fields, where, (whereValues,))
	
	def __getattr__(self, name) :
		"""If a wrapper does not have a specific field, pyGeno will 
		look for it in the wrapped_object"""
		# print "pyGenoObjectBases __getattr__ : " + name + " from " + str(type(self))

		
		if name == 'save' or name == 'delete' :
			raise AttributeError("You can't delete or save an object from wrapper, try .wrapped_object.delete()/save()")
		
		if name in self._load_sequencesTriggers and self.loadSequences :
			self.loadSequences = False
			self._load_sequences()
			return getattr(self, name)

		if name in self._load_sequencesTriggers and self.loadData :
			self.loadData = False
			self._load_data()
			return getattr(self, name)
			
		if name[:4] == 'bin_' and self.loadBinarySequences :
			self.updateBinarySequence = False
			self._load_bin_sequence()
			return getattr(self, name)
		
		attr = getattr(self.wrapped_object, name)
		if isRabaObject(attr) :
			attrKey = self._getObjBagKey(attr)
			if attrKey in pyGenoRabaObjectWrapper._bags[self.bagKey] :
				retObj = pyGenoRabaObjectWrapper._bags[self.bagKey][attrKey]
			else :
				wCls = pyGenoRabaObjectWrapper_metaclass._wrappers[attr._rabaClass]
				retObj = wCls(wrapped_object_and_bag = (attr, self.bagKey))
			return retObj
		return attr

	@classmethod
	def getIndexes(cls) :
		"""Returns a list of indexes attached to the object's class. Ex 
		Transcript.getIndexes()"""
		return cls._wrapped_class.getIndexes()

	@classmethod
	def flushIndexes(cls) :
		"""Drops all the indexes attached to the object's class. Ex 
		Transcript.flushIndexes()"""
		return cls._wrapped_class.flushIndexes()
	
	@classmethod
	def help(cls) :
		"""Returns a list of available field for queries. Ex 
		Transcript.help()"""
		return cls._wrapped_class.help().replace("_Raba", "")

	@classmethod
	def ensureGlobalIndex(cls, fields) :
		"""Add a GLOBAL index to the db to speedup lookouts. Fields can be a 
		list of fields for Multi-Column Indices or simply the name of a 
		single field. A global index is an index on the entire type.
		A global index on 'Transcript' on field 'name', will index the names for all the transcripts in the database"""
		cls._wrapped_class.ensureIndex(fields)

	@classmethod
	def dropGlobalIndex(cls, fields) :
		"""Drops an index, the opposite of ensureGlobalIndex()"""
		cls._wrapped_class.dropIndex(fields)

	def getSequencesData(self) :
		"""This lazy abstract function is only called if the object 
		sequences need to be loaded"""
		raise NotImplementedError("This fct loads non binary sequences and should be implemented in child if needed")

	def _load_sequences(self) :
		"""This lazy abstract function is only called if the object 
		sequences need to be loaded"""
		self._load_data()

	def _load_data(self) :
		"""This lazy abstract function is only called if the object 
		sequences need to be loaded"""
		raise NotImplementedError("This fct loads non binary sequences and should be implemented in child if needed")
	
	def _load_bin_sequence(self) :
		"""Same as _load_sequences(), but loads binary sequences"""
		raise NotImplementedError("This fct loads binary sequences and should be implemented in child if needed")


================================================
FILE: pyGeno/tests/__init__.py
================================================


================================================
FILE: pyGeno/tests/test_csv.py
================================================
import unittest
from pyGeno.tools.parsers.CSVTools import *

class CSVTests(unittest.TestCase):
		
	def setUp(self):
		pass

	def tearDown(self):
		pass

	def test_createParse(self) :
		testVals = ["test", "test2"]
		c = CSVFile(legend = ["col1", "col2"], separator = "\t")
		l = c.newLine()
		l["col1"] = testVals[0]
		l = c.newLine()
		l["col1"] = testVals[1]
		c.save("test.csv")
		# print "----", l
		c2 = CSVFile()
		c2.parse("test.csv", separator = "\t")
		i = 0
		for l in c2 :
			self.assertEqual(l["col1"], testVals[i])
			i += 1

def runTests() :
	unittest.main()

if __name__ == "__main__" :
	runTests()


================================================
FILE: pyGeno/tests/test_genome.py
================================================
import unittest
from pyGeno.Genome import *

import pyGeno.bootstrap as B
from pyGeno.importation.Genomes import *
from pyGeno.importation.SNPs import *

class pyGenoSNPTests(unittest.TestCase):

	def setUp(self):
		# try :
		# 	B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
		# except KeyError :
		# 	deleteGenome("human", "GRCh37.75_Y-Only")
		# 	B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
		# 	print "--> Seems to already exist in db"
     
		# try :
		# 	B.importSNPs("Human_agnostic.dummySRY.tar.gz")
		# except KeyError :
		# 	deleteSNPs("dummySRY_AGN")
		# 	B.importSNPs("Human_agnostic.dummySRY.tar.gz")
		# 	print "--> Seems to already exist in db"
		
		# try :
		# 	B.importSNPs("Human_agnostic.dummySRY_indels")
		# except KeyError :
		# 	deleteSNPs("dummySRY_AGN_indels")
		# 	B.importSNPs("Human_agnostic.dummySRY_indels")
		# 	print "--> Seems to already exist in db"
		self.ref = Genome(name = 'GRCh37.75_Y-Only')

	def tearDown(self):
		pass

	# @unittest.skip("skipping")
	def test_vanilla(self) :
		dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN')
		persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
		refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]

		self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])
		self.assertEqual('HTGCAATCATATGCTTCTGC', persProt.transcript.cDNA[:20])

	# @unittest.skip("skipping")
	def test_noModif(self) :
		from pyGeno.SNPFiltering import SNPFilter

		class MyFilter(SNPFilter) :
			def __init__(self) :
				SNPFilter.__init__(self)

			def filter(self, chromosome, dummySRY_AGN) :
				return None

		dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
		persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
		refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]

		self.assertEqual(persProt.transcript.cDNA[:20], refProt.transcript.cDNA[:20])

	# @unittest.skip("skipping")
	def test_insert(self) :
		from pyGeno.SNPFiltering import SNPFilter

		class MyFilter(SNPFilter) :
			def __init__(self) :
				SNPFilter.__init__(self)

			def filter(self, chromosome, dummySRY_AGN) :
				from pyGeno.SNPFiltering import  SequenceInsert
		
				refAllele = chromosome.refSequence[dummySRY_AGN.start]
				return SequenceInsert('XXX')

		dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
		persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
		refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
		self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])
		self.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20])

	# @unittest.skip("skipping")
	def test_SNP(self) :
		from pyGeno.SNPFiltering import SNPFilter

		class MyFilter(SNPFilter) :
			def __init__(self) :
				SNPFilter.__init__(self)

			def filter(self, chromosome, dummySRY_AGN) :
				from pyGeno.SNPFiltering import SequenceSNP
	
				return SequenceSNP(dummySRY_AGN.alt)

		dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
		persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]

		refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
		self.assertEqual('M', refProt.sequence[0])
		self.assertEqual('L', persProt.sequence[0])

	# @unittest.skip("skipping")
	def test_deletion(self) :
		from pyGeno.SNPFiltering import SNPFilter

		class MyFilter(SNPFilter) :
			def __init__(self) :
				SNPFilter.__init__(self)

			def filter(self, chromosome, dummySRY_AGN) :
				from pyGeno.SNPFiltering import SequenceDel
		
				refAllele = chromosome.refSequence[dummySRY_AGN.start]
				return SequenceDel(1)

		dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN', SNPFilter = MyFilter())
		persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
		refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
		self.assertEqual('ATGCAATCATATGCTTCTGC', refProt.transcript.cDNA[:20])
		self.assertEqual('TGCAATCATATGCTTCTGCT', persProt.transcript.cDNA[:20])

	# @unittest.skip("skipping")
	def test_indels(self) :
		from pyGeno.SNPFiltering import SNPFilter

		class MyFilter(SNPFilter) :
			def __init__(self) :
				SNPFilter.__init__(self)

			def filter(self, chromosome, dummySRY_AGN_indels) :
				from pyGeno.SNPFiltering import  SequenceInsert
				ret = ""
				for s in dummySRY_AGN_indels :
					ret += "X"
				return SequenceInsert(ret)

		dummy = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY_AGN_indels', SNPFilter = MyFilter())
		persProt = dummy.get(Protein, id = 'ENSP00000438917')[0]
		refProt = self.ref.get(Protein, id = 'ENSP00000438917')[0]
		self.assertEqual('XXXATGCAATCATATGCTTC', persProt.transcript.cDNA[:20])

	# @unittest.skip("skipping")
	def test_bags(self) :
		dummy = Genome(name = 'GRCh37.75_Y-Only')
		self.assertEqual(dummy.wrapped_object, self.ref.wrapped_object)

	# @unittest.skip("skipping")
	def test_prot_find(self) :
		prot = self.ref.get(Protein, id = 'ENSP00000438917')[0]
		needle = prot.sequence[:10]
		self.assertEqual(0, prot.find(needle))
		needle = prot.sequence[-10:]
		self.assertEqual(len(prot)-10, prot.find(needle))

	# @unittest.skip("skipping")
	def test_trans_find(self) :
		trans = self.ref.get(Transcript, name = "SRY-001")[0]
		self.assertEqual(0, trans.find(trans[:5]))

	# @unittest.skip("remote server down")
	# def test_import_remote_genome(self) :
		# self.assertRaises(KeyError, B.importRemoteGenome, "Human.GRCh37.75_Y-Only.tar.gz")

	# @unittest.skip("remote server down")
	# def test_import_remote_snps(self) :
		# self.assertRaises(KeyError, B.importRemoteSNPs, "Human_agnostic.dummySRY.tar.gz")

def runTests() :
	try :
		B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
	except KeyError :
		deleteGenome("human", "GRCh37.75_Y-Only")
		B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
		print "--> Seems to already exist in db"
 
	try :
		B.importSNPs("Human_agnostic.dummySRY.tar.gz")
	except KeyError :
		deleteSNPs("dummySRY_AGN")
		B.importSNPs("Human_agnostic.dummySRY.tar.gz")
		print "--> Seems to already exist in db"
	
	try :
		B.importSNPs("Human_agnostic.dummySRY_indels")
	except KeyError :
		deleteSNPs("dummySRY_AGN_indels")
		B.importSNPs("Human_agnostic.dummySRY_indels")
		print "--> Seems to already exist in db"
	# import time
	# time.sleep(10)
	unittest.main()

if __name__ == "__main__" :
	runTests()


================================================
FILE: pyGeno/tools/BinarySequence.py
================================================
import array, copy
import UsefulFunctions as uf

class BinarySequence :
	"""A class for representing sequences in a binary format"""

        ALPHABETA_SIZE = 32
        ALPHABETA_KMP = range(ALPHABETA_SIZE)
        
	def __init__(self, sequence, arrayForma, charToBinDict) :
	
		self.forma = arrayForma
		self.charToBin = charToBinDict
		self.sequence = sequence
		
		self.binSequence, self.defaultSequence, self.polymorphisms = self.encode(sequence)
		self.itemsize = self.binSequence.itemsize
		self.typecode = self.binSequence.typecode
		#print 'bin', len(self.sequence), len(self.binSequence)

	def encode(self, sequence):
		"""Returns a tuple (binary reprensentation, default sequence, polymorphisms list)"""
		
		polymorphisms = []
		defaultSequence = '' 
		binSequence = array.array(self.forma.typecode)
		b = 0
		i = 0
		trueI = 0 #not inc in case if poly
		poly = set()
		while i < len(sequence)-1:
			b = b | self.forma[self.charToBin[sequence[i]]]
			if sequence[i+1] == '/' :
				poly.add(sequence[i])
				i += 2
			else :
				binSequence.append(b)
				if len(poly) > 0 :
					poly.add(sequence[i])
					polymorphisms.append((trueI, poly))
					poly = set()
				
				bb = 0
				while b % 2 != 0 :
					b = b/2
					
				defaultSequence += sequence[i]
				b = 0
				i += 1
				trueI += 1
		
		if i < len(sequence) :
			b = b | self.forma[self.charToBin[sequence[i]]]
			binSequence.append(b)
			if len(poly) > 0 :
				if sequence[i] not in poly :
					poly.add(sequence[i])
				polymorphisms.append((trueI, poly))
			defaultSequence += sequence[i]
		
		return (binSequence, defaultSequence, polymorphisms)

	def __testFind(self, arr) :
		if len(arr)  == 0:
			raise TypeError ('binary find, needle is empty')
		if arr.itemsize != self.itemsize :
			raise TypeError ('binary find, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize))
	
	def __testBinary(self, arr) :
		if len(arr) != len(self) :
			raise TypeError ('bin operator, both arrays must be of same length')
		if arr.itemsize != self.itemsize :
			raise TypeError ('bin operator, both arrays must have same item size, arr: %d, self: %d' % (arr.itemsize, self.itemsize))
	
	def findPolymorphisms(self, strSeq, strict = False):
		"""
		Compares strSeq with self.sequence.
		If not 'strict', this function ignores the cases of matching heterozygocity (ex: for a given position i, strSeq[i] = A and self.sequence[i] = 'A/G'). If 'strict' it returns all positions where strSeq differs self,sequence
		"""
		arr = self.encode(strSeq)[0]
		res = []
		if not strict :
			for i in range(len(arr)+len(self)) :
				if i >= len(arr) or i > len(self) :
					break
				if arr[i] & self[i] == 0:
					res.append(i)
		else :
			for i in range(len(arr)+len(self)) :
				if i >= len(arr) or i > len(self) :
					break
				if arr[i] != self[i] :
					res.append(i)
		return res
		
	def getPolymorphisms(self):
		"""returns all polymorphsims in the form of a dict pos => alleles"""
		return self.polymorphisms
	
	def getDefaultSequence(self) :
		"""returns a default version of sequence where only the last allele of each polymorphisms is shown"""
		return self.defaultSequence
	
	def __getSequenceVariants(self, x1, polyStart, polyStop, listSequence) :
		"""polyStop, is the polymorphisme at wixh number where the calcul of combinaisons stops"""
		if polyStart < len(self.polymorphisms) and polyStart < polyStop: 
			sequence = copy.copy(listSequence)
			ret = []
			
			pk = self.polymorphisms[polyStart]
			posInSequence = pk[0]-x1
			if posInSequence < len(listSequence) : 
				for allele in pk[1] :
					sequence[posInSequence] = allele
					ret.extend(self.__getSequenceVariants(x1, polyStart+1, polyStop, sequence))
			
			return ret
		else :
			return [''.join(listSequence)]

	def getSequenceVariants(self, x1 = 0, x2 = -1, maxVariantNumber = 128) :
		"""returns the sequences resulting from all combinaisons of all polymorphismes between x1 and x2. The results is a couple (bool, variants of sequence between x1 and x2), bool == true if there's more combinaisons than maxVariantNumber"""
		if x2 == -1 :
			xx2 = len(self.defaultSequence)
		else :
			xx2 = x2
		
		polyStart = None
		nbP = 1
		stopped = False
		j = 0
		for p in self.polymorphisms :
			if p[0] >= xx2 :
				break
			
			if x1 <= p[0] :
				if polyStart == None :
					polyStart = j
				
				nbP *= len(p[1])
				
				if nbP > maxVariantNumber :
					stopped = True
					break
			j += 1
		
		if polyStart == None :
			return (stopped, [self.defaultSequence[x1:xx2]])
		
		return (stopped, self.__getSequenceVariants(x1, polyStart, j, list(self.defaultSequence[x1:xx2])))

	def getNbVariants(self, x1, x2 = -1) :
		"""returns the nb of variants of sequences between x1 and x2"""
		if x2 == -1 :
			xx2 = len(self.defaultSequence)
		else :
			xx2 = x2
		
		nbP = 1
		for p in self.polymorphisms:
			if x1 <= p[0] and p[0] <= xx2 :
				nbP *= len(p[1])
		
		return nbP

	def _dichFind(self, needle, currHaystack, offset, lst = None) :
		"""dichotomic search, if lst is None, will return the first position found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found"""
		
		if len(currHaystack) == 1 :
			if (offset <= (len(self) - len(needle))) and (currHaystack[0] & needle[0]) > 0 and (self[offset+len(needle)-1] & needle[-1]) > 0 :
				found = True
				for i in xrange(1, len(needle)-1) :
					if self[offset + i] & needle[i] == 0 :
						found = False
						break
				if found :
					if lst is not None :
						lst.append(offset)
					else :
						return offset
				else :
					if lst is None :
						return -1
		else :
			if (offset <= (len(self) - len(needle))) :
				if lst is not None :
					self._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst)
					self._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst)
				else :
					v1 = self._dichFind(needle, currHaystack[:len(currHaystack)/2], offset, lst)
					if v1 > -1 :
						return v1
					
					return self._dichFind(needle, currHaystack[len(currHaystack)/2:], offset + len(currHaystack)/2, lst)
			return -1

        def _kmp_construct_next(self, pattern):
                """the helper function for KMP-string-searching is to construct the DFA. pattern should be an integer array. return a 2D array representing the DFA for moving the pattern."""
                next = [[0 for state in pattern] for input_token in self.ALPHABETA_KMP]
                next[pattern[0]][0] = 1
                restart_state = 0
                for state in range(1, len(pattern)):
                        for input_token in self.ALPHABETA_KMP:
                                next[input_token][state] = next[input_token][restart_state]
                        next[pattern[state]][state] = state + 1
                        restart_state = next[pattern[state]][restart_state]
                return next

        def _kmp_search_first(self, pInput_sequence, pPattern):
                """use KMP algorithm to search the first occurrence in the input_sequence of the pattern. both arguments are integer arrays. return the position of the occurence if found; otherwise, -1."""
                input_sequence, pattern = pInput_sequence, [len(bin(e)) for e in pPattern]
                n, m = len(input_sequence), len(pattern)
                d = p = 0
                next = self._kmp_construct_next(pattern)
                while d < n and p < m:
                        p = next[len(bin(input_sequence[d]))][p]
                        d += 1
                if p == m: return d - p
                else: return -1

        def _kmp_search_all(self, pInput_sequence, pPattern):
                """use KMP algorithm to search all occurrence in the input_sequence of the pattern. both arguments are integer arrays. return a list of the positions of the occurences if found; otherwise, []."""
                r = []
                input_sequence, pattern = [len(bin(e)) for e in pInput_sequence], [len(bin(e)) for e in pPattern]
                n, m = len(input_sequence), len(pattern)
                d = p = 0
                next = self._kmp_construct_next(pattern)
                while d < n:
                        p = next[input_sequence[d]][p]
                        d += 1
                        if p == m:
                                r.append(d - m)
                                p = 0
                return r

        def _kmp_find(self, needle, haystack, lst = None):
		"""find with KMP-searching. needle is an integer array, reprensenting a pattern. haystack is an integer array, reprensenting the input sequence. if lst is None, return the first position found or -1 if no match found. If it's a list, will return a list of all positions in lst. returns -1 or [] if no match found."""
                if lst != None: return self._kmp_search_all(haystack, needle)
                else: return self._kmp_search_first(haystack, needle)
		
	def findByBiSearch(self, strSeq) :
		"""returns the first occurence of strSeq in self. Takes polymorphisms into account"""
		arr = self.encode(strSeq)
		return self._dichFind(arr[0], self, 0, lst = None)

	def findAllByBiSearch(self, strSeq) :
		"""Same as find but returns a list of all occurences"""
		arr = self.encode(strSeq)
		lst = []
		self._dichFind(arr[0], self, 0, lst)
		return lst

	def find(self, strSeq) :
		"""returns the first occurence of strSeq in self. Takes polymorphisms into account"""
		arr = self.encode(strSeq)
                return self._kmp_find(arr[0], self)

	def findAll(self, strSeq) :
		"""Same as find but returns a list of all occurences"""
		arr = self.encode(strSeq)
		lst = []
                lst = self._kmp_find(arr[0], self, lst)
		return lst
		
	def __and__(self, arr) :
		self.__testBinary(arr)
		
		ret = BinarySequence(self.typecode, self.forma, self.charToBin)
		for i in range(len(arr)) :
			ret.append(self[i] & arr[i])
		
		return ret
	
	def __xor__(self, arr) :
		self.__testBinary(arr)
		
		ret = BinarySequence(self.typecode, self.forma, self.charToBin)
		for i in range(len(arr)) :
			ret.append(self[i] ^ arr[i])
		
		return ret

	def __eq__(self, seq) :
		self.__testBinary(seq)
		
		if len(seq) != len(self) :
			return False

		return all( self[i] == seq[i] for i in range(len(self)) )

		
	def append(self, arr) :
		self.binSequence.append(arr)

	def extend(self, arr) :
		self.binSequence.extend(arr)

	def decode(self, binSequence):
		"""decodes a binary sequence to return a string"""
		try:
			binSeq = iter(binSequence[0])
		except TypeError, te:
			binSeq = binSequence
    
		ret = ''
		for b in binSeq :
			ch = ''
			for c in self.charToBin :
				if b & self.forma[self.charToBin[c]] > 0 :
					ch += c +'/'
			if ch == '' :
				raise KeyError('Key %d unkowom, bad format' % b)
			
			ret += ch[:-1]
		return ret
		
	def getChar(self, i):
		return self.decode([self.binSequence[i]])
		
	def __len__(self):
		return len(self.binSequence)

	def __getitem__(self,i):
		return self.binSequence[i]
	
	def __setitem__(self, i, v):
		self.binSequence[i] = v

class AABinarySequence(BinarySequence) :
	"""A binary sequence of amino acids"""
	
	def __init__(self, sequence):
		f = array.array('I', [1L, 2L, 4L, 8L, 16L, 32L, 64L, 128L, 256L, 512L, 1024L, 2048L, 4096L, 8192L, 16384L, 32768L, 65536L, 131072L, 262144L, 524288L, 1048576L, 2097152L])
		c = {'A': 17, 'C': 14, 'E': 19, 'D': 15, 'G': 13, 'F': 16, 'I': 3, 'H': 9, 'K': 8, '*': 1, 'M': 20, 'L': 0, 'N': 4, 'Q': 11, 'P': 6, 'S': 7, 'R': 5, 'T': 2, 'W': 10, 'V': 18, 'Y': 12, 'U': 21}
		BinarySequence.__init__(self, sequence, f, c)
	
class NucBinarySequence(BinarySequence) :
	"""A binary sequence of nucleotides"""
	
	def __init__(self, sequence):
		f = array.array('B', [1, 2, 4, 8])
		c = {'A': 0, 'T': 1, 'C': 2, 'G': 3}
		ce = {
			'R' : 'A/G', 'Y' : 'C/T', 'M': 'A/C',
			'K' : 'T/G', 'W' : 'A/T', 'S' : 'C/G', 'B': 'C/G/T',
			'D' : 'A/G/T', 'H' : 'A/C/T', 'V' : 'A/C/G', 'N': 'A/C/G/T'
			}
		lstSeq = list(sequence)
		for i in xrange(len(lstSeq)) :
			if lstSeq[i] in ce :
				lstSeq[i] = ce[lstSeq[i]]
		lstSeq = ''.join(lstSeq)
		BinarySequence.__init__(self, lstSeq, f, c)

if __name__=="__main__":
	
	def test0() :
		#seq = 'ACED/E/GFIHK/MLMQPS/RTWVY'
		seq = 'ACED/E/GFIHK/MLMQPS/RTWVY/A/R'
		bSeq = AABinarySequence(seq)
		start = 0
		stop = 4
		rB = bSeq.getSequenceVariants_bck(start, stop)
		r = bSeq.getSequenceVariants(start, stop)
		
		#print start, stop, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1]) 
		print start, stop#, 'nb_comb_r', len(r[1]), set(rB[1])==set(r[1]) 
		
		#if set(rB[1])!=set(r[1]) :
		print '-AV-'
		print start, stop, 'nb_comb_r', len(rB[1])
		print '\n'.join(rB[1])
		print '=AP========'
		print start, stop, 'nb_comb_r', len(r[1]) 
		print '\n'.join(r[1])
	
	def testVariants() :
		seq = 'ATGAGTTTGCCGCGCN'
		bSeq = NucBinarySequence(seq)
		print bSeq.getSequenceVariants() 

	testVariants()

        from random import randint
        alphabeta = ['A', 'C', 'G', 'T']
        seq = ''
        for _ in range(8192):
                seq += alphabeta[randint(0, 3)]
        seq += 'ATGAGTTTGCCGCGCN'
        bSeq = NucBinarySequence(seq)

        ROUND = 512
        PATTERN = 'GCGC'

        def testFind():
                for i in range(ROUND):
                        bSeq.find(PATTERN)

        def testFindByBiSearch():
                for i in range(ROUND):
                        bSeq.findByBiSearch(PATTERN)
                        
        def testFindAll():
                for i in range(ROUND):
                        bSeq.findAll(PATTERN)

        def testFindAllByBiSearch():
                for i in range(ROUND):
                        bSeq.findAllByBiSearch(PATTERN)

        import cProfile
        print('find:\n')
	cProfile.run('testFind()')
        print('findAll:\n')
        cProfile.run('testFindAll()')
        print('findByBiSearch:\n')
        cProfile.run('testFindByBiSearch()')
        print('findAllByBiSearch:\n')
        cProfile.run('testFindAllByBiSearch()')


================================================
FILE: pyGeno/tools/ProgressBar.py
================================================
import sys, time, cPickle

class ProgressBar :
	"""A very simple unthreaded progress bar. This progress bar also logs stats in .logs.
	Usage example::

		p = ProgressBar(nbEpochs = -1)
			for i in range(200000) :
				p.update(label = 'value of i %d' % i)
		p.close()
	
	If you don't know the maximum number of epochs you can enter nbEpochs < 1
	"""
	
	def __init__(self, nbEpochs = -1, width = 25, label = "progress", minRefeshTime = 1) :
		self.width = width
		self.currEpoch = 0
		self.nbEpochs = float(nbEpochs)
		self.bar = ''

		self.label = label
		self.wheel = ["-", "\\", "|", "/"]
		self.startTime = time.time()
		self.lastPrintTime = -1
		self.minRefeshTime = minRefeshTime
		
		self.runtime = -1
		self.runtime_hr = -1
		self.avg = -1
		self.remtime = -1
		self.remtime_hr = -1
		self.currTime = time.time()
		self.lastEpochDuration = -1 
		
		self.bars = []
		self.miniSnake = '~-~-~-?:>' 
		self.logs = {'epochDuration' : [], 'avg' : [], 'runtime' : [], 'remtime' : []}
		
	def formatTime(self, val) :
		if val < 60 :
			return '%.3fsc' % val
		elif val < 3600 :
			return '%.3fmin' % (val/60)
		else :
			return '%dh %dmin' % (int(val)/3600, int(val/60)%60)

	def _update(self) :
		tim = time.time()
		if self.nbEpochs > 1 :
			if self.currTime > 0 :
				self.lastEpochDuration = tim - self.currTime
			self.currTime = tim
			self.runtime = tim - self.startTime
			self.runtime_hr = self.formatTime(self.runtime)
			self.avg = self.runtime/self.currEpoch
			
			self.remtime = self.avg * (self.nbEpochs-self.currEpoch)
			self.remtime_hr = self.formatTime(self.remtime)
	
	def log(self) :
		"""logs stats about the progression, without printing anything on screen"""
		
		self.logs['epochDuration'].append(self.lastEpochDuration)
		self.logs['avg'].append(self.avg)
		self.logs['runtime'].append(self.runtime)
		self.logs['remtime'].append(self.remtime)
	
	def saveLogs(self, filename) :
		"""dumps logs into a nice pickle"""
		f = open(filename, 'wb')
		cPickle.dump(self.logs, f)
		f.close()

	def update(self, label = '', forceRefresh = False, log = False) :
		"""the function to be called at each iteration. Setting log = True is the same as calling log() just after update()"""
		self.currEpoch += 1
		tim = time.time()
		if (tim - self.lastPrintTime > self.minRefeshTime) or forceRefresh :
			self._update()
			
			wheelState = self.wheel[self.currEpoch%len(self.wheel)]
			
			if label == '' :
				slabel = self.label
			else :
				slabel = label
			
			if self.nbEpochs > 1 :
				ratio = self.currEpoch/self.nbEpochs
				snakeLen = int(self.width*ratio)
				voidLen = int(self.width - (self.width*ratio))

				if snakeLen + voidLen < self.width :
					snakeLen = self.width - voidLen
				
				self.bar = "%s %s[%s:>%s] %.2f%% (%d/%d) runtime: %s, remaining: %s, avg: %s" %(wheelState, slabel, "~-" * snakeLen, "  " * voidLen, ratio*100, self.currEpoch, self.nbEpochs, self.runtime_hr, self.remtime_hr, self.formatTime(self.avg))
				if self.currEpoch == self.nbEpochs :
					self.close()
				
			else :
				w = self.width - len(self.miniSnake)
				v = self.currEpoch%(w+1)
				snake = "%s%s%s" %("  " * (v), self.miniSnake, "  " * (w-v))
				self.bar = "%s %s[%s] %s%% (%d/%s) runtime: %s, remaining: %s, avg: %s" %(wheelState, slabel, snake, '?', self.currEpoch, '?', self.runtime_hr, '?', self.formatTime(self.avg))
			
			sys.stdout.write("\b" * (len(self.bar)+1))
			sys.stdout.write(" " * (len(self.bar)+1))
			sys.stdout.write("\b" * (len(self.bar)+1))
			sys.stdout.write(self.bar)
			sys.stdout.flush()
			self.lastPrintTime = time.time()
			
			if log :
				self.log()

	def close(self) :
		"""Closes the bar so your next print will be on another line"""
		self.update(forceRefresh = True)
		print '\n'
		
if __name__ == "__main__" :
	p = ProgressBar(nbEpochs = 100000000000)
	for i in xrange(100000000000) :
		p.update()
		#time.sleep(3)
	p.close()
	

================================================
FILE: pyGeno/tools/SecureMmap.py
================================================
import mmap

class SecureMmap:
	"""In a normal mmap, modifying the string modifies the file. This is a mmap with write protection"""
	
	def __init__(self, filename, enableWrite = False) :
		
		self.enableWrite = enableWrite
		self.filename = filename
		self.name = filename
		
		f = open(filename, 'r+b')
		self.data = mmap.mmap(f.fileno(), 0)
	
	def forceSet(self, x1, v) :
		"""Forces modification even if the mmap is write protected"""
		self.data[x1] = v
	
	def __getitem__(self, i):
		return self.data[i]
	
	def __setitem__(self, i, v) :
		if self.enableWrite :
			raise IOError("Secure mmap is write protected")
		else :
			self.data[i] = v

	def __str__(self) :
		return "secure mmap, file: %s, writing enabled : %s" % (self.filename, str(self.enableWrite))

	def __len__(self) :
		return len(self.data)


================================================
FILE: pyGeno/tools/SegmentTree.py
================================================
import random, copy, types

def aux_insertTree(childTree, parentTree):
	"""This a private (You shouldn't have to call it) recursive function that inserts a child tree into a parent tree."""
	if childTree.x1 != None and childTree.x2 != None :
		parentTree.insert(childTree.x1, childTree.x2, childTree.name, childTree.referedObject)

	for c in childTree.children:
		aux_insertTree(c, parentTree)
		
def aux_moveTree(offset, tree):
	"""This a private recursive (You shouldn't have to call it) function that translates tree(and it's children) to a given x1"""
	if tree.x1 != None and tree.x2 != None :
		tree.x1, tree.x2 = tree.x1+offset, tree.x2+offset
		
	for c in tree.children:
		aux_moveTree(offset, c)
		
class SegmentTree :
	""" Optimised genome annotations.
	A segment tree is an arborescence of segments. First position is inclusive, second exlusive, respectively refered to as x1 and x2.
	A segment tree has the following properties :
	
	* The root has no x1 or x2 (both set to None).
	
	* Segment are arrangend in an ascending order
	
	* For two segment S1 and S2 : [S2.x1, S2.x2[ C [S1.x1, S1.x2[ <=> S2 is a child of S1
	
	Here's an example of a tree :
	
	* Root : 0-15
	
	* ---->Segment : 0-12
	
	* ------->Segment : 1-6
	
	* ---------->Segment : 2-3
	
	* ---------->Segment : 4-5
	
	* ------->Segment : 7-8
	
	* ------->Segment : 9-10
	
	* ---->Segment : 11-14
	
	* ------->Segment : 12-14
	
	* ---->Segment : 13-15
	
	Each segment can have a 'name' and a 'referedObject'. ReferedObject are objects are stored within the graph for future usage.
	These objects are always stored in lists. If referedObject is already a list it will be stored as is.
	"""
	
	def __init__(self, x1 = None, x2 = None, name = '', referedObject = [], father = None, level = 0) :
		if x1 > x2 :
			self.x1, self.x2 = x2, x1
		else :
			self.x1, self.x2 = x1, x2
		
		self.father = father
		self.level = level
		self.id = random.randint(0, 10**8)
		self.name = name
		
		self.children = []
		self.referedObject = referedObject

	def __addChild(self, segmentTree, index = -1) :
		segmentTree.level = self.level + 1
		if index < 0 :
			self.children.append(segmentTree)
		else :
			self.children.insert(index, segmentTree)
	
	def insert(self, x1, x2, name = '', referedObject = []) :
		"""Insert the segment in it's right place and returns it. 
		If there's already a segment S as S.x1 == x1 and S.x2 == x2. S.name will be changed to 'S.name U name' and the
		referedObject will be appended to the already existing list"""
		
		if x1 > x2 :
			xx1, xx2 = x2, x1
		else :
			xx1, xx2 = x1, x2

		rt = None
		insertId = None
		childrenToRemove = []
		for i in range(len(self.children)) :
			if self.children[i].x1 == xx1 and xx2 == self.children[i].x2 :
				self.children[i].name = self.children[i].name + ' U ' + name
				self.children[i].referedObject.append(referedObject)
				return self.children[i]
			
			if self.children[i].x1 <= xx1 and xx2 <= self.children[i].x2 :
				return self.children[i].insert(x1, x2, name, referedObject)
			
			elif xx1 <= self.children[i].x1 and self.children[i].x2 <= xx2 :
				if rt == None :
					if type(referedObject) is types.ListType :
						rt = SegmentTree(xx1, xx2, name, referedObject, self, self.level+1)
					else :
						rt = SegmentTree(xx1, xx2, name, [referedObject], self, self.level+1)
					
					insertId = i
					
				rt.__addChild(self.children[i])
				self.children[i].father = rt
				childrenToRemove.append(self.children[i])
		
			elif xx1 <= self.children[i].x1 and xx2 <= self.children[i].x2 :
				insertId = i
				break
				
		if rt != None :
			self.__addChild(rt, insertId)
			for c in childrenToRemove :
				self.children.remove(c)
		else :
			if type(referedObject) is types.ListType :
				rt = SegmentTree(xx1, xx2, name, referedObject, self, self.level+1)
			else :
				rt = SegmentTree(xx1, xx2, name, [referedObject], self, self.level+1)
			
			if insertId != None :
				self.__addChild(rt, insertId)
			else :
				self.__addChild(rt)
		
		return rt
	
	def insertTree(self, childTree):
		"""inserts childTree in the right position (regions will be rearanged to fit the organisation of self)"""
		aux_insertTree(childTree, self)
		
	#~ def included_todo(self, x1, x2=None) :
		#~ "Returns all the segments where [x1, x2] is included"""
		#~ pass
		
	def intersect(self, x1, x2 = None) :
		"""Returns a list of all segments intersected by [x1, x2]"""
		
		def condition(x1, x2, tree) :
			#print self.id, tree.x1, tree.x2, x1, x2
			if (tree.x1 != None and tree.x2 != None) and (tree.x1 <= x1 and x1 < tree.x2 or tree.x1 <= x2 and x2 < tree.x2) :
				return True
			return False
			
		if x2 == None :
			xx1, xx2 = x1, x1
		elif x1 > x2 :
			xx1, xx2 = x2, x1
		else :
			xx1, xx2 = x1, x2
			
		c1 = self.__dichotomicSearch(xx1)
		c2 = self.__dichotomicSearch(xx2)
		
		if c1 == -1 or c2 == -1 :
			return []
			
		if xx1 < self.children[c1].x1 :
			c1 -= 1
			
		inter = self.__radiateDown(x1, x2, c1, condition)
		if self.children[c1].id == self.children[c2].id :
			inter.extend(self.__radiateUp(x1, x2, c2+1, condition))
		else :
			inter.extend(self.__radiateUp(x1, x2, c2, condition))
		
		ret = []
		for c in inter :
			ret.extend(c.intersect(x1, x2))
		
		inter.extend(ret)
		return inter
		
	def __dichotomicSearch(self, x1) :
		r1 = 0
		r2 = len(self.children)-1
		pos = -1
		while (r1 <= r2) :
			pos = (r1+r2)/2
			val = self.children[pos].x1

			if val == x1 :
				return pos
			elif x1 < val :
				r2 = pos -1
			else :
				r1 = pos +1
		
		return pos
	
	def __radiateDown(self, x1, x2, childId, condition) :
		"Radiates down: walks self.children downward until condition is no longer verifed or there's no childrens left "
		ret = []
		i = childId
		while 0 <= i :
			if condition(x1, x2, self.children[i]) :
				ret.append(self.children[i])
			else :
				break
			i -= 1
		return ret
	
	def __radiateUp(self, x1, x2, childId, condition) :
		"Radiates uo: walks self.children upward until condition is no longer verifed or there's no childrens left "
		ret = []
		i = childId
		while i < len(self.children):
			if condition(x1, x2, self.children[i]) :
				ret.append(self.children[i])
			else :
				break
			i += 1
		return ret
	
	def emptyChildren(self) :
		"""Kills of children"""
		self.children = []
	
	def removeGaps(self) :
		"""Remove all gaps between regions"""
		
		for i in range(1, len(self.children)) :
			if self.children[i].x1 > self.children[i-1].x2:				
				aux_moveTree(self.children[i-1].x2-self.children[i].x1, self.children[i])
		
	def getX1(self) :
		"""Returns the starting position of the tree"""
		if self.x1 != None :
			return self.x1
		return self.children[0].x1

	def getX2(self) :
		"""Returns the ending position of the tree"""
		if self.x2 != None :
			return self.x2
		return self.children[-1].x2
	
	def getIndexedLength(self) :
		"""Returns the total length of indexed regions"""
		if self.x1 != None and self.x2 != None:
			return self.x2 - self.x1
		else :
			if len(self.children) == 0 :
				return 0
			else :
				l = self.children[0].x2 - self.children[0].x1
				for i in range(1, len(self.children)) :
					l += self.children[i].x2 - self.children[i].x1 - max(0, self.children[i-1].x2 - self.children[i].x1)
				return l
	
	def getFirstLevel(self) :
		"""returns a list of couples (x1, x2) of all the first level indexed regions"""
		res = []
		if len(self.children) > 0 :
			for c in self.children:
				res.append((c.x1, c.x2)) 
		else :
			if self.x1 != None :
				res = [(self.x1, self.x2)]
			else :
				res = None
		return res
		
	def flatten(self) :
		"""Flattens the tree. The tree become a tree of depth 1 where overlapping regions have been merged together"""
		if len(self.children) > 1 :
			children = self.children
			self.emptyChildren()
			
			children[0].emptyChildren()
			x1 = children[0].x1
			x2 = children[0].x2
			refObjs = [children[0].referedObject]
			name = children[0].name
			
			for i in range(1, len(children)) :
				children[i].emptyChildren()
				if children[i-1] >= children[i] :
					x2 = children[i].x2
					refObjs.append(children[i].referedObject)
					name += " U " + children[i].name
				else :
					if len(refObjs) == 1 :
						refObjs = refObjs[0]
		
					self.insert(x1, x2, name, refObjs)
					x1 = children[i].x1
					x2 = children[i].x2
					refObjs = [children[i].referedObject]
					name = children[i].name
			
			if len(refObjs) == 1 :
				refObjs = refObjs[0]
		
			self.insert(x1, x2, name, refObjs)

	def move(self, newX1) :
		"""Moves tree to a new starting position, updates x1s of children"""
		if self.x1 != None and self.x2 != None :
			offset = newX1-self.x1
			aux_moveTree(offset, self)
		elif len(self.children) > 0 :
			offset = newX1-self.children[0].x1
			aux_moveTree(offset, self)

	def __str__(self) :
		strRes = self.__str()
		
		offset = ''
		for i in range(self.level+1) :
			offset += '\t'
			
		for c in self.children :
			strRes += '\n'+offset+'-->'+str(c)
		
		return strRes
	
	def __str(self) :
		if self.x1 == None :
			if len(self.children) > 0 :
				return "Root: %d-%d, name: %s, id: %d, obj: %s" %(self.children[0].x1, self.children[-1].x2, self.name, self.id, repr(self.referedObject))
			else :
				return "Root: EMPTY , name: %s, id: %d, obj: %s" %(self.name, self.id, repr(self.referedObject))
		else :
			return "Segment: %d-%d, name: %s, id: %d, father id: %d, obj: %s" %(self.x1, self.x2, self.name, self.id, self.father.id, repr(self.referedObject))
			
	
	def __len__(self) :
		"returns the size of the complete indexed region"
		if self.x1 != None and self.x2 != None :
			return self.x2-self.x1
		else :
			return self.children[-1].x2 - self.children[0].x1
	
	def __repr__(self):
		return 'Segment Tree, id:%s, father id:%s, (x1, x2): (%s, %s)' %(self.id, self.father.id, self.x1, self.x2)


if __name__== "__main__" :
	s = SegmentTree()
	s.insert(5, 10, 'region 1')
	s.insert(8, 12, 'region 2')
	s.insert(5, 8, 'region 3')
	s.insert(34, 40, 'region 4')
	s.insert(35, 38, 'region 5')
	s.insert(36, 37, 'region 6', 'aaa')
	s.insert(36, 37, 'region 6', 'aaa2')
	print "Tree:"
	print s
	print "indexed length", s.getIndexedLength()
	print "removing gaps and adding region 7 : [13-37["
	s.removeGaps()
	#s.insert(13, 37, 'region 7')
	print s
	print "indexed length", s.getIndexedLength()
	#print "intersections"
	#for c in [6, 10, 14, 1000] :
	#	print c, s.intersect(c)
	
	print "Move"
	s.move(0)
	print s
	print "indexed length", s.getIndexedLength()


================================================
FILE: pyGeno/tools/SingletonManager.py
================================================
#This thing is wonderful

objects = {}
def add(obj, objName='') :
	
	if objName == '' :
		key = obj.name
	else :
		key = objName
		
	if key not in objects :
		objects[key] = obj

	return obj

def contains(k) :
	return k in objects
	
def get(objName) :
	try :
		return objects[objName]
	except :
		return None


================================================
FILE: pyGeno/tools/Stats.py
================================================
import numpy as np

def kullback_leibler(p, q) :
	"""Discrete Kullback-Leibler divergence D(P||Q)"""
	p = np.asarray(p, dtype=np.float)
	q = np.asarray(q, dtype=np.float)

	if p.shape != q.shape :
		raise ValueError("p and q must be of the same dimensions")
	
	return np.sum(np.where(p > 0, np.log(p / q) * p, 0))

def squaredError_log10(p, q) :
	p = np.asarray(p, dtype=np.float)
	q = np.asarray(q, dtype=np.float)
	
	if p.shape != q.shape :
		raise ValueError("p and q must be of the same dimensions")
		
	return np.log10(sum((p-q)**2)) - np.log(len(p))
	
def fisherExactTest(table) :
	"""Fisher's exact test
	----------
	table: contengency table
	"""
	raise NotImplementedError


================================================
FILE: pyGeno/tools/UsefulFunctions.py
================================================
import string, os, copy, types

class UnknownNucleotide(Exception) :
	def __init__(self, nuc) :
		self.msg =  'Unknown nucleotides %s' % str(nuc)

	def __str__(self) :
		return self.msg

#This will probably be moved somewhere else in the futur
def saveResults(directoryName, fileName, strResults, log = '', args = ''):

	if not os.path.exists(directoryName):
		os.makedirs(directoryName)

	resPath = "%s/%s"%(directoryName, fileName)
	resFile = open(resPath, 'w')
	print "Saving results :\n\t%s..."%resPath
	resFile.write(strResults)
	resFile.close()

	if log != '' :
		errPath = "%s.err.txt"%(resPath)
		errFile = open(errPath, 'w')

		print "Saving log :\n\t%s..." %errPath
		errFile.write(log)
		errFile.close()

	if args != '' :
		paramPath = "%s.args.txt"%(resPath)
		paramFile = open(paramPath, 'w')

		print "Saving arguments :\n\t%s..." %paramPath
		paramFile.write(args)
		paramFile.close()

	return "%s/"%(directoryName)

nucleotides = ['A', 'T', 'C', 'G']
polymorphicNucleotides = {
			'R' : ['A','G'], 'Y' : ['C','T'], 'M': ['A','C'],
			'K' : ['T','G'], 'W' : ['A','T'], 'S' : ['C','G'], 'B': ['C','G','T'],
			'D' : ['A','G','T'], 'H' : ['A','C','T'], 'V' : ['A','C','G'], 'N': ['A','C','G','T']
			}

#<7iyed>
#from Molecular Systems Biology 8; Article number 572; doi:10.1038/msb.2012.3
codonAffinity = {'CTT': 'low', 'ACC': 'high', 'ACA': 'low', 'ACG': 'high', 'ATC': 'high', 'AAC': 'high', 'ATA': 'low', 'AGG': 'high', 'CCT': 'low', 'ACT': 'low', 'AGC': 'high', 'AAG': 'high', 'AGA': 'low', 'CAT': 'low', 'AAT': 'low', 'ATT': 'low', 'CTG': 'high', 'CTA': 'low', 'CTC': 'high', 'CAC': 'high', 'AAA': 'low', 'CCG': 'high', 'AGT': 'low', 'CCA': 'low', 'CAA': 'low', 'CCC': 'high', 'TAT': 'low', 'GGT': 'low', 'TGT': 'low', 'CGA': 'low', 'CAG': 'high', 'TCT': 'low', 'GAT': 'low', 'CGG': 'high', 'TTT': 'low', 'TGC': 'high', 'GGG': 'high', 'TAG': 'high', 'GGA': 'low', 'TGG': 'high', 'GGC': 'high', 'TAC': 'high', 'TTC': 'high', 'TCG': 'high', 'TTA': 'low', 'TTG': 'high', 'TCC': 'high', 'GAA': 'low', 'TAA': 'high', 'GCA': 'low', 'GTA': 'low', 'GCC': 'high', 'GTC': 'high', 'GCG': 'high', 'GTG': 'high', 'GAG': 'high', 'GTT': 'low', 'GCT': 'low', 'TGA': 'high', 'GAC': 'high', 'CGT': 'low', 'TCA': 'low', 'ATG': 'high', 'CGC': 'high'}

lowAffinityCodons = set(['CTT', 'ACA', 'AAA', 'ATA', 'CCT', 'AGA', 'CAT', 'AAT', 'ATT', 'CTA', 'ACT', 'CAA', 'AGT', 'CCA', 'TAT', 'GGT', 'TGT', 'CGA', 'TCT', 'GAT', 'TTT', 'GGA', 'TTA', 'CGT', 'GAA', 'TCA', 'GCA', 'GTA', 'GTT', 'GCT'])
highAffinityCodons = set(['ACC', 'ATG', 'AAG', 'ACG', 'ATC', 'AAC', 'AGG', 'AGC', 'CTG', 'CTC', 'CAC', 'CCG', 'CAG', 'CCC', 'CGC', 'CGG', 'TGC', 'GGG', 'TAG', 'TGG', 'GGC', 'TAC', 'TTC', 'TCG', 'TTG', 'TCC', 'TAA', 'GCC', 'GTC', 'GCG', 'GTG', 'GAG', 'TGA', 'GAC'])

#</7iyed>

#Empiraclly calculated using genome GRCh37.74 and Ensembl annotations
humanCodonCounts = {'CTT': 588990, 'ACC': 760250, 'ACA': 671093, 'ACG': 248588, 'ATC': 819539, 'AAC': 777291, 'ATA': 326568, 'AGG': 520514, 'CCT': 784233, 'ACT': 581281, 'AGC': 826157, 'AAG': 1373474, 'AGA': 560614, 'CAT': 487348, 'AAT': 745200, 'ATT': 685951, 'CTG': 1579105, 'CTA': 311963, 'CTC': 772503, 'CAC': 618558, 'AAA': 1111269, 'CCG': 285345, 'AGT': 558788, 'CCA': 771391, 'CAA': 572531, 'CCC': 809928, 'TAT': 507376, 'GGT': 459267, 'TGT': 443487, 'CGA': 276584, 'CAG': 1483627, 'TCT': 675336, 'GAT': 982540, 'CGG': 477748, 'TTT': 721642, 'TGC': 495033, 'GGG': 661842, 'TAG': 28685, 'GGA': 731598, 'TGG': 535340, 'GGC': 877641, 'TAC': 588108, 'TTC': 774303, 'TCG': 185384, 'TTA': 348372, 'TTG': 563764, 'TCC': 729893, 'GAA': 1355256, 'TAA': 37503, 'GCA': 718158, 'GTA': 316640, 'GCC': 1120424, 'GTC': 576027, 'GCG': 289438, 'GTG': 1119171, 'GAG': 1685297, 'GTT': 486471, 'GCT': 806491, 'TGA': 82954, 'GAC': 1033108, 'CGT': 200762, 'TCA': 569093, 'ATG': 935789, 'CGC': 404889}

humanCodonCount = 42433513

humanCodonRatios = {'CTT': 0.013880302580651288, 'ACC': 0.017916263496731935, 'ACA': 0.01581516477318293, 'ACG': 0.005858294127097137, 'ATC': 0.019313484603549088, 'AAC': 0.018317856454637634, 'ATA': 0.007695992551924702, 'AGG': 0.012266578070026868, 'CCT': 0.018481453562423644, 'ACT': 0.0136986301369863, 'AGC': 0.01946944623698726, 'AAG': 0.03236767127906662, 'AGA': 0.013211585851965638, 'CAT': 0.011484978865643295, 'AAT': 0.017561591000019253, 'ATT': 0.016165312544356155, 'CTG': 0.0372136287655467, 'CTA': 0.007351807049300867, 'CTC': 0.018205021111497414, 'CAC': 0.014577110313727737, 'AAA': 0.02618847513285077, 'CCG': 0.006724519838835875, 'AGT': 0.01316855382678309, 'CCA': 0.018178815409414725, 'CAA': 0.013492425197037068, 'CCC': 0.01908698909750885, 'TAT': 0.011956964298477951, 'GGT': 0.010823214189218791, 'TGT': 0.010451338308944631, 'CGA': 0.006518055669819277, 'CAG': 0.034963567593378375, 'TCT': 0.015915156494349172, 'GAT': 0.023154811622596506, 'CGG': 0.011258742588670422, 'TTT': 0.017006416602839365, 'TGC': 0.01166608571861585, 'GGG': 0.015597153127529177, 'TAG': 0.0006759987088507143, 'GGA': 0.017241042475083315, 'TGG': 0.012615971720276849, 'GGC': 0.020682732537369696, 'TAC': 0.013859517122704406, 'TTC': 0.01824744041342983, 'TCG': 0.004368811038576985, 'TTA': 0.008209831695999339, 'TTG': 0.013285819630347362, 'TCC': 0.017200861969641778, 'GAA': 0.03193834081095289, 'TAA': 0.0008838061557618385, 'GCA': 0.01692431168732129, 'GTA': 0.007462026535488589, 'GCC': 0.026404224415734798, 'GTC': 0.013574812907901357, 'GCG': 0.006820976618174413, 'GTG': 0.026374695868334068, 'GAG': 0.039716179049328296, 'GTT': 0.011464311239090669, 'GCT': 0.01900599179709679, 'TGA': 0.0019549170958341345, 'GAC': 0.024346511211551115, 'CGT': 0.004731213274752906, 'TCA': 0.013411404330346158, 'ATG': 0.022053064520017467, 'CGC': 0.009541727077840574}

synonymousCodonsFrequencies = {'A': {'GCA': 0.24472833804337418, 'GCC': 0.38180943946027124, 'GCT': 0.2748297757275403, 'GCG': 0.09863244676881429}, 'C': {'TGC': 0.5274613220815753, 'TGT': 0.47253867791842474}, 'E': {'GAG': 0.5542731864894314, 'GAA': 0.44572681351056864}, 'D': {'GAT': 0.48745614313610314, 'GAC': 0.5125438568638969}, 'G': {'GGT': 0.1682082284016543, 'GGG': 0.24240206742876733, 'GGA': 0.26795045906236126, 'GGC': 0.3214392451072171}, 'F': {'TTC': 0.51760124870901, 'TTT': 0.48239875129098997}, 'I': {'ATT': 0.3744155479793762, 'ATC': 0.4473324534485262, 'ATA': 0.17825199857209761}, 'H': {'CAC': 0.5593224017231121, 'CAT': 0.4406775982768879}, 'K': {'AAG': 0.552763002048904, 'AAA': 0.44723699795109595}, '*': {'TAG': 0.19233348084375965, 'TGA': 0.5562081774416328, 'TAA': 0.25145834171460757}, 'M': {'ATG': 1.0}, 'L': {'CTT': 0.14142445416797428, 'CTG': 0.37916443861342136, 'CTA': 0.07490652981477404, 'CTC': 0.18548840407837594, 'TTA': 0.08364882247135866, 'TTG': 0.13536735085409574}, 'N': {'AAT': 0.48946102144446174, 'AAC': 0.5105389785555383}, 'Q': {'CAA': 0.27844698705060605, 'CAG': 0.721553012949394}, 'P': {'CCT': 0.29583684315158226, 'CCG': 0.1076409230535928, 'CCA': 0.2909924451987384, 'CCC': 0.30552978859608654}, 'S': {'TCT': 0.19052256484488883, 'AGC': 0.23307146458142142, 'TCG': 0.05229964811768493, 'AGT': 0.15764260007543762, 'TCC': 0.2059139249534016, 'TCA': 0.1605497974271656}, 'R': {'AGG': 0.21322832103906786, 'CGC': 0.16586259289315397, 'CGG': 0.19570924878057572, 'CGA': 0.11330250857089251, 'AGA': 0.2296552676219967, 'CGT': 0.0822420610943132}, 'T': {'ACC': 0.33621349966301256, 'ACA': 0.2967846446949689, 'ACG': 0.10993573358004469, 'ACT': 0.2570661220619738}, 'W': {'TGG': 1.0}, 'V': {'GTA': 0.12674172810489015, 'GTC': 0.230566755353321, 'GTT': 0.19472010868151218, 'GTG': 0.44797140786027667}, 'Y': {'TAT': 0.46315236005272553, 'TAC': 0.5368476399472745}}


# Translation tables
# Ref: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
translTable = dict()

# Standard Code (NCBI transl_table=1)
translTable['default'] = {
'TTT' : 'F', 'TCT' : 'S', 'TAT' : 'Y', 'TGT' : 'C',
'TTC' : 'F', 'TCC' : 'S', 'TAC' : 'Y', 'TGC' : 'C',
'TTA' : 'L', 'TCA' : 'S', 'TAA' : '*', 'TGA' : '*',
'TTG' : 'L', 'TCG' : 'S', 'TAG' : '*', 'TGG' : 'W',

'CTT' : 'L', 'CTC' : 'L', 'CTA' : 'L', 'CTG' : 'L',
'CCT' : 'P', 'CCC' : 'P', 'CCA' : 'P', 'CCG' : 'P',
'CAT' : 'H', 'CAC' : 'H', 'CAA' : 'Q', 'CAG' : 'Q',
'CGT' : 'R', 'CGC' : 'R', 'CGA' : 'R', 'CGG' : 'R',

'ATT' : 'I', 'ATC' : 'I', 'ATA' : 'I', 'ATG' : 'M',
'ACT' : 'T', 'ACC' : 'T', 'ACA' : 'T', 'ACG' : 'T',
'AAT' : 'N', 'AAC' : 'N', 'AAA' : 'K', 'AAG' : 'K',
'AGT' : 'S', 'AGC' : 'S', 'AGA' : 'R', 'AGG' : 'R',

'GTT' : 'V', 'GTC' : 'V', 'GTA' : 'V', 'GTG' : 'V',
'GCT' : 'A', 'GCC' : 'A', 'GCA' : 'A', 'GCG' : 'A',
'GAT' : 'D', 'GAC' : 'D', 'GAA' : 'E', 'GAG' : 'E',
'GGT' : 'G', 'GGC' : 'G', 'GGA' : 'G', 'GGG' : 'G',

'!GA' : 'U'

}
codonTable = translTable['default']


# The Vertebrate Mitochondrial Code (NCBI transl_table=2)
translTable['mt'] = {
'TTT' : 'F', 'TCT' : 'S', 'TAT' : 'Y', 'TGT' : 'C',
'TTC' : 'F', 'TCC' : 'S', 'TAC' : 'Y', 'TGC' : 'C',
'TTA' : 'L', 'TCA' : 'S', 'TAA' : '*', 'TGA' : 'W',
'TTG' : 'L', 'TCG' : 'S', 'TAG' : '*', 'TGG' : 'W',

'CTT' : 'L', 'CTC' : 'L', 'CTA' : 'L', 'CTG' : 'L',
'CCT' : 'P', 'CCC' : 'P', 'CCA' : 'P', 'CCG' : 'P',
'CAT' : 'H', 'CAC' : 'H', 'CAA' : 'Q', 'CAG' : 'Q',
'CGT' : 'R', 'CGC' : 'R', 'CGA' : 'R', 'CGG' : 'R',

'ATT' : 'I', 'ATC' : 'I', 'ATA' : 'M', 'ATG' : 'M',
'ACT' : 'T', 'ACC' : 'T', 'ACA' : 'T', 'ACG' : 'T',
'AAT' : 'N', 'AAC' : 'N', 'AAA' : 'K', 'AAG' : 'K',
'AGT' : 'S', 'AGC' : 'S', 'AGA' : '*', 'AGG' : '*',

'GTT' : 'V', 'GTC' : 'V', 'GTA' : 'V', 'GTG' : 'V',
'GCT' : 'A', 'GCC' : 'A', 'GCA' : 'A', 'GCG' : 'A',
'GAT' : 'D', 'GAC' : 'D', 'GAA' : 'E', 'GAG' : 'E',
'GGT' : 'G', 'GGC' : 'G', 'GGA' : 'G', 'GGG' : 'G'
}


AATable = {'A': ['GCA', 'GCC', 'GCG', 'GCT'], 'C': ['TGT', 'TGC'], 'E': ['GAG', 'GAA'], 'D': ['GAT', 'GAC'], 'G': ['GGT', 'GGG', 'GGA', 'GGC'], 'F': ['TTT', 'TTC'], 'I': ['ATC', 'ATA', 'ATT'], 'H': ['CAT', 'CAC'], 'K': ['AAG', 'AAA'], '*': ['TAG', 'TGA', 'TAA'], 'M': ['ATG'], 'L': ['CTT', 'CTG', 'CTA', 'CTC', 'TTA', 'TTG'], 'N': ['AAC', 'AAT'], 'Q': ['CAA', 'CAG'], 'P': ['CCT', 'CCG', 'CCA', 'CCC'], 'S': ['AGC', 'AGT', 'TCT', 'TCG', 'TCC', 'TCA'], 'R': ['AGG', 'AGA', 'CGA', 'CGG', 'CGT', 'CGC'], 'T': ['ACA', 'ACG', 'ACT', 'ACC'], 'W': ['TGG'], 'V': ['GTA', 'GTC', 'GTG', 'GTT'], 'Y': ['TAT', 'TAC']}

AAs = ['A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', '*', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y']

toFloat = lambda x: float(x)
toInt = lambda x: int(x)
floatToStr = lambda x:"%f"%(x)
intToStr = lambda x:"%d"%(x)

splitStr = lambda x: x.split(';')
stripSplitStr = lambda x: x.strip().split(';')


def findAll(haystack, needle) :
	"""returns a list of all occurances of needle in haystack"""
	
	h = haystack
	res = []
	f = haystack.find(needle)
	offset = 0
	while (f >= 0) :
		#print h, needle, f, offset
		res.append(f+offset)
		offset += f+len(needle)
		h = h[f+len(needle):]
		f = h.find(needle)

	return res


def complementTab(seq=[]):
    """returns a list of complementary sequence without inversing it"""
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'R': 'Y', 'Y': 'R', 'M': 'K', 'K': 'M',
                  'W': 'W', 'S': 'S', 'B': 'V', 'D': 'H', 'H': 'D', 'V': 'B', 'N': 'N', 'a': 't',
                  'c': 'g', 'g': 'c', 't': 'a', 'r': 'y', 'y': 'r', 'm': 'k', 'k': 'm', 'w': 'w',
                  's': 's', 'b': 'v', 'd': 'h', 'h': 'd', 'v': 'b', 'n': 'n'}
    seq_tmp = []
    for bps in seq:
        if len(bps) == 0:
        	#Need manage '' for deletion
            seq_tmp.append('') 
        elif len(bps) == 1:
            seq_tmp.append(complement[bps])
        else:
        	#Need manage 'ACT' for insertion
        	#The insertion need to be reverse complement (like seq)
            seq_tmp.append(reverseComplement(bps))
            
    #Doesn't work in the second for when bps==''
    #seq = [complement[bp] if bp != '' else '' for bps in seq for bp in bps]

    return seq_tmp

def reverseComplementTab(seq):
    '''
    Complements a DNA sequence, returning the reverse complement in a list to manage INDEL.
    '''
    return complementTab(seq[::-1])

def reverseComplement(seq):
	'''
	Complements a DNA sequence, returning the reverse complement.
	'''
	return complement(seq)[::-1]

def complement(seq) :
	"""returns the complementary sequence without inversing it"""
	tb = string.maketrans("ACGTRYMKWSBDHVNacgtrymkwsbdhvn",
						  "TGCAYRKMWSVHDBNtgcayrkmwsvhdbn")
	
	#just to be sure that seq isn't unicode
	return str(seq).translate(tb)

def translateDNA_6Frames(sequence) :
	"""returns 6 translation of sequence. One for each reading frame"""
	trans = (
				translateDNA(sequence, 'f1'),
				translateDNA(sequence, 'f2'),
				translateDNA(sequence, 'f3'),

				translateDNA(sequence, 'r1'),
				translateDNA(sequence, 'r2'),
				translateDNA(sequence, 'r3'),
			)

	return trans

def translateDNA(sequence, frame = 'f1', translTable_id='default') :
	"""Translates DNA code, frame : fwd1, fwd2, fwd3, rev1, rev2, rev3"""

	protein = ""

	if frame == 'f1' :
		dna = sequence
	elif frame == 'f2':
		dna = sequence[1:]
	elif frame == 'f3' :
		dna = sequence[2:]
	elif frame == 'r1' :
		dna = reverseComplement(sequence)
	elif frame == 'r2' :
		dna = reverseComplement(sequence)
		dna = dna[1:]
	elif frame == 'r3' :
		dna = reverseComplement(sequence)
		dna = dna[2:]
	else :
		raise ValueError('unknown reading frame: %s, should be one of the following: fwd1, fwd2, fwd3, rev1, rev2, rev3' % frame)

	for i in range(0, len(dna),  3) :
		codon = dna[i:i+3]

		# Check if variant messed with selenocysteine codon
		if '!' in codon and codon != '!GA':
			codon = codon.replace('!', 'T')

		if (len(codon) == 3) :
			try :
				# MC
				protein += translTable[translTable_id][codon]
			except KeyError :
				combinaisons = polymorphicCodonCombinaisons(list(codon))
				translations = set()
				for ci in range(len(combinaisons)):
					translations.add(translTable[translTable_id][combinaisons[ci]])
				protein += '/'.join(translations)

	return protein

def getSequenceCombinaisons(polymorphipolymorphicDnaSeqSeq, pos = 0) :
	"""Takes a dna sequence with polymorphismes and returns all the possible sequences that it can yield"""

	if type(polymorphipolymorphicDnaSeqSeq) is not types.ListType :
		seq = list(polymorphipolymorphicDnaSeqSeq)
	else :
		seq = polymorphipolymorphicDnaSeqSeq

	if pos >= len(seq) :
		return [''.join(seq)]

	variants = []
	if seq[pos] in polymorphicNucleotides :
		chars = decodePolymorphicNucleotide(seq[pos])
	else :
		chars = seq[pos]#.split('/')

	for c in chars :
		rseq = copy.copy(seq)
		rseq[pos] = c
		variants.extend(getSequenceCombinaisons(rseq, pos + 1))

	return variants

def polymorphicCodonCombinaisons(codon) :
	"""Returns all the possible amino acids encoded by codon"""
	return getSequenceCombinaisons(codon, 0)

def encodePolymorphicNucleotide(polySeq) :
	"""returns a single character encoding all nucletides of polySeq 
	in a single character. PolySeq must have one of the following forms: 
	['A', 'T', 'G'], 'ATG', 'A/T/G'"""
	
	if type(polySeq) is types.StringType :
		if polySeq.find("/") < 0 :
			sseq = list(polySeq)
		else :
			sseq = polySeq.split('/')
			
	else :
		sseq = polySeq
	
	seq = []
	for n in sseq :
		try :
			for n2 in polymorphicNucleotides[n] :
				seq.append(n2)
		except KeyError :
			seq.append(n)
	
	seq = set(seq)
	
	if len(seq) == 4:
		return 'N'
	elif len(seq) == 3 :
		if 'T' not in seq :
			return 'V'
		elif 'G' not in seq :
			return 'H'
		elif 'C' not in seq :
			return 'D'
		elif 'A' not in seq :
			return 'B'
	elif len(seq) == 2 :
		if 'A' in seq and 'G' in seq :
			return 'R'
		elif 'C' in seq and 'T' in seq :
			return 'Y'
		elif 'A' in seq and 'C' in seq :
			return 'M'
		elif 'T' in seq and 'G' in seq :
			return 'K'
		elif 'A' in seq and 'T' in seq :
			return 'W'
		elif 'C' in seq and 'G' in seq :
			return 'S'
	elif polySeq[0] in nucleotides :
		return polySeq[0]
	else :
		raise UnknownNucleotide(polySeq)

def decodePolymorphicNucleotide(nuc) :
	"""the opposite of encodePolymorphicNucleotide, from 'R' to ['A', 'G']"""
	if nuc in polymorphicNucleotides :
		return polymorphicNucleotides[nuc]

	if nuc in nucleotides :
		return nuc

	raise ValueError('nuc: %s, is not a valid nucleotide' % nuc)

def decodePolymorphicNucleotide_str(nuc) :
	"""same as decodePolymorphicNucleotide but returns a string instead 
	of a list, from 'R' to 'A/G"""
	return '/'.join(decodePolymorphicNucleotide(nuc))

def getNucleotideCodon(sequence, x1) :
	"""Returns the entire codon of the nucleotide at pos x1 in sequence, 
	and the position of that nocleotide in the codon in a tuple"""

	if x1 < 0 or x1 >= len(sequence) :
		return None

	p = x1%3
	if p == 0 :
		return (sequence[x1: x1+3], 0)
	elif p ==1 :
		return (sequence[x1-1: x1+2], 1)
	elif p == 2 :
		return (sequence[x1-2: x1+1], 2)

def showDifferences(seq1, seq2) :
	"""Returns a string highligthing differences between seq1 and seq2:
	
	* Matches by '-'
	
	* Differences : 'A|T'
	
	* Exceeded length : '#'
	
	"""
	ret = []
	for i in range(max(len(seq1), len(seq2))) :

		if i >= len(seq1) :
			c1 = '#'
		else :
			c1 = seq1[i]
		if i >= len(seq2) :
			c2 = '#'
		else :
			c2 = seq2[i]

		if c1 != c2 :
			ret.append('%s|%s' % (c1, c2))
		else :
			ret.append('-')

	return ''.join(ret)

def highlightSubsequence(sequence, x1, x2, start=' [', stop = '] ') :
	"""returns a sequence where the subsequence in [x1, x2[ is placed 
	in bewteen 'start' and 'stop'"""

	seq = list(sequence)
	print x1, x2-1, len(seq)
	seq[x1] = start + seq[x1]
	seq[x2-1] = seq[x2-1] + stop
	return ''.join(seq)

# def highlightSubsequence(sequence, x1, x2, start=' [', stop = '] ') :
# 	"""returns a sequence where the subsequence in [x1, x2[ is placed 
# 	in bewteen 'start' and 'stop'"""

# 	seq = list(sequence)
# 	ii = 0
# 	acc = True
# 	for i in range(len(seq)) :
# 		if ii == x1 :
# 			seq[i] = start+seq[i]
# 		if ii == x2-1 :
# 			seq[i] = seq[i] + stop

# 		if i < len(seq) - 1 :
# 			if seq[i+1] == '/':
# 				acc = False
# 			else :
# 				acc = True

# 		if acc :
# 			ii += 1

# 	return ''.join(seq)


================================================
FILE: pyGeno/tools/__init__.py
================================================
__all__ = ['BinarySequence', 'CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'io', 'ProgressBar', 'SecureMmap', 'SegmentTree', 'SingletonManager', 'UsefulFunctions']


================================================
FILE: pyGeno/tools/io.py
================================================
import sys

def printf(*s) :
	'print + sys.stdout.flush()'
	for e in s[:-1] :
		print e,
	print s[-1]

	sys.stdout.flush()

def enterConfirm_prompt(enterMsg) :
	stopi = False
	while not stopi :
		print "====\n At any time you can quit by entering 'quit'\n===="
		vali = raw_input(enterMsg)
		if vali.lower() == 'quit' :
			vali = None
			stopi = True
		else :
			print "You've entered:\n\t%s" % vali
			valj = confirm_prompt("")
			if valj == 'yes' :
				stopi = True
			if valj == 'quit' :
				vali = None
				stopi = True
				
	return vali

def confirm_prompt(preMsg) :
	while True :
		val = raw_input('%splease confirm ("yes", "no", "quit"): ' % preMsg)
		if val.lower() == 'yes' or val.lower() == 'no' or val.lower() == 'quit':
			return val.lower()
		

================================================
FILE: pyGeno/tools/parsers/CSVTools.py
================================================
import os, types, collections

class EmptyLine(Exception) :
	"""Raised when an empty or comment line is found (dealt with internally)"""

	def __init__(self, lineNumber) :
		message = "Empty line: #%d" % lineNumber 
		Exception.__init__(self, message)
		self.message = message

	def __str__(self) :
		return self.message

def removeDuplicates(inFileName, outFileName) :
	"""removes duplicated lines from a 'inFileName' CSV file, the results are witten in 'outFileName'"""
	f = open(inFileName)
	legend = f.readline()
	
	data = ''
	h = {}
	h[legend] = 0
	
	lines = f.readlines()
	for l in lines :
		if not h.has_key(l) :
			h[l] = 0
			data += l
			
	f.flush()
	f.close()
	f = open(outFileName, 'w')
	f.write(legend+data)
	f.flush()
	f.close()

def catCSVs(folder, ouputFileName, removeDups = False) :
	"""Concatenates all csv in 'folder' and wites the results in 'ouputFileName'. My not work on non Unix systems"""
	strCmd = r"""cat %s/*.csv > %s""" %(folder, ouputFileName)
	os.system(strCmd)

	if removeDups :
		removeDuplicates(ouputFileName, ouputFileName)

def joinCSVs(csvFilePaths, column, ouputFileName, separator = ',') :
	"""csvFilePaths should be an iterable. Joins all CSVs according to the values in the column 'column'. Write the results in a new file 'ouputFileName' """
	
	res = ''
	legend = []
	csvs = []
	
	for f in csvFilePaths :
		c = CSVFile()
		c.parse(f)
		csvs.append(c)
		legend.append(separator.join(c.legend.keys()))
	legend = separator.join(legend)
	
	lines = []
	for i in range(len(csvs[0])) :
		val = csvs[0].get(i, column)
		line = separator.join(csvs[0][i])
		for c in csvs[1:] :
			for j in range(len(c)) :
				if val == c.get(j, column) :
					line += separator + separator.join(c[j])
					
		lines.append( line )
	
	res = legend + '\n' + '\n'.join(lines)
	
	f = open(ouputFileName, 'w')
	f.write(res)
	f.flush()
	f.close()
	
	return res

class CSVEntry(object) :
	"""A single entry in a CSV file"""
	
	def __init__(self, csvFile, lineNumber = None) :
		
		self.csvFile = csvFile
		self.data = []
		if lineNumber != None :
			self.lineNumber = lineNumber
			
			tmpL = csvFile.lines[lineNumber].replace('\r', '\n').replace('\n', '')
			if len(tmpL) == 0 or tmpL[0] in ["#", "\r", "\n", csvFile.lineSeparator] :
				raise EmptyLine(lineNumber)

			tmpData = tmpL.split(csvFile.separator)

			# tmpDatum = []
			i = 0
			while i < len(tmpData) :
			# for d in tmpData :
				d = tmpData[i]
				sd = d.strip()
				# print sd, tmpData, i
				if len(sd) > 0 and sd[0] == csvFile.stringSeparator :
					more = []	
					for i in xrange(i, len(tmpData)) :
						more.append(tmpData[i])
						i+=1
						if more[-1][-1] == csvFile.stringSeparator :
							break
					self.data.append(",".join(more)[1:-1])
					
				# if len(tmpDatum) > 0 or (len(sd) > 0 and sd[0] == csvFile.stringSeparator) :
				# 	tmpDatum.append(sd)
					
				# 	if len(sd) > 0 and sd[-1] == csvFile.stringSeparator :
				# 		self.data.append(csvFile.separator.join(tmpDatum))
				# 		tmpDatum = []
				else :
					self.data.append(sd)
					i += 1
		else :
			self.lineNumber = len(csvFile)
			for i in range(len(self.csvFile.legend)) :
				self.data.append('')

	def commit(self) :
		"""commits the line so it is added to a file stream"""
		self.csvFile.commitLine(self)

	def __iter__(self) :
		self.currentField = -1
		return self
	
	def next(self) :
		self.currentField += 1
		if self.currentField >= len(self.csvFile.legend) :
			raise StopIteration
			
		k = self.csvFile.legend.keys()[self.currentField]
		v = self.data[self.currentField]
		return k, v

	def __getitem__(self, key) :
		"""Returns the value of field 'key'"""
		try :
			indice = self.csvFile.legend[key.lower()]
		except KeyError :
			raise KeyError("CSV File has no column: '%s'" % key)
		
		return self.data[indice]

	def __setitem__(self, key, value) :
		"""Sets the value of field 'key' to 'value' """
		try :
			field = self.csvFile.legend[key.lower()]
		except KeyError :
			self.csvFile.addField(key)
			field = self.csvFile.legend[key.lower()]
			self.data.append(str(value))
		else :
			try:
				self.data[field] = str(value)
			except Exception as e:
				for i in xrange(field-len(self.data)+1) :
					self.data.append("")
				self.data[field] = str(value)

	def toStr(self) :
		return self.csvFile.separator.join(self.data)
		
	def __repr__(self) :
		r = {}
		for k, v in self.csvFile.legend.iteritems() :
			r[k] = self.data[v]

		return "<line %d: %s>" %(self.lineNumber, str(r))
		
	def __str__(self) :
		return repr(self)
	
class CSVFile(object) :
	"""
	Represents a whole CSV file::
		
		#reading
		f = CSVFile()
		f.parse('hop.csv')
		for line in f :
			print line['ref']

		#writing, legend can either be a list of a dict {field : column number}
		f = CSVFile(legend = ['name', 'email'])
		l = f.newLine()
		l['name'] = 'toto'
		l['email'] = "hop@gmail.com"
		
		for field, value in l :
			print field, value

		f.save('myCSV.csv')		
	"""
	
	def __init__(self, legend = [], separator = ',', lineSeparator = '\n') :
		
		self.legend = collections.OrderedDict()
		for i in range(len(legend)) :
			if legend[i].lower() in self.legend :
				raise  ValueError("%s is already in the legend" % legend[i].lower())
			self.legend[legend[i].lower()] = i
		self.strLegend = separator.join(legend)

		self.filename = ""
		self.lines = []	
		self.separator = separator
		self.lineSeparator = lineSeparator
		self.currentPos = -1
	
		self.streamFile = None
		self.writeRate = None
		self.streamBuffer = None
		self.keepInMemory = True

	def addField(self, field) :
		"""add a filed to the legend"""
		if field.lower() in self.legend :
			raise  ValueError("%s is already in the legend" % field.lower())
		self.legend[field.lower()] = len(self.legend)
		if len(self.strLegend) > 0 :
			self.strLegend += self.separator + field
		else :
			self.strLegend += field
			
	def parse(self, filePath, skipLines=0, separator = ',', stringSeparator = '"', lineSeparator = '\n') :
		"""Loads a CSV file"""
		
		self.filename = filePath
		f = open(filePath)
		if lineSeparator == '\n' :
			lines = f.readlines()
		else :
			lines = f.read().split(lineSeparator)
		f.flush()
		f.close()
		
		lines = lines[skipLines:]
		self.lines = []
		self.comments = []
		for l in lines :
			# print l
			if len(l) != 0 and l[0] != "#" :
				self.lines.append(l)
			elif l[0] == "#" :
				self.comments.append(l)

		self.separator = separator
		self.lineSeparator = lineSeparator
		self.stringSeparator = stringSeparator
		self.legend = collections.OrderedDict()
		
		i = 0
		for c in self.lines[0].lower().replace(stringSeparator, '').split(separator) :
			legendElement = c.strip()
			if legendElement not in self.legend :
				self.legend[legendElement] = i
			i+=1
	
		self.strLegend = self.lines[0].replace('\r', '\n').replace('\n', '')
		self.lines = self.lines[1:]
		# sk = skipLines+1
		# for l in self.lines :
		# 	if l[0] == "#" :
		# 		sk += 1
		# 	else :
		# 		break

		# self.header = self.lines[:sk]
		# self.lines = self.lines[sk:]
	

	def streamToFile(self, filename, keepInMemory = False, writeRate = 1) :
		"""Starts a stream to a file. Every line must be committed (l.commit()) to be appended in to the file.

		If keepInMemory is set to True, the parser will keep a version of the whole CSV in memory, writeRate is the number
		of lines that must be committed before an automatic save is triggered.
		"""
		if len(self.legend) < 1 :
			raise ValueError("There's no legend defined")

		try :
			os.remove(filename)
		except :
			pass
		
		self.streamFile = open(filename, "a")
		self.writeRate = writeRate
		self.streamBuffer = []
		self.keepInMemory = keepInMemory

		self.streamFile.write(self.strLegend + "\n")

	def commitLine(self, line) :
		"""Commits a line making it ready to be streamed to a file and saves the current buffer if needed. If no stream is active, raises a ValueError"""
		if self.streamBuffer is None :
			raise ValueError("Commit lines is only for when you are streaming to a file")

		self.streamBuffer.append(line)
		if len(self.streamBuffer) % self.writeRate == 0 :
			for i in xrange(len(self.streamBuffer)) :
				self.streamBuffer[i] = str(self.streamBuffer[i])
			self.streamFile.write("%s\n" % ('\n'.join(self.streamBuffer)))
			self.streamFile.flush()
			self.streamBuffer = []

	def closeStreamToFile(self) :
		"""Appends the remaining commited lines and closes the stream. If no stream is active, raises a ValueError"""
		if self.streamBuffer is None :
			raise ValueError("Commit lines is only for when you are streaming to a file")

		for i in xrange(len(self.streamBuffer)) :
			self.streamBuffer[i] = str(self.streamBuffer[i])
		self.streamFile.write('\n'.join(self.streamBuffer))
		self.streamFile.close()

		self.streamFile = None
		self.writeRate = None
		self.streamBuffer = None
		self.keepInMemory = True

	def _developLine(self, line) :
		stop = False
		while not stop :
			try :
				if self.lines[line].__class__ is not CSVEntry :
					devL = CSVEntry(self, line)
					stop = True
				else :
					devL = self.lines[line]
					stop = True
			except EmptyLine as e :
				del(self.lines[line])
					
		self.lines[line] = devL
	
	def get(self, line, key) :
		self._developLine(line)
		return self.lines[line][key]

	def set(self, line, key, val) :
		self._developLine(line)
		self.lines[line][key] = val
	
	def newLine(self) :
		"""Appends an empty line at the end of the CSV and returns it"""
		l = CSVEntry(self)
		if self.keepInMemory :
			self.lines.append(l)
		return l
	
	def insertLine(self, i) :
		"""Inserts an empty line at position i and returns it"""
		self.data.insert(i, CSVEntry(self))
		return self.lines[i]
	
	def save(self, filePath) :
		"""save the CSV to a file"""
		self.filename = filePath
		f = open(filePath, 'w')
		f.write(self.toStr())
		f.flush()
		f.close()

	def toStr(self) :
		"""returns a string version of the CSV"""
		s = [self.strLegend]
		for l in self.lines :
			s.append(l.toStr())
		return self.lineSeparator.join(s)
	
	def __iter__(self) :
		self.currentPos = -1
		return self
	
	def next(self) :
		self.currentPos += 1
		if self.currentPos >= len(self) :
			raise StopIteration
			
		self._developLine(self.currentPos)
		return self.lines[self.currentPos]
	
	def __getitem__(self, line) :
		try :
			if self.lines[line].__class__ is not CSVEntry :
				self._developLine(line)
		except AttributeError :
			start = line.start
			if start is None :
				start = 0

			for l in xrange(len(self.lines[line])) :
				self._developLine(l + start)

			# start, stop = line.start, line.stop
			# if start is None :
			# 	start = 0
			
			# if stop is None :
			# 	stop = 0

			# for l in xrange(start, stop) :
			# 	self._developLine(l)

		return self.lines[line]

	def __len__(self) :
		return len(self.lines)


================================================
FILE: pyGeno/tools/parsers/CasavaTools.py
================================================
import gzip
import pyGeno.tools.UsefulFunctions as uf

class SNPsTxtEntry(object) :
	"""A single entry in the casavas snps.txt file"""
	
	def __init__(self, lineNumber, snpsTxtFile) :
		self.snpsTxtFile = snpsTxtFile
		self.lineNumber = lineNumber
		self.values = {}
		sl = str(snpsTxtFile.data[lineNumber]).replace('\t\t', '\t').split('\t')
		
		self.values['chromosomeNumber'] = sl[0].upper().replace('CHR', '')
		#first column: chro, second first of range (identical to second column)
		self.values['start'] = int(sl[2])
		self.values['end'] = int(sl[2])+1
		self.values['bcalls_used'] = sl[3]
		self.values['bcalls_filt'] = sl[4]
		self.values['ref'] = sl[5]
		self.values['QSNP'] = int(sl[6])
		self.values['alleles'] = uf.encodePolymorphicNucleotide(sl[7]) #max_gt
		self.values['Qmax_gt'] = int(sl[8])
		self.values['max_gt_poly_site'] = sl[9]
		self.values['Qmax_gt_poly_site'] = int(sl[10])
		self.values['A_used'] = int(sl[11])
		self.values['C_used'] = int(sl[12])
		self.values['G_used'] = int(sl[13])
		self.values['T_used'] = int(sl[14])
	
	def __getitem__(self, fieldName):
		"""Returns the value of field 'fieldName'"""
		return self.values[fieldName]
	
	def __setitem__(self, fieldName, value) :
		"""Sets the value of field 'fieldName' to 'value' """
		self.values[fieldName] = value
		
	def __str__(self):
		return str(self.values)
	
class SNPsTxtFile(object) :
	"""
	Represents a whole casava's snps.txt file::
		
		f = SNPsTxtFile('snps.txt')
		for line in f :
			print line['ref']
	
	"""
	def __init__(self, fil, gziped = False) :
		self.reset()
		if not gziped :
			f = open(fil)
		else :
			f = gzip.open(fil)
		
		for l in f :
			if l[0] != '#' :
				self.data.append(l)

		f.close()

	def reset(self) :
		"""Frees the file"""
		self.data = []
		self.currentPos = 0
	
	def __iter__(self) :
		self.currentPos = 0
		return self
	
	def next(self) :
		if self.currentPos >= len(self) :
			raise StopIteration()
		v = self[self.currentPos]
		self.currentPos += 1
		return v

	def __getitem__(self, i) :
		if self.data[i].__class__ is not SNPsTxtEntry :
			self.data[i] = SNPsTxtEntry(i, self)
		return self.data[i]
	
	def __len__(self) :
		return len(self.data)


================================================
FILE: pyGeno/tools/parsers/FastaTools.py
================================================
import os

class FastaFile(object) :
	"""
	Represents a whole Fasta file::
		
		#reading
		f = FastaFile()
		f.parseFile('hop.fasta')
		for line in f :
			print line
		
		#writing
		f = FastaFile()
		f.add(">prot1", "MLPADEV")
		f.save('myFasta.fasta')
	"""
	def __init__(self, fil = None) :
		self.reset()
		if fil != None :
			self.parseFile(fil)
	
	def reset(self) :
		"""Erases everything"""
		self.data = []
		self.currentPos = 0
	
	def parseStr(self, st) :
		"""Parses a string"""
		self.data = st.split('>')[1:]

	def parseFile(self, fil) :
		"""Opens a file and parses it"""
		f = open(fil)
		self.parseStr(f.read())
		f.close()

	def __splitLine(self, li) :
		if len(self.data[li]) != 2 :
			self.data[li] = self.data[li].replace('\r', '\n')
			self.data[li] = self.data[li].replace('\n\n', '\n')
			l = self.data[li].split('\n')
			header = '>'+l[0]
			data = ''.join(l[1:])
			self.data[li] = (header, data)

	def get(self, i) :
		"""returns the ith entry"""
		self.__splitLine(i)
		return self.data[i]
		
	def add(self, header, data) :
		"""appends a new entry to the file"""
		if header[0] != '>' :
			self.data.append(('>'+header, data))
		else :
			self.data.append((header, data))
	
	def save(self, filePath) :
		"""saves the file into filePath"""
		f = open(filePath, 'w')
		f.write(self.toStr())
		f.close()
	
	def toStr(self) :
		"""returns a string version of self"""
		st = ""
		for d in self.data :
			st += "%s\n%s\n" % (d[0], d[1]) 
	
		return st
		
	def __iter__(self) :
		self.currentPos = 0
		return self
	
	def next(self) :
		#self to call getitem, and split he line if necessary
		i = self.currentPos +1
		#print i-1, self.currentPos
		if i > len(self) :
			raise StopIteration()
			
		self.currentPos = i
		return self[self.currentPos-1]

	def __getitem__(self, i) :
		"""returns the ith entry"""
		return self.get(i)
	
	def __setitem__(self, i, v) :
		"""sets the value of the ith entry"""
		if len(v) != 2: 
			raise TypeError("v must have a len of 2 : (header, data)")
			
		self.data[i] = v
		
	def __len__(self) :
		"""returns the number of entries"""
		return len(self.data)


================================================
FILE: pyGeno/tools/parsers/FastqTools.py
================================================
import os

class FastqEntry(object) :
	"""A single entry in the FastqEntry file"""
	
	def __init__(self, ident = "", seq = "", plus = "", qual = "") :
		self.values = {}
		self.values['identifier'] = ident
		self.values['sequence'] = seq
		self.values['+'] = plus
		self.values['qualities'] = qual
	
	def __getitem__(self, i):
		return self.values[i]
	
	def __setitem__(self, i, v) :
		self.values[i] = v
		
	def __str__(self):
		return "%s\n%s\n%s\n%s" %(self.values['identifier'], self.values['sequence'], self.values['+'], self.values['qualities'])
	
class FastqFile(object) :
	"""
	Represents a whole Fastq file::
		
		#reading
		f = FastqFile()
		f.parse('hop.fastq')
		for line in f :
			print line['sequence']
		
		#writing, legend can either be a list of a dict {field : column number}
		f = CSVFile(legend = ['name', 'email'])
		l = f.newLine()
		l['name'] = 'toto'
		l['email'] = "hop@gmail.com"
		f.save('myCSV.csv')
		
	"""
	
	def __init__(self, fil = None) :
		self.reset()
		if fil != None :
			self.parseFile(fil)
	
	def reset(self) :
		"""Frees the file"""
		self.data = []
		self.currentPos = 0
	
	def parseStr(self, st) :
		"""Parses a string"""
		self.data = st.replace('\r', '\n')
		self.data = self.data.replace('\n\n', '\n')
		self.data = self.data.split('\n')

	def parseFile(self, fil) :
		"""Parses a file on disc"""
		f = open(fil)
		self.parseStr(f.read())
		f.close()		
		
	def __splitEntry(self, li) :
		try :
			self.data['+'] 
			return self.data
		except :
			self.data[li] = FastqEntry(self.data[li], self.data[li+1], self.data[li+2], self.data[li+3])
			
	def get(self, li) :
		"""returns the ith entry"""
		i = li*4
		self.__splitEntry(i)
		return self.data[i]
	
	def newEntry(self, ident = "", seq = "", plus = "", qual = "") :
		"""Appends an empty entry at the end of the CSV and returns it"""
		e = FastqEntry()
		self.data.append(e)
		return e
	
	def add(self, fastqEntry) :
		"""appends an entry to self"""
		self.data.append(fastqEntry)
		
	def save(self, filePath) :
		f = open(filePath, 'w')
		f.write(self.make())
		f.close()
			
	def toStr(self) :
		st = ""
		for d in self.data :
			st += "%s\n%s" % (d[0], d[1]) 
	
		return st
		
	def __iter__(self) :
		self.currentPos = 0
		return self
	
	def next(self) :
		#self to call getitem, and split he line if necessary
		i = self.currentPos +1
		#print i-1, self.currentPos
		if i > len(self) :
			raise StopIteration()
			
		self.currentPos = i
		return self[self.currentPos-1]

	def __getitem__(self, i) :
		"""returns the ith entry"""
		return self.get(i)
	
	def __setitem__(self, i, v) :
		"""sets the ith entry"""
		if len(v) != 2: 
			raise TypeError("v must have a len of 2 : (header, data)")
			
		self.data[i] = v
		
	def __len__(self) :
		return len(self.data)/4


================================================
FILE: pyGeno/tools/parsers/GTFTools.py
================================================
import gzip

class GTFEntry(object) :
	def __init__(self, gtfFile, lineNumber) :
		"""A single entry in a GTF file"""
		
		self.lineNumber = lineNumber
		self.gtfFile = gtfFile
		self.data = gtfFile.lines[lineNumber][:-2].split('\t') #-2 remove ';\n'
		proto_atts = self.data[gtfFile.legend['attributes']].strip().split('; ')
		atts = {}
		for a in proto_atts :
			sa = a.split(' ')
			atts[sa[0]] = sa[1].replace('"', '')
		self.data[gtfFile.legend['attributes']] = atts
	
	def __getitem__(self, k) :
		try :
			return self.data[self.gtfFile.legend[k]]
		except KeyError :
			try :
				return self.data[self.gtfFile.legend['attributes']][k]
			except KeyError :
				#return None
				raise KeyError("Line %d does not have an element %s.\nline:%s" %(self.lineNumber, k, self.gtfFile.lines[self.lineNumber]))
	
	def __repr__(self) :
		return "<GTFEntry line: %d>" % self.lineNumber
	
	def __str__(self) :
		return  "<GTFEntry line: %d, %s>" % (self.lineNumber, str(self.data))

class GTFFile(object) :
	"""This is a simple GTF2.2 (Revised Ensembl GTF) parser, see http://mblab.wustl.edu/GTF22.html for more infos"""
	def __init__(self, filename, gziped = False) :
		
		self.filename = filename
		self.legend = {'seqname' : 0, 'source' : 1, 'feature' : 2, 'start' : 3, 'end' : 4, 'score' : 5, 'strand' : 6, 'frame' : 7, 'attributes' : 8}

		if gziped : 
			f = gzip.open(filename)
		else :
			f = open(filename)
		
		self.lines = []
		for l in f :
			if l[0] != '#' and l != '' :
				self.lines.append(l)
		f.close()
		
		self.currentIt = -1

	def get(self, line, elmt) :
		"""returns the value of the field 'elmt' of line 'line'"""
		return self[line][elmt]

	def __iter__(self) :
		self.currentPos = -1
		return self

	def next(self) :
		self.currentPos += 1
		try :
			return GTFEntry(self, self.currentPos)
		except IndexError:
			raise StopIteration

	def __getitem__(self, i) :
		"""returns the ith entry"""
		if self.lines[i].__class__ is not GTFEntry :
			self.lines[i] = GTFEntry(self, i)
		return self.lines[i]

	def __repr__(self) :
		return "<GTFFile: %s>" % (os.path.basename(self.filename))

	def __str__(self) :
		return "<GTFFile: %s, gziped: %s, len: %d>" % (os.path.basename(self.filename), self.gziped, len(self))
	
	def __len__(self) :
		return len(self.lines)


================================================
FILE: pyGeno/tools/parsers/VCFTools.py
================================================
import os, types, gzip

class VCFEntry(object) :
	"""A single entry in a VCF file"""
	
	def __init__(self, vcfFile, line, lineNumber) :
		#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
		#20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
		self.vcfFile = vcfFile
		self.lineNumber = lineNumber
		self.data = {}
		
		tmpL = line.replace('\r', '\n').replace('\n', '')
		tmpData = str(tmpL).split('\t')
		for i in range(6) :
			self.data[vcfFile.dnegel[i]] = tmpData[i]
		self.data['POS'] = int(self.data['POS'])
		
		filters = tmpData[6].split(';')
		if len(filters) == 1 :
			self.data['FILTER'] = filters
		else :
			for filter_value in tmpData[6].split(';') :
				filt, value = info_value.split('=')
				self.data['FILTER'][filt] = value
		
		self.data['INFO'] = {}
		for s in tmpData[7].split(';') :
			info_value = s.split('=')
			try :
				typ = self.vcfFile.meta['INFO'][info_value[0]]['Type']
			except KeyError :
				typ = None
				
			if len(info_value) == 1 :
				if typ == 'Flag' or typ == None :
					self.data['INFO'][info_value[0]] = True
				else :
					raise ValueError('%s is not a flag and has no value' % info_value[0])
			else :
				if typ == 'Integer' :
					self.data['INFO'][info_value[0]] = int(info_value[1])
				elif typ == 'Float' :
					self.data['INFO'][info_value[0]] = float(info_value[1])
				else :
					self.data['INFO'][info_value[0]] = info_value[1]
				
	def __getitem__(self, key) :
		"with the vcf file format some fields are not present in all elements therefor, this fct never raises an exception but returns None or False if the field is definied as a Flag in Meta"
		try :
			return self.data[key]
		except KeyError:
			try :
				return self.data['INFO'][key]
			except KeyError:
				try :
					if self.vcfFile.meta['INFO'][key]['Type'] == 'Flag' :
						self.data['INFO'][key] = False
						return self.data['INFO'][key]
					else :
						return None
				except KeyError:
					return None
	
	def __repr__(self) :
		return "<VCFEntry line: %d>" % self.lineNumber
	
	def __str__(self) :
		return "<VCFEntry line: %d,  %s" % (self.lineNumber, str(self.data))

class VCFFile(object) :
	"""
	This is a small parser for VCF files, it should work with any VCF file but has only been tested on dbSNP138 files.
	Represents a whole VCF file::
		
		#reading
		f = VCFFile()
		f.parse('hop.vcf')
		for line in f :
			print line['pos']
	"""
	
	def __init__(self, filename = None, gziped = False, stream = False) :
		self.legend = {}
		self.dnegel = {}
		self.meta = {}
		self.lines = None
		if filename :
			self.parse(filename, gziped, stream)
		
	def parse(self, filename, gziped = False, stream = False) :
		"""opens a file"""
		self.stream = stream
		
		if gziped :
			self.f = gzip.open(filename)
		else :
			self.f = open(filename)
		
		self.filename = filename
		self.gziped = gziped
		
		lineId = 0
		inLegend = True
		while inLegend :
			ll = self.f.readline()
			l = ll.replace('\r', '\n').replace('\n', '')
			if l[:2] == '##' :
				eqPos = l.find('=')
				key = l[2:eqPos]
				values = l[eqPos+1:].strip()
				
				if l[eqPos+1] != '<' :
					self.meta[key] = values
				else :
					if key not in self.meta :
						self.meta[key] = {}
					svalues = l[eqPos+2:-1].split(',') #remove the < and > that surounf the string 
					idKey = svalues[0].split('=')[1]
					self.meta[key][idKey] = {} 
					i = 1
					for v in svalues[1:] :
						sv = v.split("=")
						field = sv[0]
						value = sv[1]
						if field.lower() == 'description' :
							self.meta[key][idKey][field] = ','.join(svalues[i:])[len(field)+2:-1]
							break
						self.meta[key][idKey][field] = value
						i += 1
			elif l[:6] == '#CHROM': #we are in legend
				sl = l.split('\t')
				for i in range(len(sl)) :
					self.legend[sl[i]] = i
					self.dnegel[i] = sl[i]
				break
			
			lineId += 1
		
		if not stream :
			self.lines = self.f.readlines()
			self.f.close()
	
	def close(self) :
		"""closes the file"""
		self.f.close()
		
	def _developLine(self, lineNumber) :
		if self.lines[lineNumber].__class__ is not VCFEntry :
			self.lines[lineNumber] = VCFEntry(self, self.lines[lineNumber], lineNumber)
		
	def __iter__(self) :
		self.currentPos = -1
		return self
	
	def next(self) :
		self.currentPos += 1
		if not self.stream :
			try :
				return self[self.currentPos-1]
			except IndexError:
				raise StopIteration
		else :
			midfile_header = True
			while midfile_header:
				line = self.f.readline()
				if not line :
					raise StopIteration
				if not line.startswith('#'):
					midfile_header = False
			return VCFEntry(self, line, self.currentPos)
	
	def __getitem__(self, line) :
		"""returns the lineth element"""
		if self.stream :
			raise KeyError("When the file is opened as a stream it's impossible to ask for specific item")
		
		if self.lines[line].__class__ is not VCFEntry :
			self._developLine(line)
		return self.lines[line]

	def __len__(self) :
		"""returns the number of entries"""
		return len(self.lines)
	
	def __repr__(self) :
		return "<VCFFile: %s>" % (os.path.basename(self.filename))
	
	def __str__(self) :
		if self.stream :
			return "<VCFFile: %s, gziped: %s, stream: %s, len: undef>" % (os.path.basename(self.filename), self.gziped, self.stream)
		else :
			return "<VCFFile: %s, gziped: %s, stream: %s, len: %d>" % (os.path.basename(self.filename), self.gziped, self.stream, len(self))
		
if __name__ == '__main__' :
	from pyGeno.tools.ProgressBar import ProgressBar
	
	#v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/00-All.vcf.gz', gziped = True, stream = True)
	v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/test/00-All-test.vcf.gz', gziped = True, stream = True)
	startTime = time.time()
	i = 0
	pBar = ProgressBar()
	for f in v :
		#print f
		pBar.update('%s' % i)
		if i > 1000000 :
			break
		i += 1
	pBar.close()
	
	#v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/00-All.vcf', gziped = False, stream = True)
	v = VCFFile('/u/daoudat/py/pySrc/pyGeno_stuff/V2/packages/dbSNP/human/dbSNP138/test/00-All-test.vcf', gziped = False, stream = False)
	startTime = time.time()
	i = 0
	pBar = ProgressBar()
	for f in v :
		pBar.update('%s' % i)
		if i > 1000000 :
			break
		i += 1
		#print f
	pBar.close()
	#print v.lines


================================================
FILE: pyGeno/tools/parsers/__init__.py
================================================
#Parsers for different types of files
__all__ = ['CSVTools', 'FastaTools', 'FastqTools', 'GTFTools', 'CasavaTools']


================================================
FILE: setup.py
================================================
from setuptools import setup, find_packages 
from codecs import open
from os import path

here = path.abspath(path.dirname(__file__))

# Get the long description from the relevant file
with open(path.join(here, 'DESCRIPTION.rst'), encoding='utf-8') as f:
    long_description = f.read()

setup(
    name='pyGeno',

    version='1.3.2',

    description='A python package for Personalized Genomics and Proteomics',
    long_description=long_description,
    
    url='http://pyGeno.iric.ca',

    author='Tariq Daouda',
    author_email='tariq.daouda@umontreal.ca',

	test_suite="pyGeno.tests", 
	
    license='ApacheV2.0',

    # See https://pypi.python.org/pypi?%3Aaction=list_classifiers
    classifiers=[
        # How mature is this project? Common values are
        #   3 - Alpha
        #   4 - Beta
        #   5 - Production/Stable
        'Development Status :: 5 - Production/Stable',

        'Intended Audience :: Science/Research',
        'Intended Audience :: Healthcare Industry',
        'Topic :: Scientific/Engineering :: Bio-Informatics',
        'Topic :: Software Development :: Libraries',
        'Topic :: Software Development :: Libraries :: Application Frameworks',

        'License :: OSI Approved :: Apache Software License',

        'Programming Language :: Python :: 2.7',
    ],

    keywords='proteogenomics genomics proteomics annotations medicine research personalized gene sequence protein',

    packages=find_packages(exclude=[]),

    # List run-time dependencies here.  These will be installed by pip when your
    # project is installed. For an analysis of "install_requires" vs pip's
    # requirements files see:
    # https://packaging.python.org/en/latest/technical.html#install-requires-vs-requirements-files
    install_requires=['rabaDB >= 1.0.5'],

    # If there are data files included in your packages that need to be
    # installed, specify them here.  If using Python 2.6 or less, then these
    # have to be included in MANIFEST.in as well.
    package_data={
        '': ['*.txt', '*.rst', '*.tar.gz'],
    },

    # Although 'package_data' is the preferred approach, in some case you may
    # need to place data files outside of your packages.
    # see http://docs.python.org/3.4/distutils/setupscript.html#installing-additional-files
    # In this case, 'data_file' will be installed into '<sys.prefix>/my_data'
    #~ data_files=[('my_data', ['data/data_file'])],

    # To provide executable scripts, use entry points in preference to the
    # "scripts" keyword. Entry points provide cross-platform support and allow
    # pip to create the appropriate form of executable for the target platform.
    entry_points={
        'console_scripts': [
            'sample=sample:main',
        ],
    },
)