[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\npip-wheel-metadata/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n.python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# Pycharm\n.idea\n\n# Macos\n.DS_Store\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2019 Fedor Indukaev\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4016732.svg)](https://doi.org/10.5281/zenodo.4016732)\n\n\n# supervenn: precise and easy-to-read multiple sets visualization in Python\n\n### What it is\n**supervenn** is a matplotlib-based tool for visualization of any number of intersecting sets. It supports Python\n`set`s as inputs natively, but there is a [simple workaround](#use-intersection-sizes-as-inputs-instead-of-sets) to use just intersection sizes.\n\nNote that despite its name, `supervenn` does not produce actual (Euler-)Venn diagrams.\n\nThe easiest way to understand how supervenn diagrams work, is to compare some simple examples to their Euler-Venn\ncounterparts. Top row is Euler-Venn diagrams made with [matplotlib-venn](https://github.com/konstantint/matplotlib-venn)\npackage, bottom row is supervenn diagrams:\n\n<img src=\"https://i.imgur.com/dJoNhYQ.png\" width=800>\n\n### Installation\n`pip install supervenn`\n\n### Requirements\nPython 2.7 or 3.6+ with `numpy`, `matplotlib` and `pandas`.\n\n### Basic usage \nThe main entry point is the eponymous `supervenn` function. It takes a list of python `set`s as its first and only\nrequired argument and returns a `SupervennPlot` object.\n```python\nfrom supervenn import supervenn\nsets = [{1, 2, 3, 4}, {3, 4, 5}, {1, 6, 7, 8}]\nsupervenn(sets, side_plots=False)\n```\n<img src=\"https://i.imgur.com/aAOP6dq.png\" width=330>\n\nEach row represents a set, the order from bottom to top is the same as in the `sets` list. Overlapping parts correspond\nto set intersections.\n\nThe numbers at the bottom show the sizes (cardinalities) of all intersections, which we will call **chunks**.\nThe sizes of sets and their intersections (chunks) are up to proportion, but the order of elements is not preserved,\ne.g. the leftmost chunk of size 3 is `{6, 7, 8}`.\n\nA combinatorial optimization algorithms is applied that rearranges the chunks (the columns of the\narray plotted) to minimize the number of parts the sets are broken into. In the example above each set is in one piece\n( no gaps in rows at all), but it's not always possible, even for three sets:\n\n```python\nsupervenn([{1, 2}, {2, 3}, {1, 3}], side_plots=False)\n```\n\n<img src=\"https://i.imgur.com/8aTSg2A.png\" width=\"330\">\n\nBy default, additional *side plots* are also displayed:\n\n```python\nsupervenn(sets)\n```\n<img src=\"https://i.imgur.com/9IhLBcK.png\" width=330>\nHere, the numbers on the right are the set sizes (cardinalities), and numbers on the top show how many sets does this\nintersection make part of. The grey bars represent the same numbers visually.\n\nIf you need only one of the two side plots, use `side_plots='top'` or `side_plots='right'`\n\n### Features (how to)\n\n#### Use intersection sizes as inputs instead of sets\n(New in 0.5.0). Use the utility function `make_sets_from_chunk_sizes` to produce synthetic sets of integers from your intersection sizes.\nThen pass these sets to `supervenn()`: \n\n```python\nfrom supervenn import supervenn, make_sets_from_chunk_sizes\nsets, labels = make_sets_from_chunk_sizes(sizes_df)  # see below for the structure of sizes_df\nsupervenn(sets, labels)\n```\n\nIntersection sizes `sizes_df` should be a `pandas.DataFrame` with the following structure:\n\n- For `N` sets, it must have `N` boolean (or 0/1) columns and the last column must be integer, so `N+1` columns in total.\n- Each row represents a unique intersection (chunk) of the sets. The boolean value in column `set_x` indicate whether\nthis chunk lies within `set_x`. The integer value represents the size of the chunk.\n\nFor example, consider the following dataframe\n\n```\n   set_1  set_2  set_3  size\n0  False   True   True     1\n1   True  False  False     3\n2   True  False   True     2\n3   True   True  False     1\n```\n\nIt represents a configuration of three sets such that\n- [row 0] there is one element that lies in `set_2` and `set_3` but not in `set_1`,\n- [row 1] there are three elements that lie in `set_1` only and not in `set_2` or `set_3`,\n- etc two more rows.\n\n#### Add custom set annotations instead of `set_1`, `set_2` etc\nUse the `set_annotations` argument to pass a list of annotations. It should be in the same order as the sets. It is\nthe second positional argument.\n```python\nsets = [{1, 2, 3, 4}, {3, 4, 5}, {1, 6, 7, 8}]\nlabels = ['alice', 'bob', 'third party']\nsupervenn(sets, labels)\n```\n<img src=\"https://i.imgur.com/YlPKs7u.png\" width=330>\n\n#### Change size and dpi of the plot\nCreate a new figure and plot into it:\n```python\nimport matplotlib.pyplot as plt\nplt.figure(figsize=(16, 8))\nsupervenn(sets)\n```\n\nThe `supervenn` function has `figsize` and `dpi` arguments, but they are **deprecated** and will be removed in a future\nversion. Please don't use them.\n\n#### Plot into an existing axis\nUse the `ax` argument:\n\n```python\nsupervenn(sets, ax=my_axis)\n```\n\n#### Access the figure and axes objects of the plot\nUse `.figure` and `axes`  attributes of the object returned by `supervenn()`. The `axes` attribute is\norganized as a dict with descriptive strings for keys: `main`, `top_side_plot`, `right_side_plot`, `unused`. \nIf `side_plots=False`, the dict has only one key `main`.\n\n#### Save the plot to an image file\n\n```python\nimport matplotlib.pyplot as plt\nsupervenn(sets)\nplt.savefig('myplot.png')\n```\n\n#### Use a different ordering of chunks (columns)\nUse the `chunks_ordering` argument. The following options are available:\n- `'minimize gaps'`: default, use an optimization algorithm to find an order of columns with fewer\ngaps in each row;\n- `'size'`: bigger chunks go first;\n- `'occurrence'`: chunks that are in more sets go first;\n- `'random'`: randomly shuffle the columns.\n\nTo reverse the order (e.g. you want smaller chunks to go first), pass `reverse_chunks_order=False` (by default\nit's `True`) \n\n#### Reorder the sets (rows) instead of keeping the order as passed into function\nUse the `sets_ordering` argument. The following options are available:\n- `None`: default - keep the order of sets as passed into function;\n- `'minimize gaps'`: use the same algorithm as for chunks to group similar sets closer together. The difference in the\nalgorithm is that now gaps are minimized in columns instead of rows, and they are weighted by the column widths\n(i.e. chunk sizes), as we want to minimize total gap width;\n- `'size'`: bigger sets go first;\n- `'chunk count'`: sets that contain most chunks go first;\n- `'random'`: randomly shuffle the rows.\n\nTo reverse the order (e.g. you want smaller sets to go first), pass `reverse_sets_order=False` (by default\nit's `True`) \n\n#### Inspect the chunks' contents\n`supervenn(sets, ...)` returns a `SupervennPlot` object, which has a `chunks` attribute.\nIt is a `dict` with `frozenset`s of set indices as keys, and chunks as values. For example, \n`my_supervenn_object.chunks[frozenset([0, 2])]` is the chunk with all the items that are in `sets[0]` and\n`sets[2]`, but not in any of the other sets.\n\nThere is also a `get_chunk(set_indices)` method that is slightly more convenient, because you\ncan pass a `list` or any other iterable of indices instead of a `frozenset`. For example:\n`my_supervenn_object.get_chunk([0, 2])`. \n\nIf you have a good idea of a more convenient method of chunks lookup, let me know and I'll\nimplement it as well.\n\n#### Make the plot prettier if sets and/or chunks are very different in size\nUse the `widths_minmax_ratio` argument, with a value between 0.01 and 1. Consider the following example\n```python\nsets = [set(range(200)), set(range(201)), set(range(203)), set(range(206))]\nsupervenn(sets, side_plots=False)\n```\n<img src=\"https://i.imgur.com/i05lgNU.png\" width=330>\n\nAnnotations in the bottom left corner are unreadable.\n\nOne solution is to trade exact chunk proportionality for readability. This is done by making small chunks visually\nlarger. To be exact, a linear function is applied to the chunk sizes, with slope and intercept chosen so that the\nsmallest chunk size is exactly `widths_minmax_ratio` times the largest chunk size. If the ratio is already greater than\nthis value, the sizes are left unchanged. Setting `widths_minmax_ratio=1` will result in all chunks being displayed as\nsame size.\n\n```python\nsupervenn(sets, side_plots=False, widths_minmax_ratio=0.05)\n```\nThe image now looks clean, but chunks of size 1 to 3 look almost the same.\n\n\n<img src=\"https://i.imgur.com/cIp42uD.png\" width=330>\n\n\n#### Avoid clutter in the X axis annotations\n- Use the `min_width_for_annotation` argument to hide annotations for chunks smaller than this value. \n```python\nsupervenn(sets, side_plots=False, min_width_for_annotation=100)\n```\n<img src=\"https://i.imgur.com/YdCmHtZ.png\" width=330>\n\n- Pass `rotate_col_annotations=True` to print chunk sizes vertically.\n\n- There's also `col_annotations_ys_count` argument, but it is **deprecated** and will be removed in a future version.\n\n#### Change bars appearance in the main plot\nUse arguments `bar_height` (default `1`), `bar_alpha` (default `0.6`), `bar_align` (default `edge`)', `color_cycle` (\ndefault is current style's default palette). You can also use styles, for example:\n```python\nimport matplotlib.pyplot as plt\nwith plt.style.context('bmh'):\n    supervenn([{1,2,3}, {3,4}])\n```\n<img src=\"https://i.imgur.com/yEUChI4.png\" width=\"330\">\n\n\n#### Change side plots size and color\nUse `side_plot_width` (in inches, default 1) and `side_plot_color` (default `'tab:gray'`) arguments.\n\n#### Change axes labels from `SETS`, `ITEMS` to something else\nJust use `plt.xlabel` and `plt.ylabel` as usual.\n\n#### Change other parameters\nOther arguments can be found in the docstring to the function. \n\n### Algorithm used to minimize gaps\nIf there are are no more than 8 chunks, the optimal permutation is found with exhaustive search (you can increase this\nlimit up to 12 using the `max_bruteforce_size` argument). For greater chunk counts, a randomized quasi-greedy algorithm\nis applied. The description of the algorithm can be found in the docstring to `supervenn._algorithms` module.\n\n### Less trivial examples: \n\n#### Words with many meanings\n\n```python\nletters = {'a', 'r', 'c', 'i', 'z'}\nprogramming_languages = {'python', 'r', 'c', 'c++', 'java', 'julia'}\nanimals = {'python', 'buffalo', 'turkey', 'cat', 'dog', 'robin'}\ngeographic_places = {'java', 'buffalo', 'turkey', 'moscow'}\nnames = {'robin', 'julia', 'alice', 'bob', 'conrad'}\ngreen_things = {'python', 'grass'}\nsets = [letters, programming_languages, animals, geographic_places, names, green_things]\nlabels = ['letters', 'programming languages', 'animals', 'geographic places',\n          'human names', 'green things']\nplt.figure(figsize=(10, 6))\nsupervenn(sets, labels , sets_ordering='minimize gaps')\n```\n<img src=\"https://i.imgur.com/hinM4I8.png\" width=400>\n\nAnd this is how the figure would look without the smart column reordering algorithm:\n<img src=\"https://i.imgur.com/sWFah6k.png\" width=400>\n\n#### Banana genome compared to 5 other species\n[Data courtesy of Jake R Conway, Alexander Lex, Nils Gehlenborg - creators of UpSet](https://github.com/hms-dbmi/UpSetR-paper/blob/master/bananaPlot.R)\n\nImage from [D’Hont, A., Denoeud, F., Aury, J. et al. The banana (Musa acuminata) genome and the evolution of\nmonocotyledonous plants](https://www.nature.com/articles/nature11241)\n\nFigure from original article (note that it is by no means proportional!):\n\n<img src=\"https://i.imgur.com/iQlcLVG.jpg\" width=650>\n\nFigure made with [UpSetR](https://caleydo.org/tools/upset/)\n\n<img src=\"https://i.imgur.com/DH72eJJ.png\" width=700>\n\nFigure made with supervenn (using the `widths_minmax_ratio` argument)\n\n```python\nplt.figure(figsize=(20, 10))\nsupervenn(sets_list, species_names, widths_minmax_ratio=0.1,\n          sets_ordering='minimize gaps', rotate_col_annotations=True, col_annotations_area_height=1.2)\n```\n<img src=\"https://i.imgur.com/1FGvOLu.png\" width=850>\n\nFor comparison, here's the same data visualized to scale (no `widths_minmax_ratio`, but argument\n`min_width_for_annotation` is used instead to avoid column annotations overlap):\n\n```python\nplt.figure(figsize=(20, 10))\nsupervenn(sets_list, species_names, rotate_col_annotations=True,\n          col_annotations_area_height=1.2, sets_ordering='minimize gaps',\n          min_width_for_annotation=180)\n\n```\n\n<img src=\"https://i.imgur.com/MgUqkL6.png\" width=850>\n\nIt must be noted that `supervenn` produces best results when there is some inherent structure to the sets in question.\nThis typically means that the number of non-empty intersections is significantly lower than the maximum possible\n(which is `2^n_sets - 1`). This is not the case in the present example, as 62 of the 63 intersections are non-empty, \nhence the results are not that pretty.\n\n#### Order IDs in requests to a multiple vehicle routing problem solver\nThis was actually my motivation in creating this package. The team I'm currently working in provides an API that solves\na variation of the Multiple Vehicles Routing Problem. The API solves tasks of the form\n\"Given 1000 delivery orders each with lat, lon, time window and weight, and 50 vehicles each with capacity and work\nshift, distribute the orders between the vehicles and build an optimal route for each vehicle\". \n\nA given client can send tens of such requests per day and sometimes it is useful to look at their requests and\nunderstand how they are related to each other in terms of what orders are included in each of the requests. Are they\nsending the same task over and over again  - a sign that they are not satisfied with routes they get and they might need\nour help in using the API? Are they manually editing the routes (a process that results in more requests to our API, with\nonly the orders from affected routes included)? Or are they solving for several independent order sets and are happy\nwith each individual result?\n\nWe can use `supervenn` with some custom annotations to look at sets of order IDs in each of the client's requests.\nHere's an example of an OK but not perfect client's workday:\n<img src=\"https://i.imgur.com/9YfRC61.png\" width=800>\n\nRows from bottom to top are requests to our API from earlier to later, represented by their sets of order IDs.\nWe see that they solved a big task at 10:54, were not satisfied with the result, and applied some manual edits until\n11:11. Then in the evening they re-solved the whole task twice over, probably with some change in parameters.\n\nHere's a perfect day:\n\n<img src=\"https://i.imgur.com/E2o2ela.png\" width=800>\n\nThey solved three unrelated tasks and were happy with each (no repeated requests, no manual edits; each order is\ndistributed only once).\n\nAnd here's a rather extreme example of a client whose scheme of operation involves sending requests to our API every\n15-30 minutes to account for live updates on newly created orders and couriers' GPS positions.\n\n<img src=\"https://i.imgur.com/vKxHOF7.jpg\" width=800>\n\n### Comparison to similar tools\n\n#### [matplotlib-venn](https://github.com/konstantint/matplotlib-venn) \nThis tool plots area-weighted Venn diagrams with circles for two or three sets. But the problem with circles\nis that they are pretty useless even in the case of three sets. For example, if one set is symmetrical difference of the\nother two:\n```python\nfrom matplotlib_venn import venn3\nset_1 = {1, 2, 3, 4}\nset_2 = {3, 4, 5}\nset_3 = set_1 ^ set_2\nvenn3([set_1, set_2, set_3], set_colors=['steelblue', 'orange', 'green'], alpha=0.8)\n```\n<img src=\"https://i.imgur.com/Mijyzj8.png\" width=260>\n\nSee all that zeros? This image makes little sense. The `supervenn`'s approach to this problem is to allow the sets to be\nbroken into separate parts, while trying to minimize the number of such breaks and guaranteeing exact proportionality of\nall parts:\n\n<img src=\"https://i.imgur.com/e3sMQrO.png\" width=400>\n\n\n#### [UpSetR and pyUpSet](https://caleydo.org/tools/upset/)\n<img src=\"https://raw.githubusercontent.com/ImSoErgodic/py-upset/master/pictures/basic.png\" width=800>\nThis approach, while very powerful, is less visual, as it displays, so to say only _statistics about_ the sets, not the\nsets in flesh.\n\n#### [pyvenn](https://raw.githubusercontent.com/wiki/tctianchi/pyvenn)\n<img src=\"https://raw.githubusercontent.com/wiki/tctianchi/pyvenn/venn6.png\" width=800>\nThis package produces diagrams for up to 6 sets, but they are not in any way proportional. It just has pre-set images\nfor every given sets count, your actual sets only affect the labels that are placed on top of the fixed image,\nnot unlike the banana diagram above. \n\n#### [RainBio](http://www.lesfleursdunormal.fr/static/appliweb/rainbio/index.html) ([article](https://hal.archives-ouvertes.fr/hal-02264217/document))\nThis approach is quite similar to supervenn. I'll let the reader decide which one does the job better:\n\n##### RainBio:\n\n<img src=\"https://i.imgur.com/jwQAltx.png\" width=400>\n\n\n##### supervenn:\n\n<img src=\"https://i.imgur.com/hinM4I8.png\" width=400>\n\n\n_Thanks to Dr. Bilal Alsallakh for referring me to this work_\n\n#### [Linear Diagram Generator](https://www.cs.kent.ac.uk/people/staff/pjr/linear/index.html?abstractDescription=programming_languages+1%0D%0Aletters+programming_languages+2%0D%0Aprogramming_languages+animals+green_things+1%0D%0Ageographic_places+1%0D%0Aletters+3%0D%0Ahuman_names+3%0D%0Agreen_things+1%0D%0Aprogramming_languages+geographic_places+1%0D%0Aanimals+2%0D%0Aanimals+geographic_places+2%0D%0Aanimals+human_names+1%0D%0Aprogramming_languages+human_names+1%0D%0A&width=700&height=250&guides=lines)\nThis tool has a similar concept, but only available as a Javascript web app with minimal functionality, and you have to\ncompute all the intersection sizes yourself. Apparently there is also an columns rearrangement algorithm in place, but\nthe target function (number of gaps within sets) is higher than in the diagram made with supervenn.\n<img src=\"https://i.imgur.com/tZN8QAb.png\" width=600>\n\n_Thanks to [u/aboutscientific](https://www.reddit.com/user/aboutscientific/) for the link._\n\n### Credits\nThis package was created and is maintained by [Fedor Indukaev](https://www.linkedin.com/in/fedor-indukaev-4a52961b/). \nYou can contact me on Gmail and Telegram by the same username as on github.\n\n### How can I help?\n- If you like supervenn, you can click the star at the top of the page and tell other people about this tool\n- If you have an idea or even an implementation of a algorithm for matrix columns rearrangement, I'll be happy to try\nit, as my current algorithm is quite primitive. (The problem in question is almost  the traveling salesman problem in\nHamming metric).\n- If you are a Python developer, you can help by reviewing the code in any way that is convenient to you.\n- If you found a bug or have a feature request, you can submit them via the \n[Issues section](https://github.com/gecko984/supervenn/issues).\n \n"
  },
  {
    "path": "setup.py",
    "content": "# -*- coding: utf-8 -*-\n\n# python setup.py sdist bdist_wheel\n# python2 setup.py sdist bdist_wheel\n# twine upload dist/*\n\n\nimport setuptools\n\n\nwith open('README.md') as f:\n    README = f.read()\n\nsetuptools.setup(\n    author='Fedor Indukaev',\n    author_email='gecko984@gmail.com',\n    name='supervenn',\n    license='MIT',\n    description='supervenn is a tool for visualization of relations of many sets using matplotlib',\n    version='0.5.0',\n    long_description='See https://github.com/gecko984/supervenn/blob/master/README.md',\n    url='https://github.com/gecko984/supervenn',\n    packages=setuptools.find_packages(),\n    install_requires=['numpy', 'matplotlib>=2.2.5', 'pandas'],\n    classifiers=[\n        'Development Status :: 3 - Alpha',\n        'License :: OSI Approved :: MIT License',\n        'Programming Language :: Python',\n        'Programming Language :: Python :: 2.7',\n        'Programming Language :: Python :: 3.6',\n        'Programming Language :: Python :: 3.7',\n        'Programming Language :: Python :: 3.8',\n        'Programming Language :: Python :: 3.9',\n        'Programming Language :: Python :: 3.10',\n        'Programming Language :: Python :: 3.11',\n        'Topic :: Scientific/Engineering :: Visualization',\n        'Topic :: Scientific/Engineering :: Information Analysis',\n        'Intended Audience :: Science/Research',\n        'Intended Audience :: Developers'\n    ]\n)\n"
  },
  {
    "path": "supervenn/__init__.py",
    "content": "from supervenn._algorithms import (\n    get_chunks_and_composition_array,\n    get_permutations\n)\nfrom supervenn._plots import supervenn\nfrom supervenn._utils import make_sets_from_chunk_sizes\n"
  },
  {
    "path": "supervenn/_algorithms.py",
    "content": "# -*- coding: utf-8 -*-\n\"\"\"\n    supervenn._algorithms\n    This module implements all the algorithms used to prepare data for plotting.\n    ~~~~~~~~~~~\n    A semi-formal explanation of what is going on here\n    ==================================================\n\n    CHUNKS\n\n    Consider a three-way Venn diagram, shown below using squares rather than the usual circles, for obvious reasons.\n    It is easy to see that the three squares S1, S2, S3 break each other into seven elementary undivided parts, which we\n    will name :chunks:. The 7 chunks are marked as ch1 - ch7.\n\n\n              *-------S1-------*\n              |                |\n              |           ch1  |\n        *-----|---------*      |\n        |     |  ch4    |      |\n        |     |         |      |\n        |     |   *-----|------|-----*\n        |     |   | ch7 | ch5  |     |\n       S2     |   |     |      |     |\n        |     *---|-----|------*     |\n        |         |     |            |\n        |  ch2    | ch6 |            S3\n        *---------|-----*            |\n                  |          ch3     |\n                  |                  |\n                  *------------------*\n\n    The number 7 is easily derived by the following simple consideration. Each chunk is defined by a unique\n    subset of {S1, S2, S3}. For example, chunk ch4 is defined by {S1, S2} - it consists of exactly the points that lie\n    inside S1 and S2, but outside of S3. So the number of chunks is 2^3 - 1, 2^3 being the number of possible subsets of\n    {S1, S2, S3}, and -1 being for the empty set which has no business being among our chunks.\n\n    Note that, depending on how the three squares are positioned, there can be less chunks then 7. For instance, if the\n    three squares are disjoint, the number of chunks will be 3, as each square will be a chunk in itself.\n\n    What is important for us about the chunks, is that they are like the elementary building blocks of our\n    configuration of squares. In other words, we can represent any of the squares as a disjoint union of some of the\n    chunks. Also any combination of the squares w.r.t. the set-theoretical operations of intersection, union and\n    difference can be represented in the same way.\n\n    Now let's move from squares on the plane to abstract finite sets. The same line of reasoning leads us to\n    the conclusion that for N sets S1, ..., SN, there exist at most 2^N - 1 chunks (sets), such that any of the sets\n    (and any combination of the sets w.r.t union, intersection and difference) can be uniquely represented as the\n    disjoint union of several of the chunks.\n\n    COMPOSITION ARRAY\n\n    Suppose we have N sets S1, ..., SN, and we have found the chunks ch1, ... chM. As we said above, each of the sets\n    can be uniquely represented as a disjoint union of some of the chunks. Suppose we have found such decompositions for\n    all of the sets.\n    Now, represent the decompositions as a N x M array of zeros and ones, where the i,j-th item is 1 <=> set S_i\n    contains chunk ch_j. We'll call this array the :composition array: of our sets.\n    For example, for the squares above, the composition array will look like this:\n\n       ch1  ch2  ch3  ch4  ch5  ch6  ch7\n    S1  1    0    0    1    1    0    1  ->  Square 1 is made up of chunks ch1, ch4, ch5 and ch7\n    S2  0    1    0    1    0    1    1\n    S3  0    0    1    0    1    1    1\n\n    Note that since chunks are something we've just made up by ourselves, they don't have to be ordered in any\n    particular way. This means that we can reorder the chunks (= the columns if the composition array) if we\n    need to.\n\n    Most part of the present module is actually about finding a permutations of the columns, that will minimize the\n    number of gaps between the 1's in all the rows of the array, so that the sets are visually broken into as few parts\n    as possible.\n\n    For the array above, we see that there are 5 total zero-filled gaps between ones: one in the first row, two in the\n    second, one in the third. But if we find and apply the right permutation of columns, we'll have only one gap between\n    the 1's in the whole array, highlighted by asterisks below:\n\n\n       ch1  ch4  ch2  ch6  ch7  ch5  ch3\n    S1  1    1   *0*   1    1    0    0\n    S2  0    1    1    1    1    0    0\n    S3  0    0    0    1    1    1    1\n\n    When there are few chunks (say, no more than 8) the optimal permutations is easily found in a bruteforce manner by\n    checking all the possible permutations. But for a greater number of chunks this would take too long. So instead an\n    approximate greedy algorithm with randomization is used. It is implemented in the run_randomized_greedy_algorithm()\n    function. The algorithm will be described below.\n\n    REORDERING THE ROWS / SETS\n\n    Unlike the order of chunks, the order of sets can be relevant in some cases. For example, our sets sets may relate\n    to different moments in time, and we may deem it important to preserve this temporal structure in the plot. But in\n    other cases when there is no inherent order to our sets, we might want to reorder the sets (rows of the composition\n    array) so that similar sets are closer to each other.\n\n    This is done by running the same algorithm on the transposed array. But there is an important distinction here: the\n    chunks are of different size, and it makes more sense to minimize total gap width, instead of just gap count.\n    To allow that, the run_randomized_greedy_algorithm() function accepts an additional argument named :row_weights:,\n    and minimizes gap counts weighted with these coefficients. In other words, the minimization target becomes\n    sum_i(gaps_count_in_row[i] * row_weights[i]) instead of just sum_i(gaps_count_in_row[i]).\n\n    DESCRIPTION OF THE ALGORITHM\n\n    Recall that our goal is to a find a permutation of columns of a matrix of zeros and ones, so that the row-weighted\n    sum of gaps counts in rows is as low as possible.\n\n    Define similarity of two columns as the sum of row weights for rows where the values are equal in the two columns.\n    Precompute matrix of similarities of all columns (if arr has N columns, this matrix has shape N x N)\n    Find two most similar columns C1, C2, initialize two lists [C1,] and [C2,]\n    Among the remaining N-2 columns, find one that has largest similarity to either of C1 or C2. Append that column to\n    the corresponding list. E.g. we now have lists [C1, C3] and [C2,]\n    Among the remaining N-3 columns find one that has largest similarity to the last elemt of either of the lists (in\n    our example, C3 and C2). Append it to the corresponding list.\n    Continue until all columns are distributed between in two lists.\n    Finally, one of the lists is reversed and the other is concatenated to it on the right, which gives the resulting\n    permutation. This concludes the greedy algorithm.\n\n    To mitigate the greediness, the similarity matrix is perturbed by means of adding a random matrix R with\n    elements {0, noise_value} with a fixed  probability of non-zero, and the whole greedy procedure is repeated.\n    This is done with :seeds: different random matrices, and so :seeds: + 1 permutations are obtained (including the one\n    produced from the unperturbed similarities matrix). The permutation that yields the smallest value\n    of count_runs_of_ones_in_rows(arr[:, permutation]) is returned.\n\"\"\"\nfrom collections import defaultdict\nimport datetime\nfrom itertools import permutations\nimport warnings\n\nimport numpy as np\n\nHUGE_NUMBER = 1e10 # can fail for weighted! FIXME\nDEFAULT_MAX_BRUTEFORCE_SIZE = 8\nBRUTEFORCE_SIZE_HARD_LIMIT = 12\nDEFAULT_SEEDS = 10000\nDEFAULT_NOISE_PROB = 0.0075\nDEFAULT_MAX_NOISE = 1.1\n\n\ndef get_total_gaps_in_rows(arr, row_weights=None):\n    \"\"\"\n    In a numpy.array arr, count how many gaps of zeros are there between contigous runs of non-zero values in each row.\n    The counts in each row are multiplied by weights given by row_weights array and summed. By default, all row weights\n    are equal to 1, so the returned value is just the total count of gaps in rows.\n    :param arr: 2D numpy.array, all that matters is which elements are zero and which are not.\n    :param row_weights: 1D numpy array with len equal to len(arr), provides weights for each row.\n    :return: weighted sum of number of gaps in rows.\n    \"\"\"\n    if row_weights is None:\n        row_weights = np.ones(len(arr), dtype=int)\n    if len(arr) != len(row_weights):\n        raise ValueError('len(row_weights) == {} != {} == len(arr)'.format(len(row_weights), len(arr)))\n\n    arr = arr.astype(bool)\n\n    # Shift array to the left, dropping last column and appending column of zeros on the left.\n    shifted_arr = np.concatenate([np.zeros(len(arr), dtype=bool).reshape((-1, 1)), arr[:, :-1]], 1)\n    rowwise_runs_counts = (arr & (~shifted_arr)).sum(1)\n    rowwise_gaps_counts = np.maximum(rowwise_runs_counts - 1, 0)\n\n    return rowwise_gaps_counts.dot(row_weights)\n\n\ndef break_into_chunks(sets):\n    \"\"\"\n    Let us have a collection {S_1, ..., S_n} of finite sets and U be the union of all these sets.\n    For a given subset C = {i_1, ..., i_k} of indices {1, ..., n}, define the 'chunk', corresponding to C, as the set\n    of elements of U, that belong to S_i_1, ..., S_i_k, but not to any of the other sets.\n    For example, for a collection of two sets {S_1, S_2}, there can be max three chunks: S_1 - S_2, S_2 - S_1, S1 & S_2.\n    For three sets, there can be max 7 chunks (imagine a generic three-way Venn diagram and count how many different\n    area colors it can have).\n    In general, the number of possible non-empty chunks for a collection of n sets is equal to min(|U|, 2^n - 1).\n    Any chunk either lies completely inside any or completely outside any of the sets S_1, ... S_n.\n\n    This function takes a list of sets as its only argument and returns a dict with frozensets of indices as keys and\n    chunks as values.\n    :param sets: list of sets\n    :return: chunks_dict - dict with frozensets as keys and sets as values.\n    \"\"\"\n    if not sets:\n        raise ValueError('Sets list is empty.')\n\n    all_items = set.union(*sets)\n\n    if not all_items:\n        raise ValueError('All sets are empty')\n\n    # Each chunk is characterized by its occurrence pattern, which is a unique subset of indices of our sets.\n    # E.g. chunk with signature {1, 2, 5} is exactly the set of items such that they belong to sets 1, 2, 5, and\n    # don't belong to any of the other sets.\n    # Build a dict with signatures as keys (as frozensets), and lists of items as values,\n    chunks_dict = defaultdict(set)\n    for item in all_items:\n        occurrence_pattern = frozenset({i for i, set_ in enumerate(sets) if item in set_})\n        chunks_dict[occurrence_pattern].add(item)\n    return dict(chunks_dict)\n\n\ndef get_chunks_and_composition_array(sets):\n    \"\"\"\n    Take\n    - list of all chunks (each chunk is a set of items)\n    - a numpy.array A of zeros and ones with len(sets) rows and len(chunks) columns,\n    where A[i, j] == 1 <=> sets[i] includes chunks[j].\n    :param sets: list of sets\n    :return: chunks - list of sets, arr - numpy.array, chunks_dict - dict w\n    \"\"\"\n    chunks_dict = break_into_chunks(sets)\n    chunks_count = len(chunks_dict)\n    chunks = []\n    arr = np.zeros((len(sets), chunks_count), dtype=int)\n\n    for idx, (sets_indices, items) in enumerate(chunks_dict.items()):\n        chunks.append(items)\n        arr[list(sets_indices), idx] = 1\n\n    return chunks, arr\n\n\ndef find_best_columns_permutation_bruteforce(arr, row_weights=None):\n    \"\"\"\n    Using exhaustive search, find permutation of columns of np.array arr that will provide minimum value of weighted sum\n    of gaps counts in rows of arr. Will take unreasonably long time if arr has > 10 cols.\n    :param arr: 2D numpy.array.  All that matters is which elements are zero and which are not\n    :param row_weights: 1D numpy array with len equal to len(arr). Provides weights for each row.\n    :return: optimal permutation as a list of column indices.\n    \"\"\"\n    if arr.shape[1] > BRUTEFORCE_SIZE_HARD_LIMIT:\n        raise ValueError('Bruteforce ordering method accepts max {} columns, got {} instead. It would take too long.'\n                         .format(BRUTEFORCE_SIZE_HARD_LIMIT, arr.shape[1]))\n    best_permutation = None\n    best_total_gaps = HUGE_NUMBER\n    for permutation in permutations(range(arr.shape[1])):\n        total_gaps = get_total_gaps_in_rows(arr[:, permutation], row_weights=row_weights)\n        if total_gaps < best_total_gaps:\n            best_permutation = permutation\n            best_total_gaps = total_gaps\n    return list(best_permutation)\n\n\ndef columns_similarities_matrix(arr, row_weights=None):\n    \"\"\"\n    Let A and B be two 1D arrays with zeros and ones, of same shape, and row_weigths a non-negative array of same shape.\n    Define weighted similarity of A and B w.r.t. row weights as the sum of elements of row_weights in positions where\n    A and B have equal values.\n    So, given a M x N array arr, this function computes the N x N matrix of weighted similarities of its columns.\n    All weights are equal to 1 by default, in which case the similarity of two columns is equal to the number of\n    positions where values coincide in the two columns.\n    :param arr: M x N array with zeros and ones\n    :param row_weights: 1D numpy array with len equal to len(arr). Provides weights for each row.\n    :return: N x N array. with weighted similarities of columns of arr as defined above.\n    \"\"\"\n\n    if row_weights is None:\n        row_weights = np.ones(len(arr), dtype=int)\n    if len(arr) != len(row_weights):\n        raise ValueError('len(row_weights) must be equal to number of rows of arr')\n\n    return (arr.T * row_weights).dot(arr) + ((1 - arr).T * row_weights).dot(1 - arr)\n\n\ndef find_columns_permutation_greedily(similarities):\n    \"\"\"\n    Given an array of columns similarities, use a greedy algorithm to try and find a permutation of columns that\n    will lower the number of gaps in rows.\n    :param similarities: numpy.array of column similarities.\n    :return: a permutation (as a list of indices) of column indices, producing lower value of count_runs\n    \"\"\"\n\n    if len(similarities) == 1:\n        return [0]\n\n    # fill diagonal with negative value so that same column is never paired with itself, even if all similarities\n    # between different columns are zero.\n    similarities = similarities.copy()\n    np.fill_diagonal(similarities, -1)\n\n    ncols = similarities.shape[0]\n\n    # define placed_flags[i] == 1 <=> i-th column of arr has already been assinged a place in the permutation\n    placed_flags = np.zeros(ncols, dtype=int)\n\n    # find two most similar columns. Initialize two lists with tham.\n    first_col_index, second_col_index = np.unravel_index(similarities.argmax(), similarities.shape)\n    first_tail = [first_col_index]\n    second_tail = [second_col_index]\n    placed_flags[first_col_index] = 1\n    placed_flags[second_col_index] = 1\n\n    # the main greedy loop\n    for _ in range(ncols - 2):\n        # find column most similar to the last element of any of the two lists, append it to the corresponding list\n        similarities_to_first = similarities[:, first_tail[-1]] - placed_flags * HUGE_NUMBER\n        similarities_to_second = similarities[:, second_tail[-1]] - placed_flags * HUGE_NUMBER\n        first_argmax = similarities_to_first.argmax()\n        second_argmax = similarities_to_second.argmax()\n        if similarities_to_first[first_argmax] >= similarities_to_second[second_argmax]:\n            idx = first_argmax\n            first_tail.append(idx)\n        else:\n            idx = second_argmax\n            second_tail.append(idx)\n\n        # mark the chosen column as placed\n        placed_flags[idx] = 1\n\n    permutation = second_tail[::-1] + first_tail\n\n    return permutation\n\n\n# todo rename to reflect what it does, not only how it does it\ndef run_greedy_algorithm_on_composition_array(arr, row_weights=None):\n    \"\"\"\n    Given a composition array, use a greedy algorithm to find a permutation of columns that will try to minimize the\n    total weighted number of gaps in the rows of the array.\n    :param arr: numpy array with zeros and ones\n    :param row_weights: 1D numpy array with len equal to len(arr). Provides weights for each row.\n    :return:\n    \"\"\"\n    similarities = columns_similarities_matrix(arr, row_weights=row_weights)\n    return find_columns_permutation_greedily(similarities)\n\n\n# todo rename to reflect what it does, not only how it does it\ndef run_randomized_greedy_algorithm(arr, row_weights=None, seeds=DEFAULT_SEEDS, noise_prob=DEFAULT_NOISE_PROB):\n    \"\"\"\n    For a 2D np.array arr, find a permutation of columns that approximately minimizes the row-weighted number of gaps in\n    the rows of the permuted array. An approximate randomized greedy algorithm is used.\n\n\n\n    :param arr: np.array\n    :param seeds: number of different random perturbations to figh sticking in local minima.\n    :return: a list of indices representing an approximately optimal permutation of columns of arr\n    \"\"\"\n\n    arr = arr.astype(int)\n\n    # Compute similarities matrix\n    similarities = columns_similarities_matrix(arr, row_weights=row_weights)\n\n    best_found_permutation = None\n    best_found_gaps_count = HUGE_NUMBER\n\n    for seed in range(seeds + 1):\n        # Perturb similarities matrix if seed != 0\n        if seed == 0:\n            noise = np.zeros_like(similarities)\n        else:\n            np.random.seed(seed)\n            noise = (np.random.uniform(0, 1, size=similarities.shape) < noise_prob).astype(int)\n            np.random.seed(datetime.datetime.now().microsecond)\n\n        np.fill_diagonal(noise, 0)\n        noisy_similarities = similarities + noise\n\n        permutation = find_columns_permutation_greedily(noisy_similarities)\n\n        # Compute the target value for the found permutation, compare with current best.\n        gaps = get_total_gaps_in_rows(arr[:, permutation], row_weights=row_weights)\n        if gaps < best_found_gaps_count:\n            best_found_permutation = permutation\n            best_found_gaps_count = gaps\n\n    return best_found_permutation\n\n\ndef get_permutations(chunks, composition_array, chunks_ordering='minimize gaps', sets_ordering=None,\n                     reverse_chunks_order=True, reverse_sets_order=True,\n                     max_bruteforce_size=DEFAULT_MAX_BRUTEFORCE_SIZE,\n                     seeds=DEFAULT_SEEDS, noise_prob=DEFAULT_NOISE_PROB):\n    \"\"\"\n    Given chunks and composition array, get permutations which will order the chunks and the sets according to specified\n    ordering methods.\n    :param chunks, composition_array: as returned by get_chunks_and_composition_array\n    For explanation of other params, see docstring to supervenn()\n    :return: a dict of the form {'chunks_ordering': [3, 2, 5, 4, 1, 6, 0], 'sets_ordering': [2, 0, 1, 3]}\n    \"\"\"\n    chunk_sizes = [len(chunk) for chunk in chunks]\n    set_sizes = composition_array.dot(np.array(chunk_sizes))\n\n    chunks_case = {\n        'sizes': chunk_sizes,\n        'param': 'chunks_ordering',\n        'array': composition_array,\n        'row_weights': None,\n        'ordering': chunks_ordering,\n        'allowed_orderings': ['size', 'occurrence', 'random', 'minimize gaps'] + ['occurence'],  # todo remove with typo\n        'reverse': reverse_chunks_order\n    }\n\n    if chunks_ordering == 'occurence':\n        warnings.warn('Please use chunks_ordering=\"occurrence\" (with double \"r\") instead of \"occurence\" (spelling fixed'\n                      'in 0.3.0). The incorrect variant is still supported, but will be removed in a future version')\n\n    sets_case = {\n        'sizes': set_sizes,\n        'param': 'sets_ordering',\n        'array': composition_array.T,\n        'row_weights': chunk_sizes,\n        'ordering': sets_ordering,\n        'allowed_orderings': ['size', 'chunk count', 'random', 'minimize gaps', None],\n        'reverse': reverse_sets_order\n    }\n\n    permutations_ = {}\n\n    for case in chunks_case, sets_case:\n        if case['ordering'] not in case['allowed_orderings']:\n            raise ValueError('Unknown {}: {} (should be one of {})'\n                             .format(case['param'], case['ordering'], case['allowed_orderings']))\n\n        if case['ordering'] == 'size':\n            permutation = np.argsort(case['sizes'])\n        elif case['ordering'] in ['occurrence', 'chunk count'] + ['occurence']:\n            permutation = np.argsort(case['array'].sum(0))\n        elif case['ordering'] == 'random':\n            permutation = np.array(range(len(case['sizes'])))\n            np.random.shuffle(permutation)\n        elif case['ordering'] is None:\n            permutation = np.array(range(len(case['sizes'])))\n        elif case['ordering'] == 'minimize gaps':\n            if len(case['sizes']) <= min(max_bruteforce_size, BRUTEFORCE_SIZE_HARD_LIMIT):\n                permutation = find_best_columns_permutation_bruteforce(case['array'], row_weights=case['row_weights'])\n            else:\n                permutation = run_randomized_greedy_algorithm(case['array'], seeds=seeds, noise_prob=noise_prob,\n                                                              row_weights=case['row_weights'])\n        else:\n            raise ValueError(case['ordering'])\n\n        if case['ordering'] is not None and case['reverse']:\n            permutation = permutation[::-1]\n\n        permutations_[case['param']] = permutation\n\n    return permutations_\n\n\ndef _get_ordered_chunks_and_composition_array(sets, **kwargs):\n    \"\"\"\n    Wrapper for get_permutations, used only for testing\n    :param sets: list of sets\n    :param kwargs: all arguments to get_permutations except for chunks and composition_array,\n    :return:\n    \"\"\"\n    chunks, composition_array = get_chunks_and_composition_array(sets)\n\n    permutations_ = get_permutations(chunks, composition_array, **kwargs)\n\n    ordered_chunks = [chunks[i] for i in permutations_['chunks_ordering']]\n    ordered_composition_array = composition_array[:, permutations_['chunks_ordering']]\n    ordered_composition_array = ordered_composition_array[permutations_['sets_ordering'], :]\n\n    return ordered_chunks, ordered_composition_array\n"
  },
  {
    "path": "supervenn/_plots.py",
    "content": "# -*- coding: utf-8 -*-\n\"\"\"\nRoutines for plotting multiple sets.\n\"\"\"\n\nfrom functools import partial\nimport warnings\n\nimport numpy as np\nimport matplotlib.gridspec as gridspec\nimport matplotlib.pyplot as plt\n\nfrom supervenn._algorithms import (\n    break_into_chunks,\n    get_chunks_and_composition_array,\n    get_permutations,\n    DEFAULT_MAX_BRUTEFORCE_SIZE,\n    DEFAULT_SEEDS,\n    DEFAULT_NOISE_PROB)\n\n\nDEFAULT_FONTSIZE = 12\n\n\nclass SupervennPlot(object):\n    \"\"\"\n    Attributes\n    ----------\n    axes: `dict\n        a dict containing all the plot's axes under descriptive keys: 'main', 'top_side_plot', 'right_side_plot',\n        'unused' (the small empty square in the top right corner)\n    figure: matplotlib.figure.Figure\n        figure containing the plot.\n    chunks: dict\n        a dictionary allowing to get the contents of chunks. It has frozensets of key indices as keys and chunks\n        as values.\n\n    Methods\n    -------\n    get_chunk(set_indices)\n        get a chunk by the indices of its defining sets without them to a frozenset first\n   \"\"\"\n\n    def __init__(self, axes, figure, chunks_dict):\n        self.axes = axes\n        self.figure = figure\n        self.chunks = chunks_dict\n\n    def get_chunk(self, set_indices):\n        \"\"\"\n        Get the contents of a chunk defined by the sets indicated by sets_indices. Indices refer to the original sets\n        order as it was passed to supervenn() function (any reordering of sets due to use of sets_ordering params is\n        ignored).\n        For example .get_chunk_by_set_indices([1,5]) will return the chunk containing all the items that are in\n        sets[1] and sets[5], but not in any of the other sets.\n        supervenn() function, the\n        :param set_indices: iterable of integers, referring to positions in sets list, as passed into supervenn().\n        :return: chunk with items, that are in each of the sets with indices from set_indices, but not in any of the\n        other sets.\n        \"\"\"\n        return self.chunks[frozenset(set_indices)]\n\n\ndef get_alternated_ys(ys_count, low, high):\n    \"\"\"\n    A helper function generating y-positions for x-axis annotations, useful when some annotations positioned along the\n    x axis are too crowded.\n    :param ys_count: integer from 1 to 3.\n    :param low: lower bound of the area designated for annotations\n    :param high: higher bound for thr area designated for annotations.\n    :return:\n    \"\"\"\n    if ys_count not in [1, 2, 3]:\n        raise ValueError('Argument ys_count should be 1, 2 or 3.')\n    if ys_count == 1:\n        coefs = [0.5]\n        vas = ['center']\n    elif ys_count == 2:\n        coefs = [0.15, 0.85]\n        vas = ['bottom', 'top']\n    else:\n        coefs = [0.15, 0.5, 0.85]\n        vas = ['bottom', 'center', 'top']\n\n    ys = [low + coef * (high - low) for coef in coefs]\n\n    return ys, vas\n\n\ndef plot_binary_array(arr, ax=None, col_widths=None, row_heights=None, min_width_for_annotation=1,\n                      row_annotations=None, row_annotations_y=0.5,\n                      col_annotations=None, col_annotations_area_height=0.75, col_annotations_ys_count=1,\n                      rotate_col_annotations=False,\n                      color_by='row', bar_height=1, bar_alpha=0.6, bar_align='edge', color_cycle=None,\n                      alternating_background=True, fontsize=DEFAULT_FONTSIZE):\n    \"\"\"\n    Visualize a binary array as a grid with variable sized columns and rows, where cells with 1 are filled using bars\n    and cells with 0 are blank.\n    :param arr: numpy.array of zeros and ones\n    :param ax: axis to plot into (current axis by default)\n    :param col_widths: widths for grid columns, must have len equal to arr.shape[1]\n    :param row_heights: heights for grid rows, must have len equal to arr.shape[0]\n    :param min_width_for_annotation: don't annotate column with its size if size is less than this value (default 1)\n    :param row_annotations: annotations for each row, plotted in the middle of the row\n    :param row_annotations_y: a number in (0, 1), position for row annotations in the row. Default 0.5 - center of row.\n    :param col_annotations: annotations for columns, plotted in the bottom, below the x axis.\n    :param col_annotations_area_height: height of area for column annotations in inches, 1 by default\n    :param col_annotations_ys_count: 1 (default), 2, or 3 - use to reduce clutter in column annotations area\n    :param rotate_col_annotations: True / False\n    :param color_by: 'row' (default) or 'column'. If 'row', all cells in same row are same color, etc.\n    :param bar_height: height of cell fill as a fraction of row height, a number in (0, 1).\n    :param bar_alpha: alpha for cell fills.\n    :param bar_align: vertical alignment of bars, 'edge' (defaulr) or 'center'. Only matters when bar_height < 1.\n    :param color_cycle: a list of colors, given as names of matplotlib named colors, or hex codes (e.g. '#1f77b4')\n    :param alternating_background: True (default) / False - give avery second row a slight grey tint\n    :param fontsize: font size for annotations (default {}).\n    \"\"\".format(DEFAULT_FONTSIZE)\n    if row_heights is None:\n        row_heights = [1] * arr.shape[0]\n\n    if col_widths is None:\n        col_widths = [1] * arr.shape[1]\n\n    if len(row_heights) != arr.shape[0]:\n        raise ValueError('len(row_heights) doesnt match number of rows of array')\n\n    if len(col_widths) != arr.shape[1]:\n        raise ValueError('len(col_widths) doesnt match number of columns of array')\n\n    allowed_argument_values = {\n        'bar_align': ['center', 'edge'],\n        'color_by': ['row', 'column'],\n        'col_annotations_ys_count': [1, 2, 3],\n    }\n\n    for argument_name, allowed_argument_values in allowed_argument_values.items():\n        if locals()[argument_name] not in allowed_argument_values:\n            raise ValueError('Argument {} should be one of {}'.format(argument_name, allowed_argument_values))\n\n    if not 0 <= row_annotations_y <= 1:\n        raise ValueError('row_annotations_y should be a number between 0 and 1')\n\n    if color_cycle is None:\n        color_cycle = plt.rcParams['axes.prop_cycle'].by_key()['color']\n\n    grid_xs = np.insert(np.cumsum(col_widths), 0, 0)[:-1]\n    grid_ys = np.insert(np.cumsum(row_heights), 0, 0)[:-1]\n\n    if ax is not None:\n        plt.sca(ax)\n\n    # BARS\n    for row_index, (row, grid_y, row_height) in enumerate(zip(arr, grid_ys, row_heights)):\n\n        bar_y = grid_y + 0.5 * row_height if bar_align == 'center' else grid_y\n\n        # alternating background\n        if alternating_background and row_index % 2:\n            plt.barh(y=bar_y, left=0, width=sum(col_widths), height=bar_height * row_height, align=bar_align,\n                     color='grey', alpha=0.15)\n\n        for col_index, (is_filled, grid_x, col_width) in enumerate(zip(row, grid_xs, col_widths)):\n            if is_filled:\n                color_index = row_index if color_by == 'row' else col_index\n                color = color_cycle[color_index % len(color_cycle)]\n                plt.barh(y=bar_y, left=grid_x, width=col_width, height=bar_height * row_height, align=bar_align,\n                         color=color, alpha=bar_alpha)\n\n    # ROW ANNOTATIONS\n    if row_annotations is not None:\n        for row_index, (grid_y, row_height, annotation) in enumerate(zip(grid_ys, row_heights, row_annotations)):\n            annot_y = grid_y + row_annotations_y * row_height\n            plt.annotate(str(annotation), xy=(0.5 * sum(col_widths), annot_y),\n                         ha='center', va='center', fontsize=fontsize)\n\n    # COL ANNOTATIONS\n    min_y = 0\n    if col_annotations is not None:\n\n        min_y = - 1.0 * col_annotations_area_height / plt.gcf().get_size_inches()[1] * arr.shape[0]\n        plt.axhline(0, c='k')\n\n        annot_ys, vas = get_alternated_ys(col_annotations_ys_count, min_y, 0)\n\n        for col_index, (grid_x, col_width, annotation) in enumerate(zip(grid_xs, col_widths, col_annotations)):\n            annot_y = annot_ys[col_index % len(annot_ys)]\n            if col_width >= min_width_for_annotation:\n                plt.annotate(str(annotation), xy=(grid_x + col_width * 0.5, annot_y),\n                             ha='center', va=vas[col_index % len(vas)], fontsize=fontsize,\n                             rotation=90 * rotate_col_annotations)\n\n    plt.xlim(0, sum(col_widths))\n    plt.ylim(min_y, sum(row_heights))\n    plt.xticks(grid_xs, [])\n    plt.yticks(grid_ys, [])\n    plt.grid(True)\n\n\ndef side_plot(values, widths, orient, fontsize=DEFAULT_FONTSIZE, min_width_for_annotation=1, rotate_annotations=False,\n              color='tab:gray'):\n    \"\"\"\n    Barplot with multiple bars of variable width right next to each other, with an option to rotate the plot 90 degrees.\n    :param values: the values to be plotted.\n    :param widths: Widths of bars\n    :param orient: 'h' / 'horizontal' (default) or 'v' / 'vertical'\n    :param fontsize: font size for annotations\n    :param min_width_for_annotation: for horizontal plot, don't annotate bars of widths less than this value (to avoid\n    clutter. Default 1 - annotate all.)\n    :param rotate_annotations: True/False, whether to print annotations vertically instead of horizontally\n    :param color: color of bars, default 'tab:gray'\n    \"\"\"\n    bar_edges = np.insert(np.cumsum(widths), 0, 0)\n    annotation_positions = [0.5 * (begin + end) for begin, end in zip(bar_edges[:-1], bar_edges[1:])]\n    max_value = max(values)\n    if orient in ['h', 'horizontal']:\n        horizontal = True\n        plt.bar(x=bar_edges[:-1], height=values, width=widths, align='edge', alpha=0.5, color=color)\n        ticks = plt.xticks\n        lim = plt.ylim\n    elif orient in ['v', 'vertical']:\n        horizontal = False\n        plt.barh(y=bar_edges[:-1], width=values, height=widths, align='edge', alpha=0.5, color=color)\n        ticks = plt.yticks\n        lim = plt.xlim\n    else:\n        raise ValueError('Unknown orient: {} (should be \"h\" or \"v\")'.format(orient))\n\n    for i, (annotation_position, value, width) in enumerate(zip(annotation_positions, values, widths)):\n        if width < min_width_for_annotation and horizontal:\n            continue\n        x, y = 0.5 * max_value, annotation_position\n        if horizontal:\n            x, y = y, x\n        plt.annotate(value, xy=(x, y), ha='center', va='center',\n                     rotation=rotate_annotations * 90, fontsize=fontsize)\n\n    ticks(bar_edges, [])\n    lim(0, max(values))\n    plt.grid(True)\n\n\ndef get_widths_balancer(widths, minmax_ratio=0.02):\n    \"\"\"\n    Given a list of positive numbers, find a linear function, such that when applied to the numbers, the maximum value\n    remains the same, and the minimum value is minmax_ratio times the maximum value.\n    :param widths: list of numbers\n    :param minmax_ratio: the desired max / min ratio in the transformed list.\n    :return: a linear function with one float argument that has the above property\n    \"\"\"\n    if not 0 <= minmax_ratio <= 1:\n        raise ValueError('minmax_ratio must be between 0 and 1')\n    max_width = max(widths)\n    min_width = min(widths)\n    if 1.0 * min_width / max_width >= minmax_ratio:\n        slope = 1\n        intercept = 0\n    else:\n        slope = max_width * (1.0 - minmax_ratio) / (max_width - min_width)\n        intercept = max_width * (max_width * minmax_ratio - min_width) / (max_width - min_width)\n\n    def balancer(width):\n        return slope * width + intercept\n\n    return balancer\n\n\ndef remove_ticks(ax):\n    ax.set_xticks([])\n    ax.set_yticks([])\n\n\ndef setup_axes(side_plots, figsize=None, dpi=None, ax=None, side_plot_width=1.5):\n    \"\"\"\n    Set up axes for plot and return them in a dictionary. The dictionary may include the following keys:\n    - 'main': always present\n    - 'top_side_plot': present if side_plots = True, 'both' or 'top'\n    - 'right_side_plot': present if side_plots = True, 'both' or 'right'\n    - 'unused': present if side_plots = 'True' or 'both' (unused area in the top right corner)\n    :param side_plots: True / False / 'top' / 'right'\n    :param figsize: deprecated, will be removed in future versions\n    :param dpi: deprecated, will be removed in future versions\n    :param ax: optional encasing axis to plot into, default None - plot into current axis.\n    :param side_plot_width: side plots width in inches, default 1.5\n    :return: dict with string as keys and axes as values, as described above.\n    \"\"\"\n\n    if side_plots not in (True, False, 'top', 'right'):\n        raise ValueError('Incorrect value for side_plots: {}'.format(side_plots))\n\n    # Define and optionally create the encasing axis for plot according to arguments\n    if ax is None:\n        if figsize is not None or dpi is not None:\n            plt.figure(figsize=figsize, dpi=dpi)\n        supervenn_ax = plt.gca()\n    else:\n        supervenn_ax = ax\n\n    # if no side plots, there is only one axis\n    if not side_plots:\n        axes = {'main': supervenn_ax}\n\n    # if side plots are used, break encasing axis into four smaller axis using matplotlib magic and store them in a dict\n    else:\n        bbox = supervenn_ax.get_window_extent().transformed(supervenn_ax.get_figure().dpi_scale_trans.inverted())\n        plot_width, plot_height = bbox.width, bbox.height\n        width_ratios = [plot_width - side_plot_width, side_plot_width]\n        height_ratios = [side_plot_width, plot_height - side_plot_width]\n        fig = supervenn_ax.get_figure()\n        get_gridspec = partial(gridspec.GridSpecFromSubplotSpec, subplot_spec=supervenn_ax.get_subplotspec(),\n                               hspace=0, wspace=0)\n\n        if side_plots == True:\n\n            gs = get_gridspec(2, 2, height_ratios=height_ratios, width_ratios=width_ratios)\n\n            axes = {\n                'main': fig.add_subplot(gs[1, 0]),\n                'top_side_plot': fig.add_subplot(gs[0, 0]),\n                'unused': fig.add_subplot(gs[0, 1]),\n                'right_side_plot': fig.add_subplot(gs[1, 1])\n            }\n\n        elif side_plots == 'top':\n            gs = get_gridspec(2, 1, height_ratios=height_ratios)\n            axes = {\n                'main': fig.add_subplot(gs[1, 0]),\n                'top_side_plot': fig.add_subplot(gs[0, 0])\n            }\n\n        elif side_plots == 'right':\n            gs = get_gridspec(1, 2, width_ratios=width_ratios)\n            axes = {\n                'main': fig.add_subplot(gs[0, 0]),\n                'right_side_plot': fig.add_subplot(gs[0, 1])\n            }\n\n    # Remove tick from every axis, and set ticks length to 0 (we'll add ticks to the side plots manually later)\n    for ax in axes.values():\n        remove_ticks(ax)\n        ax.tick_params(which='major', length=0)\n    remove_ticks(supervenn_ax)  # if side plots are used, supervenn_ax isn't included in axes dict\n\n    return axes\n\n\ndef supervenn(sets, set_annotations=None, figsize=None, side_plots=True,\n              chunks_ordering='minimize gaps', sets_ordering=None,\n              reverse_chunks_order=True, reverse_sets_order=True,\n              max_bruteforce_size=DEFAULT_MAX_BRUTEFORCE_SIZE, seeds=DEFAULT_SEEDS, noise_prob=DEFAULT_NOISE_PROB,\n              side_plot_width=1, min_width_for_annotation=1, widths_minmax_ratio=None, side_plot_color='gray',\n              dpi=None, ax=None, **kw):\n    \"\"\"\n    Plot a diagram visualizing relationship of multiple sets.\n    :param sets: list of sets\n    :param set_annotations: list of annotations for the sets\n    :param figsize: figure size\n    :param side_plots: True / False: add small barplots on top and on the right. On top, for each chunk it is shown,\n    how many sets does this chunk lie inslde. On the right, set sizes are shown.\n    :param chunks_ordering: method of ordering the chunks (columns of the grid)\n        - 'minimize gaps' (default): use a smart algorithm to find an order of columns giving fewer gaps in each row,\n            making the plot as readable as possible.\n        - 'size': bigger chunks go first (or last if reverse_chunks_order=False)\n        - 'occurence': chunks that are in most sets go first (or last if reverse_chunks_order=False)\n        - 'random': randomly shuffle the columns\n    :param sets_ordering: method of ordering the sets (rows of the grid)\n        - None (default): keep the order as it is passed\n        - 'minimize gaps': use a smart algorithm to find an order of rows giving fewer gaps in each column\n        - 'size': bigger sets go first (or last if reverse_sets_order = False)\n        - 'chunk count': sets that contain most chunks go first (or last if reverse_sets_order = False)\n        - 'random': randomly shuffle\n    :param reverse_chunks_order: True (default) / False when chunks_ordering is \"size\" or \"occurence\",\n        chunks with bigger corresponding property go first if reverse_chunks_order=True, smaller go first if False.\n    :param reverse_sets_order: True / False, works the same way as reverse_chunks_order\n    :param max_bruteforce_size: maximal number of items for which bruteforce method is applied to find permutation\n    :param seeds: number of different random seeds for the randomized greedy algorithm to find permutation\n    :param noise_prob: probability of given element being equal to 1 in the noise array for randomized greedy algorithm\n    :param side_plot_width: width of side plots in inches (default 1.5)\n    :param side_plot_color: color of bars in side plots, default 'gray'\n    :param dpi: figure DPI\n    :param ax: axis to plot into. If ax is specified, figsize and dpi will be ignored.\n    :param min_width_for_annotation: for horizontal plot, don't annotate bars of widths less than this value (to avoid\n    clutter)\n    :param widths_minmax_ratio: desired max/min ratio of displayed chunk widths, default None (show actual widths)\n    :param rotate_col_annotations: True / False, whether to print annotations vertically\n    :param fontsize: font size for all text elements\n    :param row_annotations_y: a number in (0, 1), position for row annotations in the row. Default 0.5 - center of row.\n    :param col_annotations_area_height: height of area for column annotations in inches, 1 by default\n    :param col_annotations_ys_count: 1 (default), 2, or 3 - use to reduce clutter in column annotations area\n    :param color_by: 'row' (default) or 'column'. If 'row', all cells in same row are same color, etc.\n    :param bar_height: height of cell fill as a fraction of row height, a number in (0, 1).\n    :param bar_alpha: alpha for cell fills.\n    :param bar_align: vertical alignment of bars, 'edge' (default) or 'center'. Only matters when bar_height < 1.\n    :param color_cycle: a list of set colors, given as names of matplotlib named colors, or hex codes (e.g. '#1f77b4')\n    :param alternating_background: True (default) / False - give avery second row a slight grey tint\n\n    :return: SupervennPlot instance with attributes `axes`, `figure`, `chunks`\n        and method `get_chunk(set_indices)`. See docstring to returned object.\n    \"\"\"\n\n    if figsize is not None or dpi is not None:\n        warnings.warn(\n            'Parameters figsize and dpi of supervenn() are deprecated and will be removed in a future version.\\n'\n            'Instead of this:\\n'\n            '    supervenn(sets, figsize=(8, 5), dpi=90)'\n            '\\nPlease either do this:\\n'\n            '    plt.figure(figsize=(8, 5), dpi=90)\\n'\n            '    supervenn(sets)\\n'\n            'or plot into an existing axis by passing it as the ax argument:\\n'\n            '    supervenn(sets, ax=my_axis)\\n'\n        )\n\n    axes = setup_axes(side_plots, figsize, dpi, ax, side_plot_width)\n\n    if set_annotations is None:\n        set_annotations = ['set_{}'.format(i) for i in range(len(sets))]\n\n    chunks, composition_array = get_chunks_and_composition_array(sets)\n\n    # Find permutations of rows and columns\n    permutations_ = get_permutations(\n        chunks,\n        composition_array,\n        chunks_ordering=chunks_ordering,\n        sets_ordering=sets_ordering,\n        reverse_chunks_order=reverse_chunks_order,\n        reverse_sets_order=reverse_sets_order,\n        max_bruteforce_size=max_bruteforce_size,\n        seeds=seeds,\n        noise_prob=noise_prob)\n\n    # Apply permutations\n    chunks = [chunks[i] for i in permutations_['chunks_ordering']]\n    composition_array = composition_array[:, permutations_['chunks_ordering']]\n    composition_array = composition_array[permutations_['sets_ordering'], :]\n    set_annotations = [set_annotations[i] for i in permutations_['sets_ordering']]\n\n    # Main plot\n    chunk_sizes = [len(chunk) for chunk in chunks]\n\n    if widths_minmax_ratio is not None:\n        widths_balancer = get_widths_balancer(chunk_sizes, widths_minmax_ratio)\n        col_widths = [widths_balancer(chunk_size) for chunk_size in chunk_sizes]\n        effective_min_width_for_annotation = widths_balancer(min_width_for_annotation)\n    else:\n        col_widths = chunk_sizes\n        effective_min_width_for_annotation = min_width_for_annotation\n\n    plot_binary_array(\n        arr=composition_array,\n        row_annotations=set_annotations,\n        col_annotations=chunk_sizes,\n        ax=axes['main'],\n        col_widths=col_widths,\n        min_width_for_annotation=effective_min_width_for_annotation,\n        **kw)\n\n    xlim = axes['main'].get_xlim()\n    ylim = axes['main'].get_ylim()\n    fontsize = kw.get('fontsize', DEFAULT_FONTSIZE)\n    plt.xlabel('ITEMS', fontsize=fontsize)\n    plt.ylabel('SETS', fontsize=fontsize)\n\n    # Side plots\n\n    if 'top_side_plot' in axes:\n        plt.sca(axes['top_side_plot'])\n        side_plot(composition_array.sum(0), col_widths, 'h',\n                  min_width_for_annotation=effective_min_width_for_annotation,\n                  rotate_annotations=kw.get('rotate_col_annotations', False), color=side_plot_color, fontsize=fontsize)\n        plt.xlim(xlim)\n\n    if 'right_side_plot' in axes:\n        plt.sca(axes['right_side_plot'])\n        side_plot([len(sets[i]) for i in permutations_['sets_ordering']], [1] * len(sets), 'v', color=side_plot_color,\n                  fontsize=fontsize)\n        plt.ylim(ylim)\n\n    plt.sca(axes['main'])\n    return SupervennPlot(axes, plt.gcf(), break_into_chunks(sets))  # todo: break_into_chunks is called twice, fix\n"
  },
  {
    "path": "supervenn/_utils.py",
    "content": "from collections import defaultdict\nimport doctest\nimport sys\n\nimport pandas as pd\n\n\ndef make_sets_from_chunk_sizes(sizes_df):\n    \"\"\"\n    Given a pandas.DataFrame with sizes of intersections, produces synthetic sets of integers that\n    have exactly these intersection sizes.\n    It is a temporary roundabout before proper handling of intersection sizes without\n    the necessity for physical sets is implemented.\n    If your intersection sizes are huge, say in the tens of millions and higher, I recommend to scale\n    th numbers down to avoid using up too much memory. Don't forget to cast the scaled numbers to integer though.\n\n    :param sizes_df: pd.DataFrame, must have the following structure:\n        - For N sets, it must have N boolean columns and the last integer column, so N+1 columns in total.\n        - The names of the boolean columns are the names (labels) of the sets, the name of the integer column doesn't matter.\n        - Each row represents an unique intersection (chunk) of the sets. The boolean value in column 'set_x' indicate whether\n        this chunk lies within set_x. The integer value represents the size of the chunk.\n        - The index of the dataframe doesn't matter.\n        - 0's and 1's instead of booleans will work too.\n\n    For example, consider the following dataframe\n\n       set_1  set_2  set_3  size\n    0  False   True   True     1\n    1   True  False  False     3\n    2   True  False   True     2\n    3   True   True  False     1\n\n    It represents a configuration of three sets such that\n    - [row 0] there is one element that lies in set_2 and set_3 but not in set_1\n    - [row 1] there are three elements that lie in set_1 only and not in set_2 or set_3\n    - etc two more rows.\n\n    :return: sets, labels\n        where sets is a list of python sets, and labels is the list of set names (boolean colums names)\n\n    so the result can be used as follows:\n\n    sets, labels = make_sets_from_chunk_sizes(df)\n    supervenn(sets, labels, ...)\n\n    >>> values = [(False, True, True, 1), (True, False, False, 3), (True, False, True, 2), (True, True, False, 1)]\n    >>> df = pd.DataFrame(values, columns=['set_1', 'set_2', 'set_3', 'size'])\n    >>> sets, labels = make_sets_from_chunk_sizes(df)\n    >>> sets\n    [{1, 2, 3, 4, 5, 6}, {0, 6}, {0, 4, 5}]\n\n    >>> values = [(0, 1, 1, 1), (1, 0, 0, 3), (1, 0, 1, 2), (1, 1, 0, 1)]\n    >>> df = pd.DataFrame(values, columns=['set_1', 'set_2', 'set_3', 'size'])\n    >>> sets, labels = make_sets_from_chunk_sizes(df)\n    >>> sets\n    [{1, 2, 3, 4, 5, 6}, {0, 6}, {0, 4, 5}]\n    \"\"\"\n\n    sets = defaultdict(set)\n\n    idx = 0\n\n    for _, row in sizes_df.iterrows():\n        flags = row.iloc[:-1]\n        count = row.iloc[-1]\n        items = range(idx, idx + count)\n\n        try:\n            row_iterator = flags.iteritems()  # deprecated since pandas 1.5\n        except AttributeError:\n            row_iterator = flags.items()\n        for col_name, bool_value in row_iterator:\n            if bool_value:\n                sets[col_name].update(items)\n        idx += count\n\n    sets_labels = sizes_df.columns[:-1].tolist()\n\n    return [sets[col] for col in sets_labels], sets_labels\n\n\ndoctest.testmod(sys.modules[__name__])\n"
  },
  {
    "path": "tests/test_algorithms.py",
    "content": "# -*- coding: utf-8 -*-\n\"\"\"\nTests for _algorithms module.\n\"\"\"\nimport unittest\nimport numpy as np\nfrom collections import Counter\nfrom itertools import product\n\n\nfrom supervenn._algorithms import (\n    get_total_gaps_in_rows,\n    get_chunks_and_composition_array,\n    _get_ordered_chunks_and_composition_array,\n    find_best_columns_permutation_bruteforce,\n    run_greedy_algorithm_on_composition_array,\n    run_randomized_greedy_algorithm\n)\n\n\ndef freeze_sets(sets):\n    \"\"\"\n    Convert an iterable of sets to a set of frozensets, and check that all sets are not repeated.\n    \"\"\"\n    counter = Counter(frozenset(set_) for set_ in sets)\n    assert max(counter.values()) == 1\n    return set(counter)\n\n\ndef is_ascending(sequence):\n    return all(sequence[i + 1] >= sequence[i] for i in range(len(sequence) - 1))\n\n\ndef array_is_binary(arr):\n    \"\"\"\n    :param arr: np.array\n    :return: True if the array only contains zeros and ones, False otherwise\n    \"\"\"\n    return not set(arr.flatten()) - {0, 1}\n\n\ndef make_random_sets(min_sets_count=1):\n    sets_count = np.random.randint(min_sets_count, 10)\n    max_item = np.random.choice([2 ** i for i in range(8)]) + 1\n    max_size = np.random.choice([2 ** i for i in range(8)]) + 1\n    sets = [set(np.random.randint(1, max_item, size=np.random.randint(1, max_size)))\n            for _ in range(sets_count)]\n\n    # decimate sets to introduce some empty sets into the list\n    for index in np.random.randint(0, len(sets), size=int(len(sets) / 10)):\n        sets[index] = set()\n    return sets\n\n\ndef is_permutation(sequence):\n    n = len(sequence)\n    if len(sequence) != n:\n        return False\n    if len(set(sequence)) != n:\n        return False\n    if set(sequence) != set(range(n)):\n        return False\n    return True\n\n\nclass TestGetTotalGaps(unittest.TestCase):\n\n    def test_all_zeros(self):\n        \"\"\"\n        Test that number of runs of a zero matrix is zero\n        \"\"\"\n        arr = np.zeros((2, 3), dtype=bool)\n        gaps_count = get_total_gaps_in_rows(arr)\n        self.assertEqual(gaps_count, 0)\n\n    def test_all_ones(self):\n        \"\"\"\n        Test that the number of runs of a matrix of ones is equal to the number of rows.\n        \"\"\"\n        arr = np.ones((7, 3), dtype=bool)\n        gaps_count = get_total_gaps_in_rows(arr)\n        self.assertEqual(gaps_count, 0)\n\n    def test_one_col(self):\n        \"\"\"\n        Test that the number of runs of a matrix with one col is equal to the number of ones\n        \"\"\"\n        rows_count = 100\n        arr = np.random.randint(0, 2, size=(rows_count, 1))\n        gaps_count = get_total_gaps_in_rows(arr)\n        self.assertEqual(gaps_count, 0)\n\n    def test_nontrivial_one(self):\n        arr = np.array([[1, 0, 1, 0],\n                        [1, 0, 0, 1],\n                        [0, 1, 0, 1]])\n        gaps_count = get_total_gaps_in_rows(arr)\n        self.assertEqual(gaps_count, 3)\n\n    def test_nontrivial_two(self):\n        arr = np.array([[1, 0, 0, 0],\n                        [0, 0, 0, 1],\n                        [0, 0, 0, 1]])\n        gaps_count = get_total_gaps_in_rows(arr)\n        self.assertEqual(gaps_count, 0)\n\n    def test_nontrivial_three(self):\n        arr = np.array([[1, 1, 0, 1],\n                        [0, 1, 1, 0],\n                        [1, 0, 1, 1]])\n        gaps_count = get_total_gaps_in_rows(arr)\n        self.assertEqual(gaps_count, 2)\n\n\nclass TestGetChunksAndCompositionArray(unittest.TestCase):\n    def test_single_set(self):\n        sets = [set(np.random.randint(1, 1000, size=100))]\n        chunks, arr = get_chunks_and_composition_array(sets)\n        self.assertEqual(chunks, sets)\n        self.assertEqual(arr.shape, (1, 1))\n        self.assertEqual(arr[0, 0], 1)\n\n    def test_one_empty(self):\n        set_ = set(np.random.randint(1, 1000, size=100))\n        sets = [set_, set()]\n        chunks, arr = get_chunks_and_composition_array(sets)\n        self.assertEqual(chunks, [set_])\n        self.assertEqual(arr.shape, (2, 1))\n        self.assertEqual(list(arr[:, 0]), [1, 0])\n\n    def test_disjoint_small(self):\n        sets = [{1}, {2}]\n        chunks, arr = get_chunks_and_composition_array(sets)\n        self.assertEqual(freeze_sets(sets), freeze_sets(chunks))\n        self.assertTrue(np.array_equal(arr, np.eye(2, dtype=int)))  # fixme can be not eye\n\n    def test_disjoint_large(self):\n        set_count = 4\n        sets = [set(np.random.randint(1000 * i, 1000 * (i + 1), size=100 * (i + 1))) for i in range(set_count)]\n        chunks, arr = get_chunks_and_composition_array(sets)\n        self.assertEqual(freeze_sets(sets), freeze_sets(chunks))\n        # Verify that array is made of zeros and ones\n        self.assertTrue(array_is_binary(arr))\n        # Verify that each row and each column of arr sum to 1\n        self.assertEqual(set(arr.sum(0)), {1})\n        self.assertEqual(set(arr.sum(1)), {1})\n\n    def test_two_sets(self):\n        sets = [{1, 2, 3}, {3, 4}]\n        chunks, arr = get_chunks_and_composition_array(sets)\n        self.assertEqual(freeze_sets(chunks), {frozenset([1, 2]), frozenset([3]), frozenset([4])})\n        self.assertTrue(array_is_binary(arr))\n\n    def test_chunks_for_random_sets(self):\n        \"\"\"\n        For random sets, test that\n        1) the union of sets is the disjoint union of chunks\n        2) each chunk is either completely inside any set either completely outside\n        :return:\n        \"\"\"\n        for _ in range(100):\n            sets = make_random_sets()\n            chunks, arr = get_chunks_and_composition_array(sets)\n            all_elements = set.union(*sets)\n            self.assertEqual(set.union(*chunks), all_elements)\n            self.assertEqual(sum(len(chunk) for chunk in chunks), len(all_elements))\n            for chunk, set_ in product(chunks, sets):\n                self.assertTrue(not chunk - set_ or chunk - set_ == chunk)\n\n    def test_chunks_and_array(self):\n        \"\"\"\n        For random sets, test that the you indeed can recreate every original set as the union of chunks indicated in\n        the corresponding row of array.\n        \"\"\"\n        for _ in range(100):\n            sets = make_random_sets()\n            chunks, arr = get_chunks_and_composition_array(sets)\n            for (set_, row) in zip(sets, arr):\n                chunks_in_this_set = [chunk for is_included, chunk in zip(row, chunks) if is_included]\n                recreated_set = set.union(*chunks_in_this_set)\n                self.assertEqual(set_, recreated_set)\n                self.assertEqual(sum(len(chunk) for chunk in chunks_in_this_set), len(set_))\n\n\ndef make_test_for_algorithm(function):\n\n    class TestPermutationFinder(unittest.TestCase):\n        def test_single_set(self):\n            arr = np.array([[1]])\n            permutation = function(arr)\n            self.assertEqual(permutation, [0])\n\n        def test_single_chunk(self):\n            arr = np.array([1, 0, 1, 0]).reshape((-1, 1))\n            permutation = function(arr)\n            self.assertEqual(permutation, [0])\n\n        def test_two_cols(self):\n            for sets_count in 1, 2, 10:\n                col_1 = (np.random.uniform(0, 1, size=sets_count) > 0.6).astype(int)\n                col_2 = (np.random.uniform(0, 1, size=sets_count) > 0.6).astype(int)\n                col_1[0] = 1\n                col_2[0] = 1\n                arr = np.concatenate([col_1.reshape((-1, 1)), col_2.reshape((-1, 1))], 1)\n                permutation = function(arr)\n                self.assertIn(permutation, [[0, 1], [1, 0]])\n\n        def test_is_permutation(self):\n            for _ in range(100):\n                ncols = np.random.randint(1, 5)\n                nrows = np.random.randint(1, 5)\n                arr = (np.random.uniform(0, 1, size=(nrows, ncols)) > 0.5).astype(int)\n                arr[0][0] = 1\n                permutation = function(arr)\n                self.assertTrue(is_permutation(permutation))\n\n        def test_obvious_one(self):\n            arr = np.array([[1, 1, 1, 1],\n                            [1, 0, 1, 0],\n                            [1, 0, 1, 0]])\n            permutation = function(arr)\n            permutation_string = ''.join((str(i) for i in permutation))\n            self.assertIn('02', permutation_string + ' ' + permutation_string[::-1])\n\n        def test_obvious_two(self):\n            arr = np.array([[0, 1, 0, 1, 0, 0],\n                            [1, 0, 1, 1, 0, 1],\n                            [1, 0, 0, 1, 1, 0]])\n            permutation = function(arr)\n            permutation_string = ''.join((str(i) for i in permutation))\n            self.assertIn('03', permutation_string + ' ' + permutation_string[::-1])\n\n        def test_obvious_three(self):\n            arr = np.array([[0, 1, 0, 1, 1, 0],\n                            [1, 0, 1, 1, 0, 1],\n                            [1, 0, 0, 1, 0, 0],\n                            [1, 1, 0, 1, 0, 0],\n                            [1, 0, 0, 0, 1, 0]])\n            permutation = function(arr)\n            permutation_string = ''.join((str(i) for i in permutation))\n            self.assertIn('03', permutation_string + ' ' + permutation_string[::-1])\n            self.assertIn('13', permutation_string + ' ' + permutation_string[::-1])\n\n    return TestPermutationFinder\n\n\nTestFindBestColumnPermutationBruteforce = make_test_for_algorithm(find_best_columns_permutation_bruteforce)\nTestRunGreedyAlgorithmOnCompositionArray = make_test_for_algorithm(run_greedy_algorithm_on_composition_array)\nTestRunRandomizedGreedyAlgorithm = make_test_for_algorithm(run_randomized_greedy_algorithm)\n\n# TODO: A test that randomized is not worse than non-randomized\n\nclass TestRandomizedAlgorithmQuality(unittest.TestCase):\n\n    def test_quality(self):\n        for _ in range(100):\n            one_prob = np.random.uniform(0.1, 0.9)\n            nrows = np.random.randint(1, 30)\n            ncols = np.random.randint(1, 8)\n            arr = np.random.uniform(0, 1, size=(nrows, ncols)) < one_prob\n            permutation = run_randomized_greedy_algorithm(arr)\n            target = get_total_gaps_in_rows(arr[:, permutation])\n            best_permutation = find_best_columns_permutation_bruteforce(arr)\n            best_target = get_total_gaps_in_rows(arr[:, best_permutation])\n            self.assertLessEqual((target - best_target), 0.3 * best_target + 1)\n\n\nclass TestRandomizedAlgorithmReproducible(unittest.TestCase):\n\n    def test_reproducible(self):\n        arr = np.random.uniform(0, 1, size=(20, 20)) < 0.5\n        representations_set = set()\n        for _ in range(10):\n            permutation = run_randomized_greedy_algorithm(arr)\n            representations_set.add(str(permutation))\n        self.assertEqual(len(representations_set), 1)\n\n\nclass TestGetOrderedChunksAndCompositionArray(unittest.TestCase):\n    def test_order_chunks_size_descending(self):\n        for _ in range(10):\n            sets = make_random_sets()\n            chunks, _ = _get_ordered_chunks_and_composition_array(sets, chunks_ordering='size')\n            chunk_sizes_descending = is_ascending([len(chunk) for chunk in chunks][::-1])\n            self.assertTrue(chunk_sizes_descending)\n\n    def test_order_chunks_size_ascending(self):\n        for _ in range(10):\n            sets = make_random_sets()\n            chunks, _ = _get_ordered_chunks_and_composition_array(sets, chunks_ordering='size',\n                                                                  reverse_chunks_order=False)\n            chunk_sizes_ascending = is_ascending([len(chunk) for chunk in chunks])\n            self.assertTrue(chunk_sizes_ascending)\n\n    def test_order_chunks_occurence_descending(self):\n        for _ in range(10):\n            sets = make_random_sets()\n            _, composition_matrix = _get_ordered_chunks_and_composition_array(sets, chunks_ordering='occurrence')\n            occurences = composition_matrix.sum(0)\n            occurences_descending = is_ascending(occurences[::-1])\n            self.assertTrue(occurences_descending)\n\n    def test_order_chunks_occurence_ascending(self):\n        for _ in range(10):\n            sets = make_random_sets()\n            _, composition_matrix = _get_ordered_chunks_and_composition_array(sets, chunks_ordering='occurrence',\n                                                                              reverse_chunks_order=False)\n            occurences = composition_matrix.sum(0)\n            occurences_ascending = is_ascending(occurences)\n            self.assertTrue(occurences_ascending)\n\n    def test_order_chunks_random(self):\n        \"\"\"\n        Test that in 10 runs not all matrices are equal.\n        :return:\n        \"\"\"\n        done = False\n        while not done:\n            sets = make_random_sets(min_sets_count=9)\n            chunks, composition_matrix = get_chunks_and_composition_array(sets)\n            if len(chunks) == 1:\n                continue\n            representations_set = set()\n            for i in range(10):\n                _, composition_matrix = _get_ordered_chunks_and_composition_array(sets, chunks_ordering='random')\n                representations_set.add(str(composition_matrix))\n            self.assertGreater(len(representations_set), 1)\n            done = True\n\n    def test_order_sets_size_descending(self):\n        for _ in range(1):\n            sets = make_random_sets()\n            chunks, composition_matrix = _get_ordered_chunks_and_composition_array(sets, sets_ordering='size')\n            chunk_sizes = np.array([len(chunk) for chunk in chunks])\n            set_sizes = composition_matrix.dot(chunk_sizes.reshape((-1, 1))).flatten()\n            set_sizes_descending = is_ascending(set_sizes[::-1])\n            self.assertTrue(set_sizes_descending)\n\n    def test_order_sets_size_ascending(self):\n        for _ in range(1):\n            sets = make_random_sets()\n            chunks, composition_matrix = _get_ordered_chunks_and_composition_array(sets, sets_ordering='size',\n                                                                                   reverse_sets_order=False)\n            chunk_sizes = np.array([len(chunk) for chunk in chunks])\n            set_sizes = composition_matrix.dot(chunk_sizes.reshape((-1, 1))).flatten()\n            set_sizes_ascending = is_ascending(set_sizes)\n            self.assertTrue(set_sizes_ascending)\n\n    def test_order_sets_chunk_count_descending(self):\n        for _ in range(10):\n            sets = make_random_sets()\n            _, composition_matrix = _get_ordered_chunks_and_composition_array(sets, sets_ordering='chunk count')\n            chunk_counts = composition_matrix.sum(1)\n            chunk_counts_descending = is_ascending(chunk_counts[::-1])\n            self.assertTrue(chunk_counts_descending)\n\n    def test_order_sets_chunk_count_ascending(self):\n        for _ in range(10):\n            sets = make_random_sets()\n            _, composition_matrix = _get_ordered_chunks_and_composition_array(sets, sets_ordering='chunk count',\n                                                                              reverse_sets_order=False)\n            chunk_counts = composition_matrix.sum(1)\n            chunk_counts_ascending = is_ascending(chunk_counts)\n            self.assertTrue(chunk_counts_ascending)\n\n    def test_order_sets_random(self):\n        done = False\n        while not done:\n            sets = make_random_sets(min_sets_count=9)\n            chunks, composition_matrix = get_chunks_and_composition_array(sets)\n            if len(chunks) == 1:\n                continue\n            representations_set = set()\n            for i in range(10):\n                _, composition_matrix = _get_ordered_chunks_and_composition_array(sets, sets_ordering='random')\n                representations_set.add(str(composition_matrix))\n            self.assertGreater(len(representations_set), 1)\n            done = True\n\n\nif __name__ == '__main__':\n    unittest.main()\n"
  },
  {
    "path": "tests/test_plots.py",
    "content": "import matplotlib.pyplot as plt\nfrom matplotlib.patches import Rectangle\nimport unittest\n\nfrom supervenn._plots import supervenn\n\n\ndef rectangle_present(ax, expected_bbox_bounds):\n    \"\"\"\n    Check whether Axes ax contains a matplotlib.patches.Rectangle with given bbox bounds\n    :param ax: matplolib Axes\n    :param expected_bbox_bounds: expected bbox bounds [min_x, min_y, max_x, max_y]\n    :return: True / False\n    \"\"\"\n    found = False\n    for child in ax.get_children():\n        if isinstance(child, Rectangle) and child.get_bbox().bounds == expected_bbox_bounds:\n            found = True\n            break\n    return found\n\n\nclass TestSupervenn(unittest.TestCase):\n\n    def test_no_sets(self):\n        sets = []\n        with self.assertRaises(ValueError):\n            supervenn(sets)\n        plt.close()\n\n    def test_empty_sets(self):\n        sets = [set(), set()]\n        with self.assertRaises(ValueError):\n            supervenn(sets)\n        plt.close()\n\n    def test_no_side_plots(self):\n        \"\"\"\n        Test that the supervenn() runs without side plots, produces only one axes, and if given a list of one set\n        produces a rectangle of correct dimensions.\n        \"\"\"\n        set_size = 3\n        sets = [set(range(set_size))]\n        supervenn_plot = supervenn(sets, side_plots=False)\n        self.assertEqual(list(supervenn_plot.axes), ['main'])\n        self.assertEqual(len(supervenn_plot.figure.axes), 1)\n        expected_rectangle_bounds = (0.0, 0.0, float(set_size), 1.0)\n        self.assertTrue(rectangle_present(supervenn_plot.axes['main'], expected_rectangle_bounds))\n        plt.close()\n\n    def test_with_side_plots(self):\n        \"\"\"\n        Test that supervenn() runs with side plots, produces four axes, and if given a list of one set\n        produces a rectangle of correct dimensions.\n        \"\"\"\n        set_size = 3\n        sets = [set(range(set_size))]\n        supervenn_plot = supervenn(sets, side_plots=True)\n        self.assertEqual(set(supervenn_plot.axes), {'main', 'right_side_plot', 'top_side_plot', 'unused'})\n        expected_rectangle_bounds = (0.0, 0.0, float(set_size), 1.0)  # same for all the three axes actually!\n        for ax_name, ax in supervenn_plot.axes.items():\n            if ax_name == 'unused':\n                continue\n            self.assertTrue(rectangle_present(ax, expected_rectangle_bounds))\n        plt.close()\n\n    def test_with_one_side_plot(self):\n        set_size = 3\n        sets = [set(range(set_size))]\n        for side_plots in ('top', 'right'):\n            supervenn_plot = supervenn(sets, side_plots=side_plots)\n            self.assertEqual(set(supervenn_plot.axes), {'main', '{}_side_plot'.format(side_plots)})\n            expected_rectangle_bounds = (0.0, 0.0, float(set_size), 1.0)  # same for both axes\n            for ax_name, ax in supervenn_plot.axes.items():\n                self.assertTrue(rectangle_present(ax, expected_rectangle_bounds))\n            plt.close()\n\n\nif __name__ == '__main__':\n    unittest.main()\n"
  }
]