Full Code of virgili0/Virgilio for AI

dev 678bd1a25f1b cached

123 files

2.5 MB

650.1k tokens

48 symbols

1 requests

Download .txt

Showing preview only (2,598K chars total). Download the full file or copy to clipboard to get everything.

Repository: virgili0/Virgilio
Branch: dev
Commit: 678bd1a25f1b
Files: 123
Total size: 2.5 MB

Directory structure:
gitextract_50nlmjw2/

├── .github/
│   └── workflows/
│       └── deploy-vuepress.yml
├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── Specializations/
│   ├── HardSkills/
│   │   ├── DataPreprocessing.md
│   │   └── DataVisualization.md
│   └── SoftSkills/
│       └── ImpactfulPresentations.md
├── Tools/
│   ├── GeoGebra.md
│   ├── Latex.md
│   ├── MLDemos/
│   │   └── README.md
│   ├── Regex.ipynb
│   ├── WolframAlpha.md
│   └── regex-bin/
│       ├── pi.txt
│       └── regexPrinter.py
├── Topics/
│   ├── ANN.md
│   ├── Computer Vision/
│   │   ├── Introduction_to_Computer_Vision_using_OpenCV_and_Python.ipynb
│   │   ├── Object_Instance_Segmentation_using_TensorFlow_Framework_and_Cloud_GPU_Technology.ipynb
│   │   ├── Object_Tracking_based_on_Deep_Learning.ipynb
│   │   └── Object_detection_based_on_Deep_Learning.ipynb
│   ├── Deep learning in cloud/
│   │   └── README.md
│   ├── Demystification.md
│   ├── DialogFlow.md
│   ├── MLSystems.md
│   ├── NLP/
│   │   └── NLP.ipynb
│   ├── do_you_need_ml.md
│   ├── ds_process.md
│   ├── frame-the-problem.md
│   ├── jupyter-notebook.md
│   ├── math-fundamentals.md
│   ├── prerequisites.md
│   ├── python-fundamentals.md
│   ├── starting-a-data-project.md
│   ├── statistics-fundamentals.md
│   ├── teaching.md
│   ├── usage-and-integration.md
│   └── use-cases.md
├── content/
│   ├── .vuepress/
│   │   ├── LICENSE
│   │   ├── config.js
│   │   ├── public/
│   │   │   ├── googlece1290fc3980cafc.html
│   │   │   └── vollkorn/
│   │   │       └── SIL Open Font License.txt
│   │   └── theme/
│   │       ├── LICENSE
│   │       ├── components/
│   │       │   ├── AlgoliaSearchBox.vue
│   │       │   ├── DropdownLink.vue
│   │       │   ├── DropdownTransition.vue
│   │       │   ├── Home.vue
│   │       │   ├── NavLink.vue
│   │       │   ├── NavLinks.vue
│   │       │   ├── Navbar.vue
│   │       │   ├── Page.vue
│   │       │   ├── PageEdit.vue
│   │       │   ├── PageNav.vue
│   │       │   ├── Sidebar.vue
│   │       │   ├── SidebarButton.vue
│   │       │   ├── SidebarGroup.vue
│   │       │   ├── SidebarLink.vue
│   │       │   └── SidebarLinks.vue
│   │       ├── global-components/
│   │       │   └── Badge.vue
│   │       ├── index.js
│   │       ├── layouts/
│   │       │   ├── 404.vue
│   │       │   └── Layout.vue
│   │       ├── noopModule.js
│   │       ├── styles/
│   │       │   ├── arrow.styl
│   │       │   ├── code.styl
│   │       │   ├── config.styl
│   │       │   ├── custom-blocks.styl
│   │       │   ├── index.styl
│   │       │   ├── mobile.styl
│   │       │   ├── toc.styl
│   │       │   └── wrapper.styl
│   │       └── util/
│   │           └── index.js
│   ├── README.md
│   ├── docs/
│   │   ├── contributing.md
│   │   ├── contributors.md
│   │   └── template.md
│   ├── inferno/
│   │   ├── computer-vision/
│   │   │   ├── Object_detection_based_on_Deep_Learning.ipynb
│   │   │   ├── introduction-to-computer-vision.ipynb
│   │   │   ├── object-instance-segmentation.ipynb
│   │   │   └── object-tracking.ipynb
│   │   ├── research/
│   │   │   ├── sota-papers.md
│   │   │   └── zotero.md
│   │   ├── soft-skills/
│   │   │   └── impactful-presentations.md
│   │   ├── time-series/
│   │   │   └── introduction-to-time-series.md
│   │   ├── tools/
│   │   │   ├── geo-gebra.md
│   │   │   ├── latex.md
│   │   │   ├── regex.ipynb
│   │   │   └── wolfram-alpha.md
│   │   ├── virtual-assistants/
│   │   │   └── dialogflow-chatbot.md
│   │   └── welcome-to-inferno/
│   │       └── welcome-to-inferno.md
│   ├── package.json
│   ├── paradiso/
│   │   ├── demystification-ai-ml-dl.md
│   │   ├── do-you-really-need-ml.md
│   │   ├── introduction-to-ml.md
│   │   ├── use-cases.md
│   │   ├── virgilio-teaching-strategy.md
│   │   └── what-do-i-need-for-ml.md
│   └── purgatorio/
│       ├── collect-and-prepare-data/
│       │   ├── data-collection-text-to-diagram-01.txt
│       │   ├── data-collection.md
│       │   ├── data-preparation.md
│       │   └── data-visualization.md
│       ├── define-the-scope-and-ask-questions/
│       │   ├── frame-the-problem.md
│       │   ├── starting-a-data-project.md
│       │   ├── usage-and-integration.md
│       │   └── workspace-setup-and-cloud-computing.md
│       ├── fundamentals/
│       │   ├── jupyter-notebook.md
│       │   ├── math-fundamentals.md
│       │   ├── python-fundamentals.md
│       │   ├── statistics-fundamentals.md
│       │   └── the-data-science-process.md
│       ├── launch-and-mantain-the-system/
│       │   ├── automation-and-reproducibility.md
│       │   ├── monitoring-usage-and-behavior.md
│       │   └── serving-trained-models.md
│       ├── now-go-build/
│       │   ├── a-messy-real-world.md
│       │   ├── best-practices.md
│       │   └── transfer-learning.md
│       └── select-and-train-machine-learning-models/
│           ├── deep-learning-theory.md
│           ├── evaluation-and-finetuning.md
│           ├── machine-learning-theory.md
│           └── tools-and-libraries.md
├── docs/
│   ├── contributing.md
│   ├── contributors.md
│   └── template.md
└── google50cedcfbb5fc73b6.html

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/deploy-vuepress.yml
================================================
name: Build and deploy an updated version of the website

on:
  push

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout virgili0/Virgilio
      uses: actions/checkout@v2
      with:
        repository: virgili0/Virgilio
        path: folder/repo
    - name: Checkout virgili0/Virgilio master
      uses: actions/checkout@v2
      with:
        repository: virgili0/Virgilio
        ref: master
        path: folder/build

    - name: Set up Python 3.5
      uses: actions/setup-python@v1
      with:
        python-version: 3.5

    - name: Install Python dependencies
      run: |
        python -m pip install --upgrade pip
        pip install jupyter
        
    - uses: actions/setup-node@v1
      with:
        node-version: '12'

    - name: Install npm dependencies
      working-directory: folder/repo/
      run: |
        cd content
        npm install
        
    - name: Vuepress build and deploy
      working-directory: folder/
      run: |
        cd repo
        find . -type f -name "*.ipynb" -exec jupyter nbconvert --to markdown {} \;
        cd content
        npm run build
        cd ..
        mkdir dist
        cp content/.vuepress/dist/* dist/ -r
        cd ..
        cp -a repo/dist/. build/
        cd build
        mkdir -m 700 ~/.ssh
        echo "${{ secrets.SSH_KEY_SECRET }}" > ~/.ssh/id_ed25519
        chmod 0600 ~/.ssh/id_ed25519
        git config --local user.name "GitHub Action"
        git config --global user.email "virgilio.datascience@gmail.com"
        git add .
        git commit -m "Update build" && git push || :


================================================
FILE: .gitignore
================================================
.DS_Store
.vscode/*

# Node things
node_modules/
package-lock.json

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don’t work, or not
#   install all needed dependencies.
#Pipfile.lock

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Contributor Covenant Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at `virgilio.datascience (at) gmail.com`. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [http://contributor-covenant.org/version/1/4][version]

[homepage]: http://contributor-covenant.org
[version]: http://contributor-covenant.org/version/1/4/


================================================
FILE: LICENSE
================================================
Attribution-NonCommercial-ShareAlike 4.0 International

=======================================================================

Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.

Using Creative Commons Public Licenses

Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.

     Considerations for licensors: Our public licenses are
     intended for use by those authorized to give the public
     permission to use material in ways otherwise restricted by
     copyright and certain other rights. Our licenses are
     irrevocable. Licensors should read and understand the terms
     and conditions of the license they choose before applying it.
     Licensors should also secure all rights necessary before
     applying our licenses so that the public can reuse the
     material as expected. Licensors should clearly mark any
     material not subject to the license. This includes other CC-
     licensed material, or material used under an exception or
     limitation to copyright. More considerations for licensors:
    wiki.creativecommons.org/Considerations_for_licensors

     Considerations for the public: By using one of our public
     licenses, a licensor grants the public permission to use the
     licensed material under specified terms and conditions. If
     the licensor's permission is not necessary for any reason--for
     example, because of any applicable exception or limitation to
     copyright--then that use is not regulated by the license. Our
     licenses grant only permissions under copyright and certain
     other rights that a licensor has authority to grant. Use of
     the licensed material may still be restricted for other
     reasons, including because others have copyright or other
     rights in the material. A licensor may make special requests,
     such as asking that all changes be marked or described.
     Although not required by our licenses, you are encouraged to
     respect those requests where reasonable. More considerations
     for the public:
    wiki.creativecommons.org/Considerations_for_licensees

=======================================================================

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
Public License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International Public License
("Public License"). To the extent this Public License may be
interpreted as a contract, You are granted the Licensed Rights in
consideration of Your acceptance of these terms and conditions, and the
Licensor grants You such rights in consideration of benefits the
Licensor receives from making the Licensed Material available under
these terms and conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. BY-NC-SA Compatible License means a license listed at
     creativecommons.org/compatiblelicenses, approved by Creative
     Commons as essentially the equivalent of this Public License.

  d. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.

  e. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  f. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  g. License Elements means the license attributes listed in the name
     of a Creative Commons Public License. The License Elements of this
     Public License are Attribution, NonCommercial, and ShareAlike.

  h. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  i. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  j. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  k. NonCommercial means not primarily intended for or directed towards
     commercial advantage or monetary compensation. For purposes of
     this Public License, the exchange of the Licensed Material for
     other material subject to Copyright and Similar Rights by digital
     file-sharing or similar means is NonCommercial provided there is
     no payment of monetary compensation in connection with the
     exchange.

  l. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  m. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  n. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part, for NonCommercial purposes only; and

            b. produce, reproduce, and Share Adapted Material for
               NonCommercial purposes only.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. Additional offer from the Licensor -- Adapted Material.
               Every recipient of Adapted Material from You
               automatically receives an offer from the Licensor to
               exercise the Licensed Rights in the Adapted Material
               under the conditions of the Adapter's License You apply.

            c. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties, including when
          the Licensed Material is used other than for NonCommercial
          purposes.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.
       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

  b. ShareAlike.

     In addition to the conditions in Section 3(a), if You Share
     Adapted Material You produce, the following conditions also apply.

       1. The Adapter's License You apply must be a Creative Commons
          license with the same License Elements, this version or
          later, or a BY-NC-SA Compatible License.

       2. You must include the text of, or the URI or hyperlink to, the
          Adapter's License You apply. You may satisfy this condition
          in any reasonable manner based on the medium, means, and
          context in which You Share Adapted Material.

       3. You may not offer or impose any additional or different terms
          or conditions on, or apply any Effective Technological
          Measures to, Adapted Material that restrict exercise of the
          rights granted under the Adapter's License You apply.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database for NonCommercial purposes
     only;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material,
     including for purposes of Section 3(b); and

  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.

=======================================================================

Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.

Creative Commons may be contacted at creativecommons.org.


================================================
FILE: README.md
================================================





# [I've Launched a GenAI Framework that's robust and easy to learn and maintain, check it!](https://github.com/datapizza-labs/datapizza-ai)

Virgilio is an **open-source initiative**, aiming to **mentor and guide** anyone in the world of the **Data Science**.
Our vision is to give *everyone* the chance to get involved in this field, **get self-started** as a practitioner, **gain new skills** and **learn to navigate** through the infinite web of resources and find the ones useful for *you*.

[Find me](https://twitter.com/giac290595) on Twitter to have a chat!

## -----> [**Meet Virgilio now!**](https://virgili0.github.io/Virgilio/)
![Figure 1](virgilio.PNG "1") 


### Table of Contents

- [What is Virgilio](#what-is-virgilio)
- [About](#About)
  * [License](#license)
  * [Contribute](#contribute)


# What is Virgilio?

Studying and reading through the Internet means swimming in an **infinite jungle of chaotic information**, even more so in rapidly changing innovative fields. 

_Have you ever felt overwhelmed_ when trying to approach **Data Science** without a real “path” to follow? 

Are you tired of clicking “Run”, “Run”, “Run”.. on a Jupyter Notebook, with that false sense of confidence given by the comfort zone of the work of others?

Have you ever got confused because of the several and contradicting names for the same algorithm or approach, from different websites and fragmented tutorials? 

**Virgilio addresses these critical issues for free, for everyone.**

## [**Enter in the new web version of Virgilio!**](https://virgili0.github.io/Virgilio/)

## About

Virgilio is developed and maintained by [these awesome people](docs/contributors.md).
[Find me](https://twitter.com/giac290595) on Twitter to have a chat!

### Contribute

That's awesome! Check the [contribution guidelines](docs/contributing.md) and get involved in our project!

### License

Contents are released under the Creative Commons BY-NC-SA 4.0 [license](https://github.com/virgili0/Virgilio/blob/dev/LICENSE). Code is released under the [MIT license](https://github.com/virgili0/Virgilio/blob/dev/.vuepress/LICENSE).
The Virgilio image comes from [here](https://upload.wikimedia.org/wikipedia/commons/c/ce/Virgil_.jpg).


================================================
FILE: Specializations/HardSkills/DataPreprocessing.md
================================================
# Data Preprocessing

Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications.

[Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/analytical-data-preparation-important/) of any data scientist or data engineer, and you must _be able to manipulate, clean, and structure_ your data during the everyday work (besides expecting that this will take the most of your [daily-time](https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html)!).

There are a lot of different data types out there, and they deserve [different treatments](http://blog.appliedinformaticsinc.com/data-mining-challenges-in-data-cleaning/).

As usual the structure I've planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. 

[Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process.

The concepts through which we're going are the following:

- [Don't Joke With Data](#Don't-Joke-With-Data)
- [Business Questions](#Business-Questions)
- [Data Profiling](#Data-Profiling)
- [Who To Leave Behind](#Who-To-Leave-Behind)
- [Start Small](#Start-small)
- [The Toolkit](#The-Toolkit)
- [Data Cleaning](#Data-Cleaning)
  - [Get Rid of Extra Spaces](#Get-Rid-of-Extra-Spaces)
  - [Select and Treat All Blank Cells](#Select-and-Treat-All-Blank-Cells)
  - [Convert Values Type](#Convert-Values-Type)
  - [Remove Duplicates](#Remove-Duplicates)
  - [Change Text to Lower/Upper Case](#Change-Text-to-Lower/Upper-Case)
  - [Spell Check](#Spell-Check)
  - [Dealing with Special Characters](#Dealing-with-Special-Characters)
  - [Normalizing Dates](#Normalizing-Dates)
  - [Verification To Enrich Data](#Verification-To-Enrich-Data)
  - [Data Discretization](#Data-Discretization)
  - [Feature Scaling](#Feature-Scaling)
  - [Data Cleaning Tools](#Data-Cleaning-Tools)
- [Merge Data Sets and Integration](#Merge-Data-Sets-and-Integration)
- [Sanity Check](#Sanity-Check)
- [Automate These Boring Stuffs!](#Automate-These-Boring-Stuffs!)

**Let's Start!**

### Don't Joke With Data
First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://nektardata.com/high-quality-data/) to _produce_ good quality data.
Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data.  

### Business Questions
Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! 

### Data Profiling
According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."\
So Wikipedia is subtly suggesting us to take a coffee with the data. 

During this informal meeting, ask the data questions like:
- which business problem are you meant to solve? (what is important, and what is not) 
- how have you been collected (with noise, missing values...)?
- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages)

Eventually, you may find the data too much quiet, maybe they're just shy! \
Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)!

_Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf), [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347)

### Who To Leave Behind
During the data profiling process, it's common to realize that often some of your data are [useless](https://ambisense.net/why-useless-data-is-worse-than-no-data/). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems.
[To drop or not to drop, the Dilemma](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/).
Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): 
- How this data is going to help me?
- Is possible to use them, reducing noise o missing values?
- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it?

### Start Small
It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows. 

### The Toolkit
The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets.
The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks.
Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). 

_Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html)

### Data Cleaning
[Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations.

### Get Rid of Extra Spaces
One of the first things you want to do is [remove extra spaces](https://stackoverflow.com/questions/43332057/pandas-strip-white-space). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". I want you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. 
_Bonus tip_: learn how to use [Regex](https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/) for pattern matching, this is one of the powerful tools each data guy need to master.

_Best practices and exercises:_ [1](https://www.quora.com/How-do-you-remove-all-whitespace-from-a-Python-string), [2](https://towardsdatascience.com/5-methods-to-remove-the-from-your-data-in-python-and-the-fastest-one-281489382455), [3](https://www.tutorialspoint.com/How-to-remove-all-leading-whitespace-in-string-in-Python)

_RegeX exercises_: [1](https://www.w3resource.com/python-exercises/re/), [2](https://pycon2016.regex.training/exercises)

_Bonus Resource_: A super useful [tool](http://regviz.org/) for visualizing RegeX expressions and their effect on the text.

###  Select and Treat All Blank Cells
Often real-world data is incomplete and is necessary to handle this situation. [These](https://code.likeagirl.io/how-to-use-python-to-remove-or-modify-empty-values-in-a-csv-dataset-34426c816347) are two ways of dealing with it. [Here](https://hackersandslackers.com/pandas-dataframe-drop/) you have a more in-depth tutorial.

_Best practices and exercises:_ [1](https://www.kaggle.com/nirmal51194/data-cleaning-challenge-handling-missing-values), [2](https://stefvanbuuren.name/fimd/missing-data-pattern.html), [3](https://www.ethz.ch/content/dam/ethz/special-interest/math/statistics/sfs/Education/Advanced%20Studies%20in%20Applied%20Statistics/course-material-1719/Multivariate/w10-in-class-exercise-imputation-solution.pdf), [4](http://uc-r.github.io/missing_values)

###  Convert Values Type
[Different data types](https://pbpython.com/pandas_dtypes.html) carries different information, and you need to care about this.
[Here](https://www.geeksforgeeks.org/python-pandas-series-astype-to-convert-data-type-of-series/) is a good tutorial on how to convert type values. Remember that Python has some shortcut for doing this (executing str(3) will give you back the "3" string) but I recommend you to learn how to do it with Pandas.

###  Remove Duplicates
You don't want to duplicate data, they both are noise and occupy space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas.

###  Change Text to Lower/Upper Case
You want to _Capitalize_ names, or maybe make them uniform (some people can enter data with or without capital letters!). Check [here](https://www.geeksforgeeks.org/python-pandas-series-str-lower-upper-and-title/) for the Pandas way to do it.

###  Spell Check
You want to correct wrong words, for the sake of evenness. Check [here](https://www.tutorialspoint.com/python/python_spelling_check.htm) for a good Python module to do it. Also, this is a good starting point to [implement it](https://stackoverflow.com/questions/46409475/spell-checker-in-pandas). 

_Best practices and exercises:_ [1](https://stackoverflow.com/questions/7315114/spell-check-program-in-python), [2](https://norvig.com/spell-correct.html), [3](https://github.com/garytse89/Python-Exercises/tree/master/autoCorrect)

###  Reshape your data
Maybe you're going to feed your data into a neural network or show them in a colorful bars plot. Anyway, you need to transform your data and give them the right shape for your data pipeline. [Here](https://towardsdatascience.com/seven-clean-steps-to-reshape-your-data-with-pandas-or-how-i-use-python-where-excel-fails-62061f86ef9c) is a very good tutorial for this task. 

_Best practices and exercises:_ [1](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html), [2](https://discuss.codecademy.com/t/faq-data-cleaning-with-pandas-reshaping-your-data/384794).

###  Dealing with Special Characters
UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need [crime predictive analytics](http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1633&context=etd_projects). You can learn [here](https://stackoverflow.com/questions/45596529/replacing-special-characters-in-pandas-dataframe) how to deal with strange accents or special characters.

_Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-basic-exercise-92.php), [2](https://stackoverflow.com/questions/22518703/escape-sequences-exercise-in-python?rq=1), [3](https://learnpythonthehardway.org/book/ex2.html)

###  Normalizing Dates
I think there could be one hundred ways to write down a date. You need to decide your format and make them uniform across your dataset, and [here](https://medium.com/jbennetcodes/dealing-with-datetimes-like-a-pro-in-pandas-b80d3d808a7f) you learn how to do it.

_Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-conditional-exercise-41.php), [2](https://www.w3resource.com/python-exercises/date-time-exercise/), [3](https://www.kaggle.com/anezka/data-cleaning-challenge-parsing-dates)

###  Verification to enrich data
Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset.

_Best practices and exercises:_ [1](http://www.inweb.org.br/w3c/dataenrichment/), [2](https://solutionsreview.com/data-integration/best-practices-for-data-enrichment-after-etl/), [3](http://www.inweb.org.br/w3c/dataenrichment/)

### Data Discretization
Many Machine Learning and Data Analysis methods cannot handle continuous data, and dealing with them can be computationally prohibitive. [Here](https://www.youtube.com/watch?v=TF3_6lwITQg) you find a good video explaining why and how you need to discretize data.

_Best practices and exercises:_ [1](https://www.researchgate.net/post/What_are_the_best_methods_for_discretization_of_continuous_features), [2](https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b), [3](https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/discretization-methods-data-mining)

### Feature Scaling
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step.

_Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning)

### Data Cleaning Tools
You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more.  

### Merge Data Sets and Integration
Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why.

_Best practices and exercises:_ [1](https://www.ssc.wisc.edu/sscc/pubs/sfr-combine.htm), [2](https://rpubs.com/wsundstrom/t_merge), [3](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), [4](https://searchbusinessanalytics.techtarget.com/feature/Using-data-merging-and-concatenation-techniques-to-integrate-data), [5](https://www.analyticsvidhya.com/blog/2016/06/9-challenges-data-merging-subsetting-r-python-beginner/)

### Sanity Check
You always want to be sure that your data are _exactly_ how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now)
Look [here](https://www.trifacta.com/blog/4-key-steps-to-sanity-checking-your-data/) for a good overview. Depending on your case, the sanity check can vary a lot.

_Best practices and exercises:_ [1](https://blog.socialcops.com/academy/resources/4-data-checks-clean-data/), [2](https://www.r-bloggers.com/data-sanity-checks-data-proofer-and-r-analogues/), [3](https://www.quora.com/What-is-the-example-of-Sanity-testing-and-smoke-testing)

### Automate These Boring Stuffs!
As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point.

_Best practices and exercises:_ [1](https://blog.panoply.io/5-data-preparation-tools-1-automated-data-platform), [2](https://www.quora.com/How-do-I-make-an-automated-data-cleaning-in-Python-for-ML-Is-there-a-trick-for-that), [3](https://www.quora.com/Is-there-a-python-package-to-automate-data-preparation-in-machine-learning), [4](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/), [5](https://www.analyticsvidhya.com/blog/2018/10/rapidminer-data-preparation-machine-learning/)

### Conclusions
Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps.

-------------------------
Written by [_clone95_](https://github.com/clone95)


================================================
FILE: Specializations/HardSkills/DataVisualization.md
================================================
# Data Visualization 

It was hard for the Homo Sapiens to survive in the African savannah: a human or animal could kill you at any time.
The human brain has evolved in this wild and unpredictable context, and evolution has "coincidentally" chosen to devote a great deal of computing power to capturing and understanding the world through **sight** ([more than 60 %](https://www.quora.com/How-much-of-the-brain-is-involved-with-vision-What-about-hearing-touch-etc)).\
So, it' trivial that a clear and effective data visualization it's one of your best weapons in the Data Science world.

The track which inspired me for this guide is one of the must-buy book [**Storytelling with Data**](https://www.amazon.it/Storytelling-Data-Visualization-Business-Professionals/dp/1119002257/ref=sr_1_1?adgrpid=52005426669&gclid=CjwKCAjwndvlBRANEiwABrR32EhKMtGs8M5mBgl5lQJZCf9fglkx87ujqYVZk6gHsMDxKOd9yQa7uRoCin8QAvD_BwE&hvadid=255222968297&hvdev=c&hvlocphy=1008297&hvnetw=g&hvpos=1t3&hvqmt=e&hvrand=3841532584099296285&hvtargid=kwd-297573901809&keywords=storytelling+with+data&qid=1555538994&s=gateway&sr=8-1). By far is the best data visualization book I've ever read.

You can find [here](http://www.bdbanalytics.ir/media/1123/storytelling-with-data-cole-nussbaumer-knaflic.pdf) the free PDF. 

Another piece of dense knowledge, with exceptional conciseness and "father" of every data visualization book: [**The Visual Display of Quantitative Information**](https://www.amazon.it/Visual-Display-Quantitative-Information/dp/0961392142).

I assume you know [basic Python](https://github.com/clone95/Virgilio/blob/master/NewToDataScience/PythonBasic.md).

Each content listed here **is not** tool-specific (apart from "tools", did you ever imagine that?).

The concepts through which we're going are the following:

- [Legolas, how do your elf eyes see?](#Legolas,-how-do-your-elf-eyes-see?)
- [The Importance of Context](#The-importance-of-context)
- [The Data / Ink Ratio](#The-Information-/-Ink-Ratio)
- [Choose an Effective Visual](#Choose-an-Effective-Visual)
- [Focus your Audience’s Attention](#Focus-your-Audience’s-Attention)
- [Think like a Designer](#Think-like-a-Designer)
- [Exploring Model Visuals](#Exploring-Model-Visuals)
  - [Line Graph](#Line-Graph)
  - [Annotated Line Graph](#Annotated-Line-Graph)
  - [Stacked Bars](#Stacked-Bars)
  - [Positive and Negative Stacked Bars](#Positive-and-Negative-Stacked-Bars)
  - [Horizontal Stacked Bars](#Horizontal-Stacked-Bars)
- [Data Visualization tools](#Data-Visualization-tools)
  - [Microsoft Excel](#Microsoft-Excel)
  - [MatplotLib](#MatplotLib)
  - [Seaborn](#Seaborn)
  - [Bokeh](#Bokeh)
  - [Tableau](#Tableau)
  - [Power Bi](#Power-Bi)
- [Take Inspiration](#Take-Inspiration)
- [Storytelling with Data](#Storytelling-with-Data)
- [Common Visualization Mistakes](#Common-Visualization-Mistakes)
- [Additional Resources](#Additional-Resources)
- [Wrapping up and looking forward](#Wrapping-up-and-looking-forward)

#### **Let's Start!**
------------------------------------------------



### Legolas, how do your elf eyes see?
What do I intend with Data Visualization?\
Let's consider the [Tableau](https://www.tableau.com/learn/articles/data-visualization) definition:
>Data visualization is a graphical representation of information. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions."

And according to [Wikipedia](https://en.wikipedia.org/wiki/Data_visualization):

>Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic should follow the task.

So, the **goal** of Data Visualization is to _communicate data facts_ to drive wise business decisions. Often these decisions have to be taken by executives, councils or managers and maybe they don't know all the technical stuff behind data!

Another interesting concept you should be familiar with, is the Data-Driven company, a business model that more and more convincing organization to marry it.

[Here](https://triggerbee.com/data-driven-marketing) you find a nice definition of Data-Driven company and [here](https://www.businessmodelsinc.com/big-data-business-models/) an interesting article about it.

As a data scientist, you are the interface among several business functions: product, research, techies and managers, and your main goal is to convince people into taking the right decisions, based on data.

Often you intend to abstract the representation of the data from the underlying technical details and make them available for others. 
As usual, the target you refer to is fundamental in the decision of what data to communicate, and how.

The natural consequence of this statement is that you need to consider the importance of context. 

### The Importance of Context
As in any other field of communication, knowing your audience is critical to understand what you need to communicate.\
[Here](https://www.watershedlrs.com/blog/data-storytelling-know-your-audience) you find an article with some tips to know your audience.\
Basically, the more you know about your audience interests, jobs, and individual situations, the more you can intercept their business needs and desires.
The more you can be specific about who your audience is, the more effective your position will be for successful communication.\
Avoid a general audience, such as "external stakeholders" or "anyone in the product department", trying to communicate to too many different individuals with different needs at once, you risk not communicating to any of them as effectively as you would if you narrowed your target audience.\
If _you must_ remain general for some reason, try to simplify the most you can, and check [here](https://www.anl.gov/education/writing-a-general-audience-abstract) for some useful tips.\
[Here](https://www.techchange.org/2015/05/21/audience-matters-in-data-visualization/) you have some other reason why your data presentation should be driven by the target audience.\
Once you've clear in mind your target, you can start developing the content you want to present.


### The Data / Ink Ratio
The human brain has limited resources and overkilling it with numbers and notions can only lead to negative effects. People become bored easily, especially if your charts are hard to read or they offer _too much_ information. 
As most of the concepts I taught you in the [Impactful presentation guide](https://github.com/clone95/Virgilio/blob/master/Specializations/SoftSkills/ImpactfulPresentations.md), Less Is More is one of the principles you need to follow strictly.
The Tufte's book stresses this out mercilessly calling it "Information / Ink Ratio".
[Here](https://www.darkhorseanalytics.com/blog/data-looks-better-naked) you find an interesting journey of a chart, that brings it to un-readable to the state-of-art of minimalism.\ The general lesson here is to get rid of everything is not needed to communicate the core of your data: extra lines, numbers, legends, names, points and so on. 

The more noise you can avoid, the more your information will flow gently to your audience and the more they'll remember it.

**Data/Ink Ratio = Amount of Ink used on Data / Amount of Ink used** 

Some additional resources to learn how to optimize the Data / Ink Ratio:
- [1](https://thedoublethink.com/tuftes-principles-for-visualizing-quantitative-information/), [2](https://medium.com/@sudharsanasai/declutter-your-chart-with-data-ink-ratio-6f6908727842), [3](http://davidgiard.com/2011/05/12/DataVisualizationPart5DataInk.aspx), [4](https://www.blue-granite.com/blog/data-visualization-remove-chart-clutter-and-focus-on-the-insights), [5](https://www.idashboards.com/blog/2016/05/19/spring-cleaning-eliminate-the-data-clutter/), [6](http://www.storytellingwithdata.com/blog/2016/3/1/declutter-your-data-visualizations)


### Choose an Effective Visual
As a warrior choose his weapon depending on the context, you have to wisely choose the chart to use to represent each number you want to communicate.\
[Here](https://chartio.com/learn/dashboards-and-charts/what-are-common-chart-types-and-how-to-use-them/) is a list of the most common shapes and ideas to present data.\
As you can see, there are many different graphs and other types of visual displays of information, but a handful will work for the majority of your needs ([please don't use cake charts](https://www.businessinsider.com/pie-charts-are-the-worst-2013-6?IR=T)!).\
[Here](https://support.geckoboard.com/hc/en-us/articles/115002929972-How-to-choose-the-right-data-visualization) and [here](https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization) you have a detailed checklist easy to follow, in order to decide which type of chart suits best for your case.

### Focus your Audience’s Attention
Within the brain, there are three types of memory that are important to understand as we design visual communications: [iconic](https://en.wikipedia.org/wiki/Iconic_memory) memory, [short‐term](https://en.wikipedia.org/wiki/Short-term_memory) memory, and [long‐term](https://en.wikipedia.org/wiki/Long-term_memory) memory. What we need to leverage well for our presentations is the iconic one. In fact, she's responsible for the most part of the first impression about what we see, and has by far the most important impact on our perception.\
[Here](https://brevitaz.com/data-visualisations/) you find a good explanation about how to understand how to leverage iconic memory.\
[Here](https://www.clarityinsights.com/blog/perception-communication) another good read about this topic.

### Think like a Designer
The most important principle in design is that "the design of _____ should be driven by its function".\
Imagine a [gladius](https://it.wikipedia.org/wiki/Gladius_hispaniensis), the bread-and-butter weapon of the Roman army: you can _easily understand_ what's his purpose, even if no one told you!\
Read [here](https://www.team-consulting.com/insights/design-drivers-what-drives-great-design/) a gentle introduction to design theory, really recommended!\
[Here](http://guides.lib.berkeley.edu/data-visualization/design) you find useful design guidelines, and [here](https://uxdesign.cc/designing-a-dashboard-how-to-make-sure-it-will-show-useful-data-23af7e233d21) how to design an effective dashboard. 

### Exploring Model visuals
#### Line Graph
Despite its simplicity is the most effective chart you can show (remember, less is more!). Probably the most part of the data you have can be presented through a line graph.\
[Here](https://www.smartdraw.com/line-graph/) you find how to use its power with awareness. 

#### Annotated Line Graph
Like the previous one, but with annotations that can help readability.\
[Here](http://www.storytellingwithdata.com/blog/2018/1/22/88-annotated-line-graphs) you find only 88 examples of that :-)

#### Stacked Bars
Probably the most effective chart to compare quantities, they were used more than [270 years ago](https://gizmodo.com/these-250-year-old-charts-and-graphs-were-the-very-firs-1445388576)!\
[Here](https://www.smashingmagazine.com/2017/03/understanding-stacked-bar-charts/) you find complete guidelines to use them. 
[Here](https://peltiertech.com/excel-3d-charts-charts-with-no-value/) you can understand why is important to keep them as simple as possible, without 3D effects. Really interesting and in-depth read.

#### Positive and Negative Stacked Bars
With negative values, you can easily show bad-vs-good performance or in-vs-out flows.\
[Here](https://peltiertech.com/diverging-stacked-bar-charts/) a detailed explanation about how and when to use them.

#### Horizontal Stacked Bars
You don't need to be a fan of the Flat Earth "theory" to use Horizontal bar chart! They're similar to their vertical cousins, but orienting the chart horizontally means the category names along the left are easy to read in the horizontal text.\
[Here](https://apexcharts.com/javascript-chart-demos/bar-charts/) a guide about using them.
[Here](https://depictdatastudio.com/when-to-use-horizontal-bar-charts-vs-vertical-column-charts/) an interesting article that explains when to choose horizontal or vertical bars. 

### Storytelling with Data
When you see a great play, watch a captivating movie, or read a fantastic book, you’ve experienced the magic of the story. A good story grabs your attention and takes you on a journey, evoking an emotional response. In the middle of it, you find yourself not wanting to turn away or put it down. After finishing it—a day, a week, or even a month later—you could easily describe it to a friend.

If you reach this goal in your audience, you've arrived, and you have won the first prize! 

- **Find a subject you care about**. It is this genuine caring, and not your games with language, which will be the most compelling and seductive element in your style.
- **Keep it simple**. Great masters wrote sentences which were almost childlike when their subjects were most profound. “To be or not to be?” asks Shakespeare’s Hamlet. The longest word is three letters.
- **Choose who to leave behind**. If a sentence or a chart, no matter how excellent, does not illuminate your subject in some new and useful way, scratch it out.
- **Don't fool people with data**. [These](https://venngage.com/blog/misleading-graphs/) are clear examples of what I'm saying.  
- **Be clear**. If I broke punctuation, or I bend the meaning of the words (technical and not), I would simply won't be understood.
- **Pity the readers**. Our audience requires us to be sympathetic and patient teachers, ever willing to simplify and clarify.
- **Be suggestive**. Try to summon pictures, sounds, and feeling during your stories.
- **Have a great End**. Leave your audience with a sentence that will be the remainder of your presentation, the most internal core of your topic. The things you want your audience this about when they remember your presentation. 

For other tips and suggestions about storytelling, check my other [Impactful presentation guide](https://github.com/clone95/Virgilio/blob/master/Specializations/SoftSkills/ImpactfulPresentations.md).

Sorry, I'm a [DRY principle](https://it.wikipedia.org/wiki/Don%27t_repeat_yourself) hopeless fan.


### Data Visualization tools
I this section I introduce you to the most accessible and well-known tools, that will give you an expendable skill in Data Visualization. 

#### Microsoft Excel
Do a favor to yourself, learn [**Excel now!**](https://www.youtube.com/watch?v=-ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql)\
Excel is the swiss-knife for a lot of basic data management, computation, and representation.\
Despite its scalability limits, it's still one of the tools that _support companies_ today.\
Take [this](https://www.youtube.com/watch?v=RwUSUjRGKVM) course about data visualization with Microsoft Excel.  
[Here](https://www.keynotesupport.com/excel-basics/excel-charts-beginners.shtml) you have another good one.\
[Here](https://www.webucator.com/tutorial/intermediate-microsoft-excel/visualizing-your-data.cfm) you have some exercises to test your skill.\
[Here](https://policyviz.com/2017/07/25/my-top-10-data-visualization-excel-websites/) a list of cool websites about Excel visualizations.

#### Matplotlib

[Matplotlib](https://matplotlib.org/) is one of the most used libraries for graphical representation in Python and a lot of other libraries are built on the top of it.
My personal opinion about it is that it's not too easy to understand and implement, but today is still relevant to grasp the most out of the tutorials on the Internet. You also have a lot of examples in [StackOverflow](https://stackoverflow.com/).\
The [official beginner's guide](https://matplotlib.org/users/beginner.html) is really complete and contains everything you need to get started and then proficient with the library.\
[Here](https://matplotlib.org/Matplotlib.pdf) you have the complete documentation.\
[Here](https://pythonspot.com/matplotlib/) another bunch of chart-specific tutorials.\
[Here](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/) an ensemble of the 50 most useful visualizations with code.\
[Here](http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/) you find advanced charts and the code to realize them.
[Here](https://www.cheatography.com/gabriellerab/cheat-sheets/matplotlib-pyplot/) an handy cheat-sheet.

Challenge yourself:
- [1](http://www.ceda.ac.uk/static/media/uploads/ncas-reading-2015/matplotlib_exercises_solutions.pdf), [2](https://pynative.com/python-matplotlib-exercise/), [3](https://anaconda.org/gwinnen/matplotlib-exercises/notebook), [4](https://www.w3resource.com/graphics/matplotlib/)

Best Practices
- [1](https://www.scivision.dev/best-practices-for-matplotlib-plots/), [2](https://www.quora.com/What-are-some-best-practices-for-matplotlib-to-improve-the-quality-and-appearance-of-your-graphs-and-plots), [3](https://stackoverflow.com/questions/18059269/best-practices-to-write-function-embedding-matplotlib-plot-call), [4](https://matplotlib.org/tutorials/introductory/lifecycle.html)

#### Seaborn
As your brain is fascinated by the beauty in humans, art, or cute puppies, it is by beautiful visualizations. A common library **built on top of Matplotlib** is [Seaborn](https://seaborn.pydata.org/). It's used to enhance Matplotlib charts, so you need to become comfortable with the "mother library" first.\
Follow [this](https://www.youtube.com/playlist?list=PL998lXKj66MpNd0_XkEXwzTGPxY2jYM2d) Youtube tutorial, it covers the most you need to get started with it.\
Then read [this](https://stepupanalytics.com/introduction-to-python-for-data-visualization-with-seaborn/) long and complete blog post.\
[Here](https://www.kaggle.com/kanncaa1/seaborn-tutorial-for-beginners) you find another long tutorial for beginners. 

Challenge yourself: [1](https://anaconda.org/gwinnen/seaborn-exercises/notebook), [2](http://unsupervisedlearning.co.uk/2017/11/08/seaborn-exercises-solutions/), [3](https://www.codecademy.com/courses/learn-seaborn/lessons/seaborn-distributions/exercises/box-plots-ii), [4](https://anaconda.org/gwinnen/seaborn-exercises/notebook)

Best practices: [1](http://walkerke.github.io/geog30323/slides/data-visualization/), [2](https://mode.com/resources/analytics-dispatch/data-visualization-best-practices/), [3](https://www.datacamp.com/courses/improving-your-data-visualizations-in-python), 

Additional examples: [1](https://python-graph-gallery.com/category/seaborn/), [2](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html), [3](https://towardsdatascience.com/data-visualization-using-seaborn-fc24db95a850), [4](https://www.kaggle.com/mchirico/plotly-seaborn-examples)

#### Bokeh
From the [Bokeh](http://bokeh.pydata.org/en/latest/) documentation:

>Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

Bokeh prides itself on being a library for interactive data visualization.

Unlike popular counterparts in the Python visualization space, like Matplotlib and Seaborn, Bokeh renders its graphics using HTML and JavaScript. This makes it a great candidate for building interactive web-based dashboards and applications. 

But what's the real difference among Bokeh, Matplotlib and Seaborn?

As a comment in this Reddit [thread](https://www.reddit.com/r/Python/comments/4tuwoz/how_do_you_decide_between_the_plotting_libraries/) says: 

Each library has its own distinct purpose:

Matplotlib is for basic plotting -- bars, pies, lines, scatter plots, etc.

Seaborn is for statistical visualization -- use it if you're creating heatmaps or somehow summarizing your data and still want to show the distribution of your data

Bokeh is for interactive visualization -- if your data is so complex (or you haven't yet found the "message" in your data), then use Bokeh to create interactive visualizations that will allow your viewers to explore the data themselves.

[Here](https://mybinder.org/v2/gh/bokeh/bokeh-notebooks/master?filepath=tutorial%2F00%20-%20Introduction%20and%20Setup.ipynb) you have the official tutorial. It covers pretty everything you need to know, go through it. It contains exercises too.\
[Here](http://bokeh.pydata.org/en/latest/docs/user_guide.html) you have the official user guide.

Another list of useful additional tutorials: [1](https://towardsdatascience.com/data-visualization-with-bokeh-in-python-part-one-getting-started-a11655a467d4), [2](https://realpython.com/python-data-visualization-bokeh/), [3](https://towardsdatascience.com/data-visualization-with-bokeh-in-python-part-one-getting-started-a11655a467d4)

Additional examples: [1](https://www.journaldev.com/19527/bokeh-python-data-visualization), [2](https://programminghistorian.org/en/lessons/visualizing-with-bokeh), [3](https://www.analyticsvidhya.com/blog/2015/08/interactive-data-visualization-library-python-bokeh/), [4](https://www.geeksforgeeks.org/python-data-visualization-using-bokeh/), [5](https://github.com/bokeh/bokeh/tree/master/examples)

#### Power BI
[Power Bi](https://powerbi.microsoft.com/it-it/) is a super cool tool from Microsoft, used mostly in Business Intelligence to build relationships among data, cleaning and visualizing them in wonderful interactive dashboards. The thing that I love of Power BI is that's free for personal usage and very cheap for enterprise purposes. It's also super easy to use.\
Check [this](https://www.youtube.com/watch?v=gqO0EiCn4cY) tutorial for beginners and then explore the official [Guided Learning](https://docs.microsoft.com/en-us/power-bi/guided-learning/), they have a lot of step-by-step tutorials and side projects to challenge yourself. 

Good additional resources to follow: [1](https://www.youtube.com/user/mspowerbi), [2](https://www.youtube.com/channel/UCFp1vaKzpfvoGai0vE5VJ0w), [3](https://www.youtube.com/channel/UC-h-wArcxJC8zBOD-UxfCOg), [4](https://www.youtube.com/channel/UCaTn-yDjPDvf-1CtJJHTNcQ), [5](https://www.youtube.com/user/ModernExcel)

Best practices: [1](https://www.c-sharpcorner.com/article/power-bi-best-practices-part-3/), [2](https://docs.microsoft.com/it-it/power-bi/visuals/power-bi-visualization-best-practices), [3](https://community.powerbi.com/t5/Community-Blog/Best-Practices-For-Power-BI-Desktop-Development/ba-p/521710), [4](https://www.c-sharpcorner.com/article/power-bi-best-practices-part-3/), [5](https://powerpivotpro.com/2017/06/top-5-power-bi-visual-design-practices-transforming-good-great/)

### Take Inspiration
The best way you can get self-confident with data visualization is to watch, watch, and watch data visualization.
I put here plenty of resources where you can take inspiration and ideas from.

Websites: [1](https://www.idashboards.com/blog/2018/07/06/get-inspired-19-inspiring-data-viz-designs/), [2](https://medium.com/@Infogram/18-data-visualization-resources-for-education-and-inspiration-529c6f528983), [3](https://www.pinterest.it/stevenschillema/data-visualization-inspiration/?lp=true), [4](https://www.designyourway.net/blog/inspiration/data-visualization-designs-that-should-inspire-you-23-infographics/), [5](https://www.awwwards.com/websites/data-visualization/), [6](https://mode.com/resources/analytics-dispatch/data-visualization-examples/), [7](https://visme.co/blog/examples-data-visualizations/), [8](https://datavizproject.com/)

Bonus point!
Try [Google Facets](https://pair-code.github.io/facets/), a super useful web-tool for fast visualizations. It's really EASY to use, and you can upload your dataset and get the first insights from it. It's also awesome for showing data to not-technical people.

### Storytelling with Data
I can't stress more on this point. When you prepare data visualizations, focus on a story to tell to your audience.\
This approach has several [proven and positive](https://www.dataplusscience.com/files/Kosara_Computer_2013.pdf) effects.\
[**Definitely check this**](https://www.slideshare.net/kris77chan/edward-segel-interactivestorytelling), is the best resource I've ever found on this concept applied in data visualization.\
[Here](http://www.nickdiakopoulos.com/2013/04/12/storytelling-with-data-what-are-the-impacts-on-the-audience/) you find a good article that explains _why_.\
[Here](https://www.slideshare.net/kris77chan/edward-segel-interactivestorytelling)'s a great presentation about storytelling with data.\
[Here](https://www.forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-everyone-needs/#202002b852ad) another interesting read.

### Common Visualization Mistakes
From an old Chinese statement:
> Look at the other's mistakes, and correct your ones.

To know what are the most frequent mistakes is fundamental to master a skill, so I list here for you a bunch of resources that will give you the awareness of the "Don't"s in data visualization:

- [1](https://www.anychart.com/blog/2017/08/29/data-visualization-mistakes-avoid/), [2](https://undullify.com/data-visualization-102-common-mistakes-visualizing-data/), [3](https://www.rtinsights.com/what-are-the-5-most-common-data-visualization-mistakes/), [4](https://thenextweb.com/dd/2015/05/15/7-most-common-data-visualization-mistakes/), [5](https://www.reddit.com/r/datascience/comments/8wj1nr/play_your_charts_right_an_illustrated_collection/)

### Additional Resources
I really love data visualization and during the last years, I've collected a lot of cool websites and "need-to-bookmark" places. I've already given you a lot of them, here I list everything else is remaining.

- [Data is Beautiful SubReddit](https://www.reddit.com/r/dataisbeautiful/)
- [Analytics SubReddit](https://www.reddit.com/r/dataviz/)
- [The Pudding](https://pudding.cool/)
- [Flow Data](https://flowingdata.com/)
- [Small Multiples](https://smallmultiples.com.au/projects/)
- [Awesome Interactive Journalism](https://github.com/wbkd/awesome-interactive-journalism)
- [EdwardTufte Twitter account](https://twitter.com/EdwardTufte)
- [Fivethirtyeight](https://fivethirtyeight.com/)
- [List of super cool websites](https://www.reddit.com/r/dataisbeautiful/comments/435g7b/i_love_live_data_visualizations_heres_every_one/)
- [Every line of Hamilton](https://pudding.cool/2017/03/hamilton/)
- [Storytelling with Data blog](http://www.storytellingwithdata.com/)

### Wrapping up and looking forward
What I've tried to here is to list a map of the most useful resources about data visualization (I've searched and compared a lot of them), trying to give you a reference point of the subject.\
As I suggested to you earlier, the only way to become really comfortable with something is to face it in the first person. So the best tip I can give you is "find your project". 

- Choose an argument that interests you in some way. You can find a lot o free public dataset to experiment with. Check your country websites or enter [Kaggle](https://www.kaggle.com/) or  [UCI](https://archive.ics.uci.edu/ml/index.php) to find a lot of them. 
- Plot the data in every way you can experiment, applying the techniques you have seen.
- Inspire yourself watching how people visualized similar datasets. Search in Kaggle for "Visualization" and you'll be stunned by the number of examples.

It's better to be proficient in one tool and barely know other ones, than being the jack of all trades but masters of none. So, I suggest you choose the tool that inspires you more and diving deep into that. In fact, the tools we've seen overlap with each other in many ways, but they are different in scale and approach.

Happy Learning and good luck with your studies!

-------------------------
Written by [_clone95_](https://github.com/clone95)


================================================
FILE: Specializations/SoftSkills/ImpactfulPresentations.md
================================================
# Impactful Presentations

## Why do you need to impress your audience?

In my day, in the ancient Rome, data scientists were called [haruspexes](https://en.wikipedia.org/wiki/Haruspex). 
The state-of-art technique they used to represent data and make decisions based on it was to spread the entrails of a bird on a sacred table and try to interpret them. 
"Such a small stomach? Aha! I'm going to tell the general now that it's time to attack decisively!"

But they always had that problem that they couldn't exceed an average success of 50 percent!
Today, fortunately, we have more effective tools and the core field "Data Science" is looking at the factual evidence of the past to try to make better decisions in the present that affect the future. In the age of data-driven decisions, it is increasingly important to have a clear representation of them and how they advise us to act in practice. 

Let us give a few examples:

E-Commerce may want to understand which products are best sold, to what kind of target, in what volumes. 
It may want to understand which products are often bought together so you can 
get the famous "hey, the other users who bought X also bought Y!".\
A newspaper might be interested in how age groups are divided and in what numbers, among its readers, 
and might also want to see at what time of day its articles are consulted and from what type of device (mobile, desktop, real paper?).\
A bank wants to understand what are the maximum margins that can be put on policies and loans, to find the best compromise between competitiveness and gain. These are just a few examples, now let's try to make a game:
try to think of any kind of human activity, involving more than 5 humans, that would benefit from a clear representation
of the data of their activity, to make actual decisions. 

You will find that this reasoning applies to anything, _if you have enough data!_

Extracting knowledge from the data, however, is useless if then the audience 
to which we must communicate(managers, customers, colleagues, departments) 
does not perceive the urgency of looking at the data to make decisions!

With this guide, I want to introduce you to the key principles for effectively presenting your data and the conclusions
that can be drawn from them. Remember! Every decision you propose will always have a certain risk of not being the right one 
(the real world is based on unpredictable and chaotic mechanics and interactions), but the important point in all this is to 
get as far away as possible from the success achieved by the haruspex and their weak and random 50 percent.

During this guide, I consider that you're presenting with the tech support of simple slides.

## Prerequisites
No one! But read [this book](https://www.amazon.it/Pyramid-Principle-BarbaraMinto/dp/0273710516) if you have time!

# Index
- [How to build the content](#How-to-build-the-content)
   - [Know your audience](#Know-your-audience)
   - [Develop high quality content](#Develop-high-quality-content)
   - [Build a structure](#Build-a-structure)
   - [Less is more](#Less-is-more)
   - [Leverage data power wisely](#Leverage-data-power-wisely)
   
- [How to present the content](#How-to-present-the-content)
   - [Connect with your audience](#Connect-with-your-audience) 
   - [Don’t read](#Don’t-read)
   - [Be intriguing](#Be-intriguing)
   - [Use humor](#Use-humor)
   - [Do not tell lies](#Do not tell lies)
   - [Take care with contradictions](#Take care with contradictions)
   - [The Pragmatic Storyteller](#The-Pragmatic-Storyteller)
 
Let's dive right in!

## How to build the content

### Know your audience
Are you talking to managers or developers? Are you educating your sellers, or are you the same seller who has to convince a customer?
In any case, the first rule is [**Know your audience**](https://www.asme.org/career-education/articles/public-speaking/public-speaking-know-your-audience).
If you know your audience, their tastes, their interests, you can build your presentation in the best and most targeted way. [Here](https://www.ethos3.com/2009/10/5-ways-to-get-to-know-your-audience/) you find 5 more ways to do it.  

### Develop high-quality content
The content is the nucleus of your presentation. While all the other ideas below will help you make your content more effective, a great presentation starts and ends with great content. So don't shorten your audience by shortening the effort you spend developing your content. You will need to invest many hours of research, writing and asking feedbacks if you want to create a presentation that your audience will love. You want to refine your material as more as you can, but don't over prepare everything, otherwise, you'll seem too much rigid and artificial: we'll see that one of the key points in this guide will be "be authentic".

### Build a structure
I strongly advise you to build an organic structure that you then follow in your presentation.

[Here](https://virtualspeech.com/blog/how-to-structure-your-presentation) you find a good overview of "how to build a structure".
[Here](https://visme.co/blog/presentation-structure/) you find 7 common structures.
A simple and well-known structure is the 3-act one.

Although not all presentations fit easily into the 3-act structure, it is generally a good general method to follow (with the necessary adjustments according to the situation).

**1** - The first Step is the introduction, the setting of the presentation. This is the moment when you capture the audience's attention, giving them the expectation of what will come out of it and a reason to keep listening.

**2** - The intermediate Step is the moment when you support their interest. You are usually detailing a problem and offering a solution while you educate and inform along the way. It is here where you truly build your case and sell the benefits. This is where you want to provide compelling examples, data, statistics, etc. to support your points.

**3** - The final Step is where you solve the problem, summarizes it and reminds the audience of the highlights of your presentation. Then leave the audience with a call to action and a list of practical points. What is supposed to be taken away from your presentation by the audience? This should be clearly defined in the closing Step. Also, a final story or illustration and questions from the audience are a great way to end the presentation and help people remember the sense of your discourse.

### Less is more
Packing slides with information do not necessarily make them more effective. In fact, you often get the opposite effect by producing confusing slides that take away, rather than adding value to a  presentation.
Well-designed slides help the speaker to emphasize his or her point of view and the audience to understand the key steps of a presentation.\
Follow the principle of ["Less is More"](https://www.presentation-guru.com/when-it-comes-to-presentations-less-is-more/).\
[The Hemingway Editor](http://www.hemingwayapp.com/) will help a lot you writing in a good style, preferring conciseness and fluency.

### Leverage data power wisely
According to SpiderMan's uncle:
> From great power comes great responsibility.

The data can be of help or your enemies during a presentation, depending on how you use them and especially you show them graphically. Don't put too much data in your presentation (remember, Less is More)!
Take this [awesome read](https://moz.com/blog/data-visualization-principles-lessons-from-tufte), then find [here](http://mkweb.bcgsc.ca/talks/datavisualization/datavisualization.pdf) great examples of what I'm saying. 

## How to present the content

### Connect with your audience
First, you need to [create a link](https://www.forbes.com/sites/lisaroepe/2017/03/14/6-ways-to-connect-with-your-audience-during-a-presentation/#73e158396516) between you and your audience. In this way, they'll follow you because they trust that you're useful for them and you genuinely want them to learn something. I like to call this process ["building trust"](https://www.trainingjournal.com/articles/opinion/how-build-trust-audience-when-making-presentation). 

### Don’t read
Really, don't do it. Reading slides will bore your audience and you'll seem less confident.
[Here's](https://www.techwell.com/2013/10/give-better-presentation-don-t-read-your-slides) why.
[Here's](https://academia.stackexchange.com/questions/76370/why-do-most-people-think-its-a-bad-idea-to-read-from-slides) a mine full of other reasons.

### Be intriguing
The best way to attract people is to grab their attention. Doing this it's not so easy, but with some psychological tricks you can bewitch and convince them. [Here](https://www.inc.com/sims-wyeth/how-to-capture-and-hold-audience-attention.html) you have a really good explanation of what I'm saying.

### Use humor
If you're here, it's mostly because I've been intrigued and caught your attention. A big part of this game has been played by humor.
[Here](https://www.writersdigest.com/online-editor/how-to-mix-humor-into-your-writing) you can learn how to embed humor in your writings, and [here](https://www.fastcompany.com/3068891/how-to-incorporate-humor-into-presentations-in-the-most-un-cringeworthy-way) you can learn how to do it in your speeches.

### Do not tell lies.
Nowadays is very easy check a lot of information. Always be sure your information is true. When the audience know or figure out your information is false, they disconnect from you and you lose your credibility.

### Take care with contradictions.
Check before start all your speech is connect and have not contradictions. Say something and a few minutes later say the opposite is very bad for your credibility. Always take care with that.

### The Pragmatic Storyteller
A sunny morning, in the 44 B.C.
> "Good morning, noble court legislators. Today I want to tell you a personal story. At the age of 7, my father took me to the Colosseum in Rome for the first time to see the gladiators fight.  Like all children, I was very excited about that day, I saw all the older boys playing the game of gladiators with fake swords and talk about the furious fights between men, lions, and elephants!
Tens of thousands of people were there for the same reason, seeing two prisoners fighting in blood until one of them wins.
I was feeling different sensations in the air: the excitement in the audience, who wants to vent their repressed violence, the fear in the eyes of the fighters, who know that everyone will fight with the utmost commitment because it is at stake his life.
The clash begins, the crowd screams and enjoys seeing this show that I can only consider terrifying. In the end, the strongest physically wins and kills mercilessly the other.
Dear noble legislators, today I'm 50 years old and I'm here to ask you: is this the symbol we want for the Roman civilization? How do we differ from the barbarians beyond the Alps?
How can we consider ourselves a refined people rich in culture, if we fall into easy vices like that of gratuitous violence? For these reasons, appealing to your intelligence, _I suggest you abolish the violent games in the Colosseum, for the sake of the image of our civilization._

This was **the personal Virgilio's story** that was talking, not me.
This was Storytelling.

You can notice several things:
- Coherent flow (I didn't improvise.)
   - Setting
   - Story 
   - Emotions and sensations (you need to reproduce them in your audience)
   - Motivations for the conclusion
   - Conclusion
- Use of the first person
- Rhetorical questions
- Naturalness
- Confidence with the public that grows with the story, while we show spontaneity and humility
- In the end, a wrap-up of reasons and the practical suggestion

Apply [StoryTelling](https://blog.hubspot.com/marketing/storytelling) in a coherent fashion.

The more authentic, visceral and suggestive you are, the more your audience will trust you and remember
the concepts.

[Here](https://www.youtube.com/watch?v=Nj-hdQMa3uA) a brief Ted Talk about this.\
[Here](https://www.articulatemarketing.com/blog/22-rules-of-storytelling-from-pixar) you find a great list of storytelling best practices from Pixar.\
[Here](https://visme.co/blog/visual-storytelling-rules/) you find the rules of thumb of visual storytelling.

----
Written by [_clone95_](https://github.com/clone95)


================================================
FILE: Tools/GeoGebra.md
================================================
# GeoGebra
[GeoGebra](https://www.geogebra.org) (GG) is a powerful dynamic mathematics application for all levels of education that combines geometry, algebra, spreadsheet, grapher, statistics and infinitesimal calculation into a single easy-to-use software. The GeoGebra community is growing exponentially with millions of users based in many countries. GeoGebra has become the leading provider of software for advanced mathematics, science support, technology, engineering and mathematics and innovations in teaching and learning around the world.

## Installation
GeoGebra applications can be used offline for [iOS](https://itunes.apple.com/us/app/geogebra-graphing-calculator/id1146717204), [Android](https://play.google.com/store/apps/details?id=org.geogebra.android), [Windows](https://www.geogebra.org/download), Mac, Chromebook and Linux.

We also advise you that all installers are subject to the non-commercial license. If you intend to install GeoGebra on several devices, you may be interested in GeoGebra Mass Installation.


## Features
- Geometry, Algebra and Spreadsheet are connected and fully dynamic
- Ease of use of the interface with many powerful features
- Innovative tool for creating interactive learning resources in the form of web pages
- Open source software freely usable by non-commercial users

In the simplest way, you can make constructions containing points, vectors, segments, lines and conics as well as functions, which can then be modified dynamically with the mouse.

## Mathematical representations of GeoGebra
GeoGebra includes three different representations of mathematical objects: _a graphical representation, an algebraic representation and a spreadsheet representation_.
These allow each of the three different representations of mathematical objects to be displayed: graphical (e. g. points, function curves), algebraic (e. g. point coordinates, equations), and in spreadsheet cells. 

## GeoGebra applications
We will demonstrate how easy it is to use GeoGebra applications by presenting:

- [Scientific Calculator](#scientific-calculator)
- [Graphing Calculator](#graphing-calculator)

All GeoGebra applications are fully compatible because they are based on the same powerful GeoGebra math engine. Your GeoGebra folders will run in all applications and on all your devices.

### Scientific Calculator
The GeoGebra scientific calculator is available online via this [site](https://www.geogebra.org/calculator).
GeoGebra is a scientific calculator that includes:
- Calculations using fractions
- Trigonometric functions: `sin`, `cos`, `tan`
- Statistical functions
- Exponential functions and logarithms
- Mode examen pour les tests

The GeoGebra scientific calculator consists of a header bar, an input bar and a scientific calculation keyboard. The scientific calculator consists of three different keyboards. The keyboards can be switched by selecting the one you want to use.
-  _`123` keyboard_: provide keys for numbers and basic mathematical operations and symbols. Keys for numbers and basic arithmetic, trigonometric and logarithmic operators are available.
- _`f(x)` keyboard_: provide statistical and other mathematical functions.  The keys are used to select other mathematical and statistical functions.
- _`ABC` keyboard_: contains letter keys. It includes alphabetical keys 

### Graphing Calculator
The GeoGebra Graphing calculator is available online via this [site](https://www.geogebra.org/graphing). This graphing calculator gives you the possibility to draw functions and to explore equations.
 
In or to create a new curve, please type your expression in the input field. The software then traces the representative curve of your expression as you type.
For example, you can draw a simple line by typing this expression `y = 2x + 3`. In order to make the graph more dynamic, you can use parameters instead of constants (as an example, `y = ax + b`).
Add cursors in the parameters by clicking on the relevant buttons or define them yourself by entering `a=2` and `b=3`. If you assign constant values to parameters such as `a` and `b`, you will automatically be able to adjust these values using cursors. 

Graphing Calculator features include: 
- Representing functions, polar and parametric curves
- Solving equations with a powerful mathematical engine
- Experimenting transformations with cursors
- Calculating derivatives and integrals
- Making statistics and regressions with fitting lines

------------
Created by Khaled Bayoudh. Contacts: [mail](mailto:khaled.isimm@gmail.com) [platform](http://deep-tech.cf)




================================================
FILE: Tools/Latex.md
================================================
# LaTeX
LaTeX is a markup language (or, as said in the [official website](https://www.latex-project.org/about/), "a document preparation system for high-quality typesetting") used to create wonderful papers and presentations. Almost all papers you will read during your career are written using LaTeX. So, let's see how it works!

## Why LaTeX?
For years now LaTeX has been the go-to tool whenever someone needs to create a document that will contain mathematical formulas. LaTeX is used a lot to write scientific papers and it is also used by bloggers and scientific content creators in the internet. You can even use LaTeX syntax on facebook messenger! (which only renders if you are on your computer)

## Installation
There are several LaTeX distributios, you can see a complete list [here](http://www.tug.org/interest.html#free).

Under Unix systems, you can install [TeXLive](http://www.tug.org/texlive/). In particular, under Ubuntu you can type in the terminal `sudo apt-get install texlive-full`. 
Under Windows systems, you can install [MiKTeX](https://miktex.org/) or also [TexLive](http://www.tug.org/texlive/).
Under MacOs, [MacTex](http://www.tug.org/mactex/).

After the installation, you need an editor to write your LaTeX document. You can use whatever editor you want (notepad, vim, nano, gedit and so on) but I recommend you to choose [Texmaker](http://www.xm1math.net/texmaker/) which is free and cross platform. It is also worth saying that Visual Studio Code with some dedicated extensions (such as [LaTeX Workshop](https://marketplace.visualstudio.com/items?itemName=James-Yu.latex-workshop)) is pretty good to use.

## Writing a document
There are tons of on-line guides about LaTeX to get you started. Among them:
- A comprehensive guide can be found [here](https://en.wikibooks.org/wiki/LaTeX).
- Another cool guide [here](https://www.latex-tutorial.com/tutorials/).
- [This one](http://www.docs.is.ed.ac.uk/skills/documents/3722/3722-2014.pdf) is perfect for beginners.
- Also [here](http://web.mit.edu/rsi/www/pdfs/new-latex.pdf) another guide.

It's also possible to write your LaTeX document on-line and share it with your collaborators using [OverLeaf](https://www.overleaf.com/).

There are already lots of templates made. You can find some of them [here](https://www.latextemplates.com/).

To draw awesome graphs and charts, you can use the package [TikZ](https://en.wikipedia.org/wiki/PGF/TikZ).

Also a good site to keep in mind when facing a problem with LaTeX is [StackOverflow](https://tex.stackexchange.com/) with the LaTeX dedicated section.

### Tools to increase productivity
The LaTeX syntax can seem daunting at first, with plenty of new commands for all the mathematical symbols you know and need to use.
  - [This website](https://www.codecogs.com/latex/eqneditor.php) allows one to write a formula online, and it also has plenty of symbols in which you can just click, generating the code you need. You can also preview your formula, so that it is easier for you to make sure everything is being properly written.
  - Whenever you need a symbol but you don't know the command, use [this site](http://detexify.kirelabs.org/classify.html). All you have to do is draw the symbol and then suggestions will appear on the right.
  - Creating tables in LaTeX can be particularly annoying. I usually do it [here](https://www.tablesgenerator.com/) and then ask the site to generate the appropriate code.
  - [MathJax](https://www.mathjax.org/) is one of the ways in which you can get LaTeX to render, say, in your blog! (example [here](http://mathspp.blogspot.com/2018/11/twitter-proof-roots-go-hand-in-hand.html), where the formulas are rendered with MathJax)
  - [Mathpix Snipping Tool](https://mathpix.com/) helps you to convert images to LaTeX by just taking a screenshort of desired math formula. It can also recognize arrays and various math fonts.

## Useful Packages

Now that you know how to produce a (simple) LaTeX document, you may feel the need to write or draw particular content. To do so, you can use specific packages. All you need to do is to include them at the beginning of your document with the command `\usepackage{name_of_the_package}`.

### Displaying Math

The [`amsmath` package](https://ctan.org/pkg/amsmath) provides miscellaneous enhancements for improving the information structure and printed output of documents that contain mathematical formulas, as stated in [this useful guide](http://texdoc.net/texmf-dist/doc/latex/amsmath/amsldoc.pdf). 

Extra mathematical fonts and symbols can be used by including the [`amssymb` package](https://ctan.org/pkg/amsfonts). A recap can be found [here](http://milde.users.sourceforge.net/LUCR/Math/mathpackages/amssymb-symbols.pdf).

### Code Blocks

The [`listings` package](https://ctan.org/pkg/listings) allows to insert programming code in your LaTeX document. You can highlight code, or specify your language of choice and let the package automatically colour special words, comments, etc for you. [Here](https://www.overleaf.com/learn/latex/Code_listing) a guide with examples.

Based on the previous package, [`pythonhighlight`](https://ctan.org/pkg/pythonhighlight) is a simple Python highlighting style to be used with LaTeX. You can find the very simple instructions [here](https://github.com/olivierverdier/python-latex-highlighting).

To write pseudocode, you can use [algorithms](https://ctan.org/pkg/algorithms), which consists in two packages: `algorithm` and `algorithmic`. [Here](https://math-linux.com/latex-26/faq/latex-faq/article/how-to-write-algorithm-and-pseudocode-in-latex-usepackage-algorithm-usepackage-algorithmic) you can find examples and useful commands.

### Logic

For natural deductions there's the [`bussproofs` package](https://ctan.org/pkg/bussproofs). You can find the user guide with examples [here](https://www.math.ucsd.edu/~sbuss/ResearchWeb/bussproofs/BussGuide2_Smith2012.pdf).

If you find tedious manually writing truth tables, [here](http://www.siafoo.net/snippet/249) you'll find an incredibly useful Pyhton script. It automatically generates the LaTeX code of a compiled truth table given one or more propositional logic formulas. (Note: remember that in Python you can write *p* &rarr; *q* as `not p or q` as they are logically equivalent).

### Automata

To draw finite state machines with LaTeX you can use the `tikz-automata` package, [here](https://www3.nd.edu/~kogge/courses/cse30151-fa17/Public/other/tikz_tutorial.pdf) a quick tutorial. You can also automatically generate the code using [this website](https://notendur.hi.is/aee11/automataLatexGen/).

### This is why you need to learn Machine Learning 
Oh damn, take a look at [this](https://mathpix.com/).

------------
Created by Damiano Azzolini. Contacts: [mail](mailto:damiazz94@gmail.com) [github](https://github.com/damianoazzolini)

Expanded upon by the editor of the [Mathspp Blog](https://mathspp.blogspot.com), [RojerGS](https://github.com/RojerGS), and by Lara Vignotto ([mail](mailto:lara.vignotto@gmail.com), [github](https://github.com/laravignotto))


================================================
FILE: Tools/MLDemos/README.md
================================================
# MLDemos

[MLDemos](http://mldemos.b4silio.com/) is an open-source visualization tool for machine learning algorithms created to help studying and understanding how several algorithms function and how their parameters affect and modify the results in problems of classification, regression, clustering, dimensionality reduction, dynamical systems and reward maximization.

MLDemos is open-source and free for personal and academic use.

![organizations](organizations.png)

## Install

### Binary Packages

#### Legalities  

The packages contain binary versions of a number of opensource libraries. I am including them here with the knowledge that this might not be entirely compatible with the distribution policies of each respective library. I will try to contact and get the necessary permissions, to the extent to which this is possible, from the related parties. In the meantime, I distribute this software in good faith, my goal is for people to be able to study and work with the different methods implemented here. See the acknowledgements section below for a list of the people who contributed.
You are free to use this software for personal and educational purposes, you are not allowed to use it for commercial purposes. You can redistribute the software as long as you provide a link to this page. Then again, this page will always link to the latest version of the software so you may be better off taking the version here anyway.

### Source Code

The MLDemos source code can be obtained directly via git or from the public repository (get the *devel* branch for the latest release)

```sh
git clone git://gitorious.org/mldemos/mldemos.git -b devel
```

[public GitHub repository](https://github.com/b4silio/MLDemos)
[source_backup](http://mldemos.b4silio.com/MLDemos-0.3.0-source.zip) (0.3.0)

#### **Requirements**

The code requires Qt (5.10) and (in part) OpenCV (3.1) and Boost (1.47). Previous versions of these libraries might work as well but you might as well use the newer version. Be sure to adjust your include and lib paths to point them to the correct directories.

The software was compiled and tested on Mac OSX High Sierra, Windows 10, Gentoo, Ubuntu and Kubuntu 10.04, using QtCreator 2.1 and 2.6.

* Windows
In order to compile MLDemos in windows, you will need MinGW (commonly installed with the QtSDK for MinGW.

* Debian
Prof. Barak A. Pearlmutter has created a debian package, which will be available soon. In the meantime you can build it following the instruction below:

```sh
git clone git://github.com/barak/mldemos.git
cd mldemos
git checkout debian
dpkg-checkbuilddeps
fakeroot debian/rules binary
sudo dpkg --install ../mldemos_*.deb
```

> Note: OpenCV 2.4 is not available directly (only 2.1 is), which will require you to build OpenCV2.4\. This is only necessary to use MLP and Boosting. These are two important algorithms, so you might as well make the effort:

```sh
git clone git://github.com/barak/opencv.git
cd opencv
git checkout master
dpkg-checkbuilddeps
fakeroot debian/rules binary
sudo dpkg --install ../*opencv*.deb
```

Again, a huge thanks to Barak!

### **Known Bugs**

* WINDOWS: Clearing the canvas while in the 3D display leaves part of the memory occupied, which can accumulate when this is done several times (part of a memory bug on Windows only)
* LINUX (CDE package) loading and saving of external files does not work
* Approximate KNN classification creates weird blank spaces on some machines and with some metrics.
* Saving does not work on the linux CDE package
* Resizing the canvas when a reward map has been drawn does not update the underlying data (avoid doing it).
* In Boosting, changing the data does not recompute the learners, which can lead to bad results if the data has changed boundaries significantly

### **What's New** [Changelog](http://mldemos.b4silio.com/changelog.txt)

v0.5.0

### *New Visualization and Dataset Features*

* *Added 3D visualization of samples and classification, regression and maximization results*
* *Added Visualization panel with individual plots, correlations, density, etc.*
* *Added Editing tools to drag/magnet data, change class, increase or decrease dimensions of the dataset*
* *Added categorical dimensions (indexed dimensions with non-numerical values)*
* *Added Dataset Editing panel to swap, delete and rename dimensions, classes or categorical values*
* *Several bug-fixes for display, import/**export** of data, classification performance*

### *New Algorithms and methodologies*

* *Added Grid-Search panel for batch-testing ranges of values for up to two parameters at a time*

* *Added One-vs-All multi-class classification for non-multi-class algorithms*

* *Trained models can now be kept and tested on new data (training on one dataset, testing on another)*

* *Added Automatic Relevance Determination for SVM with RBF kernel (Thanks to Ashwini Shukla!)*

* *Added Growing Hierarchical Self Organizing Maps (original code by Michael Dittenbach)*

* *Added Random Forest classification*

* *Added LDA as a classifier (in addition to projector)*

* *Added Save/Load Model option for GMMs and SVMs*

## Screenshots

![MLDemos](MLDemos.png)

## Algorithms

### Implemented Methods

#### **Classification**

* Support Vector Machine (SVM) (C, nu, Pegasos)
* Relevance Vector Machine (RVM)
* Gaussian Mixture Models (GMM)
* Multi-Layer Perceptron + BackPropagation
* Gentle AdaBoost + Naive Bayes
* Approximate K-Nearest Neighbors (KNN)
* Gaussian Process Classification (GP)
* Random Forests

#### **Regression**

* Support Vector Regression (SVR)
* Relevance Vector Regression (RVR)
* Gaussian Mixture Regression (GMR)
* MLP + BackProp
* Approximate KNN
* Gaussian Process Regression (GPR)
* Sparse Optimized Gaussian Processes (SOGP)
* Locally Weighed Scatterplot Smoothing (LOWESS)
* Locally Weighed Projection Regression (LWPR)

#### **Dynamical Systems**

* GMM+GMR
* LWPR
* SVR
* SEDS
* SOGP (Slow!)
* MLP
* KNN
* Augmented-SVM (ASVM)

#### **Clustering**

* K-Means
* Soft K-Means
* Kernel K-Means
* K-Means++
* GMM
* One Class SVM
* FLAME
* DBSCAN

#### **Projections**

* Principal Component Analysis (PCA)
* Kernel PCA
* Independent Component Analysis (ICA)
* Canonical Correlation Analysis (CCA)
* Linear Discriminant Analysis (LDA)
* Fisher Linear Discriminant
* EigenFaces to 2D (using PCA)

#### **Reward Maximization** *(Reinforcement Learning)*

* Random Search
* Random Walk
* PoWER
* Genetic Algorithms (GA)
* Particle Swarm Optimization
* Particle Filters
* Donut
* Gradient-Free Methods (nlopt)

### Contributing

If you are developing a new algorithm that could fit into the MLDemos framework and would like to see it integrated into the software, please get in contact (see info below) and describe what type of help you require for the implementation of a MLDemos plugin.

### Acknowledgements

This program would not exist if a number of people had not put a lot of effort into implementing the different algorithms that are combined here into a single program.

* Florent D'Hallouin (GMM + GMR) - [LASA](http://lasa.epfl.ch/)
* Dan Grollman (SOGP) - [LASA](http://lasa.epfl.ch/)
* Mohammad Khansari (SEDS + DSAvoid) - [LASA](http://lasa.epfl.ch/)
* Ashwini Shukla (ASVM, ARD Kernels) - [LASA](http://lasa.epfl.ch/)
* Stephane Magnenat (ESMLR) - [website](http://stephane.magnenat.net/)
* Chih-Chung Chang and Chih-Jen Lin (libSVM) - [website](http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
* David Mount and Sunik Arya (ANN library) - [website](http://www.cs.umd.edu/~mount/ANN/)
* Davis E. King (DLIB) - [website](http://dlib.net/)
* Stefan Klanke and Sethu Vijayakumar (LWPR) - [website](http://www.ipab.inf.ed.ac.uk/slmc/software/lwpr/)
* Robert Davies (Newmat) - [website](http://www.robertnz.net/nm_intro.htm)
* JF Cardoso (ICA) - [website](http://www.tsi.enst.fr/icacentral/algos.html)
* Steven G. Johnson (NLOpt) - [website](http://ab-initio.mit.edu/wiki/index.php/NLopt)
* The WillowGarage crowd (OpenCV) - [website](http://opencv.org/)
* Trolltech/Nokia/Digia (Qt) - [website](http://qt.digia.com/)
* The authors of several of the icons - [website](http://www.iconeasy.com/)
* The PhD students following the 2012 ML class at EPFL (Julien Eberle, Pierre-Antoine Sondag, Guillaume deChambrier, Klas Kronander, Renaud Richardet, Raphael Ullman)

Moreover, the program itself would be far less performant without the work of the support and development team at LASA: Christophe Paccolat, Nicolas Sommer and Otpal Vittoz.

Thanks also to the people who have not contributed code but have contributed no less directly: Aude Billard, for being one of the best bosses one could wish for, François Fleuret, for a bunch of fruitful discussions, and the AML 2010,  and 2011 classes for patiently giving it a first test-drive.

## quick start

### Very quick start

1. Launch the software
1. Draw samples by clicking either the left or right mouse button.
    1. left-click generates samples of class 0
    1. right-click generates samples of the class selected in the toolbar (default: 1)
1. Select the Display Options icon
    1. this will allow you to display model information, confidence/likelihood maps and to hide the original samples
    1. the mouse wheel will allow you to zoom in and out
    1. alt+dragging will allow you to pan around the space
1. Select the Algorithms Options icon
1. Select one of the algorithm icons to open their respective option panels
1. Click the Classify button to run the algorithm on the current data

### Importing data

Generating data in MLDemos is done in three different ways: by manually drawing samples, by projecting image data through PCA (via the PCAFaces plugin), or by loading external data.

Comma separated values, or other text-file based value tables can be drag-and-dropped into the interface. In this case a Data Loading dialog will appear to allow choosing which columns or rows should loaded, interpreted as class labels or headers, etc.

Alternatively, a native data format used by the software is ascii-based and contains:

1. The # of samples followed by # of dimensions
1. For each sample, one line containing
    1. The sample values space-separated (float, one per each dimension)
    1. The sample class index (integer 0 ... 255)
    1. A flag value (integer 0-3) to terminate the line (unused for the time being)

A simple example would be

```code
4 3
0.10 0.11 0.12 0 0
0.14 0.91 0.11 0 0
0.43 0.74 0.41 1 0
0.28 0.34 0.33 1 0
```

which presents 4 three-dimensional samples, two from class 0 and one from class 1.

When the file is saved from MLDemos, the software adds the current algorithm parameters (provided an algorithm was selected), which can be useful for demonstration purposes. If no such information is present, the default algorithm parameters are selected.

Drawing manually some samples, or importing a standard dataset and saving it from within MLDemos should give you ample examples on the file syntax.

------------
Created by Dr. Basilio Noris at the [Learning Algorithms and Systems Laboratory](http://lasa.epfl.ch/)

Expanded upon by the editor of the [iOSDevLog Blog](http://iosdevlog.com/), [jiaxianhua](https://github.com/iOSDevLog)

================================================
FILE: Tools/Regex.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regex introduction\n",
    "\n",
    "## What is a regex?\n",
    "[**Regex**](https://en.wikipedia.org/wiki/Regular_expression) stands for _regular expression_, and regular expressions are a way of writing patterns that match strings. Usually these patterns can be used to search strings for specific things, or to search and then replace certain things, etc. Regular expressions are great for string manipulation!\n",
    "\n",
    "## Why do regular expressions matter?\n",
    "From the first paragraph in this guide you might have guessed it, but regular expressions can be very useful **whenever you have to deal with strings**. From the basic renaming of a set of similarly named variables in your source code to [data preprocessing](https://github.com/clone95/Virgilio/blob/master/Specializations/HardSkills/DataPreprocessing.md). Regular expressions usually offer a concise way of expressing whatever type of things you want to find. For example, if you wanted to parse a form and look for the year that someone might have been born in, you could use something like `(19)|(20)[0-9][0-9]`. This is an example of a regular expression!\n",
    "\n",
    "## Prerequisites\n",
    "This guide does not assume any prior knowledge. Examples will be coded in Python, but mastery of the programming language is neither assumed nor needed. You are welcome to read the guide in your browser or to download it and to run the examples/toying around with them.\n",
    "\n",
    "# Index\n",
    " - [Basic regex](#Basic-regex)\n",
    "   - [Using Python re](#Using-Python-re)\n",
    "   - [$\\pi$ lookup](#$\\pi$-lookup)\n",
    " - [Matching options](#Matching-options)\n",
    "   - [Virgilio or Virgil?](#Virgilio-or-Virgil?)\n",
    " - [Matching repetitions](#Matching-repetitions)\n",
    "   - [Greed](#Greed)\n",
    "   - [Removing excessive spaces](#Removing-excessive-spaces)\n",
    " - [Character classes](#Character-classes)\n",
    "   - [Phone numbers v1](#Phone-numbers-v1)\n",
    " - [More `re` functions](#More-re-functions)\n",
    "   - [`search` with `match`](#search-with-match)\n",
    "   - [Count matches with `findall`](#Count-matches-with-findall)\n",
    " - [Special characters](#Special-characters)\n",
    "   - [Phone numbers v2](#Phone-numbers-v2)\n",
    " - [Groups](#Groups)\n",
    "   - [Phone numbers v3](#Phone-numbers-v3)\n",
    " - [Toy project about regex](#Toy-project-about-regex)\n",
    " - [Further reading](#Further-reading)\n",
    " - [Suggested solutions](#Suggested-solutions)\n",
    " \n",
    "Let's dive right in!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Just a quick word:** I tried to include some small exercises whenever I show you something new, so that you can try and test your knowledge. Examples of solutions are provided in the [end of the notebook](#Suggested-solutions)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic regex\n",
    "\n",
    "A regex is just a string written in a certain format, that can then be used by specific tools/libraries/programs to perform pattern matching on strings. Throughout this guide we will use `this formatting` to refer to regular expressions!\n",
    "\n",
    "The simplest regular expressions that one can create are just composed of regular characters. If you wanted to find all the occurrences of the word _\"Virgilio\"_ in a text, you could write the regex `Virgilio`. In this regular expression, no character is doing anything special or different. In fact, this regular expression is just a normal word. That is ok, regular expressions are strings, after all!\n",
    "\n",
    "If you were given the text _\"Project Virgilio is great\"_, you could use your `Virgilio` regex to find the occurrence of the word _\"Virgilio\"_. However, if the text was _\"Project virgilio is great\"_, then your regex wouldn't work, because regular expressions are **case-sensitive** by default and thus should match everything exactly. We say that `Virgilio` matches the sequence of characters \"Virgilio\" literally."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Python re\n",
    "\n",
    "To check if our regular expressions are working well and to give you the opportunity to directly experiment with them, we will be using Python's `re` module to work with regular expressions. To use the `re` module we first import it, then define a regular expression and then use the `search()` function over a string! Pretty simple:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'Virgilio' is in 'Project Virgilio is great'\n",
      "'Virgilio' is not in 'Project virgilio is great'\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "regex = \"Virgilio\"\n",
    "str1 = \"Project Virgilio is great\"\n",
    "str2 = \"Project virgilio is great\"\n",
    "\n",
    "if re.search(regex, str1):\n",
    "    print(\"'{}' is in '{}'\".format(regex, str1))\n",
    "else:\n",
    "    print(\"'{}' is not in '{}'\".format(regex, str1))\n",
    "    \n",
    "if re.search(regex, str2):\n",
    "    print(\"'{}' is in '{}'\".format(regex, str2))\n",
    "else:\n",
    "    print(\"'{}' is not in '{}'\".format(regex, str2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `re.search(regex, string)` function takes a regex as first argument and then searches for any matches over the string that was given as the second argument. However, the return value of the function is **not** a boolean, but a *match object*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(8, 16), match='Virgilio'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search(regex, str1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Match objects have relevant information about the match(es) encountered: the start and end positions, the string that was matched, and even some other things for more complex regular expressions.\n",
    "\n",
    "We can see that in this case the match is exactly the same as the regular expression, so it may look like the `match` information inside the match object is irrelevant... but it becomes relevant as soon as we introduce options or repetitions into our regex.\n",
    "\n",
    "If no matches are found, then the `.search()` function returns `None`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(re.search(regex, str2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whenever the match is not `None`, we can save the returned match object and use it to extract all the needed information!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The match started at pos 8 and ended at pos 16\n",
      "Or with tuple notation, the match is at (8, 16)\n",
      "And btw, the actual string matched was 'Virgilio'\n"
     ]
    }
   ],
   "source": [
    "m = re.search(regex, str1)\n",
    "if m is not None:\n",
    "    print(\"The match started at pos {} and ended at pos {}\".format(m.start(), m.end()))\n",
    "    print(\"Or with tuple notation, the match is at {}\".format(m.span()))\n",
    "    print(\"And btw, the actual string matched was '{}'\".format(m.group()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you should try to get some more matches and some fails with your own literal regular expressions. I provide three examples of my own:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The match is at (20, 25)\n",
      "\n",
      "Woops, did I just got the alphabet wrong..?\n",
      "\n",
      "I just matched 'a' inside 'aaaaa aaaaaa a aaa'\n"
     ]
    }
   ],
   "source": [
    "m1 = re.search(\"regex\", \"This guide is about regexes\")\n",
    "if m1 is not None:\n",
    "    print(\"The match is at {}\\n\".format(m1.span()))\n",
    "\n",
    "m2 = re.search(\"abc\", \"The alphabet goes 'abdefghij...'\")\n",
    "if m2 is None:\n",
    "    print(\"Woops, did I just got the alphabet wrong..?\\n\")\n",
    "    \n",
    "s = \"aaaaa aaaaaa a aaa\"\n",
    "m3 = re.search(\"a\", s)\n",
    "if m3 is not None:\n",
    "    print(\"I just matched '{}' inside '{}'\".format(m3.group(), s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### $\\pi$ lookup\n",
    "\n",
    "$$\\pi = 3.1415\\cdots$$\n",
    "\n",
    "right? Well, what comes after the dots? An infinite sequence of digits, right? Could it be that your date of birth appears in the first million digits of $\\pi$? Well, we could use a regex to find that out! Change the `regex` variable below to look for your date of birth or for any number you want, in the first million digits of $\\pi$!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "pifile = \"regex-bin/pi.txt\"\n",
    "regex = \"\"  # define your regex to look your favourite number up\n",
    "\n",
    "with open(pifile, \"r\") as f:\n",
    "    pistr = f.read()  # pistr is a string that contains 1M digits of pi\n",
    "    \n",
    "## search for your number here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To search for numbers in the first 100 million digits of $\\pi$ (or 200 million, I didn't really get it) you can check [this](https://www.angio.net/pi/piquery) website."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching options\n",
    "\n",
    "We just saw a very simple regular expression that was trying to find the word _\"Virgilio\"_ in text, but we also saw that we had zero flexibility and we couldn't even handle the fact that someone may have forgotten to capitalize the name properly, spelling it like _\"virgilio\"_ instead.\n",
    "\n",
    "To prevent problems like this, regular expressions can be written in a way to handle different possibilities. For our case, we want the first letter to be either _\"V\"_ or _\"v\"_, and that should be followed by _\"irgilio\"_.\n",
    "\n",
    "In order to handle different possibilities, we use the character `|`. For instance, `V|v` matches the letter vee, regardless of its capitalization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "small v found\n",
      "big V found\n"
     ]
    }
   ],
   "source": [
    "v = \"v\"\n",
    "V = \"V\"\n",
    "regex = \"v|V\"\n",
    "if re.search(regex, v):\n",
    "    print(\"small v found\")\n",
    "if re.search(regex, V):\n",
    "    print(\"big V found\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can concatenate the regex for the first letter and the `irgilio` regex (for the rest of the name) to get a regex that matches the name of Virgilio, regardless of the capitalization of its first letter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "virgilio found!\n",
      "Virgilio found!\n"
     ]
    }
   ],
   "source": [
    "virgilio = \"virgilio\"\n",
    "Virgilio = \"Virgilio\"\n",
    "regex = \"(V|v)irgilio\"\n",
    "if re.search(regex, virgilio):\n",
    "    print(\"virgilio found!\")\n",
    "if re.search(regex, Virgilio):\n",
    "    print(\"Virgilio found!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that we write the regex with parenthesis: `(V|v)irgilio`\n",
    "\n",
    "If we only wrote `V|virgilio`, then the regular expression would match either \"V\" or \"virgilio\", instead of \"Virgilio\" or \"virgilio\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(29, 30), match='V'>\n"
     ]
    }
   ],
   "source": [
    "regex = \"V|virgilio\"\n",
    "print(re.search(regex, \"This sentence only has a big V\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we really need to parenthesize the `(V|v)` there. If we do, it will work as expected!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(27, 35), match='virgilio'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"(V|v)irgilio\"\n",
    "print(re.search(regex, \"The name of the project is virgilio, but with a big V!\"))\n",
    "print(re.search(regex, \"This sentence only has a big V\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Maybe you didn't even notice, but there is something else going on! Notice that we used the characteres `|`, `(` and `)`, and those are not present in the word _\"virgilio\"_, but nonetheless our regex `(V|v)irgilio` matched it... that is because these three characters have special meanings in the regex world, and hence are **not** interpreted literally, contrary to what happens to any letter in `irgilio`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Virgilio or Virgil?\n",
    "\n",
    "Here is a couple of paragraphs from Wikipedia's [article on Virgil](https://en.wikipedia.org/wiki/Virgil):\n",
    "\n",
    " > Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n",
    "\n",
    " > Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide through Hell and Purgatory.\n",
    " \n",
    "\"Virgilio\" is the italian form of \"Virgil\", and I edited the above paragraphs to have the italian version instead of the english one. I want you to revert this!\n",
    "\n",
    "You might want to take a look at [`while` cycles in Python](https://realpython.com/python-while-loop/), [string indexing](https://www.digitalocean.com/community/tutorials/how-to-index-and-slice-strings-in-python-3) and [string concatenation](https://realpython.com/python-string-split-concatenate-join/). The point is that you find a match, you break the string into the part _before_ the match and the part _after_ the match, and you glue those two together with _Virgilio_ in between.\n",
    "\n",
    "Notice that [string replacement](https://www.tutorialspoint.com/python/string_replace.htm) would probably be faster and easier, but that would defeat the purpose of this exercise. After fixing everything, print the final results to be sure that you fixed every occurrence of the name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "paragraphs = \\\n",
    "\"\"\"Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n",
    "\n",
    "Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory.\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching repetitions\n",
    "\n",
    "Sometimes we want to find patterns that have bits that will be repeated. For example, people make a _\"awww\"_ or _\"owww\"_ sound when they see something cute, like a baby. But the number of _\"w\"_ I used there was completely arbitrary! If the baby is really really cute, someone might write _\"awwwwwwwwwww\"_. So how can I write a regex that matches _\"aww\"_ and _\"oww\"_, but with an arbitrary number of characters _\"w\"_?\n",
    "\n",
    "I will illustrate several ways of capturing repetitions, by testing regular expressions against the following strings:\n",
    "\n",
    " - \"awww\" (3 letters \"w\")\n",
    " - \"awwww\" (4 letters \"w\")\n",
    " - \"awwwwwww\" (7 letters \"w\")\n",
    " - \"awwwwwwwwwwwwwwww\" (16 letters \"w\")\n",
    " - \"aw\" (1 letter \"w\")\n",
    " - \"a\" (0 letters \"w\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "cute_strings = [\n",
    "    \"awww\",\n",
    "    \"awwww\",\n",
    "    \"awwwwwww\",\n",
    "    \"awwwwwwwwwwwwwwww\",\n",
    "    \"aw\",\n",
    "    \"a\"\n",
    "]\n",
    "\n",
    "def match_cute_strings(regex):\n",
    "    \"\"\"Takes a regex, prints matches and non-matches\"\"\"\n",
    "    for s in cute_strings:\n",
    "        m = re.search(regex, s)\n",
    "        if m:\n",
    "            print(\"match: {}\".format(s))\n",
    "        else:\n",
    "            print(\"non match: {}\".format(s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### At least once\n",
    "\n",
    "If I want to match all strings that containt **at least** one \"w\", we can use the character `+`. A `+` means that we want to find **one or more repetitions** of whatever was to the left of it. For example, the regex `a+` will match any string that has at least one \"a\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "match: awww\n",
      "match: awwww\n",
      "match: awwwwwww\n",
      "match: awwwwwwwwwwwwwwww\n",
      "match: aw\n",
      "non match: a\n"
     ]
    }
   ],
   "source": [
    "regex = \"aw+\"\n",
    "match_cute_strings(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Any number of times\n",
    "\n",
    "If I want to match all strings that contain an arbitrary number of letters \"w\", I can use the character `*`. The character `*` means **match any number of repetitions** of whatever comes on the left of it, _even 0 repetitions_! So the regex `a*` would match the empty string \"\", because the empty string \"\" has 0 repetitions of the letter \"a\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "match: awww\n",
      "match: awwww\n",
      "match: awwwwwww\n",
      "match: awwwwwwwwwwwwwwww\n",
      "match: aw\n",
      "match: a\n"
     ]
    }
   ],
   "source": [
    "regex = \"aw*\"\n",
    "match_cute_strings(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### A specific number of times\n",
    "\n",
    "If I want to match a string that contains a certain particle a specific number of times, I can use the `{n}` notation, where `n` is replaced by the number of repetitions I want. For example, `a{3}` matches the string \"aaa\" but not the string \"aa\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "match: awww\n",
      "match: awwww\n",
      "match: awwwwwww\n",
      "match: awwwwwwwwwwwwwwww\n",
      "non match: aw\n",
      "non match: a\n"
     ]
    }
   ],
   "source": [
    "regex = \"aw{3}\"\n",
    "match_cute_strings(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Wait a minute**, why did the pattern `aw{3}` match the longer expressions of cuteness, like \"awwww\" or \"awwwwwww\"? Because the regular expressions try to find _substrings_ that match the pattern. Our pattern is `awww` (if I write the `w{3}` explicitly) and the string **awww**w has that substring, just like the string **awww**wwww has it, or the longer version with 16 letters \"w\". If we wanted to exclude the strings \"awwww\", \"awwwwwww\" and \"awwwwwwwwwwwwwwww\" we would have to fix our regex. A better example that demonstrates how `{n}` works is by considering, instead of expressions of cuteness, expressions of amusement like \"wow\", \"woow\" and \"wooooooooooooow\". We define some expressions of amusement:\n",
    "\n",
    " - \"wow\"\n",
    " - \"woow\"\n",
    " - \"wooow\"\n",
    " - \"woooow\"\n",
    " - \"wooooooooow\"\n",
    " \n",
    "and now we test our `{3}` pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "wow_strings = [\n",
    "    \"wow\",\n",
    "    \"woow\",\n",
    "    \"wooow\",\n",
    "    \"woooow\",\n",
    "    \"wooooooooow\"\n",
    "]\n",
    "\n",
    "def match_wow_strings(regex):\n",
    "    \"\"\"Takes a regex, prints matches and non-matches\"\"\"\n",
    "    for s in wow_strings:\n",
    "        m = re.search(regex, s)\n",
    "        if m:\n",
    "            print(\"match: {}\".format(s))\n",
    "        else:\n",
    "            print(\"non match: {}\".format(s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "non match: wow\n",
      "non match: woow\n",
      "match: wooow\n",
      "non match: woooow\n",
      "non match: wooooooooow\n"
     ]
    }
   ],
   "source": [
    "regex = \"wo{3}w\"\n",
    "match_wow_strings(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Between $n$ and $m$ times\n",
    "\n",
    "Expressing amusement with only three \"o\" is ok, but people might also use two or four \"o\". How can we capture a variable number of letters, but within a range? Say I only want to capture versions of \"wow\" that have between 2 and 4 letters \"o\". I can do it with `{2,4}`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "non match: wow\n",
      "match: woow\n",
      "match: wooow\n",
      "match: woooow\n",
      "non match: wooooooooow\n"
     ]
    }
   ],
   "source": [
    "regex = \"wo{2,4}w\"\n",
    "match_wow_strings(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Up to $n$ times or at least $m$ times\n",
    "\n",
    "Now we are just playing with the type of repetitions we might want, but of course we might say that we want **no more** than $n$ repetitions, which you would do with `{,n}`, or that we want **at least** $m$ repetitions, which you would do with `{m,}`.\n",
    "\n",
    "In fact, take a look at these regular expressions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "match: wow\n",
      "match: woow\n",
      "match: wooow\n",
      "match: woooow\n",
      "non match: wooooooooow\n"
     ]
    }
   ],
   "source": [
    "regex = \"wo{,4}w\" # should not match strings with more than 4 o's\n",
    "match_wow_strings(regex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "non match: wow\n",
      "non match: woow\n",
      "match: wooow\n",
      "match: woooow\n",
      "match: wooooooooow\n"
     ]
    }
   ],
   "source": [
    "regex = \"wo{3,}w\" # should not match strings with less than 3 o's\n",
    "match_wow_strings(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### To be or not to be\n",
    "\n",
    "Last but not least, sometimes we care about something that might or might not be present. For example, above we dealed with the English and Italian versions of the name Virgilio. If we wanted to write a regular expression to capture both versions, we could write `((V|v)irgil)|((V|v)irgilio)`, or slightly more compact, `(V|v)((irgil)|(irgilio))`. But this does not look good at all, right? All we need to say is that the final \"io\" might or might not be present. We do this with the `?` character. So the regex `(V|v)irgil(io)?` matches the upper and lower case versions of \"Virgil\" and \"Virgilio\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The name virgil was matched!\n",
      "The name Virgil was matched!\n",
      "The name virgilio was matched!\n",
      "The name Virgilio was matched!\n"
     ]
    }
   ],
   "source": [
    "regex = \"(V|v)irgil(io)?\"\n",
    "names = [\"virgil\", \"Virgil\", \"virgilio\", \"Virgilio\"]\n",
    "for name in names:\n",
    "    m = re.search(regex, name)\n",
    "    if m:\n",
    "        print(\"The name {} was matched!\".format(name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greed\n",
    "\n",
    "The `+`, `?`, `*` and `{,}` operators are all greedy. What does this mean? It means that they will try to match as much as possible. They have this default behaviour, as opposed to stopping to try and find more matches as soon as the regex is satisfied. To better illustrate what I mean by this, let us look again at the information contained in the `match` object we have been dealing with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 3), match='aaa'>\n"
     ]
    }
   ],
   "source": [
    "regex = \"a+\"\n",
    "s = \"aaa\"\n",
    "m = re.search(regex, s)\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the part of the printed information that says `match='aaa'`. The function `m.group()` will let me know what was the actual string that was matched by the regular expression, and in this case it was \"aaa\". Why does it make sense to have access to this information? Well, the regex I wrote, `a+`, will match one or more letters \"a\" in a row. If I use the regex over a string and I get a match, how would I be able to know how many \"a\"s were matched, if I didn't have access to that type of information?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aaa\n"
     ]
    }
   ],
   "source": [
    "print(m.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So let us verify that, in fact, the operators I mentioned are all greedy. Again, because they all match as many characters as they can.\n",
    "\n",
    "Below, we see that given a string of thirty times the letter \"a\",\n",
    "\n",
    "  - the pattern `a?` matches 1 \"a\", which is as much as it could\n",
    "  - the pattern `a+` matches 30 \"a\"s, which is as much as it could\n",
    "  - the pattern `a*` also matches 30\n",
    "  - the pattern `a{5,10}` matches 10 \"a\"s, which was the limit imposed by us"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a\n",
      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n",
      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n",
      "aaaaaaaaaa\n"
     ]
    }
   ],
   "source": [
    "s = \"a\"*30\n",
    "print(re.search(\"a?\", s).group())\n",
    "print(re.search(\"a+\", s).group())\n",
    "print(re.search(\"a*\", s).group())\n",
    "print(re.search(\"a{5,10}\", s).group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we don't want our operators to be greedy, we just put an extra `?` after them. So the following regular expressions are **not** greedy:\n",
    "\n",
    "  - the pattern `a??` will match **no** characters, much like `a*?`, because now their goal is to match as little as possible. But a match of length 0 is the shortest match possible!\n",
    "  - the pattern `a+?` will only match 1 \"a\"\n",
    "  - the pattern `a{5,10}?` will only match 5 \"a\"s\n",
    "  \n",
    "We can easily confirm what I just said by running the code below. Notice that now I print things differently, because otherwise we wouldn't be able to see the `a??` and `a*?` patterns matching nothing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "''\n",
      "'a'\n",
      "''\n",
      "'aaaaa'\n"
     ]
    }
   ],
   "source": [
    "s = \"a\"*30\n",
    "print(\"'{}'\".format(re.search(\"a??\", s).group()))\n",
    "print(\"'{}'\".format(re.search(\"a+?\", s).group()))\n",
    "print(\"'{}'\".format(re.search(\"a*?\", s).group()))\n",
    "print(\"'{}'\".format(re.search(\"a{5,10}?\", s).group()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Removing excessive spaces\n",
    "\n",
    "Now that we know about repetitions, I am going to tell you about the `sub` function and we are going to use that to parse a piece of text and remove all extra spaces that are present. Typing in `re.sub(regex, rep, string)` will use the given regex on the given string, and whenever it matches, it removes the match and puts the `rep` in there.\n",
    "\n",
    "For example, I can use that to replace all English/Italian occurrences of the name Virgilio with a standardized one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.\n"
     ]
    }
   ],
   "source": [
    "s = \"Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil.\"\n",
    "regex = \"(V|v)(e|i)rgil(io)?\"\n",
    "\n",
    "print(\n",
    "    re.sub(regex, \"Virgilio\", s)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "    Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count(\"  \")` is   equal   to    0  or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weird_text = \"Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count(\"  \")` is   equal   to    0  or not.\"\n",
    "regex = \"\"  # put your regex here\n",
    "\n",
    "# substitute the extra whitespace here\n",
    "# save the result in 's'\n",
    "\n",
    "# this print should be 0\n",
    "print(s.count(\"  \"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Character classes\n",
    "\n",
    "So far we have been using writing some simple regular expressions that have been matching some words, and some names, and things like that. Now we have a different plan. We will write a regular expression that will match on US phone numbers, which we will assume are of the form xxx-xxx-xxxx. The first three digits are the area code, but we will not care about whether the area code actually makes sense or not. How do we match this, then?\n",
    "\n",
    "In fact, how can I match the first digit? It can be any number from 0 to 9, so should I write `(0|1|2|3|4|5|6|7|8|9)` to match the first digit, and then repeat? Actually, we could do that, yes, to get this regex:\n",
    "\n",
    "`(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}`\n",
    "\n",
    "Does this work?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 12), match='202-555-0181'>\n",
      "None\n",
      "None\n",
      "<re.Match object; span=(0, 12), match='512-555-0191'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}\"\n",
    "numbers = [\n",
    "    \"202-555-0181\",\n",
    "    \"202555-0181\",\n",
    "    \"202 555 0181\",\n",
    "    \"512-555-0191\",\n",
    "    \"96-125-3546\",\n",
    "]\n",
    "for nr in numbers:\n",
    "    print(re.search(regex, nr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like it works, but surely there must be a better way... and there is! Instead of writing out every digit like we did, we can actually write a range of values! In fact, the regex `[0-9]` matches all digits from 0 to 9. So we can actually shorten our regex to `[0-9]{3}-[0-9]{3}-[0-9]{4}`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 12), match='202-555-0181'>\n",
      "None\n",
      "None\n",
      "<re.Match object; span=(0, 12), match='512-555-0191'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"[0-9]{3}-[0-9]{3}-[0-9]{4}\"\n",
    "numbers = [\n",
    "    \"202-555-0181\",\n",
    "    \"202555-0181\",\n",
    "    \"202 555 0181\",\n",
    "    \"512-555-0191\",\n",
    "    \"96-125-3546\",\n",
    "]\n",
    "for nr in numbers:\n",
    "    print(re.search(regex, nr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The magic here is being done by the `[]`, which denotes a character class. The way `[]` works is, the regex will try to match any of the things that are inside, and it just so happens that `0-9` is a shorter way of listing all the digits. Of course you could also do `[0123456789]{3}-[0123456789]{3}-[0123456789]{4}` which is slightly shorter than our first attempt, but still pretty bad. Similar to `0-9`, we have `a-z` and `A-Z`, which go through all letters of the alphabet.\n",
    "\n",
    "You can also start and end in different places, for example `c-o` can be used to match words that only use letters between the \"c\" and the \"o\", like \"hello\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 5), match='hello'>\n",
      "<re.Match object; span=(1, 4), match='ice'>\n"
     ]
    }
   ],
   "source": [
    "regex = \"[c-o]+\"\n",
    "print(re.search(regex, \"hello\"))\n",
    "print(re.search(regex, \"rice\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With these character classes we can actually rewrite our Virgilio regex into something slightly shorter, going from `(V|v)(e|i)rgil(io)?` to `[Vv][ie]rgil(io)?`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.\n"
     ]
    }
   ],
   "source": [
    "s = \"Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil.\"\n",
    "regex = \"[Vv][ie]rgil(io)?\"\n",
    "\n",
    "print(\n",
    "    re.sub(regex, \"Virgilio\", s)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again we see that our regular expression matched the **ice** in r**ice**, because the \"r\" was not inside the legal range of letters, but **ice** was.\n",
    "\n",
    "The _character class_ is the square brackets `[]` and whatever goes inside it. Also, note that the special characters we have been using lose their meaning inside a character class! So `[()?+*{}]` will actually look to match any of those characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(25, 26), match='?'>\n"
     ]
    }
   ],
   "source": [
    "regex = \"[()?+*{}]\"\n",
    "print(re.search(regex, \"Did I just ask a question?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A final note on character classes, if they start with `^` then we are actually saying \"use everything _except_ what is inside this\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n",
      "<re.Match object; span=(0, 1), match='r'>\n"
     ]
    }
   ],
   "source": [
    "regex = \"[^c-o]+\"\n",
    "print(re.search(regex, \"hello\"))\n",
    "print(re.search(regex, \"rice\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Phone numbers v1\n",
    "\n",
    "Now that you know how to use character classes to denote ranges, you need to write a regular expression that matches american phone numbers with the format xxx-xxx-xxxx. Not only that, but you must also cope with the fact that the numbers may or may not be preceeded by the country indicator, which you can assume that will look like \"+1\" or \"001\". The country indicator may be separated from the rest of the number with a space or with a dash."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regex = \"\"  # write your regex here\n",
    "matches = [  # you should be able to match those\n",
    "    \"202-555-0181\",\n",
    "    \"001 202-555-0181\",\n",
    "    \"+1-512-555-0191\"\n",
    "]\n",
    "non_matches = [  # for now, none of these should be matched\n",
    "    \"202555-0181\",\n",
    "    \"96-125-3546\",\n",
    "    \"(+1)5125550191\"\n",
    "]\n",
    "for s in matches:\n",
    "    print(re.search(regex, s))\n",
    "for s in non_matches:\n",
    "    print(re.search(regex, s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More `re` functions\n",
    "\n",
    "So far we only looked at the `.search()` function of the `re` module, but now I am going to tell you about a couple more function that can be quite handy when you are dealing with pattern matching. By the time you are done with this small section, you will now the following functions: `match()`, `search()`, `findall()`, `sub()` and `split()`.\n",
    "\n",
    "If you are here mostly for the regular expressions, and you don't care much about using them with Python, you can just skim through this section... even though it is still a nice read."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `search()` and `sub()`\n",
    "\n",
    "You already know these two functions, `re.search(regex, string)` will try to find your pattern given by `regex` in the given `string` and return the information of the match in a `match` object. The function `re.sub(regex, rep, string)` will take a regex and two strings; it will then look for the pattern you specified in `string` and replace the matches with the other string `rep` you gave it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `match()`\n",
    "\n",
    "The function `re.match(regex, string)` is similar to the function `re.search()`, except that `.match()` will only check if your pattern applies to the **beginning** of the string. That is, if your string does not **start** with the pattern you provided, the function returns `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ".search() found abc in abcdef\n",
      ".search() found abc in the alphabet starts with abc\n",
      ".match() says that abcdef starts with abc\n"
     ]
    }
   ],
   "source": [
    "regex = \"abc\"\n",
    "string1 = \"abcdef\"\n",
    "string2 = \"the alphabet starts with abc\"\n",
    "# the .search() function finds the patterns, regardless of position\n",
    "if re.search(regex, string1):\n",
    "    print(\".search() found {} in {}\".format(regex, string1))\n",
    "if re.search(regex, string2):\n",
    "    print(\".search() found {} in {}\".format(regex, string2))\n",
    "    \n",
    "# the .match() function only checks if the string STARTS with the pattern\n",
    "if re.match(regex, string1):\n",
    "    print(\".match() says that {} starts with {}\".format(string1, regex))\n",
    "if re.match(regex, string2):  # this one should NOT print\n",
    "    print(\".match() says that {} starts with {}\".format(string2, regex))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `findall()`\n",
    "\n",
    "The `re.findall(regex, string)` is exactly like the `.search()` function, except that it will return **all** the matches it can find, instead of just the first one. Instead of returning a `match` object, it just returns the string that matched."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 3), match='wow'>\n",
      "['wow', 'wow', 'wow']\n"
     ]
    }
   ],
   "source": [
    "regex = \"wow\"\n",
    "string = \"wow wow wow!\"\n",
    "\n",
    "print(re.search(regex, string))\n",
    "\n",
    "print(re.findall(regex, string))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 3), match='ab1'>\n",
      "['ab1', 'ab2', 'ab3']\n"
     ]
    }
   ],
   "source": [
    "regex = \"ab[0-9]\"\n",
    "string = \"ab1 ab2 ab3\"\n",
    "\n",
    "print(re.search(regex, string))\n",
    "\n",
    "print(re.findall(regex, string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important to note that the `findall()` function only returns _non-overlaping_ matches. That is, one could argue that `wow` appears twice in \"wowow\", in the beginning: **wow**ow, and in the end: wo**wow**. Nonetheless, `findall()` only returns one match because the second match overlaps with the first:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['wow']\n"
     ]
    }
   ],
   "source": [
    "regex = \"wow\"\n",
    "string = \"wowow\"\n",
    "print(re.findall(regex, string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this information it now makes a bit more sense to consider the greediness of the operators we showed before, like `?` and `+`. Imagine we are dealing with the regex `a+` and we have a string \"aaaaaaaaa\". If we use the greedy version of `+`, then we get a single match which is the whole string. If we use the non-greedy version of the operator `+`, perhaps because we want as many matches as possible, we will get a bunch of \"a\" matches!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['aaaaaaaaa']\n",
      "['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']\n"
     ]
    }
   ],
   "source": [
    "regex_greedy = \"a+\"\n",
    "regex_nongreedy = \"a+?\"\n",
    "string = \"aaaaaaaaa\"\n",
    "\n",
    "print(re.findall(regex_greedy, string))\n",
    "\n",
    "print(re.findall(regex_nongreedy, string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `split()`\n",
    "\n",
    "The `re.split(regex, string)` splits the given string into bits wherever it is able to find the pattern you specified. Say we are interested in finding all the sequences of consecutive consonants in a sentence (I don't know why you would want that...). Then we can use the vowels and the space \" \" to break up the sentence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Th', 's', 's', 'j', 'st', 'r', 'g', 'l', 'r', 's', 'nt', 'nc', '']\n"
     ]
    }
   ],
   "source": [
    "regex = \"[aeiou ]+\" # this will eliminate all vowels/spaces that appear consecutively\n",
    "string = \"This is just a regular sentence\"\n",
    "\n",
    "print(re.split(regex, string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `search` with `match`\n",
    "\n",
    "Recall that the `match()` function only checks if your pattern is in the beginning of the string. What I want you to do is define your own `search` function that takes a regex and a string, and returns `True` if the pattern is inside the string, and `False` otherwise. Can you do it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def my_search(regex, string):\n",
    "    pass  # write your code here\n",
    "\n",
    "regex = \"[0-9]{2,4}\"\n",
    "\n",
    "# your function should be able to match in all these strings\n",
    "string1 = \"1984 was already some years ago.\"\n",
    "string2 = \"There is also a book whose title is '1984', but the story isn't set in the year of 1984.\"\n",
    "string3 = \"Sometimes people write '84 for short.\"\n",
    "\n",
    "# your function should also match with this regex and this string\n",
    "regex = \"a*\"\n",
    "string = \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count matches with `findall`\n",
    "\n",
    "Now I want you to define the `count_matches` function, which takes a regex and a string, and returns the number of non-overlaping matches there exist in the given string. Can you do it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_matches(regex, string):\n",
    "    pass  # your code goes here\n",
    "\n",
    "regex = \"wow\"\n",
    "\n",
    "string1 = \"wow wow wow\" # this should be 3\n",
    "string2 = \"wowow\" # this should be 1\n",
    "string3 = \"wowowow\" # this should be 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Special characters\n",
    "\n",
    "It is time to ramp things up a bit! We have seen some characters that have special meanings, and now I am going to introduce a couple more of those! I will start by listing them, and then I'll explain them in more detail:\n",
    "\n",
    " - `.` is used to match **any** character, except for a newline\n",
    " - `^` is used to match at the beginning of the string\n",
    " - `$` is used to match at the end of the string\n",
    " - `\\d` is used to match any digit\n",
    " - `\\w` is used to match any alphanumeric character\n",
    " - `\\s` is used to match any type of whitespace\n",
    " - `\\` is used to remove the special meaning of the characters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dot `.`\n",
    "\n",
    "The `.` can be used in a regular expression to capture any character that might have been used there, as long as we are still in the same line. That is, the only place where `.` doesn't work is if we changed lines in the text. Imagine the pattern was `d.ck`. Then the pattern would match\n",
    "\n",
    "```\n",
    "\"duck\"```\n",
    "\n",
    "but it would not match\n",
    "\n",
    "```\n",
    "\"d\n",
    "ck\"```\n",
    "\n",
    "because we changed lines in the middle of the string."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Caret `^`\n",
    "\n",
    "If we use a `^` in the beginning of the regular expression, then we only care about matches in the beginning of the string. That is, `^wow` would only match if the string started with \"wow\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 3), match='wow'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"^wow\"\n",
    "\n",
    "print(re.search(regex, \"wow, this is awesome\"))\n",
    "print(re.search(regex, \"this is awesome, wow\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that `^` inside the character class can also mean \"anything but whatever is in this class\", so the regular expression `[^d]uck` would match any string that has **uck** in it, as long as it is not the word \"duck\". If the caret `^` appears inside a character class `[]` but it is not the first character, than it has no special meaning and it just stands for the character itself. This means that the regex `[()^{}]` is looking to match any of the characters listed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 1), match='^'>\n",
      "<re.Match object; span=(0, 1), match='('>\n",
      "<re.Match object; span=(0, 1), match='}'>\n"
     ]
    }
   ],
   "source": [
    "regex = \"[()^{}]\"\n",
    "print(re.search(regex, \"^\"))\n",
    "print(re.search(regex, \"(\"))\n",
    "print(re.search(regex, \"}\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dollar sign `$`\n",
    "\n",
    "Contrary to the caret `^`, the dollar sign only matches at the end of the string!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n",
      "<re.Match object; span=(17, 20), match='wow'>\n"
     ]
    }
   ],
   "source": [
    "regex = \"wow$\"\n",
    "\n",
    "print(re.search(regex, \"wow, this is awesome\"))\n",
    "print(re.search(regex, \"this is awesome, wow\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combining the `^` with the `$` means we are looking to match the whole string with our pattern. For example `^[a-zA-Z ]*$` checks if our string only contains letters and spaces and nothing else:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 47), match='this is a sentence with only letters and spaces'>\n",
      "None\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"^[a-zA-Z ]*$\"\n",
    "\n",
    "s1 = \"this is a sentence with only letters and spaces\"\n",
    "s2 = \"this sentence has 1 number\"\n",
    "s3 = \"this one has punctuation...\"\n",
    "\n",
    "print(re.search(regex, s1))\n",
    "print(re.search(regex, s2))\n",
    "print(re.search(regex, s3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Character groups `\\d`, `\\w` and `\\s`\n",
    "\n",
    "Whenever you see a backslash followed by a letter, that probably means that something _special_ is going on. These three special \"characters\" are shorthand notation for some character classes `[]`. For example, the `\\d` is the same as `[0-9]`. The `\\w` represents any alphanumeric character (like letters, numbers and `_`), and `\\s` represents any whitespace character (like the space \" \", the tab, the newline, etc).\n",
    "\n",
    "All these three special characters I showed, can be capitalized. If they are, then they mean the exact opposite! So `\\D` means \"anything **except** a digit\", `\\W` means \"anything **except** an alphanumeric character\" and `\\S` means \"anything **except** whitespace characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['these are some words']\n"
     ]
    }
   ],
   "source": [
    "regex = \"\\D+\"\n",
    "s = \"these are some words\"\n",
    "print(re.findall(regex, s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adding up to that, these special characters can be used inside a character class, so for instance `[abc\\d]` would match any digit and the letters \"a\", \"b\" and \"c\". If the caret character `^` is used, then we are excluding whatever the special character refers to. As an example, if `[\\d]` would match any digit, then `[^\\d]` will match anything that is not a digit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The backslash `\\`\n",
    "\n",
    "We already saw the backslash being used before letters to give them some special meaning... Well, the backslash before a special character also strips it of its special meaning! So, if you wanted to match a backslash, you could use `\\\\`. If you want to match any of the other special characters we already saw, you could put a `\\` before them, like `\\+` to match a plus sign. The next regular expression can be used to match an addition expression like \"16 + 6\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 6), match='16 + 6'>\n",
      "<re.Match object; span=(0, 6), match='4325+2'>\n",
      "<re.Match object; span=(0, 6), match='4+ 564'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"[\\d]+ ?\\+ ?[\\d]+\"\n",
    "add1 = \"16 + 6\"\n",
    "add2 = \"4325+2\"\n",
    "add3 = \"4+ 564\"\n",
    "mult1 = \"56 * 2\"\n",
    "\n",
    "print(re.search(regex, add1))\n",
    "print(re.search(regex, add2))\n",
    "print(re.search(regex, add3))\n",
    "print(re.search(regex, mult1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Phone numbers v2\n",
    "\n",
    "Now I invite you to take a look at [Phone numbers v1](#Phone-numbers-v1) and rewrite your regular expression to include some new special characters that you didn't know before!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regex = \"\"  # write your regex here\n",
    "matches = [  # you should be able to match those\n",
    "    \"202-555-0181\",\n",
    "    \"001 202-555-0181\",\n",
    "    \"+1-512-555-0191\"\n",
    "]\n",
    "non_matches = [  # for now, none of these should be matched\n",
    "    \"202555-0181\",\n",
    "    \"96-125-3546\",\n",
    "    \"(+1)5125550191\"\n",
    "]\n",
    "for s in matches:\n",
    "    print(re.search(regex, s))\n",
    "for s in non_matches:\n",
    "    print(re.search(regex, s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Groups\n",
    "\n",
    "So far, when we used a regex to match a string we could retrieve the whole information of the match by using the `.group()` function on the match object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "my nam is\n"
     ]
    }
   ],
   "source": [
    "regex = \"my name? is\"\n",
    "\n",
    "m = re.search(regex, \"my nam is Virgilio\")\n",
    "if m is not None:\n",
    "    print(m.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Say we are dealing with phone numbers again, and we want to look for phone numbers in a big text. But after that, we also want to extract the country from where the number is from. How could we do it..? Well, we can use a regex to match the phone numbers, and then use a second regex to extract the country code, right? (Let us just assume that phone numbers are written with the digits all in a sequence, with no spaces or \"-\" separating them.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The country code is: +351\n",
      "The country code is: 001\n",
      "The country code is: +1\n",
      "The country code is: 0048\n"
     ]
    }
   ],
   "source": [
    "regex_number = \"((00|[+])\\d{1,3}[ -])\\d{8,12}\"\n",
    "regex_code = \"((00|[+])\\d{1,3})\"\n",
    "matches = [  # you should be able to match those\n",
    "    \"+351 2025550181\",\n",
    "    \"001 2025550181\",\n",
    "    \"+1-5125550191\",\n",
    "    \"0048 123456789\"\n",
    "]\n",
    "\n",
    "for s in matches:\n",
    "    m = re.search(regex_number, s)  # match the phone number\n",
    "    if m is not None:\n",
    "        phone_number = m.group()    # extract the phone number\n",
    "        code = re.search(regex_code, phone_number)  # match the country code\n",
    "        print(\"The country code is: {}\".format(code.group()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But not only is this repetitive, because I just copied the beginning of the `regex_number` into the `regex_code`, but it becomes very cumbersome if I am trying to retrieve several different parts of my match. Because of this, there is a functionality of regular expressions that is _grouping_. By grouping parts of the regular expression, you can do things like using the repetition operators on them and **retrieve their information** later on.\n",
    "\n",
    "To do grouping, one only needs to use the `()` parenthesis. For example, the regex `(ab)+` looks for matches of the form \"ab\", \"abab\", \"ababab\", etcetera.\n",
    "\n",
    "We also used the grouping [in the beginning](#Matching-options) to create a regex that matched \"Virgilio\" and \"virgilio\", by writing `(V|v)irgilio`.\n",
    "\n",
    "Now off to the part that really matters! We can use grouping to retrieve portions of the matches, and we do that with the `.group()` function! Any set of `()` defines a group, and then we can use the `.group(i)` function to retrieve group `i`. Just note that the 0th group is always the whole match, and then you start counting from the left!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abc defghi\n",
      "abc defghi\n",
      "abc\n",
      "defghi\n",
      "fg\n",
      "('abc', 'defghi', 'fg')\n"
     ]
    }
   ],
   "source": [
    "regex_with_grouping = \"(abc) (de(fg)hi)\"\n",
    "m = re.search(regex_with_grouping, \"abc defghi jklm n opq\")\n",
    "print(m.group())\n",
    "print(m.group(0))\n",
    "print(m.group(1))\n",
    "print(m.group(2))\n",
    "print(m.group(3))\n",
    "print(m.groups())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that `match.group()` and `match.group(0)` are the same thing. Also note that the function `match.groups()` returns all the groups in a tuple!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Phone numbers v3\n",
    "\n",
    "Using what you learned so far, write a regex that matches phone numbers with different country codes. Assume the following:\n",
    "\n",
    "  - The country code starts with either `00` or `+`, followed by one to three digits\n",
    "  - The phone number has length between 8 and 12\n",
    "  - The phone number and country code are separated by a space \" \" or by a hyphen \"-\"\n",
    "  \n",
    "Have your code look for phone numbers in the string I will provide next, and have it print the different country codes it finds.\n",
    "\n",
    "You might want to read what the exact behaviour of `re.findall()` is when the regex has groups in it. You can do that by checking the [documentation of the `re` module](https://docs.python.org/3/library/re.html#re.findall)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "paragraph = \"\"\"Hello, I am Virgilio and I am from Italy.\n",
    "If phones were a thing when I was alive, my number would've probably been 0039 3123456789.\n",
    "I would also love to get a house with 3 floors and something like +1 000 square meters.\n",
    "Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...\n",
    "And come to think of it, someone told me that Socrates had dibs on +30-2111112222\"\"\"\n",
    "# you should find 3 phone numbers\n",
    "# and you should not be fooled by the other numbers that show up in the text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Toy project about regex\n",
    "\n",
    "For the toy project, that is far from trivial, you are left with mimicking what [I did here](http://mathspp.blogspot.com/2017/11/on-computing-all-patterns-matched-by.html). If you follow that link, you will find a piece of code that takes a regular expression and then prints all the strings that the given regex would match.\n",
    "\n",
    "I'll just give you a couple of examples on how this works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"./regex-bin\")\n",
    "import regexPrinter\n",
    "\n",
    "def get_iter(regex):\n",
    "    return regexPrinter.printRegex(regex).print()\n",
    "\n",
    "def printall(regex):\n",
    "    for poss_match in get_iter(regex):\n",
    "        print(poss_match)\n",
    "\n",
    "regex = \"V|virgilio\"\n",
    "printall(regex)\n",
    "print(\"-\"*30)\n",
    "regex = \"wo+w\"\n",
    "printall(regex)\n",
    "print(\"-\"*30)\n",
    "# notice that for some reason, dumb me used {n:m} instead of {n,m}\n",
    "# also note that I only implemented {n,m}, and not {n,} nor {,m} nor {n}\n",
    "# also note that this does not support nor \\d nor [0-9]\n",
    "regex = \"((00|[+])1[ -])?[0123456789]{3:3}\"\n",
    "printall(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the code is protected against infinite patterns, which are signaled with `...`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is infinite!\n",
      "this is infinite!!\n",
      "this is infinite!...!\n"
     ]
    }
   ],
   "source": [
    "printall(\"this is infinite!+\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are completely new to this sort of things, then this will look completely impossible... but it is not, because I am a normal person and I was able to do it! So if you really want you can also do it! In the link you have listed all the functionality I decided to include, which excluded `\\d`, for example.\n",
    "\n",
    "I was only able to do this in the way I did because I had gone through some (not all) of the blog posts in [this amazing series](https://ruslanspivak.com/lsbasi-part1/).\n",
    "\n",
    "Maybe you can implement a smaller subset of the features without too much trouble? The point of this is that you could only print the strings matched by a regex if you know how regular expressions work. Try starting with only implementing literal matching and the `|` and `?` operators. Can you now include grouping `()` so that `(ab)?` would work as expected? Can you add `[]`? What about `+` and `*`? Or maybe start with `{n,m}` and write `?`, `+` and `*` as `{0,1}`, `{1,}` and `{0,}` respectively.\n",
    "\n",
    "You can also postpone this project for a bit, and dig deeper into the world of regex. The next section contains some additional references and some websites with exercises to practice your new knowledge!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further reading\n",
    "For regular expressions in Python, you can take a look at the [documentation](https://docs.python.org/3/library/re.html) of the `re` module, as well as this [regex HOWTO](https://docs.python.org/3/howto/regex.html).\n",
    "\n",
    "Some nice topics to follow up on this would include, but are not limited to:\n",
    "  - Non capturing groups (and named groups for Python)\n",
    "  - Lookaheads (positive, negative, ...)\n",
    "  - Regex compilation and flags (for Python)\n",
    "  - Recursive regular expressions\n",
    "\n",
    "[This](https://regexr.com/) interesting website (and [this one](https://regex101.com/) as well) provides an interface for you to type regular expressions and see what they match in a text. The tool also gives you an explanation of what your regular expression is doing.\n",
    "\n",
    "---\n",
    "\n",
    "I found some interesting websites with exercises on regular expressions. [This one](https://regexone.com/lesson/introduction_abcs) has more \"basic\" exercises, each one of them preceeded by an explanation of whatever you will need to complete the exercise. I suggest you to go through them. [Hackerrank](https://www.hackerrank.com/domains/regex) and [regexplay](http://play.inginf.units.it/#/) also have some interesting exercises, but those require you to login in some way.\n",
    "\n",
    "---\n",
    "\n",
    "If you enjoyed this guide and/or it was useful, consider leaving a star in the [Virgilio repository](https://github.com/clone95/Virgilio) and sharing it with your friends!\n",
    "\n",
    "This was brought to you by the editor of the [Mathspp Blog](https://mathspp.blogspot.com), [RojerGS](https://github.com/RojerGS)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Suggested solutions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### $\\pi$ lookup (solved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found the number '9876' at positions (4087, 4091)\n"
     ]
    }
   ],
   "source": [
    "pifile = \"regex-bin/pi.txt\"\n",
    "regex = \"9876\"  # define your regex to look your favourite number up\n",
    "\n",
    "with open(pifile, \"r\") as f:\n",
    "    pistr = f.read()  # pistr is a string that contains 1M digits of pi\n",
    "    \n",
    "## search for your number here\n",
    "m = re.search(regex, pistr)\n",
    "if m:\n",
    "    print(\"Found the number '{}' at positions {}\".format(regex, m.span()))\n",
    "else:\n",
    "    print(\"Sorry, the first million digits of pi can't help you with that...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Virgilio or Virgil? (solved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n",
      "\n",
      "Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide through Hell and Purgatory.\n"
     ]
    }
   ],
   "source": [
    "paragraphs = \\\n",
    "\"\"\"Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n",
    "\n",
    "Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory.\"\"\"\n",
    "\n",
    "regex = \"(V|v)irgilio\"\n",
    "parsed_str = paragraphs\n",
    "m = re.search(regex, parsed_str)\n",
    "while m is not None:\n",
    "    parsed_str = parsed_str[:m.start()] + \"Virgil\" + parsed_str[m.end():]\n",
    "    m = re.search(regex, parsed_str)\n",
    "\n",
    "print(parsed_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Removing excessive spaces (solved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "Now it is your turn. I am going to give you this sentence as input, and your job is to fix the whitespace in it. When you are done, save the result in a string named `s`, and check if `s.count()` is equal to 0 or not.\n"
     ]
    }
   ],
   "source": [
    "weird_text = \"Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count(\"  \")` is   equal   to    0  or not.\"\n",
    "regex = \" +\"  # put your regex here\n",
    "# there are several possible solutions, I chose this one\n",
    "\n",
    "# substitute the extra whitespace here\n",
    "s = re.sub(regex, \" \", weird_text)\n",
    "\n",
    "# this print should be 0\n",
    "print(s.count(\"  \"))\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Phone numbers v1 (solved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 12), match='202-555-0181'>\n",
      "<re.Match object; span=(0, 16), match='001 202-555-0181'>\n",
      "<re.Match object; span=(0, 15), match='+1-512-555-0191'>\n",
      "None\n",
      "None\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"((00|[+])1[ -])?[0-9]{3}-[0-9]{3}-[0-9]{4}\"  # write your regex here\n",
    "matches = [  # you should be able to match those\n",
    "    \"202-555-0181\",\n",
    "    \"001 202-555-0181\",\n",
    "    \"+1-512-555-0191\"\n",
    "]\n",
    "non_matches = [  # for now, none of these should be matched\n",
    "    \"202555-0181\",\n",
    "    \"96-125-3546\",\n",
    "    \"(+1)5125550191\"\n",
    "]\n",
    "for s in matches:\n",
    "    print(re.search(regex, s))\n",
    "for s in non_matches:\n",
    "    print(re.search(regex, s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `search` with `matched` (solved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "True\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "def my_search(regex, string):\n",
    "    found = False\n",
    "    while string:\n",
    "        m = re.match(regex, string)\n",
    "        if m:\n",
    "            return True\n",
    "        string = string[1:]\n",
    "    # check if the pattern matches the empty string\n",
    "    if re.match(regex, string):\n",
    "        return True\n",
    "    else:\n",
    "        return False\n",
    "\n",
    "regex = \"[0-9]{2,4}\"\n",
    "\n",
    "# your function should be able to match in all these strings\n",
    "string1 = \"1984 was already some years ago.\"\n",
    "print(my_search(regex, string1))\n",
    "string2 = \"There is also a book whose title is '1984', but the story isn't set in the year of 1984.\"\n",
    "print(my_search(regex, string2))\n",
    "string3 = \"Sometimes people write '84 for short.\"\n",
    "print(my_search(regex, string3))\n",
    "\n",
    "# your function should also match with this regex and this string\n",
    "regex = \"a*\"\n",
    "string = \"\"\n",
    "print(my_search(regex, string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count matches with `findall` (solved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "1\n",
      "2\n"
     ]
    }
   ],
   "source": [
    "def count_matches(regex, string):\n",
    "    return len(re.findall(regex, string))\n",
    "\n",
    "regex = \"wow\"\n",
    "\n",
    "string1 = \"wow wow wow\" # this should be 3\n",
    "print(count_matches(regex, string1))\n",
    "string2 = \"wowow\" # this should be 1\n",
    "print(count_matches(regex, string2))\n",
    "string3 = \"wowowow\" # this should be 2\n",
    "print(count_matches(regex, string3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Phone numbers v2 (solved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 12), match='202-555-0181'>\n",
      "<re.Match object; span=(0, 16), match='001 202-555-0181'>\n",
      "<re.Match object; span=(0, 15), match='+1-512-555-0191'>\n",
      "None\n",
      "None\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = \"((00|[+])1[ -])?\\d{3}-\\d{3}-\\d{4}\"  # write your regex here\n",
    "matches = [  # you should be able to match those\n",
    "    \"202-555-0181\",\n",
    "    \"001 202-555-0181\",\n",
    "    \"+1-512-555-0191\"\n",
    "]\n",
    "non_matches = [  # for now, none of these should be matched\n",
    "    \"202555-0181\",\n",
    "    \"96-125-3546\",\n",
    "    \"(+1)5125550191\"\n",
    "]\n",
    "for s in matches:\n",
    "    print(re.search(regex, s))\n",
    "for s in non_matches:\n",
    "    print(re.search(regex, s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Phone numbers v3 (solved)\n",
    "\n",
    "For this \"problem\", one thinks of using the `.findall()` function to look for all matches. When we do that, we don't get a list of the match objects, but instead a list with tuples, where each tuple has a specific group from our regex. This is the behaviour that is [documented for the `re.findall()` function](https://docs.python.org/3/library/re.html#re.findall).\n",
    "\n",
    "This is fine, because we really only cared about the number code, and we can print it easily. If we wanted the match objects, then the alternative would be to use the [`re.finditer()`](https://docs.python.org/3/library/re.html#re.finditer) function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('0039', '00')\n",
      "('0039', '00')\n",
      "('+30', '+')\n",
      "The number '0039 3123456789' has country code: 0039\n",
      "The number '0039 3135313531' has country code: 0039\n",
      "The number '+30-2111112222' has country code: +30\n"
     ]
    }
   ],
   "source": [
    "paragraph = \"\"\"Hello, I am Virgilio and I am from Italy.\n",
    "If phones were a thing when I was alive, my number would've probably been 0039 3123456789.\n",
    "I would also love to get a house with 3 floors and something like +1 000 square meters.\n",
    "Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...\n",
    "And come to think of it, someone told me that Socrates had dibs on +30-2111112222\"\"\"\n",
    "# you should find 3 phone numbers\n",
    "# and you should not be fooled by the other numbers that show up in the text\n",
    "\n",
    "regex = \"((00|[+])\\d{1,3})[ -]\\d{8,12}\"\n",
    "ns = re.findall(regex, paragraph)  # find numbers\n",
    "for n in ns:\n",
    "    # n is a tuple with the two groups our string has\n",
    "    print(n)\n",
    "    \n",
    "for n in re.finditer(regex, paragraph):\n",
    "    print(\"The number '{}' has country code: {}\".format(n.group(), n.group(1)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: Tools/WolframAlpha.md
================================================
# WolframAlpha
[WolframAlpha](https://www.wolframalpha.com) (WA) is a computational knowledge engine, which is a very fancy way of saying that WolframAlpha is a platform that can answer your questions. WolframAlpha is most notable for its capabilities regarding mathematics and it can be a very powerful tool to help you with your computations.

## Accessing WolframAlpha
WolframAlpha's knowledge engine can be accessed online through [wolframalpha.com](https://www.wolframalpha.com) but if you have access to a license, perhaps through your university/research center/company, you might want to install [Wolfram Mathematica](https://www.wolfram.com/mathematica/), _"a modern technical computing system spanning most areas of technical computing — including neural networks, machine learning, image processing, geometry, data science, visualizations, and others"_.

## WolframAlpha's mathematical capabilities
This little guide will focus on teaching some of the WA's mathematical capabilities. Please bear in mind that there is much more that it can do! This is what we will be covering:

  - [Basic calculations](#basic-calculations)
  - [Plotting functions](#plotting-functions)
  - [Solving equations](#solving-equations)
  - [Solving inequalities](#solving-inequalities)
  - [Matrix algebra](#matrix-algebra)
  - [Computing series and summations](#computing-series-and-summations)
  - [Finding derivatives](#finding-derivatives)
  - [Computing integrals](#computing-integrals)
  - [Finding limits](#finding-limits)
  - [Miscellaneous](#miscellaneous)
  
Whenever you input something into WA, you get the link of your query, so that you can actually share what you asked and the answer given pretty easily. For example, following [this link](https://www.wolframalpha.com/input/?i=Who+is+the+US+president) you can see what WA told me when I asked him who the US president is. Through this guide, blue letters with a gray background give a link to a WA query. So if you click this -> [`What is the 345th decimal place of pi`](https://www.wolframalpha.com/input/?i=What+is+the+345th+decimal+place+of+pi) you will see what WA answered me when I asked for the 345th decimal place of pi (it's 5, by the way).

Another important thing to notice is that you don't have to follow a strict syntax when asking things to WA, even though the more you can facilitate WA's life, the better.

Also note that Mathematica - the language developed by the creators of WA -, uses `[]` for function calls, instead of `()`, and all function names are capitalized, so `Sqrt[n]` would give you the usual square root function, that in many languages would probably be used as `sqrt(n)`. This is relevant because WA supports a subset of Mathematica's functions.

One final **very important** hint is that if you have Mathematica, you can get step-by-step solutions to limits, integrals and derivatives (only to name a few) by starting a command with `==`.
  
### Basic calculations
WolframAlpha can, of course, be used as a pretty advanced calculator. Typing in `2^100` will give you the well-known answer of `1267650600228229401496703205376`. Some useful operators to know include:
  - The usual addition `+`, subtraction `-`, multiplication `*` and division `/`
  - The power operator `^`, used as `x^y`, which can also be used as `Power[x, y]`
  - To find the remainder of a division, either type in `x mod m` or use `Mod[x, y]`
  - The square root is `Sqrt[x]`, and the `n`-th root of `x` is given by `R

Download .txt

gitextract_50nlmjw2/

├── .github/
│   └── workflows/
│       └── deploy-vuepress.yml
├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── Specializations/
│   ├── HardSkills/
│   │   ├── DataPreprocessing.md
│   │   └── DataVisualization.md
│   └── SoftSkills/
│       └── ImpactfulPresentations.md
├── Tools/
│   ├── GeoGebra.md
│   ├── Latex.md
│   ├── MLDemos/
│   │   └── README.md
│   ├── Regex.ipynb
│   ├── WolframAlpha.md
│   └── regex-bin/
│       ├── pi.txt
│       └── regexPrinter.py
├── Topics/
│   ├── ANN.md
│   ├── Computer Vision/
│   │   ├── Introduction_to_Computer_Vision_using_OpenCV_and_Python.ipynb
│   │   ├── Object_Instance_Segmentation_using_TensorFlow_Framework_and_Cloud_GPU_Technology.ipynb
│   │   ├── Object_Tracking_based_on_Deep_Learning.ipynb
│   │   └── Object_detection_based_on_Deep_Learning.ipynb
│   ├── Deep learning in cloud/
│   │   └── README.md
│   ├── Demystification.md
│   ├── DialogFlow.md
│   ├── MLSystems.md
│   ├── NLP/
│   │   └── NLP.ipynb
│   ├── do_you_need_ml.md
│   ├── ds_process.md
│   ├── frame-the-problem.md
│   ├── jupyter-notebook.md
│   ├── math-fundamentals.md
│   ├── prerequisites.md
│   ├── python-fundamentals.md
│   ├── starting-a-data-project.md
│   ├── statistics-fundamentals.md
│   ├── teaching.md
│   ├── usage-and-integration.md
│   └── use-cases.md
├── content/
│   ├── .vuepress/
│   │   ├── LICENSE
│   │   ├── config.js
│   │   ├── public/
│   │   │   ├── googlece1290fc3980cafc.html
│   │   │   └── vollkorn/
│   │   │       └── SIL Open Font License.txt
│   │   └── theme/
│   │       ├── LICENSE
│   │       ├── components/
│   │       │   ├── AlgoliaSearchBox.vue
│   │       │   ├── DropdownLink.vue
│   │       │   ├── DropdownTransition.vue
│   │       │   ├── Home.vue
│   │       │   ├── NavLink.vue
│   │       │   ├── NavLinks.vue
│   │       │   ├── Navbar.vue
│   │       │   ├── Page.vue
│   │       │   ├── PageEdit.vue
│   │       │   ├── PageNav.vue
│   │       │   ├── Sidebar.vue
│   │       │   ├── SidebarButton.vue
│   │       │   ├── SidebarGroup.vue
│   │       │   ├── SidebarLink.vue
│   │       │   └── SidebarLinks.vue
│   │       ├── global-components/
│   │       │   └── Badge.vue
│   │       ├── index.js
│   │       ├── layouts/
│   │       │   ├── 404.vue
│   │       │   └── Layout.vue
│   │       ├── noopModule.js
│   │       ├── styles/
│   │       │   ├── arrow.styl
│   │       │   ├── code.styl
│   │       │   ├── config.styl
│   │       │   ├── custom-blocks.styl
│   │       │   ├── index.styl
│   │       │   ├── mobile.styl
│   │       │   ├── toc.styl
│   │       │   └── wrapper.styl
│   │       └── util/
│   │           └── index.js
│   ├── README.md
│   ├── docs/
│   │   ├── contributing.md
│   │   ├── contributors.md
│   │   └── template.md
│   ├── inferno/
│   │   ├── computer-vision/
│   │   │   ├── Object_detection_based_on_Deep_Learning.ipynb
│   │   │   ├── introduction-to-computer-vision.ipynb
│   │   │   ├── object-instance-segmentation.ipynb
│   │   │   └── object-tracking.ipynb
│   │   ├── research/
│   │   │   ├── sota-papers.md
│   │   │   └── zotero.md
│   │   ├── soft-skills/
│   │   │   └── impactful-presentations.md
│   │   ├── time-series/
│   │   │   └── introduction-to-time-series.md
│   │   ├── tools/
│   │   │   ├── geo-gebra.md
│   │   │   ├── latex.md
│   │   │   ├── regex.ipynb
│   │   │   └── wolfram-alpha.md
│   │   ├── virtual-assistants/
│   │   │   └── dialogflow-chatbot.md
│   │   └── welcome-to-inferno/
│   │       └── welcome-to-inferno.md
│   ├── package.json
│   ├── paradiso/
│   │   ├── demystification-ai-ml-dl.md
│   │   ├── do-you-really-need-ml.md
│   │   ├── introduction-to-ml.md
│   │   ├── use-cases.md
│   │   ├── virgilio-teaching-strategy.md
│   │   └── what-do-i-need-for-ml.md
│   └── purgatorio/
│       ├── collect-and-prepare-data/
│       │   ├── data-collection-text-to-diagram-01.txt
│       │   ├── data-collection.md
│       │   ├── data-preparation.md
│       │   └── data-visualization.md
│       ├── define-the-scope-and-ask-questions/
│       │   ├── frame-the-problem.md
│       │   ├── starting-a-data-project.md
│       │   ├── usage-and-integration.md
│       │   └── workspace-setup-and-cloud-computing.md
│       ├── fundamentals/
│       │   ├── jupyter-notebook.md
│       │   ├── math-fundamentals.md
│       │   ├── python-fundamentals.md
│       │   ├── statistics-fundamentals.md
│       │   └── the-data-science-process.md
│       ├── launch-and-mantain-the-system/
│       │   ├── automation-and-reproducibility.md
│       │   ├── monitoring-usage-and-behavior.md
│       │   └── serving-trained-models.md
│       ├── now-go-build/
│       │   ├── a-messy-real-world.md
│       │   ├── best-practices.md
│       │   └── transfer-learning.md
│       └── select-and-train-machine-learning-models/
│           ├── deep-learning-theory.md
│           ├── evaluation-and-finetuning.md
│           ├── machine-learning-theory.md
│           └── tools-and-libraries.md
├── docs/
│   ├── contributing.md
│   ├── contributors.md
│   └── template.md
└── google50cedcfbb5fc73b6.html

Download .txt

SYMBOL INDEX (48 symbols across 3 files)

FILE: Tools/regex-bin/regexPrinter.py
  class Token (line 33) | class Token(Enum):
  class TreeNode (line 53) | class TreeNode(object):
    method __init__ (line 54) | def __init__(self, token, value, next_node):
    method print (line 59) | def print(self):
  class LiteralNode (line 62) | class LiteralNode(TreeNode):
    method __init__ (line 63) | def __init__(self, token, value, quantifier, next_node):
    method print (line 67) | def print(self):
  class ChooseNode (line 73) | class ChooseNode(TreeNode):
    method __init__ (line 74) | def __init__(self, token, value, quantifier, value_range, next_node):
    method print (line 79) | def print(self):
  class OrNode (line 85) | class OrNode(TreeNode):
    method __init__ (line 86) | def __init__(self, token, value, children, quantifier, next_node):
    method print (line 91) | def print(self):
  class NoQuantifierNode (line 98) | class NoQuantifierNode(TreeNode):
    method __init__ (line 99) | def __init__(self):
    method get_printer (line 102) | def get_printer(self, values):
  class QuantifierNode (line 108) | class QuantifierNode(TreeNode):
    method __init__ (line 109) | def __init__(self, token, value):
    method get_printer (line 112) | def get_printer(self, values):
  class EOFNode (line 145) | class EOFNode(TreeNode):
    method __init__ (line 146) | def __init__(self):
    method print (line 149) | def print(self):
  function tokenize (line 152) | def tokenize(s):
  function printRegex (line 197) | def printRegex(r):
  function parse_expr (line 200) | def parse_expr(tokens):
  function parse_orexpr (line 208) | def parse_orexpr(tokens):
  function parse_word (line 218) | def parse_word(tokens):
  function parse_quantchar (line 230) | def parse_quantchar(tokens):
  function parse_char (line 241) | def parse_char(tokens):
  function parse_quant (line 262) | def parse_quant(tokens):
  function parse_digit (line 268) | def parse_digit(tokens):

FILE: content/.vuepress/theme/index.js
  method alias (line 18) | alias () {

FILE: content/.vuepress/theme/util/index.js
  function normalize (line 6) | function normalize (path) {
  function getHash (line 12) | function getHash (path) {
  function isExternal (line 19) | function isExternal (path) {
  function isMailto (line 23) | function isMailto (path) {
  function isTel (line 27) | function isTel (path) {
  function ensureExt (line 31) | function ensureExt (path) {
  function isActive (line 45) | function isActive (route, path) {
  function resolvePage (line 56) | function resolvePage (pages, rawPath, base) {
  function resolvePath (line 79) | function resolvePath (relative, base, append) {
  function resolveSidebarItems (line 124) | function resolveSidebarItems (page, regularPath, site, localePath) {
  function resolveHeaders (line 151) | function resolveHeaders (page) {
  function groupHeaders (line 168) | function groupHeaders (headers) {
  function resolveNavLinkItem (line 182) | function resolveNavLinkItem (linkItem) {
  function resolveMatchingConfig (line 193) | function resolveMatchingConfig (regularPath, config) {
  function ensureEndingSlash (line 211) | function ensureEndingSlash (path) {
  function resolveItem (line 217) | function resolveItem (item, pages, base, groupDepth = 1) {

Download .json

Condensed preview — 123 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,648K chars).

[
  {
    "path": ".github/workflows/deploy-vuepress.yml",
    "chars": 1590,
    "preview": "name: Build and deploy an updated version of the website\n\non:\n  push\n\njobs:\n  build:\n    runs-on: ubuntu-latest\n    step"
  },
  {
    "path": ".gitignore",
    "chars": 1762,
    "preview": ".DS_Store\n.vscode/*\n\n# Node things\nnode_modules/\npackage-lock.json\n\n# Byte-compiled / optimized / DLL files\n__pycache__/"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 3234,
    "preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, w"
  },
  {
    "path": "LICENSE",
    "chars": 20845,
    "preview": "Attribution-NonCommercial-ShareAlike 4.0 International\n\n================================================================"
  },
  {
    "path": "README.md",
    "chars": 2225,
    "preview": "\n\n\n\n\n# [I've Launched a GenAI Framework that's robust and easy to learn and maintain, check it!](https://github.com/data"
  },
  {
    "path": "Specializations/HardSkills/DataPreprocessing.md",
    "chars": 19995,
    "preview": "# Data Preprocessing\n\nData preprocessing (also known as Data Preparation, but \"Preprocessing\" sounds more like magic) is"
  },
  {
    "path": "Specializations/HardSkills/DataVisualization.md",
    "chars": 28483,
    "preview": "# Data Visualization \n\nIt was hard for the Homo Sapiens to survive in the African savannah: a human or animal could kill"
  },
  {
    "path": "Specializations/SoftSkills/ImpactfulPresentations.md",
    "chars": 12377,
    "preview": "# Impactful Presentations\n\n## Why do you need to impress your audience?\n\nIn my day, in the ancient Rome, data scientists"
  },
  {
    "path": "Tools/GeoGebra.md",
    "chars": 4589,
    "preview": "# GeoGebra\n[GeoGebra](https://www.geogebra.org) (GG) is a powerful dynamic mathematics application for all levels of edu"
  },
  {
    "path": "Tools/Latex.md",
    "chars": 7160,
    "preview": "# LaTeX\r\nLaTeX is a markup language (or, as said in the [official website](https://www.latex-project.org/about/), \"a doc"
  },
  {
    "path": "Tools/MLDemos/README.md",
    "chars": 11181,
    "preview": "# MLDemos\n\n[MLDemos](http://mldemos.b4silio.com/) is an open-source visualization tool for machine learning algorithms c"
  },
  {
    "path": "Tools/Regex.ipynb",
    "chars": 81367,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Regex introduction\\n\",\n    \"\\n\",\n"
  },
  {
    "path": "Tools/WolframAlpha.md",
    "chars": 20847,
    "preview": "# WolframAlpha\n[WolframAlpha](https://www.wolframalpha.com) (WA) is a computational knowledge engine, which is a very fa"
  },
  {
    "path": "Tools/regex-bin/pi.txt",
    "chars": 51199,
    "preview": "3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066"
  },
  {
    "path": "Tools/regex-bin/regexPrinter.py",
    "chars": 9572,
    "preview": "\"\"\"Implements a parser for a subset of the regular expressions grammar\r\n        The parser returns a generator for all t"
  },
  {
    "path": "Topics/ANN.md",
    "chars": 9096,
    "preview": "### Deep Learning\nTo deal with problems of great complexity, such as the recognition of images or the understanding of h"
  },
  {
    "path": "Topics/Computer Vision/Introduction_to_Computer_Vision_using_OpenCV_and_Python.ipynb",
    "chars": 41559,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"name\": \"Introduction to Computer Vision "
  },
  {
    "path": "Topics/Computer Vision/Object_Instance_Segmentation_using_TensorFlow_Framework_and_Cloud_GPU_Technology.ipynb",
    "chars": 29318,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"name\": \"Object Instance Segmentation usi"
  },
  {
    "path": "Topics/Computer Vision/Object_Tracking_based_on_Deep_Learning.ipynb",
    "chars": 33579,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"name\": \"Object Tracking based on Deep Le"
  },
  {
    "path": "Topics/Computer Vision/Object_detection_based_on_Deep_Learning.ipynb",
    "chars": 32640,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"name\": \"Object detection based on  Deep "
  },
  {
    "path": "Topics/Deep learning in cloud/README.md",
    "chars": 7578,
    "preview": "# Deep learning in cloud\n\nEver had a laptop which is not powerful enough to run your models? Forget about it and use **C"
  },
  {
    "path": "Topics/Demystification.md",
    "chars": 11004,
    "preview": "---\ntitle: Demystification of the key concepts of AI and ML\nauthor: clone95\ndescription: Clarify the jargon and the idea"
  },
  {
    "path": "Topics/DialogFlow.md",
    "chars": 11928,
    "preview": "\n# ChatBots with DialogFlow, Python, and Flask\n\n## We have 99.94847 percent probability of death, Luke\nIn simple terms, "
  },
  {
    "path": "Topics/MLSystems.md",
    "chars": 15143,
    "preview": "## Machine Learning Systems\n\nThe following paragraphs aim to introduce in more detail how an ML system can present itsel"
  },
  {
    "path": "Topics/NLP/NLP.ipynb",
    "chars": 1095204,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"name\": \"NLP.ipynb\",\n      \"provenance\": "
  },
  {
    "path": "Topics/do_you_need_ml.md",
    "chars": 10968,
    "preview": "\n# Do you really need Machine Learning?\n\n# What you will learn \nThe purpose of this guide is to warn you that there is n"
  },
  {
    "path": "Topics/ds_process.md",
    "chars": 11017,
    "preview": "# The Data Science Process Lifecycle\n\n# What you will learn \nIn this guide you will understand the big picture of the Da"
  },
  {
    "path": "Topics/frame-the-problem.md",
    "chars": 19972,
    "preview": "# Frame the problem\n\n# What you will learn \nIn this guide, we try to figure out how to frame the kind of problem we want"
  },
  {
    "path": "Topics/jupyter-notebook.md",
    "chars": 9211,
    "preview": "# Jupyter Notebook\n\n# What you will learn \nIn this guide you'll learn how to use the Jupyter Notebook and the integrated"
  },
  {
    "path": "Topics/math-fundamentals.md",
    "chars": 9544,
    "preview": "# Math Fundamentals\n\n# What you will learn \nIn this guide, you'll learn what is the fundamental mathematical knowledge y"
  },
  {
    "path": "Topics/prerequisites.md",
    "chars": 17908,
    "preview": "# **What** do you need to do Machine Learning?\n\n\n# What you will learn \nIn this guide you will learn which elements do y"
  },
  {
    "path": "Topics/python-fundamentals.md",
    "chars": 13619,
    "preview": "# Python Fundamentals\n\n# What you will learn \n\nIn this guide you'll learn the basics concepts computer science and progr"
  },
  {
    "path": "Topics/starting-a-data-project.md",
    "chars": 17463,
    "preview": "# Starting a Data Project\n\n# What you will learn \nIn this guide, you'll learn to prepare yourself to start the project. "
  },
  {
    "path": "Topics/statistics-fundamentals.md",
    "chars": 9585,
    "preview": "# Statistics Fundamentals\n\n# What you will learn \n\nIn this guide, you'll learn what is the fundamental statistical knowl"
  },
  {
    "path": "Topics/teaching.md",
    "chars": 15474,
    "preview": "\n# Virgilio's Learning Strategy - Learning to Learn\n\n# What you will learn \nThis guide serves various purposes: \n\n- To u"
  },
  {
    "path": "Topics/usage-and-integration.md",
    "chars": 7873,
    "preview": "# Usage and Integration\n\n# What you will learn \n\nIn this guide we see which are the key questions to ask when framing a "
  },
  {
    "path": "Topics/use-cases.md",
    "chars": 10904,
    "preview": "\n# Machine Learning Use Cases\n\n# What you will learn \nThe purpose of this guide is to give a high-level overview of the "
  },
  {
    "path": "content/.vuepress/LICENSE",
    "chars": 1114,
    "preview": "The MIT License (MIT)\n\nCopyright (c) 2018-present, Yuxi (Evan) You, Virgilio contributors\n\nPermission is hereby granted,"
  },
  {
    "path": "content/.vuepress/config.js",
    "chars": 10921,
    "preview": "module.exports = {\n    title: 'Virgilio',\n    base: \"/Virgilio/\",\n    description: 'Data Science E-Learning',\n  plugins:"
  },
  {
    "path": "content/.vuepress/public/googlece1290fc3980cafc.html",
    "chars": 53,
    "preview": "google-site-verification: googlece1290fc3980cafc.html"
  },
  {
    "path": "content/.vuepress/public/vollkorn/SIL Open Font License.txt",
    "chars": 4299,
    "preview": "This Font Software is licensed under the SIL Open Font License, Version 1.1.\nThis license is copied below, and is also a"
  },
  {
    "path": "content/.vuepress/theme/LICENSE",
    "chars": 1091,
    "preview": "The MIT License (MIT)\n\nCopyright (c) 2018-present, Yuxi (Evan) You\n\nPermission is hereby granted, free of charge, to any"
  },
  {
    "path": "content/.vuepress/theme/components/AlgoliaSearchBox.vue",
    "chars": 4589,
    "preview": "<template>\n  <form\n    id=\"search-form\"\n    class=\"algolia-search-wrapper search-box\"\n    role=\"search\"\n  >\n    <input\n "
  },
  {
    "path": "content/.vuepress/theme/components/DropdownLink.vue",
    "chars": 5108,
    "preview": "<template>\n  <div\n    class=\"dropdown-wrapper\"\n    :class=\"{ open }\"\n  >\n    <button\n      class=\"dropdown-title\"\n      "
  },
  {
    "path": "content/.vuepress/theme/components/DropdownTransition.vue",
    "chars": 561,
    "preview": "<template>\n  <transition\n    name=\"dropdown\"\n    @enter=\"setHeight\"\n    @after-enter=\"unsetHeight\"\n    @before-leave=\"se"
  },
  {
    "path": "content/.vuepress/theme/components/Home.vue",
    "chars": 3538,
    "preview": "<template>\n  <main\n    class=\"home\"\n    aria-labelledby=\"main-title\"\n  >\n    <header class=\"hero\">\n      <img\n        v-"
  },
  {
    "path": "content/.vuepress/theme/components/NavLink.vue",
    "chars": 1561,
    "preview": "<template>\n  <RouterLink\n    v-if=\"isInternal\"\n    class=\"nav-link\"\n    :to=\"link\"\n    :exact=\"exact\"\n    @focusout.nati"
  },
  {
    "path": "content/.vuepress/theme/components/NavLinks.vue",
    "chars": 3804,
    "preview": "<template>\n  <nav\n    v-if=\"userLinks.length || repoLink\"\n    class=\"nav-links\"\n  >\n    <!-- user links -->\n    <div\n   "
  },
  {
    "path": "content/.vuepress/theme/components/Navbar.vue",
    "chars": 3522,
    "preview": "<template>\n  <header class=\"navbar\">\n    <SidebarButton @toggle-sidebar=\"$emit('toggle-sidebar')\" />\n\n    <RouterLink\n  "
  },
  {
    "path": "content/.vuepress/theme/components/Page.vue",
    "chars": 535,
    "preview": "<template>\n  <main class=\"page\">\n    <slot name=\"top\" />\n\n    <Content class=\"theme-default-content\" />\n    <PageEdit />"
  },
  {
    "path": "content/.vuepress/theme/components/PageEdit.vue",
    "chars": 3167,
    "preview": "<template>\n  <footer class=\"page-edit\">\n    <div\n      v-if=\"editLink\"\n      class=\"edit-link\"\n    >\n      <a\n        :h"
  },
  {
    "path": "content/.vuepress/theme/components/PageNav.vue",
    "chars": 3376,
    "preview": "<template>\n  <div\n    v-if=\"prev || next\"\n    class=\"page-nav\"\n  >\n    <p class=\"inner\">\n      <span\n        v-if=\"prev\""
  },
  {
    "path": "content/.vuepress/theme/components/Sidebar.vue",
    "chars": 1240,
    "preview": "<template>\n  <aside class=\"sidebar\">\n    <NavLinks />\n\n    <slot name=\"top\" />\n\n    <SidebarLinks\n      :depth=\"0\"\n     "
  },
  {
    "path": "content/.vuepress/theme/components/SidebarButton.vue",
    "chars": 990,
    "preview": "<template>\n  <div\n    class=\"sidebar-button\"\n    @click=\"$emit('toggle-sidebar')\"\n  >\n    <svg\n      class=\"icon\"\n      "
  },
  {
    "path": "content/.vuepress/theme/components/SidebarGroup.vue",
    "chars": 2892,
    "preview": "<template>\n  <section\n    class=\"sidebar-group\"\n    :class=\"[\n      {\n        collapsable,\n        'is-sub-group': depth"
  },
  {
    "path": "content/.vuepress/theme/components/SidebarLink.vue",
    "chars": 3291,
    "preview": "<script>\nimport { isActive, hashRE, groupHeaders } from '../util'\n\nexport default {\n  functional: true,\n\n  props: ['item"
  },
  {
    "path": "content/.vuepress/theme/components/SidebarLinks.vue",
    "chars": 1991,
    "preview": "<template>\n  <ul\n    v-if=\"items.length\"\n    class=\"sidebar-links\"\n  >\n    <li\n      v-for=\"(item, i) in items\"\n      :k"
  },
  {
    "path": "content/.vuepress/theme/global-components/Badge.vue",
    "chars": 805,
    "preview": "<script>\nexport default {\n  functional: true,\n  props: {\n    type: {\n      type: String,\n      default: 'tip'\n    },\n   "
  },
  {
    "path": "content/.vuepress/theme/index.js",
    "chars": 1442,
    "preview": "const path = require('path')\n\n// Theme API.\nmodule.exports = (options, ctx) => {\n  const { themeConfig, siteConfig } = c"
  },
  {
    "path": "content/.vuepress/theme/layouts/404.vue",
    "chars": 530,
    "preview": "<template>\n  <div class=\"theme-container\">\n    <div class=\"theme-default-content\">\n      <h1>404</h1>\n\n      <blockquote"
  },
  {
    "path": "content/.vuepress/theme/layouts/Layout.vue",
    "chars": 3185,
    "preview": "<template>\n  <div\n    class=\"theme-container\"\n    :class=\"pageClasses\"\n    @touchstart=\"onTouchStart\"\n    @touchend=\"onT"
  },
  {
    "path": "content/.vuepress/theme/noopModule.js",
    "chars": 18,
    "preview": "export default {}\n"
  },
  {
    "path": "content/.vuepress/theme/styles/arrow.styl",
    "chars": 577,
    "preview": "@require './config'\n\n.arrow\n  display inline-block\n  width 0\n  height 0\n  &.up\n    border-left 4px solid transparent\n   "
  },
  {
    "path": "content/.vuepress/theme/styles/code.styl",
    "chars": 2861,
    "preview": "{$contentClass}\n  code\n    color lighten($textColor, 20%)\n    padding 0.25rem 0.5rem\n    margin 0\n    font-size 0.85em\n "
  },
  {
    "path": "content/.vuepress/theme/styles/config.styl",
    "chars": 41,
    "preview": "$contentClass = '.theme-default-content'\n"
  },
  {
    "path": "content/.vuepress/theme/styles/custom-blocks.styl",
    "chars": 961,
    "preview": ".custom-block\n  .custom-block-title\n    font-weight 600\n    margin-bottom -0.4rem\n  &.tip, &.warning, &.danger\n    paddi"
  },
  {
    "path": "content/.vuepress/theme/styles/index.styl",
    "chars": 4147,
    "preview": "@require './config'\n@require './code'\n@require './custom-blocks'\n@require './arrow'\n@require './wrapper'\n@require './toc"
  },
  {
    "path": "content/.vuepress/theme/styles/mobile.styl",
    "chars": 731,
    "preview": "@require './config'\n\n$mobileSidebarWidth = $sidebarWidth * 0.82\n\n// narrow desktop / iPad\n@media (max-width: $MQNarrow)\n"
  },
  {
    "path": "content/.vuepress/theme/styles/toc.styl",
    "chars": 54,
    "preview": ".table-of-contents\n  .badge\n    vertical-align middle\n"
  },
  {
    "path": "content/.vuepress/theme/styles/wrapper.styl",
    "chars": 180,
    "preview": "$wrapper\n  max-width $contentWidth\n  margin 0 auto\n  padding 2rem 2.5rem\n  @media (max-width: $MQNarrow)\n    padding 2re"
  },
  {
    "path": "content/.vuepress/theme/util/index.js",
    "chars": 5699,
    "preview": "export const hashRE = /#.*$/\nexport const extRE = /\\.(md|html)$/\nexport const endingSlashRE = /\\/$/\nexport const outboun"
  },
  {
    "path": "content/README.md",
    "chars": 10142,
    "preview": "# <div class=\"title\">*Virgilio* <a style=\"display:inline\" href=\"https://github.com/Virgili0/Virgilio\"><img alt=\"GitHub s"
  },
  {
    "path": "content/docs/contributing.md",
    "chars": 6720,
    "preview": "# Index\n - [You have immense power](#Contribute)\n - [Easiest way to contribute](#Easiest-way-to-contribute)\n - [Contribu"
  },
  {
    "path": "content/docs/contributors.md",
    "chars": 1271,
    "preview": "# Contributors\n\nHere's the list of the awesome people making Virgilio possible.\n\nCore team:\n\n- **[clone95](https://githu"
  },
  {
    "path": "content/docs/template.md",
    "chars": 1650,
    "preview": "This file contains a generic template for a guide about some subject X.\n\nPlease remember to:\n - be consistent throughout"
  },
  {
    "path": "content/inferno/computer-vision/Object_detection_based_on_Deep_Learning.ipynb",
    "chars": 31961,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"OaEGVW0XCzhL\",\n        \"colab_typ"
  },
  {
    "path": "content/inferno/computer-vision/introduction-to-computer-vision.ipynb",
    "chars": 41587,
    "preview": "{\n  \"cells\": [\n    {\n      \"metadata\": {\n        \"id\": \"OmqAkANMsWql\",\n        \"colab_type\": \"text\"\n      },\n      \"cell"
  },
  {
    "path": "content/inferno/computer-vision/object-instance-segmentation.ipynb",
    "chars": 29325,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"s7QFZ6ztHsSC\",\n        \"colab_typ"
  },
  {
    "path": "content/inferno/computer-vision/object-tracking.ipynb",
    "chars": 33578,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"_3CcxHNErMI2\",\n        \"colab_typ"
  },
  {
    "path": "content/inferno/research/sota-papers.md",
    "chars": 31727,
    "preview": "# Research papers explained\n\n## Year-by-Year\n* [2020](#2020)\n* [2019](#2019)\n* [2018](#2018)\n* [2017](#2017)\n\n\n## 2020\nC"
  },
  {
    "path": "content/inferno/research/zotero.md",
    "chars": 4398,
    "preview": "---\ntitle: Zotero\nauthor: khaledbay\ndescription: Explain how to use the bibliographic reference management tool Zotero. "
  },
  {
    "path": "content/inferno/soft-skills/impactful-presentations.md",
    "chars": 12233,
    "preview": "---\ntitle: Impactful Presentations\nauthor: clone95\ndescription: A detailed guide about \"how to impress your audience\".\n-"
  },
  {
    "path": "content/inferno/time-series/introduction-to-time-series.md",
    "chars": 21629,
    "preview": "---\ntitle: Introduction to Time Series\nauthor: clone95\ndescription: This guide aims to show you the Data Science applica"
  },
  {
    "path": "content/inferno/tools/geo-gebra.md",
    "chars": 5057,
    "preview": "---\ntitle: Geo Gebra\nauthor: khaledbay\ndescription: The purpose of this guide is to show you the powerful mathematics ap"
  },
  {
    "path": "content/inferno/tools/latex.md",
    "chars": 7742,
    "preview": "---\ntitle: LaTex\nauthor: damianoazzolini\ndescription: The purpose of this guide is to show you the endless capabilities "
  },
  {
    "path": "content/inferno/tools/regex.ipynb",
    "chars": 81367,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Regex introduction\\n\",\n    \"\\n\",\n"
  },
  {
    "path": "content/inferno/tools/wolfram-alpha.md",
    "chars": 20980,
    "preview": "---\ntitle: Wolfram Alpha\nauthor: rogerjs\ndescription: The purpose of this guide is to show you the endless capabilities "
  },
  {
    "path": "content/inferno/virtual-assistants/dialogflow-chatbot.md",
    "chars": 12266,
    "preview": "---\ntitle: Chatbot with DialogFlo\nauthor: clone95\ndescription: The purpose of this guide is to guide you through the cre"
  },
  {
    "path": "content/inferno/welcome-to-inferno/welcome-to-inferno.md",
    "chars": 2467,
    "preview": "---\ntitle: Welcome To Inferno\nauthor: clone95\ndescription: This guide explains how the Inferno section of Virgilio is or"
  },
  {
    "path": "content/package.json",
    "chars": 755,
    "preview": "{\n  \"name\": \"virgilio\",\n  \"version\": \"0.0.1\",\n  \"description\": \"Your new Mentor for Data Science E-Learning.\",\n  \"main\":"
  },
  {
    "path": "content/paradiso/demystification-ai-ml-dl.md",
    "chars": 11115,
    "preview": "---\ntitle: Demystification of the key concepts of AI and ML\nauthor: clone95\ndescription: This guide wants to clarify ide"
  },
  {
    "path": "content/paradiso/do-you-really-need-ml.md",
    "chars": 11087,
    "preview": "---\ntitle: Do you really need Machine Learning?\nauthor: clone95\ndescription: Understand for which kind of problems it ma"
  },
  {
    "path": "content/paradiso/introduction-to-ml.md",
    "chars": 15581,
    "preview": "---\ntitle: Introduction to Machine Learning\nauthor: clone95\ndescription: This guide aims to introduce you to how an ML s"
  },
  {
    "path": "content/paradiso/use-cases.md",
    "chars": 10912,
    "preview": "---\ntitle: Machine Learning use cases\nauthor: clone95\ndescription: The purpose of this guide is to give a high-level ove"
  },
  {
    "path": "content/paradiso/virgilio-teaching-strategy.md",
    "chars": 15584,
    "preview": "---\ntitle: Virgilio's Teaching Strategy\nauthor: clone95\ndescription: Give learning advices, best practice in using onlin"
  },
  {
    "path": "content/paradiso/what-do-i-need-for-ml.md",
    "chars": 18014,
    "preview": "---\ntitle: What do I need to do Machine Learning?\nauthor: clone95\ndescription: In this guide, you will learn which eleme"
  },
  {
    "path": "content/purgatorio/collect-and-prepare-data/data-collection-text-to-diagram-01.txt",
    "chars": 439,
    "preview": "# Object And Messages\n[ ... start ] -> Data Collection : research or business goal / question\nData Collection -> [... in"
  },
  {
    "path": "content/purgatorio/collect-and-prepare-data/data-collection.md",
    "chars": 18627,
    "preview": "---\ntitle: Data Collection\nauthor: neomatrix369\ndescription: The purpose of this guide is to talk about data collection "
  },
  {
    "path": "content/purgatorio/collect-and-prepare-data/data-preparation.md",
    "chars": 28154,
    "preview": "---\ntitle: Data Preparation\nauthor: clone95, neomatrix369\ndescription: The purpose of this guide is to show you the diff"
  },
  {
    "path": "content/purgatorio/collect-and-prepare-data/data-visualization.md",
    "chars": 28653,
    "preview": "---\ntitle: Data Visualization\nauthor: clone95\ndescription: The purpose of this guide is to show you the importance of da"
  },
  {
    "path": "content/purgatorio/define-the-scope-and-ask-questions/frame-the-problem.md",
    "chars": 19299,
    "preview": "---\ntitle: Frame the Problem\nauthor: clone95\ndescription: Understand which kind of problem you want to solve and define "
  },
  {
    "path": "content/purgatorio/define-the-scope-and-ask-questions/starting-a-data-project.md",
    "chars": 16999,
    "preview": "---\ntitle: Starting a Data Project\nauthor: clone95\ndescription: Learn to look for sources that can help you solve the pr"
  },
  {
    "path": "content/purgatorio/define-the-scope-and-ask-questions/usage-and-integration.md",
    "chars": 7198,
    "preview": "---\ntitle: Usage and Integration\nauthor: clone95\ndescription: In this guide, we see which are the key questions to ask w"
  },
  {
    "path": "content/purgatorio/define-the-scope-and-ask-questions/workspace-setup-and-cloud-computing.md",
    "chars": 10961,
    "preview": "---\ntitle: Workspace Setup and Cloud Computing \nauthor: zszazi | clone95\ndescription: Setup your workspace locally and u"
  },
  {
    "path": "content/purgatorio/fundamentals/jupyter-notebook.md",
    "chars": 8670,
    "preview": "---\ntitle: Jupyter Notebook\nauthor: clone95\ndescription: Learn how to use the Jupyter Notebook, the most popular applica"
  },
  {
    "path": "content/purgatorio/fundamentals/math-fundamentals.md",
    "chars": 9080,
    "preview": "---\ntitle: Math Fundamentals\nauthor: clone95\ndescription: In this guide, you'll learn what is the fundamental mathematic"
  },
  {
    "path": "content/purgatorio/fundamentals/python-fundamentals.md",
    "chars": 13666,
    "preview": "---\ntitle: Python Fundamentals\nauthor: clone95\ndescription: In this guide you'll learn the basics concepts computer scie"
  },
  {
    "path": "content/purgatorio/fundamentals/statistics-fundamentals.md",
    "chars": 9267,
    "preview": "---\ntitle: Statistics Fundamentals\nauthor: clone95\ndescription: Learn what is the fundamental statistical knowledge you "
  },
  {
    "path": "content/purgatorio/fundamentals/the-data-science-process.md",
    "chars": 11093,
    "preview": "---\ntitle: The Data Science Process\nauthor: clone95\ndescription: In this guide, you will understand the big picture of t"
  },
  {
    "path": "content/purgatorio/launch-and-mantain-the-system/automation-and-reproducibility.md",
    "chars": 15907,
    "preview": "---\ntitle: Automation and Reproducibility\nauthor: clone95\ndescription: This guide introduces you to best practices of th"
  },
  {
    "path": "content/purgatorio/launch-and-mantain-the-system/monitoring-usage-and-behavior.md",
    "chars": 10446,
    "preview": "---\ntitle: Monitoring Usage and Behavior\nauthor: clone95\ndescription: This guide introduces you to the best practices of"
  },
  {
    "path": "content/purgatorio/launch-and-mantain-the-system/serving-trained-models.md",
    "chars": 9925,
    "preview": "---\ntitle: Serving Trained Models\nauthor: clone95\ndescription: In this guide, you will learn what are the most widely us"
  },
  {
    "path": "content/purgatorio/now-go-build/a-messy-real-world.md",
    "chars": 12315,
    "preview": "---\ntitle: A Messy Real World\nauthor: clone95\ndescription: This guide aims to show you the challenges and the messiness "
  },
  {
    "path": "content/purgatorio/now-go-build/best-practices.md",
    "chars": 8503,
    "preview": "---\ntitle: Best Practices\nauthor: clone95\ndescription: A detailed collection of the best practices of the Data Science p"
  },
  {
    "path": "content/purgatorio/now-go-build/transfer-learning.md",
    "chars": 8206,
    "preview": "---\ntitle: Transfer Learning\nauthor: clone95\ndescription: A detailed guide about what is Transfer Learning, how to use i"
  },
  {
    "path": "content/purgatorio/select-and-train-machine-learning-models/deep-learning-theory.md",
    "chars": 33676,
    "preview": "---\ntitle: Deep Learning Theory\nauthor: clone95\ndescription: Get started with Deep Learning, the branch of machine learn"
  },
  {
    "path": "content/purgatorio/select-and-train-machine-learning-models/evaluation-and-finetuning.md",
    "chars": 10676,
    "preview": "---\ntitle: Evaluation and Fine Tuning\nauthor: clone95\ndescription: Learn how to evaluate your models after the training "
  },
  {
    "path": "content/purgatorio/select-and-train-machine-learning-models/machine-learning-theory.md",
    "chars": 14015,
    "preview": "---\ntitle: Machine Learning Theory\nauthor: clone95\ndescription: Learn how to train and build powerful Machine Learning m"
  },
  {
    "path": "content/purgatorio/select-and-train-machine-learning-models/tools-and-libraries.md",
    "chars": 23978,
    "preview": "---\ntitle: Tools and Libraries\nauthor: clone95\ndescription: Learn to use existing libraries, frameworks and out-of-the-b"
  },
  {
    "path": "docs/contributing.md",
    "chars": 6388,
    "preview": "# Index\n - [You have immense power](#Contribute)\n - [Easiest way to contribute](#Easiest-way-to-contribute)\n - [Contribu"
  },
  {
    "path": "docs/contributors.md",
    "chars": 1271,
    "preview": "# Contributors\n\nHere's the list of the awesome people making Virgilio possible.\n\nCore team:\n\n- **[clone95](https://githu"
  },
  {
    "path": "docs/template.md",
    "chars": 1650,
    "preview": "This file contains a generic template for a guide about some subject X.\n\nPlease remember to:\n - be consistent throughout"
  },
  {
    "path": "google50cedcfbb5fc73b6.html",
    "chars": 53,
    "preview": "google-site-verification: google50cedcfbb5fc73b6.html"
  }
]

About this extraction

This page contains the full source code of the virgili0/Virgilio GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 123 files (2.5 MB), approximately 650.1k tokens, and a symbol index with 48 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo