[
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: '[BUG] Brief Description of the Issue'\nlabels: bug\nassignees: ''\n\n---\n\nFound a bug? Please fill out the sections below. 👍\n\n\n### Describe the bug\n\nA clear and concise description of what the bug is.\n\n### Steps to Reproduce\n\n1. (for ex.) went to...\n2. clicked on this point\n3. not working\n\n### Expected Behavior\nA brief description of what you expected to happen.\n\n### Actual Behavior:\nwhat actually happened.\n\n### Environment\n- OS: \n- Model Used (e.g., GPT-4v, Gemini Pro Vision):\n- Framework Version (optional):\n\n### Screenshots\nIf applicable, add screenshots to help explain your problem.\n\n### Additional context\nAdd any other context about the problem here."
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: '[FEATURE] Short Description of the Feature'\nlabels: enhancement\nassignees: ''\n\n---\n\n### Is your feature request related to a problem? Please describe.\n\nA clear and concise description of what the problem is. Ex. I'm always frustrated when [...]\n\n### Describe the solution you'd like\nA clear and concise description of what you want to happen.\n\n### Describe alternatives you've considered\nA clear and concise description of any alternative solutions or features you've considered.\n\n### Additional context\nAdd any other context or screenshots about the feature request here."
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "content": "## What does this PR do?\n\n<!-- Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change. -->\n\nFixes # (issue)\n\n## Requirement/Documentation\n\n<!-- Please provide all documents that are important to understand the reason of that PR. -->\n\n- If there is a requirement document, please, share it here.\n\n## Type of change\n\n<!-- Please delete bullets that are not relevant. -->\n\n- [ ] Bug fix (non-breaking change which fixes an issue)\n- [ ] Chore (refactoring code, technical debt, workflow improvements)\n- [ ] New feature (non-breaking change which adds functionality)\n- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)\n- [ ] Tests (Unit/Integration/E2E or any other test)\n- [ ] This change requires a documentation update\n\n\n## Mandatory Tasks\n\n- [ ] Make sure you have self-reviewed the code. A decent size PR without self-review might be rejected. Make sure before submmiting this PR you run tests with evaluate.py\n"
  },
  {
    "path": ".github/workflows/upload-package.yml",
    "content": "name: Upload Python Package\n\non:\n  push:\n    tags:\n      - 'v*'\n\njobs:\n  deploy:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v3\n\n    - name: Set up Python\n      uses: actions/setup-python@v3\n      with:\n        python-version: '3.8'\n\n    - name: Install dependencies\n      run: |\n        python -m pip install --upgrade pip\n        pip install setuptools wheel twine\n\n    - name: Build and check package\n      run: |\n        python setup.py sdist bdist_wheel\n        twine check dist/*\n        \n    - name: Upload to PyPi\n      uses: pypa/gh-action-pypi-publish@v1.4.2\n      with:\n        user: __token__\n        password: ${{ secrets.PYPI_API_TOKEN }}\n\n"
  },
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n#.idea/\n\n.DS_Store\n\n# Avoid sending testing screenshots up\n*.png\noperate/screenshots/\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# Contributing\nWe appreciate your contributions!\n\n## Process\n1. Fork it\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create new Pull Request\n\n## Modifying and Running Code\n1. Make changes in `operate/main.py`\n2. Run `pip install .` again\n3. Run `operate` to see your changes\n\n## Testing Changes\n**After making significant changes, it's important to verify that SOC can still successfully perform a set of common test cases.**\nIn the root directory of the project, run:\n```\npython3 evaluate.py\n```   \nThis will automatically prompt `operate` to perform several simple objectives.   \nUpon completion of each objective, GPT-4v will give an evaluation and determine if the objective was successfully reached.   \n\n`evaluate.py` will print out if each test case `[PASSED]` or `[FAILED]`. In addition, a justification will be given on why the pass/fail was given.   \n\nIt is recommended that a screenshot of the `evaluate.py` output is included in any PR which could impact the performance of SOC.\n\n## Contribution Ideas\n- **Improve performance by finding optimal screenshot grid**: A primary element of the framework is that it overlays a percentage grid on the screenshot which GPT-4v uses to estimate click locations. If someone is able to find the optimal grid and some evaluation metrics to confirm it is an improvement on the current method then we will merge that PR. \n- **Improve the `SUMMARY_PROMPT`**\n- **Improve Linux and Windows compatibility**: There are still some issues with Linux and Windows compatibility. PRs to fix the issues are encouraged. \n- **Adding New Multimodal Models**: Integration of new multimodal models is welcomed. If you have a specific model in mind that you believe would be a valuable addition, please feel free to integrate it and submit a PR.\n- **Iterate `--accurate` flag functionality**: Look at https://github.com/OthersideAI/self-operating-computer/pull/57 for previous iteration\n- **Enhanced Security**: A feature request to implement a _robust security feature_ that prompts users for _confirmation before executing potentially harmful actions_. This feature aims to _prevent unintended actions_ and _safeguard user data_ as mentioned here in this [OtherSide#25](https://github.com/OthersideAI/self-operating-computer/issues/25)\n\n\n## Guidelines\nThis will primarily be a [Software 2.0](https://karpathy.medium.com/software-2-0-a64152b37c35) project. For this reason: \n\n- Let's try to hold off refactors into separate files until `main.py` is more than 1000 lines\n\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2023 OthersideAI\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE."
  },
  {
    "path": "README.md",
    "content": "ome\n<h1 align=\"center\">Self-Operating Computer Framework</h1>\n\n<p align=\"center\">\n  <strong>A framework to enable multimodal models to operate a computer.</strong>\n</p>\n<p align=\"center\">\n  Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Released Nov 2023, the Self-Operating Computer Framework was one of the first examples of full computer-use. \n</p>\n\n<div align=\"center\">\n  <img src=\"https://github.com/OthersideAI/self-operating-computer/blob/main/readme/self-operating-computer.png\" width=\"750\"  style=\"margin: 10px;\"/>\n</div>\n\n<!--\n:rotating_light: **OUTAGE NOTIFICATION: gpt-4o**\n**This model is currently experiencing an outage so the self-operating computer may not work as expected.**\n-->\n\n\n## Key Features\n- **Compatibility**: Designed for various multimodal models.\n- **Integration**: Currently integrated with **GPT-4o, GPT-4.1, o1, Gemini Pro Vision, Claude 3, Qwen-VL and LLaVa.**\n- **Future Plans**: Support for additional models.\n\n## Demo\nhttps://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0\n\n\n## Run `Self-Operating Computer`\n\n1. **Install the project**\n```\npip install self-operating-computer\n```\n2. **Run the project**\n```\noperate\n```\n3. **Enter your OpenAI Key**: If you don't have one, you can obtain an OpenAI key [here](https://platform.openai.com/account/api-keys). If you need you change your key at a later point, run `vim .env` to open the `.env` and replace the old key. \n\n<div align=\"center\">\n  <img src=\"https://github.com/OthersideAI/self-operating-computer/blob/main/readme/key.png\" width=\"300\"  style=\"margin: 10px;\"/>\n</div>\n\n4. **Give Terminal app the required permissions**: As a last step, the Terminal app will ask for permission for \"Screen Recording\" and \"Accessibility\" in the \"Security & Privacy\" page of Mac's \"System Preferences\".\n\n<div align=\"center\">\n  <img src=\"https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-1.png\" width=\"300\"  style=\"margin: 10px;\"/>\n  <img src=\"https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-2.png\" width=\"300\"  style=\"margin: 10px;\"/>\n</div>\n\n## Using `operate` Modes\n\n#### OpenAI models\n\nThe default model for the project is gpt-4o which you can use by simply typing `operate`. To try running OpenAI's new `o1` model, use the command below.\n\n```\noperate -m o1-with-ocr\n```\n\nTo experiment with OpenAI's latest `gpt-4.1` model, run:\n\n```\noperate -m gpt-4.1-with-ocr\n```\n\n\n### Multimodal Models  `-m`\nTry Google's `gemini-pro-vision` by following the instructions below. Start `operate` with the Gemini model\n```\noperate -m gemini-pro-vision\n```\n\n**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.\n\n#### Try Claude `-m claude-3`\nUse Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Claude dashboard](https://console.anthropic.com/dashboard) to get an API key and run the command below to try it. \n\n```\noperate -m claude-3\n```\n\n#### Try qwen `-m qwen-vl`\nUse Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Qwen dashboard](https://bailian.console.aliyun.com/) to get an API key and run the command below to try it. \n\n```\noperate -m qwen-vl\n```\n\n#### Try LLaVa Hosted Through Ollama `-m llava`\nIf you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!   \n*Note: Ollama currently only supports MacOS and Linux. Windows now in Preview*   \n\nFirst, install Ollama on your machine from https://ollama.ai/download.   \n\nOnce Ollama is installed, pull the LLaVA model:\n```\nollama pull llava\n```\nThis will download the model on your machine which takes approximately 5 GB of storage.   \n\nWhen Ollama has finished pulling LLaVA, start the server:\n```\nollama serve\n```\n\nThat's it! Now start `operate` and select the LLaVA model:\n```\noperate -m llava\n```   \n**Important:** Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.\n\nLearn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)\n\n### Voice Mode `--voice`\nThe framework supports voice inputs for the objective. Try voice by following the instructions below. \n**Clone the repo** to a directory on your computer:\n```\ngit clone https://github.com/OthersideAI/self-operating-computer.git\n```\n**Cd into directory**:\n```\ncd self-operating-computer\n```\nInstall the additional `requirements-audio.txt`\n```\npip install -r requirements-audio.txt\n```\n**Install device requirements**\nFor mac users:\n```\nbrew install portaudio\n```\nFor Linux users:\n```\nsudo apt install portaudio19-dev python3-pyaudio\n```\nRun with voice mode\n```\noperate --voice\n```\n\n### Optical Character Recognition Mode `-m gpt-4-with-ocr`\nThe Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click. \n\nBased on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write: \n\n `operate` or `operate -m gpt-4-with-ocr` will also work. \n\n### Set-of-Mark Prompting `-m gpt-4-with-som`\nThe Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.\n\nLearn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).\n\nFor this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).\n\nStart `operate` with the SoM model\n\n```\noperate -m gpt-4-with-som\n```\n\n\n\n## Contributions are Welcomed!:\n\nIf you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md).\n\n## Feedback\n\nFor any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter. \n\n## Join Our Discord Community\n\nFor real-time discussions and community support, join our Discord server. \n- If you're already a member, join the discussion in [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).\n- If you're new, first [join our Discord Server](https://discord.gg/YqaKtyBEzM) and then navigate to the [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).\n\n## Follow HyperWriteAI for More Updates\n\nStay updated with the latest developments:\n- Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI).\n- Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/).\n\n## Compatibility\n- This project is compatible with Mac OS, Windows, and Linux (with X server installed).\n\n## OpenAI Rate Limiting Note\nThe ```gpt-4o``` model is required. To unlock access to this model, your account needs to spend at least \\$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \\$5.   \nLearn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)**\n"
  },
  {
    "path": "evaluate.py",
    "content": "import sys\nimport os\nimport subprocess\nimport platform\nimport base64\nimport json\nimport openai\nimport argparse\n\nfrom dotenv import load_dotenv\n\n# \"Objective for `operate`\" : \"Guideline for passing this test case given to GPT-4v\"\nTEST_CASES = {\n    \"Go to Github.com\": \"A Github page is visible.\",\n    \"Go to Youtube.com and play a video\": \"The YouTube video player is visible.\",\n}\n\nEVALUATION_PROMPT = \"\"\"\nYour job is to look at the given screenshot and determine if the following guideline is met in the image.\nYou must respond in the following format ONLY. Do not add anything else:\n{{ \"guideline_met\": (true|false), \"reason\": \"Explanation for why guideline was or wasn't met\" }}\nguideline_met must be set to a JSON boolean. True if the image meets the given guideline.\nreason must be a string containing a justification for your decision.\n\nGuideline: {guideline}\n\"\"\"\n\nSCREENSHOT_PATH = os.path.join(\"screenshots\", \"screenshot.png\")\n\n\n# Check if on a windows terminal that supports ANSI escape codes\ndef supports_ansi():\n    \"\"\"\n    Check if the terminal supports ANSI escape codes\n    \"\"\"\n    plat = platform.system()\n    supported_platform = plat != \"Windows\" or \"ANSICON\" in os.environ\n    is_a_tty = hasattr(sys.stdout, \"isatty\") and sys.stdout.isatty()\n    return supported_platform and is_a_tty\n\n\nif supports_ansi():\n    # Standard green text\n    ANSI_GREEN = \"\\033[32m\"\n    # Bright/bold green text\n    ANSI_BRIGHT_GREEN = \"\\033[92m\"\n    # Reset to default text color\n    ANSI_RESET = \"\\033[0m\"\n    # ANSI escape code for blue text\n    ANSI_BLUE = \"\\033[94m\"  # This is for bright blue\n\n    # Standard yellow text\n    ANSI_YELLOW = \"\\033[33m\"\n\n    ANSI_RED = \"\\033[31m\"\n\n    # Bright magenta text\n    ANSI_BRIGHT_MAGENTA = \"\\033[95m\"\nelse:\n    ANSI_GREEN = \"\"\n    ANSI_BRIGHT_GREEN = \"\"\n    ANSI_RESET = \"\"\n    ANSI_BLUE = \"\"\n    ANSI_YELLOW = \"\"\n    ANSI_RED = \"\"\n    ANSI_BRIGHT_MAGENTA = \"\"\n\n\ndef format_evaluation_prompt(guideline):\n    prompt = EVALUATION_PROMPT.format(guideline=guideline)\n    return prompt\n\n\ndef parse_eval_content(content):\n    try:\n        res = json.loads(content)\n\n        print(res[\"reason\"])\n\n        return res[\"guideline_met\"]\n    except:\n        print(\n            \"The model gave a bad evaluation response and it couldn't be parsed. Exiting...\"\n        )\n        exit(1)\n\n\ndef evaluate_final_screenshot(guideline):\n    \"\"\"Load the final screenshot and return True or False if it meets the given guideline.\"\"\"\n    with open(SCREENSHOT_PATH, \"rb\") as img_file:\n        img_base64 = base64.b64encode(img_file.read()).decode(\"utf-8\")\n\n        eval_message = [\n            {\n                \"role\": \"user\",\n                \"content\": [\n                    {\"type\": \"text\", \"text\": format_evaluation_prompt(guideline)},\n                    {\n                        \"type\": \"image_url\",\n                        \"image_url\": {\"url\": f\"data:image/jpeg;base64,{img_base64}\"},\n                    },\n                ],\n            }\n        ]\n\n        response = openai.chat.completions.create(\n            model=\"gpt-4o\",\n            messages=eval_message,\n            presence_penalty=1,\n            frequency_penalty=1,\n            temperature=0.7,\n        )\n\n        eval_content = response.choices[0].message.content\n\n        return parse_eval_content(eval_content)\n\n\ndef run_test_case(objective, guideline, model):\n    \"\"\"Returns True if the result of the test with the given prompt meets the given guideline for the given model.\"\"\"\n    # Run `operate` with the model to evaluate and the test case prompt\n    subprocess.run(\n        [\"operate\", \"-m\", model, \"--prompt\", f'\"{objective}\"'],\n        stdout=subprocess.DEVNULL,\n    )\n\n    try:\n        result = evaluate_final_screenshot(guideline)\n    except OSError:\n        print(\"[Error] Couldn't open the screenshot for evaluation\")\n        return False\n\n    return result\n\n\ndef get_test_model():\n    parser = argparse.ArgumentParser(\n        description=\"Run the self-operating-computer with a specified model.\"\n    )\n\n    parser.add_argument(\n        \"-m\",\n        \"--model\",\n        help=\"Specify the model to evaluate.\",\n        required=False,\n        default=\"gpt-4-with-ocr\",\n    )\n\n    return parser.parse_args().model\n\n\ndef main():\n    load_dotenv()\n    openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n\n    model = get_test_model()\n\n    print(f\"{ANSI_BLUE}[EVALUATING MODEL `{model}`]{ANSI_RESET}\")\n    print(f\"{ANSI_BRIGHT_MAGENTA}[STARTING EVALUATION]{ANSI_RESET}\")\n\n    passed = 0\n    failed = 0\n    for objective, guideline in TEST_CASES.items():\n        print(f\"{ANSI_BLUE}[EVALUATING]{ANSI_RESET} '{objective}'\")\n\n        result = run_test_case(objective, guideline, model)\n        if result:\n            print(f\"{ANSI_GREEN}[PASSED]{ANSI_RESET} '{objective}'\")\n            passed += 1\n        else:\n            print(f\"{ANSI_RED}[FAILED]{ANSI_RESET} '{objective}'\")\n            failed += 1\n\n    print(\n        f\"{ANSI_BRIGHT_MAGENTA}[EVALUATION COMPLETE]{ANSI_RESET} {passed} test{'' if passed == 1 else 's'} passed, {failed} test{'' if failed == 1 else 's'} failed\"\n    )\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "operate/__init__.py",
    "content": ""
  },
  {
    "path": "operate/config.py",
    "content": "import os\nimport sys\n\nimport google.generativeai as genai\nfrom dotenv import load_dotenv\nfrom ollama import Client\nfrom openai import OpenAI\nimport anthropic\nfrom prompt_toolkit.shortcuts import input_dialog\n\n\nclass Config:\n    \"\"\"\n    Configuration class for managing settings.\n\n    Attributes:\n        verbose (bool): Flag indicating whether verbose mode is enabled.\n        openai_api_key (str): API key for OpenAI.\n        google_api_key (str): API key for Google.\n        ollama_host (str): url to ollama running remotely.\n    \"\"\"\n\n    _instance = None\n\n    def __new__(cls):\n        if cls._instance is None:\n            cls._instance = super(Config, cls).__new__(cls)\n            # Put any initialization here\n        return cls._instance\n\n    def __init__(self):\n        load_dotenv()\n        self.verbose = False\n        self.openai_api_key = (\n            None  # instance variables are backups in case saving to a `.env` fails\n        )\n        self.google_api_key = (\n            None  # instance variables are backups in case saving to a `.env` fails\n        )\n        self.ollama_host = (\n            None  # instance variables are backups in case savint to a `.env` fails\n        )\n        self.anthropic_api_key = (\n            None  # instance variables are backups in case saving to a `.env` fails\n        )\n        self.qwen_api_key = (\n            None  # instance variables are backups in case saving to a `.env` fails\n        )\n\n    def initialize_openai(self):\n        if self.verbose:\n            print(\"[Config][initialize_openai]\")\n\n        if self.openai_api_key:\n            if self.verbose:\n                print(\"[Config][initialize_openai] using cached openai_api_key\")\n            api_key = self.openai_api_key\n        else:\n            if self.verbose:\n                print(\n                    \"[Config][initialize_openai] no cached openai_api_key, try to get from env.\"\n                )\n            api_key = os.getenv(\"OPENAI_API_KEY\")\n\n        client = OpenAI(\n            api_key=api_key,\n        )\n        client.api_key = api_key\n        client.base_url = os.getenv(\"OPENAI_API_BASE_URL\", client.base_url)\n        return client\n\n    def initialize_qwen(self):\n        if self.verbose:\n            print(\"[Config][initialize_qwen]\")\n\n        if self.qwen_api_key:\n            if self.verbose:\n                print(\"[Config][initialize_qwen] using cached qwen_api_key\")\n            api_key = self.qwen_api_key\n        else:\n            if self.verbose:\n                print(\n                    \"[Config][initialize_qwen] no cached qwen_api_key, try to get from env.\"\n                )\n            api_key = os.getenv(\"QWEN_API_KEY\")\n\n        client = OpenAI(\n            api_key=api_key,\n            base_url=\"https://dashscope.aliyuncs.com/compatible-mode/v1\",\n        )\n        client.api_key = api_key\n        client.base_url = \"https://dashscope.aliyuncs.com/compatible-mode/v1\"\n        return client\n\n    def initialize_google(self):\n        if self.google_api_key:\n            if self.verbose:\n                print(\"[Config][initialize_google] using cached google_api_key\")\n            api_key = self.google_api_key\n        else:\n            if self.verbose:\n                print(\n                    \"[Config][initialize_google] no cached google_api_key, try to get from env.\"\n                )\n            api_key = os.getenv(\"GOOGLE_API_KEY\")\n        genai.configure(api_key=api_key, transport=\"rest\")\n        model = genai.GenerativeModel(\"gemini-pro-vision\")\n\n        return model\n\n    def initialize_ollama(self):\n        if self.ollama_host:\n            if self.verbose:\n                print(\"[Config][initialize_ollama] using cached ollama host\")\n        else:\n            if self.verbose:\n                print(\n                    \"[Config][initialize_ollama] no cached ollama host. Assuming ollama running locally.\"\n                )\n            self.ollama_host = os.getenv(\"OLLAMA_HOST\", None)\n        model = Client(host=self.ollama_host)\n        return model\n\n    def initialize_anthropic(self):\n        if self.anthropic_api_key:\n            api_key = self.anthropic_api_key\n        else:\n            api_key = os.getenv(\"ANTHROPIC_API_KEY\")\n        return anthropic.Anthropic(api_key=api_key)\n\n    def validation(self, model, voice_mode):\n        \"\"\"\n        Validate the input parameters for the dialog operation.\n        \"\"\"\n        self.require_api_key(\n            \"OPENAI_API_KEY\",\n            \"OpenAI API key\",\n            model == \"gpt-4\"\n            or voice_mode\n            or model == \"gpt-4-with-som\"\n            or model == \"gpt-4-with-ocr\"\n            or model == \"gpt-4.1-with-ocr\"\n            or model == \"o1-with-ocr\",\n        )\n        self.require_api_key(\n            \"GOOGLE_API_KEY\", \"Google API key\", model == \"gemini-pro-vision\"\n        )\n        self.require_api_key(\n            \"ANTHROPIC_API_KEY\", \"Anthropic API key\", model == \"claude-3\"\n        )\n        self.require_api_key(\"QWEN_API_KEY\", \"Qwen API key\", model == \"qwen-vl\")\n\n    def require_api_key(self, key_name, key_description, is_required):\n        key_exists = bool(os.environ.get(key_name))\n        if self.verbose:\n            print(\"[Config] require_api_key\")\n            print(\"[Config] key_name\", key_name)\n            print(\"[Config] key_description\", key_description)\n            print(\"[Config] key_exists\", key_exists)\n        if is_required and not key_exists:\n            self.prompt_and_save_api_key(key_name, key_description)\n\n    def prompt_and_save_api_key(self, key_name, key_description):\n        key_value = input_dialog(\n            title=\"API Key Required\", text=f\"Please enter your {key_description}:\"\n        ).run()\n\n        if key_value is None:  # User pressed cancel or closed the dialog\n            sys.exit(\"Operation cancelled by user.\")\n\n        if key_value:\n            if key_name == \"OPENAI_API_KEY\":\n                self.openai_api_key = key_value\n            elif key_name == \"GOOGLE_API_KEY\":\n                self.google_api_key = key_value\n            elif key_name == \"ANTHROPIC_API_KEY\":\n                self.anthropic_api_key = key_value\n            elif key_name == \"QWEN_API_KEY\":\n                self.qwen_api_key = key_value\n            self.save_api_key_to_env(key_name, key_value)\n            load_dotenv()  # Reload environment variables\n            # Update the instance attribute with the new key\n\n    @staticmethod\n    def save_api_key_to_env(key_name, key_value):\n        with open(\".env\", \"a\") as file:\n            file.write(f\"\\n{key_name}='{key_value}'\")\n"
  },
  {
    "path": "operate/exceptions.py",
    "content": "class ModelNotRecognizedException(Exception):\n    \"\"\"Exception raised for unrecognized models.\n\n    Attributes:\n        model -- the unrecognized model\n        message -- explanation of the error\n    \"\"\"\n\n    def __init__(self, model, message=\"Model not recognized\"):\n        self.model = model\n        self.message = message\n        super().__init__(self.message)\n\n    def __str__(self):\n        return f\"{self.message} : {self.model} \""
  },
  {
    "path": "operate/main.py",
    "content": "\"\"\"\nSelf-Operating Computer\n\"\"\"\nimport argparse\nfrom operate.utils.style import ANSI_BRIGHT_MAGENTA\nfrom operate.operate import main\n\n\ndef main_entry():\n    parser = argparse.ArgumentParser(\n        description=\"Run the self-operating-computer with a specified model.\"\n    )\n    parser.add_argument(\n        \"-m\",\n        \"--model\",\n        help=\"Specify the model to use\",\n        required=False,\n        default=\"gpt-4-with-ocr\",\n    )\n\n    # Add a voice flag\n    parser.add_argument(\n        \"--voice\",\n        help=\"Use voice input mode\",\n        action=\"store_true\",\n    )\n    \n    # Add a flag for verbose mode\n    parser.add_argument(\n        \"--verbose\",\n        help=\"Run operate in verbose mode\",\n        action=\"store_true\",\n    )\n    \n    # Allow for direct input of prompt\n    parser.add_argument(\n        \"--prompt\",\n        help=\"Directly input the objective prompt\",\n        type=str,\n        required=False,\n    )\n\n    try:\n        args = parser.parse_args()\n        main(\n            args.model,\n            terminal_prompt=args.prompt,\n            voice_mode=args.voice,\n            verbose_mode=args.verbose\n        )\n    except KeyboardInterrupt:\n        print(f\"\\n{ANSI_BRIGHT_MAGENTA}Exiting...\")\n\n\nif __name__ == \"__main__\":\n    main_entry()\n"
  },
  {
    "path": "operate/models/__init__.py",
    "content": ""
  },
  {
    "path": "operate/models/apis.py",
    "content": "import base64\nimport io\nimport json\nimport os\nimport time\nimport traceback\n\nimport easyocr\nimport ollama\nimport pkg_resources\nfrom PIL import Image\nfrom ultralytics import YOLO\n\nfrom operate.config import Config\nfrom operate.exceptions import ModelNotRecognizedException\nfrom operate.models.prompts import (\n    get_system_prompt,\n    get_user_first_message_prompt,\n    get_user_prompt,\n)\nfrom operate.utils.label import (\n    add_labels,\n    get_click_position_in_percent,\n    get_label_coordinates,\n)\nfrom operate.utils.ocr import get_text_coordinates, get_text_element\nfrom operate.utils.screenshot import capture_screen_with_cursor, compress_screenshot\nfrom operate.utils.style import ANSI_BRIGHT_MAGENTA, ANSI_GREEN, ANSI_RED, ANSI_RESET\n\n# Load configuration\nconfig = Config()\n\n\nasync def get_next_action(model, messages, objective, session_id):\n    if config.verbose:\n        print(\"[Self-Operating Computer][get_next_action]\")\n        print(\"[Self-Operating Computer][get_next_action] model\", model)\n    if model == \"gpt-4\":\n        return call_gpt_4o(messages), None\n    if model == \"qwen-vl\":\n        operation = await call_qwen_vl_with_ocr(messages, objective, model)\n        return operation, None\n    if model == \"gpt-4-with-som\":\n        operation = await call_gpt_4o_labeled(messages, objective, model)\n        return operation, None\n    if model == \"gpt-4-with-ocr\":\n        operation = await call_gpt_4o_with_ocr(messages, objective, model)\n        return operation, None\n    if model == \"gpt-4.1-with-ocr\":\n        operation = await call_gpt_4_1_with_ocr(messages, objective, model)\n        return operation, None\n    if model == \"o1-with-ocr\":\n        operation = await call_o1_with_ocr(messages, objective, model)\n        return operation, None\n    if model == \"agent-1\":\n        return \"coming soon\"\n    if model == \"gemini-pro-vision\":\n        return call_gemini_pro_vision(messages, objective), None\n    if model == \"llava\":\n        operation = call_ollama_llava(messages)\n        return operation, None\n    if model == \"claude-3\":\n        operation = await call_claude_3_with_ocr(messages, objective, model)\n        return operation, None\n    raise ModelNotRecognizedException(model)\n\n\ndef call_gpt_4o(messages):\n    if config.verbose:\n        print(\"[call_gpt_4_v]\")\n    time.sleep(1)\n    client = config.initialize_openai()\n    try:\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        # Call the function to capture the screen with the cursor\n        capture_screen_with_cursor(screenshot_filename)\n\n        with open(screenshot_filename, \"rb\") as img_file:\n            img_base64 = base64.b64encode(img_file.read()).decode(\"utf-8\")\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        if config.verbose:\n            print(\n                \"[call_gpt_4_v] user_prompt\",\n                user_prompt,\n            )\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": user_prompt},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{img_base64}\"},\n                },\n            ],\n        }\n        messages.append(vision_message)\n\n        response = client.chat.completions.create(\n            model=\"gpt-4o\",\n            messages=messages,\n            presence_penalty=1,\n            frequency_penalty=1,\n        )\n\n        content = response.choices[0].message.content\n\n        content = clean_json(content)\n\n        assistant_message = {\"role\": \"assistant\", \"content\": content}\n        if config.verbose:\n            print(\n                \"[call_gpt_4_v] content\",\n                content,\n            )\n        content = json.loads(content)\n\n        messages.append(assistant_message)\n\n        return content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[Operate] That did not work. Trying again {ANSI_RESET}\",\n            e,\n        )\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response was {ANSI_RESET}\",\n            content,\n        )\n        if config.verbose:\n            traceback.print_exc()\n        return call_gpt_4o(messages)\n\n\nasync def call_qwen_vl_with_ocr(messages, objective, model):\n    if config.verbose:\n        print(\"[call_qwen_vl_with_ocr]\")\n\n    # Construct the path to the file within the package\n    try:\n        time.sleep(1)\n        client = config.initialize_qwen()\n\n        confirm_system_prompt(messages, objective, model)\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        # Call the function to capture the screen with the cursor\n        raw_screenshot_filename = os.path.join(screenshots_dir, \"raw_screenshot.png\")\n        capture_screen_with_cursor(raw_screenshot_filename)\n\n        # Compress screenshot image to make size be smaller\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.jpeg\")\n        compress_screenshot(raw_screenshot_filename, screenshot_filename)\n\n        with open(screenshot_filename, \"rb\") as img_file:\n            img_base64 = base64.b64encode(img_file.read()).decode(\"utf-8\")\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\",\n                 \"text\": f\"{user_prompt}**REMEMBER** Only output json format, do not append any other text.\"},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{img_base64}\"},\n                },\n            ],\n        }\n        messages.append(vision_message)\n\n        response = client.chat.completions.create(\n            model=\"qwen2.5-vl-72b-instruct\",\n            messages=messages,\n        )\n\n        content = response.choices[0].message.content\n\n        content = clean_json(content)\n\n        # used later for the messages\n        content_str = content\n\n        content = json.loads(content)\n\n        processed_content = []\n\n        for operation in content:\n            if operation.get(\"operation\") == \"click\":\n                text_to_click = operation.get(\"text\")\n                if config.verbose:\n                    print(\n                        \"[call_qwen_vl_with_ocr][click] text_to_click\",\n                        text_to_click,\n                    )\n                # Initialize EasyOCR Reader\n                reader = easyocr.Reader([\"en\"])\n\n                # Read the screenshot\n                result = reader.readtext(screenshot_filename)\n\n                text_element_index = get_text_element(\n                    result, text_to_click, screenshot_filename\n                )\n                coordinates = get_text_coordinates(\n                    result, text_element_index, screenshot_filename\n                )\n\n                # add `coordinates`` to `content`\n                operation[\"x\"] = coordinates[\"x\"]\n                operation[\"y\"] = coordinates[\"y\"]\n\n                if config.verbose:\n                    print(\n                        \"[call_qwen_vl_with_ocr][click] text_element_index\",\n                        text_element_index,\n                    )\n                    print(\n                        \"[call_qwen_vl_with_ocr][click] coordinates\",\n                        coordinates,\n                    )\n                    print(\n                        \"[call_qwen_vl_with_ocr][click] final operation\",\n                        operation,\n                    )\n                processed_content.append(operation)\n\n            else:\n                processed_content.append(operation)\n\n        # wait to append the assistant message so that if the `processed_content` step fails we don't append a message and mess up message history\n        assistant_message = {\"role\": \"assistant\", \"content\": content_str}\n        messages.append(assistant_message)\n\n        return processed_content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}\"\n        )\n        if config.verbose:\n            print(\"[Self-Operating Computer][Operate] error\", e)\n            traceback.print_exc()\n        return gpt_4_fallback(messages, objective, model)\n\ndef call_gemini_pro_vision(messages, objective):\n    \"\"\"\n    Get the next action for Self-Operating Computer using Gemini Pro Vision\n    \"\"\"\n    if config.verbose:\n        print(\n            \"[Self Operating Computer][call_gemini_pro_vision]\",\n        )\n    # sleep for a second\n    time.sleep(1)\n    try:\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        # Call the function to capture the screen with the cursor\n        capture_screen_with_cursor(screenshot_filename)\n        # sleep for a second\n        time.sleep(1)\n        prompt = get_system_prompt(\"gemini-pro-vision\", objective)\n\n        model = config.initialize_google()\n        if config.verbose:\n            print(\"[call_gemini_pro_vision] model\", model)\n\n        response = model.generate_content([prompt, Image.open(screenshot_filename)])\n\n        content = response.text[1:]\n        if config.verbose:\n            print(\"[call_gemini_pro_vision] response\", response)\n            print(\"[call_gemini_pro_vision] content\", content)\n\n        content = json.loads(content)\n        if config.verbose:\n            print(\n                \"[get_next_action][call_gemini_pro_vision] content\",\n                content,\n            )\n\n        return content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[Operate] That did not work. Trying another method {ANSI_RESET}\"\n        )\n        if config.verbose:\n            print(\"[Self-Operating Computer][Operate] error\", e)\n            traceback.print_exc()\n        return call_gpt_4o(messages)\n\n\nasync def call_gpt_4o_with_ocr(messages, objective, model):\n    if config.verbose:\n        print(\"[call_gpt_4o_with_ocr]\")\n\n    # Construct the path to the file within the package\n    try:\n        time.sleep(1)\n        client = config.initialize_openai()\n\n        confirm_system_prompt(messages, objective, model)\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        # Call the function to capture the screen with the cursor\n        capture_screen_with_cursor(screenshot_filename)\n\n        with open(screenshot_filename, \"rb\") as img_file:\n            img_base64 = base64.b64encode(img_file.read()).decode(\"utf-8\")\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": user_prompt},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{img_base64}\"},\n                },\n            ],\n        }\n        messages.append(vision_message)\n\n        response = client.chat.completions.create(\n            model=\"gpt-4o\",\n            messages=messages,\n        )\n\n        content = response.choices[0].message.content\n\n        content = clean_json(content)\n\n        # used later for the messages\n        content_str = content\n\n        content = json.loads(content)\n\n        processed_content = []\n\n        for operation in content:\n            if operation.get(\"operation\") == \"click\":\n                text_to_click = operation.get(\"text\")\n                if config.verbose:\n                    print(\n                        \"[call_gpt_4o_with_ocr][click] text_to_click\",\n                        text_to_click,\n                    )\n                # Initialize EasyOCR Reader\n                reader = easyocr.Reader([\"en\"])\n\n                # Read the screenshot\n                result = reader.readtext(screenshot_filename)\n\n                text_element_index = get_text_element(\n                    result, text_to_click, screenshot_filename\n                )\n                coordinates = get_text_coordinates(\n                    result, text_element_index, screenshot_filename\n                )\n\n                # add `coordinates`` to `content`\n                operation[\"x\"] = coordinates[\"x\"]\n                operation[\"y\"] = coordinates[\"y\"]\n\n                if config.verbose:\n                    print(\n                        \"[call_gpt_4o_with_ocr][click] text_element_index\",\n                        text_element_index,\n                    )\n                    print(\n                        \"[call_gpt_4o_with_ocr][click] coordinates\",\n                        coordinates,\n                    )\n                    print(\n                        \"[call_gpt_4o_with_ocr][click] final operation\",\n                        operation,\n                    )\n                processed_content.append(operation)\n\n            else:\n                processed_content.append(operation)\n\n        # wait to append the assistant message so that if the `processed_content` step fails we don't append a message and mess up message history\n        assistant_message = {\"role\": \"assistant\", \"content\": content_str}\n        messages.append(assistant_message)\n\n        return processed_content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}\"\n        )\n        if config.verbose:\n            print(\"[Self-Operating Computer][Operate] error\", e)\n            traceback.print_exc()\n        return gpt_4_fallback(messages, objective, model)\n\n\nasync def call_gpt_4_1_with_ocr(messages, objective, model):\n    if config.verbose:\n        print(\"[call_gpt_4_1_with_ocr]\")\n\n    try:\n        time.sleep(1)\n        client = config.initialize_openai()\n\n        confirm_system_prompt(messages, objective, model)\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        capture_screen_with_cursor(screenshot_filename)\n\n        with open(screenshot_filename, \"rb\") as img_file:\n            img_base64 = base64.b64encode(img_file.read()).decode(\"utf-8\")\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": user_prompt},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{img_base64}\"},\n                },\n            ],\n        }\n        messages.append(vision_message)\n\n        response = client.chat.completions.create(\n            model=\"gpt-4.1\",\n            messages=messages,\n        )\n\n        content = response.choices[0].message.content\n\n        content = clean_json(content)\n\n        content_str = content\n\n        content = json.loads(content)\n\n        processed_content = []\n\n        for operation in content:\n            if operation.get(\"operation\") == \"click\":\n                text_to_click = operation.get(\"text\")\n                if config.verbose:\n                    print(\n                        \"[call_gpt_4_1_with_ocr][click] text_to_click\",\n                        text_to_click,\n                    )\n                reader = easyocr.Reader([\"en\"])\n\n                result = reader.readtext(screenshot_filename)\n\n                text_element_index = get_text_element(\n                    result, text_to_click, screenshot_filename\n                )\n                coordinates = get_text_coordinates(\n                    result, text_element_index, screenshot_filename\n                )\n\n                operation[\"x\"] = coordinates[\"x\"]\n                operation[\"y\"] = coordinates[\"y\"]\n\n                if config.verbose:\n                    print(\n                        \"[call_gpt_4_1_with_ocr][click] text_element_index\",\n                        text_element_index,\n                    )\n                    print(\n                        \"[call_gpt_4_1_with_ocr][click] coordinates\",\n                        coordinates,\n                    )\n                    print(\n                        \"[call_gpt_4_1_with_ocr][click] final operation\",\n                        operation,\n                    )\n                processed_content.append(operation)\n\n            else:\n                processed_content.append(operation)\n\n        assistant_message = {\"role\": \"assistant\", \"content\": content_str}\n        messages.append(assistant_message)\n\n        return processed_content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}\"\n        )\n        if config.verbose:\n            print(\"[Self-Operating Computer][Operate] error\", e)\n            traceback.print_exc()\n        return gpt_4_fallback(messages, objective, model)\n\n\nasync def call_o1_with_ocr(messages, objective, model):\n    if config.verbose:\n        print(\"[call_o1_with_ocr]\")\n\n    # Construct the path to the file within the package\n    try:\n        time.sleep(1)\n        client = config.initialize_openai()\n\n        confirm_system_prompt(messages, objective, model)\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        # Call the function to capture the screen with the cursor\n        capture_screen_with_cursor(screenshot_filename)\n\n        with open(screenshot_filename, \"rb\") as img_file:\n            img_base64 = base64.b64encode(img_file.read()).decode(\"utf-8\")\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": user_prompt},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{img_base64}\"},\n                },\n            ],\n        }\n        messages.append(vision_message)\n\n        response = client.chat.completions.create(\n            model=\"o1\",\n            messages=messages,\n        )\n\n        content = response.choices[0].message.content\n\n        content = clean_json(content)\n\n        # used later for the messages\n        content_str = content\n\n        content = json.loads(content)\n\n        processed_content = []\n\n        for operation in content:\n            if operation.get(\"operation\") == \"click\":\n                text_to_click = operation.get(\"text\")\n                if config.verbose:\n                    print(\n                        \"[call_o1_with_ocr][click] text_to_click\",\n                        text_to_click,\n                    )\n                # Initialize EasyOCR Reader\n                reader = easyocr.Reader([\"en\"])\n\n                # Read the screenshot\n                result = reader.readtext(screenshot_filename)\n\n                text_element_index = get_text_element(\n                    result, text_to_click, screenshot_filename\n                )\n                coordinates = get_text_coordinates(\n                    result, text_element_index, screenshot_filename\n                )\n\n                # add `coordinates`` to `content`\n                operation[\"x\"] = coordinates[\"x\"]\n                operation[\"y\"] = coordinates[\"y\"]\n\n                if config.verbose:\n                    print(\n                        \"[call_o1_with_ocr][click] text_element_index\",\n                        text_element_index,\n                    )\n                    print(\n                        \"[call_o1_with_ocr][click] coordinates\",\n                        coordinates,\n                    )\n                    print(\n                        \"[call_o1_with_ocr][click] final operation\",\n                        operation,\n                    )\n                processed_content.append(operation)\n\n            else:\n                processed_content.append(operation)\n\n        # wait to append the assistant message so that if the `processed_content` step fails we don't append a message and mess up message history\n        assistant_message = {\"role\": \"assistant\", \"content\": content_str}\n        messages.append(assistant_message)\n\n        return processed_content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}\"\n        )\n        if config.verbose:\n            print(\"[Self-Operating Computer][Operate] error\", e)\n            traceback.print_exc()\n        return gpt_4_fallback(messages, objective, model)\n\n\nasync def call_gpt_4o_labeled(messages, objective, model):\n    time.sleep(1)\n\n    try:\n        client = config.initialize_openai()\n\n        confirm_system_prompt(messages, objective, model)\n        file_path = pkg_resources.resource_filename(\"operate.models.weights\", \"best.pt\")\n        yolo_model = YOLO(file_path)  # Load your trained model\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        # Call the function to capture the screen with the cursor\n        capture_screen_with_cursor(screenshot_filename)\n\n        with open(screenshot_filename, \"rb\") as img_file:\n            img_base64 = base64.b64encode(img_file.read()).decode(\"utf-8\")\n\n        img_base64_labeled, label_coordinates = add_labels(img_base64, yolo_model)\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        if config.verbose:\n            print(\n                \"[call_gpt_4_vision_preview_labeled] user_prompt\",\n                user_prompt,\n            )\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": user_prompt},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\n                        \"url\": f\"data:image/jpeg;base64,{img_base64_labeled}\"\n                    },\n                },\n            ],\n        }\n        messages.append(vision_message)\n\n        response = client.chat.completions.create(\n            model=\"gpt-4o\",\n            messages=messages,\n            presence_penalty=1,\n            frequency_penalty=1,\n        )\n\n        content = response.choices[0].message.content\n\n        content = clean_json(content)\n\n        assistant_message = {\"role\": \"assistant\", \"content\": content}\n\n        messages.append(assistant_message)\n\n        content = json.loads(content)\n        if config.verbose:\n            print(\n                \"[call_gpt_4_vision_preview_labeled] content\",\n                content,\n            )\n\n        processed_content = []\n\n        for operation in content:\n            print(\n                \"[call_gpt_4_vision_preview_labeled] for operation in content\",\n                operation,\n            )\n            if operation.get(\"operation\") == \"click\":\n                label = operation.get(\"label\")\n                if config.verbose:\n                    print(\n                        \"[Self Operating Computer][call_gpt_4_vision_preview_labeled] label\",\n                        label,\n                    )\n\n                coordinates = get_label_coordinates(label, label_coordinates)\n                if config.verbose:\n                    print(\n                        \"[Self Operating Computer][call_gpt_4_vision_preview_labeled] coordinates\",\n                        coordinates,\n                    )\n                image = Image.open(\n                    io.BytesIO(base64.b64decode(img_base64))\n                )  # Load the image to get its size\n                image_size = image.size  # Get the size of the image (width, height)\n                click_position_percent = get_click_position_in_percent(\n                    coordinates, image_size\n                )\n                if config.verbose:\n                    print(\n                        \"[Self Operating Computer][call_gpt_4_vision_preview_labeled] click_position_percent\",\n                        click_position_percent,\n                    )\n                if not click_position_percent:\n                    print(\n                        f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] Failed to get click position in percent. Trying another method {ANSI_RESET}\"\n                    )\n                    return call_gpt_4o(messages)\n\n                x_percent = f\"{click_position_percent[0]:.2f}\"\n                y_percent = f\"{click_position_percent[1]:.2f}\"\n                operation[\"x\"] = x_percent\n                operation[\"y\"] = y_percent\n                if config.verbose:\n                    print(\n                        \"[Self Operating Computer][call_gpt_4_vision_preview_labeled] new click operation\",\n                        operation,\n                    )\n                processed_content.append(operation)\n            else:\n                if config.verbose:\n                    print(\n                        \"[Self Operating Computer][call_gpt_4_vision_preview_labeled] .append none click operation\",\n                        operation,\n                    )\n\n                processed_content.append(operation)\n\n            if config.verbose:\n                print(\n                    \"[Self Operating Computer][call_gpt_4_vision_preview_labeled] new processed_content\",\n                    processed_content,\n                )\n            return processed_content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}\"\n        )\n        if config.verbose:\n            print(\"[Self-Operating Computer][Operate] error\", e)\n            traceback.print_exc()\n        return call_gpt_4o(messages)\n\n\ndef call_ollama_llava(messages):\n    if config.verbose:\n        print(\"[call_ollama_llava]\")\n    time.sleep(1)\n    try:\n        model = config.initialize_ollama()\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        # Call the function to capture the screen with the cursor\n        capture_screen_with_cursor(screenshot_filename)\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        if config.verbose:\n            print(\n                \"[call_ollama_llava] user_prompt\",\n                user_prompt,\n            )\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": user_prompt,\n            \"images\": [screenshot_filename],\n        }\n        messages.append(vision_message)\n\n        response = model.chat(\n            model=\"llava\",\n            messages=messages,\n        )\n\n        # Important: Remove the image path from the message history.\n        # Ollama will attempt to load each image reference and will\n        # eventually timeout.\n        messages[-1][\"images\"] = None\n\n        content = response[\"message\"][\"content\"].strip()\n\n        content = clean_json(content)\n\n        assistant_message = {\"role\": \"assistant\", \"content\": content}\n        if config.verbose:\n            print(\n                \"[call_ollama_llava] content\",\n                content,\n            )\n        content = json.loads(content)\n\n        messages.append(assistant_message)\n\n        return content\n\n    except ollama.ResponseError as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Operate] Couldn't connect to Ollama. With Ollama installed, run `ollama pull llava` then `ollama serve`{ANSI_RESET}\",\n            e,\n        )\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[llava] That did not work. Trying again {ANSI_RESET}\",\n            e,\n        )\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response was {ANSI_RESET}\",\n            content,\n        )\n        if config.verbose:\n            traceback.print_exc()\n        return call_ollama_llava(messages)\n\n\nasync def call_claude_3_with_ocr(messages, objective, model):\n    if config.verbose:\n        print(\"[call_claude_3_with_ocr]\")\n\n    try:\n        time.sleep(1)\n        client = config.initialize_anthropic()\n\n        confirm_system_prompt(messages, objective, model)\n        screenshots_dir = \"screenshots\"\n        if not os.path.exists(screenshots_dir):\n            os.makedirs(screenshots_dir)\n\n        screenshot_filename = os.path.join(screenshots_dir, \"screenshot.png\")\n        capture_screen_with_cursor(screenshot_filename)\n\n        # downsize screenshot due to 5MB size limit\n        with open(screenshot_filename, \"rb\") as img_file:\n            img = Image.open(img_file)\n\n            # Convert RGBA to RGB\n            if img.mode == \"RGBA\":\n                img = img.convert(\"RGB\")\n\n            # Calculate the new dimensions while maintaining the aspect ratio\n            original_width, original_height = img.size\n            aspect_ratio = original_width / original_height\n            new_width = 2560  # Adjust this value to achieve the desired file size\n            new_height = int(new_width / aspect_ratio)\n            if config.verbose:\n                print(\"[call_claude_3_with_ocr] resizing claude\")\n\n            # Resize the image\n            img_resized = img.resize((new_width, new_height), Image.Resampling.LANCZOS)\n\n            # Save the resized and converted image to a BytesIO object for JPEG format\n            img_buffer = io.BytesIO()\n            img_resized.save(\n                img_buffer, format=\"JPEG\", quality=85\n            )  # Adjust the quality parameter as needed\n            img_buffer.seek(0)\n\n            # Encode the resized image as base64\n            img_data = base64.b64encode(img_buffer.getvalue()).decode(\"utf-8\")\n\n        if len(messages) == 1:\n            user_prompt = get_user_first_message_prompt()\n        else:\n            user_prompt = get_user_prompt()\n\n        vision_message = {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"image\",\n                    \"source\": {\n                        \"type\": \"base64\",\n                        \"media_type\": \"image/jpeg\",\n                        \"data\": img_data,\n                    },\n                },\n                {\n                    \"type\": \"text\",\n                    \"text\": user_prompt\n                    + \"**REMEMBER** Only output json format, do not append any other text.\",\n                },\n            ],\n        }\n        messages.append(vision_message)\n\n        # anthropic api expect system prompt as an separate argument\n        response = client.messages.create(\n            model=\"claude-3-opus-20240229\",\n            max_tokens=3000,\n            system=messages[0][\"content\"],\n            messages=messages[1:],\n        )\n\n        content = response.content[0].text\n        content = clean_json(content)\n        content_str = content\n        try:\n            content = json.loads(content)\n        # rework for json mode output\n        except json.JSONDecodeError as e:\n            if config.verbose:\n                print(\n                    f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] JSONDecodeError: {e} {ANSI_RESET}\"\n                )\n            response = client.messages.create(\n                model=\"claude-3-opus-20240229\",\n                max_tokens=3000,\n                system=f\"This json string is not valid, when using with json.loads(content) \\\n                it throws the following error: {e}, return correct json string. \\\n                **REMEMBER** Only output json format, do not append any other text.\",\n                messages=[{\"role\": \"user\", \"content\": content}],\n            )\n            content = response.content[0].text\n            content = clean_json(content)\n            content_str = content\n            content = json.loads(content)\n\n        if config.verbose:\n            print(\n                f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] content: {content} {ANSI_RESET}\"\n            )\n        processed_content = []\n\n        for operation in content:\n            if operation.get(\"operation\") == \"click\":\n                text_to_click = operation.get(\"text\")\n                if config.verbose:\n                    print(\n                        \"[call_claude_3_ocr][click] text_to_click\",\n                        text_to_click,\n                    )\n                # Initialize EasyOCR Reader\n                reader = easyocr.Reader([\"en\"])\n\n                # Read the screenshot\n                result = reader.readtext(screenshot_filename)\n\n                # limit the text to extract has a higher success rate\n                text_element_index = get_text_element(\n                    result, text_to_click[:3], screenshot_filename\n                )\n                coordinates = get_text_coordinates(\n                    result, text_element_index, screenshot_filename\n                )\n\n                # add `coordinates`` to `content`\n                operation[\"x\"] = coordinates[\"x\"]\n                operation[\"y\"] = coordinates[\"y\"]\n\n                if config.verbose:\n                    print(\n                        \"[call_claude_3_ocr][click] text_element_index\",\n                        text_element_index,\n                    )\n                    print(\n                        \"[call_claude_3_ocr][click] coordinates\",\n                        coordinates,\n                    )\n                    print(\n                        \"[call_claude_3_ocr][click] final operation\",\n                        operation,\n                    )\n                processed_content.append(operation)\n\n            else:\n                processed_content.append(operation)\n\n        assistant_message = {\"role\": \"assistant\", \"content\": content_str}\n        messages.append(assistant_message)\n\n        return processed_content\n\n    except Exception as e:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}\"\n        )\n        if config.verbose:\n            print(\"[Self-Operating Computer][Operate] error\", e)\n            traceback.print_exc()\n            print(\"message before convertion \", messages)\n\n        # Convert the messages to the GPT-4 format\n        gpt4_messages = [messages[0]]  # Include the system message\n        for message in messages[1:]:\n            if message[\"role\"] == \"user\":\n                # Update the image type format from \"source\" to \"url\"\n                updated_content = []\n                for item in message[\"content\"]:\n                    if isinstance(item, dict) and \"type\" in item:\n                        if item[\"type\"] == \"image\":\n                            updated_content.append(\n                                {\n                                    \"type\": \"image_url\",\n                                    \"image_url\": {\n                                        \"url\": f\"data:image/png;base64,{item['source']['data']}\"\n                                    },\n                                }\n                            )\n                        else:\n                            updated_content.append(item)\n\n                gpt4_messages.append({\"role\": \"user\", \"content\": updated_content})\n            elif message[\"role\"] == \"assistant\":\n                gpt4_messages.append(\n                    {\"role\": \"assistant\", \"content\": message[\"content\"]}\n                )\n\n        return gpt_4_fallback(gpt4_messages, objective, model)\n\n\ndef get_last_assistant_message(messages):\n    \"\"\"\n    Retrieve the last message from the assistant in the messages array.\n    If the last assistant message is the first message in the array, return None.\n    \"\"\"\n    for index in reversed(range(len(messages))):\n        if messages[index][\"role\"] == \"assistant\":\n            if index == 0:  # Check if the assistant message is the first in the array\n                return None\n            else:\n                return messages[index]\n    return None  # Return None if no assistant message is found\n\n\ndef gpt_4_fallback(messages, objective, model):\n    if config.verbose:\n        print(\"[gpt_4_fallback]\")\n    system_prompt = get_system_prompt(\"gpt-4o\", objective)\n    new_system_message = {\"role\": \"system\", \"content\": system_prompt}\n    # remove and replace the first message in `messages` with `new_system_message`\n\n    messages[0] = new_system_message\n\n    if config.verbose:\n        print(\"[gpt_4_fallback][updated]\")\n        print(\"[gpt_4_fallback][updated] len(messages)\", len(messages))\n\n    return call_gpt_4o(messages)\n\n\ndef confirm_system_prompt(messages, objective, model):\n    \"\"\"\n    On `Exception` we default to `call_gpt_4_vision_preview` so we have this function to reassign system prompt in case of a previous failure\n    \"\"\"\n    if config.verbose:\n        print(\"[confirm_system_prompt] model\", model)\n\n    system_prompt = get_system_prompt(model, objective)\n    new_system_message = {\"role\": \"system\", \"content\": system_prompt}\n    # remove and replace the first message in `messages` with `new_system_message`\n\n    messages[0] = new_system_message\n\n    if config.verbose:\n        print(\"[confirm_system_prompt]\")\n        print(\"[confirm_system_prompt] len(messages)\", len(messages))\n        for m in messages:\n            if m[\"role\"] != \"user\":\n                print(\"--------------------[message]--------------------\")\n                print(\"[confirm_system_prompt][message] role\", m[\"role\"])\n                print(\"[confirm_system_prompt][message] content\", m[\"content\"])\n                print(\"------------------[end message]------------------\")\n\n\ndef clean_json(content):\n    if config.verbose:\n        print(\"\\n\\n[clean_json] content before cleaning\", content)\n    if content.startswith(\"```json\"):\n        content = content[\n            len(\"```json\") :\n        ].strip()  # Remove starting ```json and trim whitespace\n    elif content.startswith(\"```\"):\n        content = content[\n            len(\"```\") :\n        ].strip()  # Remove starting ``` and trim whitespace\n    if content.endswith(\"```\"):\n        content = content[\n            : -len(\"```\")\n        ].strip()  # Remove ending ``` and trim whitespace\n\n    # Normalize line breaks and remove any unwanted characters\n    content = \"\\n\".join(line.strip() for line in content.splitlines())\n\n    if config.verbose:\n        print(\"\\n\\n[clean_json] content after cleaning\", content)\n\n    return content\n"
  },
  {
    "path": "operate/models/prompts.py",
    "content": "import platform\nfrom operate.config import Config\n\n# Load configuration\nconfig = Config()\n\n# General user Prompts\nUSER_QUESTION = \"Hello, I can help you with anything. What would you like done?\"\n\n\nSYSTEM_PROMPT_STANDARD = \"\"\"\nYou are operating a {operating_system} computer, using the same operating system as a human.\n\nFrom looking at the screen, the objective, and your previous actions, take the next best series of action. \n\nYou have 4 possible operation actions available to you. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement.\n\n1. click - Move mouse and click\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"click\", \"x\": \"x percent (e.g. 0.10)\", \"y\": \"y percent (e.g. 0.13)\" }}]  # \"percent\" refers to the percentage of the screen's dimensions in decimal format\n```\n\n2. write - Write with your keyboard\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"write\", \"content\": \"text to write here\" }}]\n```\n\n3. press - Use a hotkey or press key to operate the computer\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"press\", \"keys\": [\"keys to use\"] }}]\n```\n\n4. done - The objective is completed\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"done\", \"summary\": \"summary of what was completed\" }}]\n```\n\nReturn the actions in array format `[]`. You can take just one action or multiple actions.\n\nHere a helpful example:\n\nExample 1: Searches for Google Chrome on the OS and opens it\n```\n[\n    {{ \"thought\": \"Searching the operating system to find Google Chrome because it appears I am currently in terminal\", \"operation\": \"press\", \"keys\": {os_search_str} }},\n    {{ \"thought\": \"Now I need to write 'Google Chrome' as a next step\", \"operation\": \"write\", \"content\": \"Google Chrome\" }},\n    {{ \"thought\": \"Finally I'll press enter to open Google Chrome assuming it is available\", \"operation\": \"press\", \"keys\": [\"enter\"] }}\n]\n```\n\nExample 2: Focuses on the address bar in a browser before typing a website\n```\n[\n    {{ \"thought\": \"I'll focus on the address bar in the browser. I can see the browser is open so this should be safe to try\", \"operation\": \"press\", \"keys\": [{cmd_string}, \"l\"] }},\n    {{ \"thought\": \"Now that the address bar is in focus I can type the URL\", \"operation\": \"write\", \"content\": \"https://news.ycombinator.com/\" }},\n    {{ \"thought\": \"I'll need to press enter to go the URL now\", \"operation\": \"press\", \"keys\": [\"enter\"] }}\n]\n```\n\nA few important notes: \n\n- Go to Google Docs and Google Sheets by typing in the Chrome Address bar\n- Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.\n\nObjective: {objective} \n\"\"\"\n\n\nSYSTEM_PROMPT_LABELED = \"\"\"\nYou are operating a {operating_system} computer, using the same operating system as a human.\n\nFrom looking at the screen, the objective, and your previous actions, take the next best series of action. \n\nYou have 4 possible operation actions available to you. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement.\n\n1. click - Move mouse and click - We labeled the clickable elements with red bounding boxes and IDs. Label IDs are in the following format with `x` being a number: `~x`\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"click\", \"label\": \"~x\" }}]  # 'percent' refers to the percentage of the screen's dimensions in decimal format\n```\n2. write - Write with your keyboard\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"write\", \"content\": \"text to write here\" }}]\n```\n3. press - Use a hotkey or press key to operate the computer\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"press\", \"keys\": [\"keys to use\"] }}]\n```\n\n4. done - The objective is completed\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"done\", \"summary\": \"summary of what was completed\" }}]\n```\nReturn the actions in array format `[]`. You can take just one action or multiple actions.\n\nHere a helpful example:\n\nExample 1: Searches for Google Chrome on the OS and opens it\n```\n[\n    {{ \"thought\": \"Searching the operating system to find Google Chrome because it appears I am currently in terminal\", \"operation\": \"press\", \"keys\": {os_search_str} }},\n    {{ \"thought\": \"Now I need to write 'Google Chrome' as a next step\", \"operation\": \"write\", \"content\": \"Google Chrome\" }},\n]\n```\n\nExample 2: Focuses on the address bar in a browser before typing a website\n```\n[\n    {{ \"thought\": \"I'll focus on the address bar in the browser. I can see the browser is open so this should be safe to try\", \"operation\": \"press\", \"keys\": [{cmd_string}, \"l\"] }},\n    {{ \"thought\": \"Now that the address bar is in focus I can type the URL\", \"operation\": \"write\", \"content\": \"https://news.ycombinator.com/\" }},\n    {{ \"thought\": \"I'll need to press enter to go the URL now\", \"operation\": \"press\", \"keys\": [\"enter\"] }}\n]\n```\n\nExample 3: Send a \"Hello World\" message in the chat\n```\n[\n    {{ \"thought\": \"I see a messsage field on this page near the button. It looks like it has a label\", \"operation\": \"click\", \"label\": \"~34\" }},\n    {{ \"thought\": \"Now that I am focused on the message field, I'll go ahead and write \", \"operation\": \"write\", \"content\": \"Hello World\" }},\n]\n```\n\nA few important notes: \n\n- Go to Google Docs and Google Sheets by typing in the Chrome Address bar\n- Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.\n\nObjective: {objective} \n\"\"\"\n\n\n# TODO: Add an example or instruction about `Action: press ['pagedown']` to scroll\nSYSTEM_PROMPT_OCR = \"\"\"\nYou are operating a {operating_system} computer, using the same operating system as a human.\n\nFrom looking at the screen, the objective, and your previous actions, take the next best series of action. \n\nYou have 4 possible operation actions available to you. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement.\n\n1. click - Move mouse and click - Look for text to click. Try to find relevant text to click, but if there's nothing relevant enough you can return `\"nothing to click\"` for the text value and we'll try a different method.\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"click\", \"text\": \"The text in the button or link to click\" }}]  \n```\n2. write - Write with your keyboard\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"write\", \"content\": \"text to write here\" }}]\n```\n3. press - Use a hotkey or press key to operate the computer\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"press\", \"keys\": [\"keys to use\"] }}]\n```\n4. done - The objective is completed\n```\n[{{ \"thought\": \"write a thought here\", \"operation\": \"done\", \"summary\": \"summary of what was completed\" }}]\n```\n\nReturn the actions in array format `[]`. You can take just one action or multiple actions.\n\nHere a helpful example:\n\nExample 1: Searches for Google Chrome on the OS and opens it\n```\n[\n    {{ \"thought\": \"Searching the operating system to find Google Chrome because it appears I am currently in terminal\", \"operation\": \"press\", \"keys\": {os_search_str} }},\n    {{ \"thought\": \"Now I need to write 'Google Chrome' as a next step\", \"operation\": \"write\", \"content\": \"Google Chrome\" }},\n    {{ \"thought\": \"Finally I'll press enter to open Google Chrome assuming it is available\", \"operation\": \"press\", \"keys\": [\"enter\"] }}\n]\n```\n\nExample 2: Open a new Google Docs when the browser is already open\n```\n[\n    {{ \"thought\": \"I'll focus on the address bar in the browser. I can see the browser is open so this should be safe to try\", \"operation\": \"press\", \"keys\": [{cmd_string}, \"t\"] }},\n    {{ \"thought\": \"Now that the address bar is in focus I can type the URL\", \"operation\": \"write\", \"content\": \"https://docs.new/\" }},\n    {{ \"thought\": \"I'll need to press enter to go the URL now\", \"operation\": \"press\", \"keys\": [\"enter\"] }}\n]\n```\n\nExample 3: Search for someone on Linkedin when already on linkedin.com\n```\n[\n    {{ \"thought\": \"I can see the search field with the placeholder text 'search'. I click that field to search\", \"operation\": \"click\", \"text\": \"search\" }},\n    {{ \"thought\": \"Now that the field is active I can write the name of the person I'd like to search for\", \"operation\": \"write\", \"content\": \"John Doe\" }},\n    {{ \"thought\": \"Finally I'll submit the search form with enter\", \"operation\": \"press\", \"keys\": [\"enter\"] }}\n]\n```\n\nA few important notes: \n\n- Default to Google Chrome as the browser\n- Go to websites by opening a new tab with `press` and then `write` the URL\n- Reflect on previous actions and the screenshot to ensure they align and that your previous actions worked. \n- If the first time clicking a button or link doesn't work, don't try again to click it. Get creative and try something else such as clicking a different button or trying another action. \n- Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.\n\nObjective: {objective} \n\"\"\"\n\nOPERATE_FIRST_MESSAGE_PROMPT = \"\"\"\nPlease take the next best action. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement. Remember you only have the following 4 operations available: click, write, press, done\n\nYou just started so you are in the terminal app and your code is running in this terminal tab. To leave the terminal, search for a new program on the OS. \n\nAction:\"\"\"\n\nOPERATE_PROMPT = \"\"\"\nPlease take the next best action. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement. Remember you only have the following 4 operations available: click, write, press, done\nAction:\"\"\"\n\n\ndef get_system_prompt(model, objective):\n    \"\"\"\n    Format the vision prompt more efficiently and print the name of the prompt used\n    \"\"\"\n\n    if platform.system() == \"Darwin\":\n        cmd_string = \"\\\"command\\\"\"\n        os_search_str = \"[\\\"command\\\", \\\"space\\\"]\"\n        operating_system = \"Mac\"\n    elif platform.system() == \"Windows\":\n        cmd_string = \"\\\"ctrl\\\"\"\n        os_search_str = \"[\\\"win\\\"]\"\n        operating_system = \"Windows\"\n    else:\n        cmd_string = \"\\\"ctrl\\\"\"\n        os_search_str = \"[\\\"win\\\"]\"\n        operating_system = \"Linux\"\n\n    if model == \"gpt-4-with-som\":\n        prompt = SYSTEM_PROMPT_LABELED.format(\n            objective=objective,\n            cmd_string=cmd_string,\n            os_search_str=os_search_str,\n            operating_system=operating_system,\n        )\n    elif model == \"gpt-4-with-ocr\" or model == \"gpt-4.1-with-ocr\" or model == \"o1-with-ocr\" or model == \"claude-3\" or model == \"qwen-vl\":\n\n        prompt = SYSTEM_PROMPT_OCR.format(\n            objective=objective,\n            cmd_string=cmd_string,\n            os_search_str=os_search_str,\n            operating_system=operating_system,\n        )\n\n    else:\n        prompt = SYSTEM_PROMPT_STANDARD.format(\n            objective=objective,\n            cmd_string=cmd_string,\n            os_search_str=os_search_str,\n            operating_system=operating_system,\n        )\n\n    # Optional verbose output\n    if config.verbose:\n        print(\"[get_system_prompt] model:\", model)\n    # print(\"[get_system_prompt] prompt:\", prompt)\n\n    return prompt\n\n\ndef get_user_prompt():\n    prompt = OPERATE_PROMPT\n    return prompt\n\n\ndef get_user_first_message_prompt():\n    prompt = OPERATE_FIRST_MESSAGE_PROMPT\n    return prompt\n"
  },
  {
    "path": "operate/models/weights/__init__.py",
    "content": ""
  },
  {
    "path": "operate/operate.py",
    "content": "import sys\nimport os\nimport time\nimport asyncio\nfrom prompt_toolkit.shortcuts import message_dialog\nfrom prompt_toolkit import prompt\nfrom operate.exceptions import ModelNotRecognizedException\nimport platform\n\n# from operate.models.prompts import USER_QUESTION, get_system_prompt\nfrom operate.models.prompts import (\n    USER_QUESTION,\n    get_system_prompt,\n)\nfrom operate.config import Config\nfrom operate.utils.style import (\n    ANSI_GREEN,\n    ANSI_RESET,\n    ANSI_YELLOW,\n    ANSI_RED,\n    ANSI_BRIGHT_MAGENTA,\n    ANSI_BLUE,\n    style,\n)\nfrom operate.utils.operating_system import OperatingSystem\nfrom operate.models.apis import get_next_action\n\n# Load configuration\nconfig = Config()\noperating_system = OperatingSystem()\n\n\ndef main(model, terminal_prompt, voice_mode=False, verbose_mode=False):\n    \"\"\"\n    Main function for the Self-Operating Computer.\n\n    Parameters:\n    - model: The model used for generating responses.\n    - terminal_prompt: A string representing the prompt provided in the terminal.\n    - voice_mode: A boolean indicating whether to enable voice mode.\n\n    Returns:\n    None\n    \"\"\"\n\n    mic = None\n    # Initialize `WhisperMic`, if `voice_mode` is True\n\n    config.verbose = verbose_mode\n    config.validation(model, voice_mode)\n\n    if voice_mode:\n        try:\n            from whisper_mic import WhisperMic\n\n            # Initialize WhisperMic if import is successful\n            mic = WhisperMic()\n        except ImportError:\n            print(\n                \"Voice mode requires the 'whisper_mic' module. Please install it using 'pip install -r requirements-audio.txt'\"\n            )\n            sys.exit(1)\n\n    # Skip message dialog if prompt was given directly\n    if not terminal_prompt:\n        message_dialog(\n            title=\"Self-Operating Computer\",\n            text=\"An experimental framework to enable multimodal models to operate computers\",\n            style=style,\n        ).run()\n\n    else:\n        print(\"Running direct prompt...\")\n\n    # # Clear the console\n    if platform.system() == \"Windows\":\n        os.system(\"cls\")\n    else:\n        print(\"\\033c\", end=\"\")\n\n    if terminal_prompt:  # Skip objective prompt if it was given as an argument\n        objective = terminal_prompt\n    elif voice_mode:\n        print(\n            f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RESET} Listening for your command... (speak now)\"\n        )\n        try:\n            objective = mic.listen()\n        except Exception as e:\n            print(f\"{ANSI_RED}Error in capturing voice input: {e}{ANSI_RESET}\")\n            return  # Exit if voice input fails\n    else:\n        print(\n            f\"[{ANSI_GREEN}Self-Operating Computer {ANSI_RESET}|{ANSI_BRIGHT_MAGENTA} {model}{ANSI_RESET}]\\n{USER_QUESTION}\"\n        )\n        print(f\"{ANSI_YELLOW}[User]{ANSI_RESET}\")\n        objective = prompt(style=style)\n\n    system_prompt = get_system_prompt(model, objective)\n    system_message = {\"role\": \"system\", \"content\": system_prompt}\n    messages = [system_message]\n\n    loop_count = 0\n\n    session_id = None\n\n    while True:\n        if config.verbose:\n            print(\"[Self Operating Computer] loop_count\", loop_count)\n        try:\n            operations, session_id = asyncio.run(\n                get_next_action(model, messages, objective, session_id)\n            )\n\n            stop = operate(operations, model)\n            if stop:\n                break\n\n            loop_count += 1\n            if loop_count > 10:\n                break\n        except ModelNotRecognizedException as e:\n            print(\n                f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}\"\n            )\n            break\n        except Exception as e:\n            print(\n                f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}\"\n            )\n            break\n\n\ndef operate(operations, model):\n    if config.verbose:\n        print(\"[Self Operating Computer][operate]\")\n    for operation in operations:\n        if config.verbose:\n            print(\"[Self Operating Computer][operate] operation\", operation)\n        # wait one second\n        time.sleep(1)\n        operate_type = operation.get(\"operation\").lower()\n        operate_thought = operation.get(\"thought\")\n        operate_detail = \"\"\n        if config.verbose:\n            print(\"[Self Operating Computer][operate] operate_type\", operate_type)\n\n        if operate_type == \"press\" or operate_type == \"hotkey\":\n            keys = operation.get(\"keys\")\n            operate_detail = keys\n            operating_system.press(keys)\n        elif operate_type == \"write\":\n            content = operation.get(\"content\")\n            operate_detail = content\n            operating_system.write(content)\n        elif operate_type == \"click\":\n            x = operation.get(\"x\")\n            y = operation.get(\"y\")\n            click_detail = {\"x\": x, \"y\": y}\n            operate_detail = click_detail\n\n            operating_system.mouse(click_detail)\n        elif operate_type == \"done\":\n            summary = operation.get(\"summary\")\n\n            print(\n                f\"[{ANSI_GREEN}Self-Operating Computer {ANSI_RESET}|{ANSI_BRIGHT_MAGENTA} {model}{ANSI_RESET}]\"\n            )\n            print(f\"{ANSI_BLUE}Objective Complete: {ANSI_RESET}{summary}\\n\")\n            return True\n\n        else:\n            print(\n                f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] unknown operation response :({ANSI_RESET}\"\n            )\n            print(\n                f\"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response {ANSI_RESET}{operation}\"\n            )\n            return True\n\n        print(\n            f\"[{ANSI_GREEN}Self-Operating Computer {ANSI_RESET}|{ANSI_BRIGHT_MAGENTA} {model}{ANSI_RESET}]\"\n        )\n        print(f\"{operate_thought}\")\n        print(f\"{ANSI_BLUE}Action: {ANSI_RESET}{operate_type} {operate_detail}\\n\")\n\n    return False\n"
  },
  {
    "path": "operate/utils/__init__.py",
    "content": ""
  },
  {
    "path": "operate/utils/label.py",
    "content": "import io\nimport base64\nimport json\nimport os\nimport time\nimport asyncio\nfrom PIL import Image, ImageDraw\n\n\ndef validate_and_extract_image_data(data):\n    if not data or \"messages\" not in data:\n        raise ValueError(\"Invalid request, no messages found\")\n\n    messages = data[\"messages\"]\n    if (\n        not messages\n        or not isinstance(messages, list)\n        or not messages[-1].get(\"image_url\")\n    ):\n        raise ValueError(\"No image provided or incorrect format\")\n\n    image_data = messages[-1][\"image_url\"][\"url\"]\n    if not image_data.startswith(\"data:image\"):\n        raise ValueError(\"Invalid image format\")\n\n    return image_data.split(\"base64,\")[-1], messages\n\n\ndef get_label_coordinates(label, label_coordinates):\n    \"\"\"\n    Retrieves the coordinates for a given label.\n\n    :param label: The label to find coordinates for (e.g., \"~1\").\n    :param label_coordinates: Dictionary containing labels and their coordinates.\n    :return: Coordinates of the label or None if the label is not found.\n    \"\"\"\n    return label_coordinates.get(label)\n\n\ndef is_overlapping(box1, box2):\n    x1_box1, y1_box1, x2_box1, y2_box1 = box1\n    x1_box2, y1_box2, x2_box2, y2_box2 = box2\n\n    # Check if there is no overlap\n    if x1_box1 > x2_box2 or x1_box2 > x2_box1:\n        return False\n    if (\n        y1_box1 > y2_box2 or y1_box2 > y2_box1\n    ):  # Adjusted to check 100px proximity above\n        return False\n\n    return True\n\n\ndef add_labels(base64_data, yolo_model):\n    image_bytes = base64.b64decode(base64_data)\n    image_labeled = Image.open(io.BytesIO(image_bytes))  # Corrected this line\n    image_debug = image_labeled.copy()  # Create a copy for the debug image\n    image_original = (\n        image_labeled.copy()\n    )  # Copy of the original image for base64 return\n\n    results = yolo_model(image_labeled)\n\n    draw = ImageDraw.Draw(image_labeled)\n    debug_draw = ImageDraw.Draw(\n        image_debug\n    )  # Create a separate draw object for the debug image\n    font_size = 45\n\n    labeled_images_dir = \"labeled_images\"\n    label_coordinates = {}  # Dictionary to store coordinates\n\n    if not os.path.exists(labeled_images_dir):\n        os.makedirs(labeled_images_dir)\n\n    counter = 0\n    drawn_boxes = []  # List to keep track of boxes already drawn\n    for result in results:\n        if hasattr(result, \"boxes\"):\n            for det in result.boxes:\n                bbox = det.xyxy[0]\n                x1, y1, x2, y2 = bbox.tolist()\n\n                debug_label = \"D_\" + str(counter)\n                debug_index_position = (x1, y1 - font_size)\n                debug_draw.rectangle([(x1, y1), (x2, y2)], outline=\"blue\", width=1)\n                debug_draw.text(\n                    debug_index_position,\n                    debug_label,\n                    fill=\"blue\",\n                    font_size=font_size,\n                )\n\n                overlap = any(\n                    is_overlapping((x1, y1, x2, y2), box) for box in drawn_boxes\n                )\n\n                if not overlap:\n                    draw.rectangle([(x1, y1), (x2, y2)], outline=\"red\", width=1)\n                    label = \"~\" + str(counter)\n                    index_position = (x1, y1 - font_size)\n                    draw.text(\n                        index_position,\n                        label,\n                        fill=\"red\",\n                        font_size=font_size,\n                    )\n\n                    # Add the non-overlapping box to the drawn_boxes list\n                    drawn_boxes.append((x1, y1, x2, y2))\n                    label_coordinates[label] = (x1, y1, x2, y2)\n\n                    counter += 1\n\n    # Save the image\n    timestamp = time.strftime(\"%Y%m%d-%H%M%S\")\n\n    output_path = os.path.join(labeled_images_dir, f\"img_{timestamp}_labeled.png\")\n    output_path_debug = os.path.join(labeled_images_dir, f\"img_{timestamp}_debug.png\")\n    output_path_original = os.path.join(\n        labeled_images_dir, f\"img_{timestamp}_original.png\"\n    )\n\n    image_labeled.save(output_path)\n    image_debug.save(output_path_debug)\n    image_original.save(output_path_original)\n\n    buffered_original = io.BytesIO()\n    image_original.save(buffered_original, format=\"PNG\")  # I guess this is needed\n    img_base64_original = base64.b64encode(buffered_original.getvalue()).decode(\"utf-8\")\n\n    # Convert image to base64 for return\n    buffered_labeled = io.BytesIO()\n    image_labeled.save(buffered_labeled, format=\"PNG\")  # I guess this is needed\n    img_base64_labeled = base64.b64encode(buffered_labeled.getvalue()).decode(\"utf-8\")\n\n    return img_base64_labeled, label_coordinates\n\n\ndef get_click_position_in_percent(coordinates, image_size):\n    \"\"\"\n    Calculates the click position at the center of the bounding box and converts it to percentages.\n\n    :param coordinates: A tuple of the bounding box coordinates (x1, y1, x2, y2).\n    :param image_size: A tuple of the image dimensions (width, height).\n    :return: A tuple of the click position in percentages (x_percent, y_percent).\n    \"\"\"\n    if not coordinates or not image_size:\n        return None\n\n    # Calculate the center of the bounding box\n    x_center = (coordinates[0] + coordinates[2]) / 2\n    y_center = (coordinates[1] + coordinates[3]) / 2\n\n    # Convert to percentages\n    x_percent = x_center / image_size[0]\n    y_percent = y_center / image_size[1]\n\n    return x_percent, y_percent\n"
  },
  {
    "path": "operate/utils/misc.py",
    "content": "import json\nimport re\n\n\ndef convert_percent_to_decimal(percent):\n    try:\n        # Remove the '%' sign and convert to float\n        decimal_value = float(percent)\n\n        # Convert to decimal (e.g., 20% -> 0.20)\n        return decimal_value\n    except ValueError as e:\n        print(f\"[convert_percent_to_decimal] error: {e}\")\n        return None\n\n\ndef parse_operations(response):\n    if response == \"DONE\":\n        return {\"type\": \"DONE\", \"data\": None}\n    elif response.startswith(\"CLICK\"):\n        # Adjust the regex to match the correct format\n        click_data = re.search(r\"CLICK \\{ (.+) \\}\", response).group(1)\n        click_data_json = json.loads(f\"{{{click_data}}}\")\n        return {\"type\": \"CLICK\", \"data\": click_data_json}\n\n    elif response.startswith(\"TYPE\"):\n        # Extract the text to type\n        try:\n            type_data = re.search(r\"TYPE (.+)\", response, re.DOTALL).group(1)\n        except:\n            type_data = re.search(r'TYPE \"(.+)\"', response, re.DOTALL).group(1)\n        return {\"type\": \"TYPE\", \"data\": type_data}\n\n    elif response.startswith(\"SEARCH\"):\n        # Extract the search query\n        try:\n            search_data = re.search(r'SEARCH \"(.+)\"', response).group(1)\n        except:\n            search_data = re.search(r\"SEARCH (.+)\", response).group(1)\n        return {\"type\": \"SEARCH\", \"data\": search_data}\n\n    return {\"type\": \"UNKNOWN\", \"data\": response}\n"
  },
  {
    "path": "operate/utils/ocr.py",
    "content": "from operate.config import Config\nfrom PIL import Image, ImageDraw\nimport os\nfrom datetime import datetime\n\n# Load configuration\nconfig = Config()\n\n\ndef get_text_element(result, search_text, image_path):\n    \"\"\"\n    Searches for a text element in the OCR results and returns its index. Also draws bounding boxes on the image.\n    Args:\n        result (list): The list of results returned by EasyOCR.\n        search_text (str): The text to search for in the OCR results.\n        image_path (str): Path to the original image.\n\n    Returns:\n        int: The index of the element containing the search text.\n\n    Raises:\n        Exception: If the text element is not found in the results.\n    \"\"\"\n    if config.verbose:\n        print(\"[get_text_element]\")\n        print(\"[get_text_element] search_text\", search_text)\n        # Create /ocr directory if it doesn't exist\n        ocr_dir = \"ocr\"\n        if not os.path.exists(ocr_dir):\n            os.makedirs(ocr_dir)\n\n        # Open the original image\n        image = Image.open(image_path)\n        draw = ImageDraw.Draw(image)\n\n    found_index = None\n    for index, element in enumerate(result):\n        text = element[1]\n        box = element[0]\n\n        if config.verbose:\n            # Draw bounding box in blue\n            draw.polygon([tuple(point) for point in box], outline=\"blue\")\n\n        if search_text in text:\n            found_index = index\n            if config.verbose:\n                print(\"[get_text_element][loop] found search_text, index:\", index)\n\n    if found_index is not None:\n        if config.verbose:\n            # Draw bounding box of the found text in red\n            box = result[found_index][0]\n            draw.polygon([tuple(point) for point in box], outline=\"red\")\n            # Save the image with bounding boxes\n            datetime_str = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n            ocr_image_path = os.path.join(ocr_dir, f\"ocr_image_{datetime_str}.png\")\n            image.save(ocr_image_path)\n            print(\"[get_text_element] OCR image saved at:\", ocr_image_path)\n\n        return found_index\n\n    raise Exception(\"The text element was not found in the image\")\n\n\ndef get_text_coordinates(result, index, image_path):\n    \"\"\"\n    Gets the coordinates of the text element at the specified index as a percentage of screen width and height.\n    Args:\n        result (list): The list of results returned by EasyOCR.\n        index (int): The index of the text element in the results list.\n        image_path (str): Path to the screenshot image.\n\n    Returns:\n        dict: A dictionary containing the 'x' and 'y' coordinates as percentages of the screen width and height.\n    \"\"\"\n    if index >= len(result):\n        raise Exception(\"Index out of range in OCR results\")\n\n    # Get the bounding box of the text element\n    bounding_box = result[index][0]\n\n    # Calculate the center of the bounding box\n    min_x = min([coord[0] for coord in bounding_box])\n    max_x = max([coord[0] for coord in bounding_box])\n    min_y = min([coord[1] for coord in bounding_box])\n    max_y = max([coord[1] for coord in bounding_box])\n\n    center_x = (min_x + max_x) / 2\n    center_y = (min_y + max_y) / 2\n\n    # Get image dimensions\n    with Image.open(image_path) as img:\n        width, height = img.size\n\n    # Convert to percentages\n    percent_x = round((center_x / width), 3)\n    percent_y = round((center_y / height), 3)\n\n    return {\"x\": percent_x, \"y\": percent_y}\n"
  },
  {
    "path": "operate/utils/operating_system.py",
    "content": "import pyautogui\nimport platform\nimport time\nimport math\n\nfrom operate.utils.misc import convert_percent_to_decimal\n\n\nclass OperatingSystem:\n    def write(self, content):\n        try:\n            content = content.replace(\"\\\\n\", \"\\n\")\n            for char in content:\n                pyautogui.write(char)\n        except Exception as e:\n            print(\"[OperatingSystem][write] error:\", e)\n\n    def press(self, keys):\n        try:\n            for key in keys:\n                pyautogui.keyDown(key)\n            time.sleep(0.1)\n            for key in keys:\n                pyautogui.keyUp(key)\n        except Exception as e:\n            print(\"[OperatingSystem][press] error:\", e)\n\n    def mouse(self, click_detail):\n        try:\n            x = convert_percent_to_decimal(click_detail.get(\"x\"))\n            y = convert_percent_to_decimal(click_detail.get(\"y\"))\n\n            if click_detail and isinstance(x, float) and isinstance(y, float):\n                self.click_at_percentage(x, y)\n\n        except Exception as e:\n            print(\"[OperatingSystem][mouse] error:\", e)\n\n    def click_at_percentage(\n        self,\n        x_percentage,\n        y_percentage,\n        duration=0.2,\n        circle_radius=50,\n        circle_duration=0.5,\n    ):\n        try:\n            screen_width, screen_height = pyautogui.size()\n            x_pixel = int(screen_width * float(x_percentage))\n            y_pixel = int(screen_height * float(y_percentage))\n\n            pyautogui.moveTo(x_pixel, y_pixel, duration=duration)\n\n            start_time = time.time()\n            while time.time() - start_time < circle_duration:\n                angle = ((time.time() - start_time) / circle_duration) * 2 * math.pi\n                x = x_pixel + math.cos(angle) * circle_radius\n                y = y_pixel + math.sin(angle) * circle_radius\n                pyautogui.moveTo(x, y, duration=0.1)\n\n            pyautogui.click(x_pixel, y_pixel)\n        except Exception as e:\n            print(\"[OperatingSystem][click_at_percentage] error:\", e)\n"
  },
  {
    "path": "operate/utils/screenshot.py",
    "content": "import os\nimport platform\nimport subprocess\nimport pyautogui\nfrom PIL import Image, ImageDraw, ImageGrab\nimport Xlib.display\nimport Xlib.X\nimport Xlib.Xutil  # not sure if Xutil is necessary\n\n\ndef capture_screen_with_cursor(file_path):\n    user_platform = platform.system()\n\n    if user_platform == \"Windows\":\n        screenshot = pyautogui.screenshot()\n        screenshot.save(file_path)\n    elif user_platform == \"Linux\":\n        # Use xlib to prevent scrot dependency for Linux\n        screen = Xlib.display.Display().screen()\n        size = screen.width_in_pixels, screen.height_in_pixels\n        screenshot = ImageGrab.grab(bbox=(0, 0, size[0], size[1]))\n        screenshot.save(file_path)\n    elif user_platform == \"Darwin\":  # (Mac OS)\n        # Use the screencapture utility to capture the screen with the cursor\n        subprocess.run([\"screencapture\", \"-C\", file_path])\n    else:\n        print(f\"The platform you're using ({user_platform}) is not currently supported\")\n\n\ndef compress_screenshot(raw_screenshot_filename, screenshot_filename):\n    with Image.open(raw_screenshot_filename) as img:\n        # Check if the image has an alpha channel (transparency)\n        if img.mode in ('RGBA', 'LA') or (img.mode == 'P' and 'transparency' in img.info):\n            # Create a white background image\n            background = Image.new('RGB', img.size, (255, 255, 255))\n            # Paste the image onto the background, using the alpha channel as mask\n            background.paste(img, mask=img.split()[3])  # 3 is the alpha channel\n            # Save the result as JPEG\n            background.save(screenshot_filename, 'JPEG', quality=85)  # Adjust quality as needed\n        else:\n            # If no alpha channel, simply convert and save\n            img.convert('RGB').save(screenshot_filename, 'JPEG', quality=85)\n"
  },
  {
    "path": "operate/utils/style.py",
    "content": "import sys\nimport platform\nimport os\nfrom prompt_toolkit.styles import Style as PromptStyle\n\n\n# Define style\nstyle = PromptStyle.from_dict(\n    {\n        \"dialog\": \"bg:#88ff88\",\n        \"button\": \"bg:#ffffff #000000\",\n        \"dialog.body\": \"bg:#44cc44 #ffffff\",\n        \"dialog shadow\": \"bg:#003800\",\n    }\n)\n\n\n# Check if on a windows terminal that supports ANSI escape codes\ndef supports_ansi():\n    \"\"\"\n    Check if the terminal supports ANSI escape codes\n    \"\"\"\n    plat = platform.system()\n    supported_platform = plat != \"Windows\" or \"ANSICON\" in os.environ\n    is_a_tty = hasattr(sys.stdout, \"isatty\") and sys.stdout.isatty()\n    return supported_platform and is_a_tty\n\n\n# Define ANSI color codes\nANSI_GREEN = \"\\033[32m\" if supports_ansi() else \"\"  # Standard green text\nANSI_BRIGHT_GREEN = \"\\033[92m\" if supports_ansi() else \"\"  # Bright/bold green text\nANSI_RESET = \"\\033[0m\" if supports_ansi() else \"\"  # Reset to default text color\nANSI_BLUE = \"\\033[94m\" if supports_ansi() else \"\"  # Bright blue\nANSI_YELLOW = \"\\033[33m\" if supports_ansi() else \"\"  # Standard yellow text\nANSI_RED = \"\\033[31m\" if supports_ansi() else \"\"\nANSI_BRIGHT_MAGENTA = \"\\033[95m\" if supports_ansi() else \"\"  # Bright magenta text\n"
  },
  {
    "path": "requirements-audio.txt",
    "content": "whisper-mic"
  },
  {
    "path": "requirements.txt",
    "content": "annotated-types==0.6.0\nanyio==3.7.1\ncertifi==2023.7.22\ncharset-normalizer==3.3.2\ncolorama==0.4.6\ncontourpy==1.2.0\ncycler==0.12.1\ndistro==1.8.0\nEasyProcess==1.1\nentrypoint2==1.1\nexceptiongroup==1.1.3\nfonttools==4.44.0\nh11==0.14.0\nhttpcore==1.0.2\nhttpx>=0.25.2\nidna==3.4\nimportlib-resources==6.1.1\nkiwisolver==1.4.5\nmatplotlib==3.8.1\nMouseInfo==0.1.3\nmss==9.0.1\nnumpy==1.26.1\nopenai==1.2.3\npackaging==23.2\nPillow==10.1.0\nprompt-toolkit==3.0.39\nPyAutoGUI==0.9.54\npydantic==2.4.2\npydantic_core==2.10.1\nPyGetWindow==0.0.9\nPyMsgBox==1.0.9\npyparsing==3.1.1\npyperclip==1.8.2\nPyRect==0.2.0\npyscreenshot==3.1\nPyScreeze==0.1.29\npython3-xlib==0.15\npython-dateutil==2.8.2\npython-dotenv==1.0.0\npytweening==1.0.7\nrequests==2.31.0\nrubicon-objc==0.4.7\nsix==1.16.0\nsniffio==1.3.0\ntqdm==4.66.1\ntyping_extensions==4.8.0\nurllib3==2.0.7\nwcwidth==0.2.9\nzipp==3.17.0\ngoogle-generativeai==0.3.0\naiohttp==3.9.1\nultralytics==8.0.227\neasyocr==1.7.1\nollama==0.1.6\nanthropic"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup, find_packages\n\n# Read the contents of your requirements.txt file\nwith open(\"requirements.txt\") as f:\n    required = f.read().splitlines()\n\n# Read the contents of your README.md file for the project description\nwith open(\"README.md\", \"r\", encoding=\"utf-8\") as readme_file:\n    long_description = readme_file.read()\n\nsetup(\n    name=\"self-operating-computer\",\n    version=\"1.5.8\",\n    packages=find_packages(),\n    install_requires=required,  # Add dependencies here\n    entry_points={\n        \"console_scripts\": [\n            \"operate=operate.main:main_entry\",\n        ],\n    },\n    package_data={\n        # Include the file in the operate.models.weights package\n        \"operate.models.weights\": [\"best.pt\"],\n    },\n    long_description=long_description,  # Add project description here\n    long_description_content_type=\"text/markdown\",  # Specify Markdown format\n    # include any other necessary setup options here\n)\n"
  }
]