Repository: OthersideAI/self-operating-computer Branch: main Commit: fac568eea7da Files: 29 Total size: 103.7 KB Directory structure: gitextract_otm8wgpb/ ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── bug_report.md │ │ └── feature_request.md │ ├── PULL_REQUEST_TEMPLATE.md │ └── workflows/ │ └── upload-package.yml ├── .gitignore ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── evaluate.py ├── operate/ │ ├── __init__.py │ ├── config.py │ ├── exceptions.py │ ├── main.py │ ├── models/ │ │ ├── __init__.py │ │ ├── apis.py │ │ ├── prompts.py │ │ └── weights/ │ │ ├── __init__.py │ │ └── best.pt │ ├── operate.py │ └── utils/ │ ├── __init__.py │ ├── label.py │ ├── misc.py │ ├── ocr.py │ ├── operating_system.py │ ├── screenshot.py │ └── style.py ├── requirements-audio.txt ├── requirements.txt └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/ISSUE_TEMPLATE/bug_report.md ================================================ --- name: Bug report about: Create a report to help us improve title: '[BUG] Brief Description of the Issue' labels: bug assignees: '' --- Found a bug? Please fill out the sections below. 👍 ### Describe the bug A clear and concise description of what the bug is. ### Steps to Reproduce 1. (for ex.) went to... 2. clicked on this point 3. not working ### Expected Behavior A brief description of what you expected to happen. ### Actual Behavior: what actually happened. ### Environment - OS: - Model Used (e.g., GPT-4v, Gemini Pro Vision): - Framework Version (optional): ### Screenshots If applicable, add screenshots to help explain your problem. ### Additional context Add any other context about the problem here. ================================================ FILE: .github/ISSUE_TEMPLATE/feature_request.md ================================================ --- name: Feature request about: Suggest an idea for this project title: '[FEATURE] Short Description of the Feature' labels: enhancement assignees: '' --- ### Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] ### Describe the solution you'd like A clear and concise description of what you want to happen. ### Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered. ### Additional context Add any other context or screenshots about the feature request here. ================================================ FILE: .github/PULL_REQUEST_TEMPLATE.md ================================================ ## What does this PR do? Fixes # (issue) ## Requirement/Documentation - If there is a requirement document, please, share it here. ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] Chore (refactoring code, technical debt, workflow improvements) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Tests (Unit/Integration/E2E or any other test) - [ ] This change requires a documentation update ## Mandatory Tasks - [ ] Make sure you have self-reviewed the code. A decent size PR without self-review might be rejected. Make sure before submmiting this PR you run tests with evaluate.py ================================================ FILE: .github/workflows/upload-package.yml ================================================ name: Upload Python Package on: push: tags: - 'v*' jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v3 with: python-version: '3.8' - name: Install dependencies run: | python -m pip install --upgrade pip pip install setuptools wheel twine - name: Build and check package run: | python setup.py sdist bdist_wheel twine check dist/* - name: Upload to PyPi uses: pypa/gh-action-pypi-publish@v1.4.2 with: user: __token__ password: ${{ secrets.PYPI_API_TOKEN }} ================================================ FILE: .gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ cover/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder .pybuilder/ target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: # .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # poetry # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. # This is especially recommended for binary packages to ensure reproducibility, and is more # commonly ignored for libraries. # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control #poetry.lock # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. #pdm.lock # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it # in version control. # https://pdm.fming.dev/#use-with-ide .pdm.toml # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm # JetBrains specific template is maintained in a separate JetBrains.gitignore that can # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. #.idea/ .DS_Store # Avoid sending testing screenshots up *.png operate/screenshots/ ================================================ FILE: CONTRIBUTING.md ================================================ # Contributing We appreciate your contributions! ## Process 1. Fork it 2. Create your feature branch (`git checkout -b my-new-feature`) 3. Commit your changes (`git commit -am 'Add some feature'`) 4. Push to the branch (`git push origin my-new-feature`) 5. Create new Pull Request ## Modifying and Running Code 1. Make changes in `operate/main.py` 2. Run `pip install .` again 3. Run `operate` to see your changes ## Testing Changes **After making significant changes, it's important to verify that SOC can still successfully perform a set of common test cases.** In the root directory of the project, run: ``` python3 evaluate.py ``` This will automatically prompt `operate` to perform several simple objectives. Upon completion of each objective, GPT-4v will give an evaluation and determine if the objective was successfully reached. `evaluate.py` will print out if each test case `[PASSED]` or `[FAILED]`. In addition, a justification will be given on why the pass/fail was given. It is recommended that a screenshot of the `evaluate.py` output is included in any PR which could impact the performance of SOC. ## Contribution Ideas - **Improve performance by finding optimal screenshot grid**: A primary element of the framework is that it overlays a percentage grid on the screenshot which GPT-4v uses to estimate click locations. If someone is able to find the optimal grid and some evaluation metrics to confirm it is an improvement on the current method then we will merge that PR. - **Improve the `SUMMARY_PROMPT`** - **Improve Linux and Windows compatibility**: There are still some issues with Linux and Windows compatibility. PRs to fix the issues are encouraged. - **Adding New Multimodal Models**: Integration of new multimodal models is welcomed. If you have a specific model in mind that you believe would be a valuable addition, please feel free to integrate it and submit a PR. - **Iterate `--accurate` flag functionality**: Look at https://github.com/OthersideAI/self-operating-computer/pull/57 for previous iteration - **Enhanced Security**: A feature request to implement a _robust security feature_ that prompts users for _confirmation before executing potentially harmful actions_. This feature aims to _prevent unintended actions_ and _safeguard user data_ as mentioned here in this [OtherSide#25](https://github.com/OthersideAI/self-operating-computer/issues/25) ## Guidelines This will primarily be a [Software 2.0](https://karpathy.medium.com/software-2-0-a64152b37c35) project. For this reason: - Let's try to hold off refactors into separate files until `main.py` is more than 1000 lines ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2023 OthersideAI Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ ome

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Released Nov 2023, the Self-Operating Computer Framework was one of the first examples of full computer-use.

## Key Features - **Compatibility**: Designed for various multimodal models. - **Integration**: Currently integrated with **GPT-4o, GPT-4.1, o1, Gemini Pro Vision, Claude 3, Qwen-VL and LLaVa.** - **Future Plans**: Support for additional models. ## Demo https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0 ## Run `Self-Operating Computer` 1. **Install the project** ``` pip install self-operating-computer ``` 2. **Run the project** ``` operate ``` 3. **Enter your OpenAI Key**: If you don't have one, you can obtain an OpenAI key [here](https://platform.openai.com/account/api-keys). If you need you change your key at a later point, run `vim .env` to open the `.env` and replace the old key.
4. **Give Terminal app the required permissions**: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".
## Using `operate` Modes #### OpenAI models The default model for the project is gpt-4o which you can use by simply typing `operate`. To try running OpenAI's new `o1` model, use the command below. ``` operate -m o1-with-ocr ``` To experiment with OpenAI's latest `gpt-4.1` model, run: ``` operate -m gpt-4.1-with-ocr ``` ### Multimodal Models `-m` Try Google's `gemini-pro-vision` by following the instructions below. Start `operate` with the Gemini model ``` operate -m gemini-pro-vision ``` **Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR. #### Try Claude `-m claude-3` Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Claude dashboard](https://console.anthropic.com/dashboard) to get an API key and run the command below to try it. ``` operate -m claude-3 ``` #### Try qwen `-m qwen-vl` Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Qwen dashboard](https://bailian.console.aliyun.com/) to get an API key and run the command below to try it. ``` operate -m qwen-vl ``` #### Try LLaVa Hosted Through Ollama `-m llava` If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama! *Note: Ollama currently only supports MacOS and Linux. Windows now in Preview* First, install Ollama on your machine from https://ollama.ai/download. Once Ollama is installed, pull the LLaVA model: ``` ollama pull llava ``` This will download the model on your machine which takes approximately 5 GB of storage. When Ollama has finished pulling LLaVA, start the server: ``` ollama serve ``` That's it! Now start `operate` and select the LLaVA model: ``` operate -m llava ``` **Important:** Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time. Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama) ### Voice Mode `--voice` The framework supports voice inputs for the objective. Try voice by following the instructions below. **Clone the repo** to a directory on your computer: ``` git clone https://github.com/OthersideAI/self-operating-computer.git ``` **Cd into directory**: ``` cd self-operating-computer ``` Install the additional `requirements-audio.txt` ``` pip install -r requirements-audio.txt ``` **Install device requirements** For mac users: ``` brew install portaudio ``` For Linux users: ``` sudo apt install portaudio19-dev python3-pyaudio ``` Run with voice mode ``` operate --voice ``` ### Optical Character Recognition Mode `-m gpt-4-with-ocr` The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click. Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write: `operate` or `operate -m gpt-4-with-ocr` will also work. ### Set-of-Mark Prompting `-m gpt-4-with-som` The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models. Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441). For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR). Start `operate` with the SoM model ``` operate -m gpt-4-with-som ``` ## Contributions are Welcomed!: If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md). ## Feedback For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter. ## Join Our Discord Community For real-time discussions and community support, join our Discord server. - If you're already a member, join the discussion in [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157). - If you're new, first [join our Discord Server](https://discord.gg/YqaKtyBEzM) and then navigate to the [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157). ## Follow HyperWriteAI for More Updates Stay updated with the latest developments: - Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI). - Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/). ## Compatibility - This project is compatible with Mac OS, Windows, and Linux (with X server installed). ## OpenAI Rate Limiting Note The ```gpt-4o``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5. Learn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)** ================================================ FILE: evaluate.py ================================================ import sys import os import subprocess import platform import base64 import json import openai import argparse from dotenv import load_dotenv # "Objective for `operate`" : "Guideline for passing this test case given to GPT-4v" TEST_CASES = { "Go to Github.com": "A Github page is visible.", "Go to Youtube.com and play a video": "The YouTube video player is visible.", } EVALUATION_PROMPT = """ Your job is to look at the given screenshot and determine if the following guideline is met in the image. You must respond in the following format ONLY. Do not add anything else: {{ "guideline_met": (true|false), "reason": "Explanation for why guideline was or wasn't met" }} guideline_met must be set to a JSON boolean. True if the image meets the given guideline. reason must be a string containing a justification for your decision. Guideline: {guideline} """ SCREENSHOT_PATH = os.path.join("screenshots", "screenshot.png") # Check if on a windows terminal that supports ANSI escape codes def supports_ansi(): """ Check if the terminal supports ANSI escape codes """ plat = platform.system() supported_platform = plat != "Windows" or "ANSICON" in os.environ is_a_tty = hasattr(sys.stdout, "isatty") and sys.stdout.isatty() return supported_platform and is_a_tty if supports_ansi(): # Standard green text ANSI_GREEN = "\033[32m" # Bright/bold green text ANSI_BRIGHT_GREEN = "\033[92m" # Reset to default text color ANSI_RESET = "\033[0m" # ANSI escape code for blue text ANSI_BLUE = "\033[94m" # This is for bright blue # Standard yellow text ANSI_YELLOW = "\033[33m" ANSI_RED = "\033[31m" # Bright magenta text ANSI_BRIGHT_MAGENTA = "\033[95m" else: ANSI_GREEN = "" ANSI_BRIGHT_GREEN = "" ANSI_RESET = "" ANSI_BLUE = "" ANSI_YELLOW = "" ANSI_RED = "" ANSI_BRIGHT_MAGENTA = "" def format_evaluation_prompt(guideline): prompt = EVALUATION_PROMPT.format(guideline=guideline) return prompt def parse_eval_content(content): try: res = json.loads(content) print(res["reason"]) return res["guideline_met"] except: print( "The model gave a bad evaluation response and it couldn't be parsed. Exiting..." ) exit(1) def evaluate_final_screenshot(guideline): """Load the final screenshot and return True or False if it meets the given guideline.""" with open(SCREENSHOT_PATH, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") eval_message = [ { "role": "user", "content": [ {"type": "text", "text": format_evaluation_prompt(guideline)}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}, }, ], } ] response = openai.chat.completions.create( model="gpt-4o", messages=eval_message, presence_penalty=1, frequency_penalty=1, temperature=0.7, ) eval_content = response.choices[0].message.content return parse_eval_content(eval_content) def run_test_case(objective, guideline, model): """Returns True if the result of the test with the given prompt meets the given guideline for the given model.""" # Run `operate` with the model to evaluate and the test case prompt subprocess.run( ["operate", "-m", model, "--prompt", f'"{objective}"'], stdout=subprocess.DEVNULL, ) try: result = evaluate_final_screenshot(guideline) except OSError: print("[Error] Couldn't open the screenshot for evaluation") return False return result def get_test_model(): parser = argparse.ArgumentParser( description="Run the self-operating-computer with a specified model." ) parser.add_argument( "-m", "--model", help="Specify the model to evaluate.", required=False, default="gpt-4-with-ocr", ) return parser.parse_args().model def main(): load_dotenv() openai.api_key = os.getenv("OPENAI_API_KEY") model = get_test_model() print(f"{ANSI_BLUE}[EVALUATING MODEL `{model}`]{ANSI_RESET}") print(f"{ANSI_BRIGHT_MAGENTA}[STARTING EVALUATION]{ANSI_RESET}") passed = 0 failed = 0 for objective, guideline in TEST_CASES.items(): print(f"{ANSI_BLUE}[EVALUATING]{ANSI_RESET} '{objective}'") result = run_test_case(objective, guideline, model) if result: print(f"{ANSI_GREEN}[PASSED]{ANSI_RESET} '{objective}'") passed += 1 else: print(f"{ANSI_RED}[FAILED]{ANSI_RESET} '{objective}'") failed += 1 print( f"{ANSI_BRIGHT_MAGENTA}[EVALUATION COMPLETE]{ANSI_RESET} {passed} test{'' if passed == 1 else 's'} passed, {failed} test{'' if failed == 1 else 's'} failed" ) if __name__ == "__main__": main() ================================================ FILE: operate/__init__.py ================================================ ================================================ FILE: operate/config.py ================================================ import os import sys import google.generativeai as genai from dotenv import load_dotenv from ollama import Client from openai import OpenAI import anthropic from prompt_toolkit.shortcuts import input_dialog class Config: """ Configuration class for managing settings. Attributes: verbose (bool): Flag indicating whether verbose mode is enabled. openai_api_key (str): API key for OpenAI. google_api_key (str): API key for Google. ollama_host (str): url to ollama running remotely. """ _instance = None def __new__(cls): if cls._instance is None: cls._instance = super(Config, cls).__new__(cls) # Put any initialization here return cls._instance def __init__(self): load_dotenv() self.verbose = False self.openai_api_key = ( None # instance variables are backups in case saving to a `.env` fails ) self.google_api_key = ( None # instance variables are backups in case saving to a `.env` fails ) self.ollama_host = ( None # instance variables are backups in case savint to a `.env` fails ) self.anthropic_api_key = ( None # instance variables are backups in case saving to a `.env` fails ) self.qwen_api_key = ( None # instance variables are backups in case saving to a `.env` fails ) def initialize_openai(self): if self.verbose: print("[Config][initialize_openai]") if self.openai_api_key: if self.verbose: print("[Config][initialize_openai] using cached openai_api_key") api_key = self.openai_api_key else: if self.verbose: print( "[Config][initialize_openai] no cached openai_api_key, try to get from env." ) api_key = os.getenv("OPENAI_API_KEY") client = OpenAI( api_key=api_key, ) client.api_key = api_key client.base_url = os.getenv("OPENAI_API_BASE_URL", client.base_url) return client def initialize_qwen(self): if self.verbose: print("[Config][initialize_qwen]") if self.qwen_api_key: if self.verbose: print("[Config][initialize_qwen] using cached qwen_api_key") api_key = self.qwen_api_key else: if self.verbose: print( "[Config][initialize_qwen] no cached qwen_api_key, try to get from env." ) api_key = os.getenv("QWEN_API_KEY") client = OpenAI( api_key=api_key, base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", ) client.api_key = api_key client.base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1" return client def initialize_google(self): if self.google_api_key: if self.verbose: print("[Config][initialize_google] using cached google_api_key") api_key = self.google_api_key else: if self.verbose: print( "[Config][initialize_google] no cached google_api_key, try to get from env." ) api_key = os.getenv("GOOGLE_API_KEY") genai.configure(api_key=api_key, transport="rest") model = genai.GenerativeModel("gemini-pro-vision") return model def initialize_ollama(self): if self.ollama_host: if self.verbose: print("[Config][initialize_ollama] using cached ollama host") else: if self.verbose: print( "[Config][initialize_ollama] no cached ollama host. Assuming ollama running locally." ) self.ollama_host = os.getenv("OLLAMA_HOST", None) model = Client(host=self.ollama_host) return model def initialize_anthropic(self): if self.anthropic_api_key: api_key = self.anthropic_api_key else: api_key = os.getenv("ANTHROPIC_API_KEY") return anthropic.Anthropic(api_key=api_key) def validation(self, model, voice_mode): """ Validate the input parameters for the dialog operation. """ self.require_api_key( "OPENAI_API_KEY", "OpenAI API key", model == "gpt-4" or voice_mode or model == "gpt-4-with-som" or model == "gpt-4-with-ocr" or model == "gpt-4.1-with-ocr" or model == "o1-with-ocr", ) self.require_api_key( "GOOGLE_API_KEY", "Google API key", model == "gemini-pro-vision" ) self.require_api_key( "ANTHROPIC_API_KEY", "Anthropic API key", model == "claude-3" ) self.require_api_key("QWEN_API_KEY", "Qwen API key", model == "qwen-vl") def require_api_key(self, key_name, key_description, is_required): key_exists = bool(os.environ.get(key_name)) if self.verbose: print("[Config] require_api_key") print("[Config] key_name", key_name) print("[Config] key_description", key_description) print("[Config] key_exists", key_exists) if is_required and not key_exists: self.prompt_and_save_api_key(key_name, key_description) def prompt_and_save_api_key(self, key_name, key_description): key_value = input_dialog( title="API Key Required", text=f"Please enter your {key_description}:" ).run() if key_value is None: # User pressed cancel or closed the dialog sys.exit("Operation cancelled by user.") if key_value: if key_name == "OPENAI_API_KEY": self.openai_api_key = key_value elif key_name == "GOOGLE_API_KEY": self.google_api_key = key_value elif key_name == "ANTHROPIC_API_KEY": self.anthropic_api_key = key_value elif key_name == "QWEN_API_KEY": self.qwen_api_key = key_value self.save_api_key_to_env(key_name, key_value) load_dotenv() # Reload environment variables # Update the instance attribute with the new key @staticmethod def save_api_key_to_env(key_name, key_value): with open(".env", "a") as file: file.write(f"\n{key_name}='{key_value}'") ================================================ FILE: operate/exceptions.py ================================================ class ModelNotRecognizedException(Exception): """Exception raised for unrecognized models. Attributes: model -- the unrecognized model message -- explanation of the error """ def __init__(self, model, message="Model not recognized"): self.model = model self.message = message super().__init__(self.message) def __str__(self): return f"{self.message} : {self.model} " ================================================ FILE: operate/main.py ================================================ """ Self-Operating Computer """ import argparse from operate.utils.style import ANSI_BRIGHT_MAGENTA from operate.operate import main def main_entry(): parser = argparse.ArgumentParser( description="Run the self-operating-computer with a specified model." ) parser.add_argument( "-m", "--model", help="Specify the model to use", required=False, default="gpt-4-with-ocr", ) # Add a voice flag parser.add_argument( "--voice", help="Use voice input mode", action="store_true", ) # Add a flag for verbose mode parser.add_argument( "--verbose", help="Run operate in verbose mode", action="store_true", ) # Allow for direct input of prompt parser.add_argument( "--prompt", help="Directly input the objective prompt", type=str, required=False, ) try: args = parser.parse_args() main( args.model, terminal_prompt=args.prompt, voice_mode=args.voice, verbose_mode=args.verbose ) except KeyboardInterrupt: print(f"\n{ANSI_BRIGHT_MAGENTA}Exiting...") if __name__ == "__main__": main_entry() ================================================ FILE: operate/models/__init__.py ================================================ ================================================ FILE: operate/models/apis.py ================================================ import base64 import io import json import os import time import traceback import easyocr import ollama import pkg_resources from PIL import Image from ultralytics import YOLO from operate.config import Config from operate.exceptions import ModelNotRecognizedException from operate.models.prompts import ( get_system_prompt, get_user_first_message_prompt, get_user_prompt, ) from operate.utils.label import ( add_labels, get_click_position_in_percent, get_label_coordinates, ) from operate.utils.ocr import get_text_coordinates, get_text_element from operate.utils.screenshot import capture_screen_with_cursor, compress_screenshot from operate.utils.style import ANSI_BRIGHT_MAGENTA, ANSI_GREEN, ANSI_RED, ANSI_RESET # Load configuration config = Config() async def get_next_action(model, messages, objective, session_id): if config.verbose: print("[Self-Operating Computer][get_next_action]") print("[Self-Operating Computer][get_next_action] model", model) if model == "gpt-4": return call_gpt_4o(messages), None if model == "qwen-vl": operation = await call_qwen_vl_with_ocr(messages, objective, model) return operation, None if model == "gpt-4-with-som": operation = await call_gpt_4o_labeled(messages, objective, model) return operation, None if model == "gpt-4-with-ocr": operation = await call_gpt_4o_with_ocr(messages, objective, model) return operation, None if model == "gpt-4.1-with-ocr": operation = await call_gpt_4_1_with_ocr(messages, objective, model) return operation, None if model == "o1-with-ocr": operation = await call_o1_with_ocr(messages, objective, model) return operation, None if model == "agent-1": return "coming soon" if model == "gemini-pro-vision": return call_gemini_pro_vision(messages, objective), None if model == "llava": operation = call_ollama_llava(messages) return operation, None if model == "claude-3": operation = await call_claude_3_with_ocr(messages, objective, model) return operation, None raise ModelNotRecognizedException(model) def call_gpt_4o(messages): if config.verbose: print("[call_gpt_4_v]") time.sleep(1) client = config.initialize_openai() try: screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") # Call the function to capture the screen with the cursor capture_screen_with_cursor(screenshot_filename) with open(screenshot_filename, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() if config.verbose: print( "[call_gpt_4_v] user_prompt", user_prompt, ) vision_message = { "role": "user", "content": [ {"type": "text", "text": user_prompt}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}, }, ], } messages.append(vision_message) response = client.chat.completions.create( model="gpt-4o", messages=messages, presence_penalty=1, frequency_penalty=1, ) content = response.choices[0].message.content content = clean_json(content) assistant_message = {"role": "assistant", "content": content} if config.verbose: print( "[call_gpt_4_v] content", content, ) content = json.loads(content) messages.append(assistant_message) return content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[Operate] That did not work. Trying again {ANSI_RESET}", e, ) print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response was {ANSI_RESET}", content, ) if config.verbose: traceback.print_exc() return call_gpt_4o(messages) async def call_qwen_vl_with_ocr(messages, objective, model): if config.verbose: print("[call_qwen_vl_with_ocr]") # Construct the path to the file within the package try: time.sleep(1) client = config.initialize_qwen() confirm_system_prompt(messages, objective, model) screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) # Call the function to capture the screen with the cursor raw_screenshot_filename = os.path.join(screenshots_dir, "raw_screenshot.png") capture_screen_with_cursor(raw_screenshot_filename) # Compress screenshot image to make size be smaller screenshot_filename = os.path.join(screenshots_dir, "screenshot.jpeg") compress_screenshot(raw_screenshot_filename, screenshot_filename) with open(screenshot_filename, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() vision_message = { "role": "user", "content": [ {"type": "text", "text": f"{user_prompt}**REMEMBER** Only output json format, do not append any other text."}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}, }, ], } messages.append(vision_message) response = client.chat.completions.create( model="qwen2.5-vl-72b-instruct", messages=messages, ) content = response.choices[0].message.content content = clean_json(content) # used later for the messages content_str = content content = json.loads(content) processed_content = [] for operation in content: if operation.get("operation") == "click": text_to_click = operation.get("text") if config.verbose: print( "[call_qwen_vl_with_ocr][click] text_to_click", text_to_click, ) # Initialize EasyOCR Reader reader = easyocr.Reader(["en"]) # Read the screenshot result = reader.readtext(screenshot_filename) text_element_index = get_text_element( result, text_to_click, screenshot_filename ) coordinates = get_text_coordinates( result, text_element_index, screenshot_filename ) # add `coordinates`` to `content` operation["x"] = coordinates["x"] operation["y"] = coordinates["y"] if config.verbose: print( "[call_qwen_vl_with_ocr][click] text_element_index", text_element_index, ) print( "[call_qwen_vl_with_ocr][click] coordinates", coordinates, ) print( "[call_qwen_vl_with_ocr][click] final operation", operation, ) processed_content.append(operation) else: processed_content.append(operation) # wait to append the assistant message so that if the `processed_content` step fails we don't append a message and mess up message history assistant_message = {"role": "assistant", "content": content_str} messages.append(assistant_message) return processed_content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}" ) if config.verbose: print("[Self-Operating Computer][Operate] error", e) traceback.print_exc() return gpt_4_fallback(messages, objective, model) def call_gemini_pro_vision(messages, objective): """ Get the next action for Self-Operating Computer using Gemini Pro Vision """ if config.verbose: print( "[Self Operating Computer][call_gemini_pro_vision]", ) # sleep for a second time.sleep(1) try: screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") # Call the function to capture the screen with the cursor capture_screen_with_cursor(screenshot_filename) # sleep for a second time.sleep(1) prompt = get_system_prompt("gemini-pro-vision", objective) model = config.initialize_google() if config.verbose: print("[call_gemini_pro_vision] model", model) response = model.generate_content([prompt, Image.open(screenshot_filename)]) content = response.text[1:] if config.verbose: print("[call_gemini_pro_vision] response", response) print("[call_gemini_pro_vision] content", content) content = json.loads(content) if config.verbose: print( "[get_next_action][call_gemini_pro_vision] content", content, ) return content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[Operate] That did not work. Trying another method {ANSI_RESET}" ) if config.verbose: print("[Self-Operating Computer][Operate] error", e) traceback.print_exc() return call_gpt_4o(messages) async def call_gpt_4o_with_ocr(messages, objective, model): if config.verbose: print("[call_gpt_4o_with_ocr]") # Construct the path to the file within the package try: time.sleep(1) client = config.initialize_openai() confirm_system_prompt(messages, objective, model) screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") # Call the function to capture the screen with the cursor capture_screen_with_cursor(screenshot_filename) with open(screenshot_filename, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() vision_message = { "role": "user", "content": [ {"type": "text", "text": user_prompt}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}, }, ], } messages.append(vision_message) response = client.chat.completions.create( model="gpt-4o", messages=messages, ) content = response.choices[0].message.content content = clean_json(content) # used later for the messages content_str = content content = json.loads(content) processed_content = [] for operation in content: if operation.get("operation") == "click": text_to_click = operation.get("text") if config.verbose: print( "[call_gpt_4o_with_ocr][click] text_to_click", text_to_click, ) # Initialize EasyOCR Reader reader = easyocr.Reader(["en"]) # Read the screenshot result = reader.readtext(screenshot_filename) text_element_index = get_text_element( result, text_to_click, screenshot_filename ) coordinates = get_text_coordinates( result, text_element_index, screenshot_filename ) # add `coordinates`` to `content` operation["x"] = coordinates["x"] operation["y"] = coordinates["y"] if config.verbose: print( "[call_gpt_4o_with_ocr][click] text_element_index", text_element_index, ) print( "[call_gpt_4o_with_ocr][click] coordinates", coordinates, ) print( "[call_gpt_4o_with_ocr][click] final operation", operation, ) processed_content.append(operation) else: processed_content.append(operation) # wait to append the assistant message so that if the `processed_content` step fails we don't append a message and mess up message history assistant_message = {"role": "assistant", "content": content_str} messages.append(assistant_message) return processed_content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}" ) if config.verbose: print("[Self-Operating Computer][Operate] error", e) traceback.print_exc() return gpt_4_fallback(messages, objective, model) async def call_gpt_4_1_with_ocr(messages, objective, model): if config.verbose: print("[call_gpt_4_1_with_ocr]") try: time.sleep(1) client = config.initialize_openai() confirm_system_prompt(messages, objective, model) screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") capture_screen_with_cursor(screenshot_filename) with open(screenshot_filename, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() vision_message = { "role": "user", "content": [ {"type": "text", "text": user_prompt}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}, }, ], } messages.append(vision_message) response = client.chat.completions.create( model="gpt-4.1", messages=messages, ) content = response.choices[0].message.content content = clean_json(content) content_str = content content = json.loads(content) processed_content = [] for operation in content: if operation.get("operation") == "click": text_to_click = operation.get("text") if config.verbose: print( "[call_gpt_4_1_with_ocr][click] text_to_click", text_to_click, ) reader = easyocr.Reader(["en"]) result = reader.readtext(screenshot_filename) text_element_index = get_text_element( result, text_to_click, screenshot_filename ) coordinates = get_text_coordinates( result, text_element_index, screenshot_filename ) operation["x"] = coordinates["x"] operation["y"] = coordinates["y"] if config.verbose: print( "[call_gpt_4_1_with_ocr][click] text_element_index", text_element_index, ) print( "[call_gpt_4_1_with_ocr][click] coordinates", coordinates, ) print( "[call_gpt_4_1_with_ocr][click] final operation", operation, ) processed_content.append(operation) else: processed_content.append(operation) assistant_message = {"role": "assistant", "content": content_str} messages.append(assistant_message) return processed_content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}" ) if config.verbose: print("[Self-Operating Computer][Operate] error", e) traceback.print_exc() return gpt_4_fallback(messages, objective, model) async def call_o1_with_ocr(messages, objective, model): if config.verbose: print("[call_o1_with_ocr]") # Construct the path to the file within the package try: time.sleep(1) client = config.initialize_openai() confirm_system_prompt(messages, objective, model) screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") # Call the function to capture the screen with the cursor capture_screen_with_cursor(screenshot_filename) with open(screenshot_filename, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() vision_message = { "role": "user", "content": [ {"type": "text", "text": user_prompt}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}, }, ], } messages.append(vision_message) response = client.chat.completions.create( model="o1", messages=messages, ) content = response.choices[0].message.content content = clean_json(content) # used later for the messages content_str = content content = json.loads(content) processed_content = [] for operation in content: if operation.get("operation") == "click": text_to_click = operation.get("text") if config.verbose: print( "[call_o1_with_ocr][click] text_to_click", text_to_click, ) # Initialize EasyOCR Reader reader = easyocr.Reader(["en"]) # Read the screenshot result = reader.readtext(screenshot_filename) text_element_index = get_text_element( result, text_to_click, screenshot_filename ) coordinates = get_text_coordinates( result, text_element_index, screenshot_filename ) # add `coordinates`` to `content` operation["x"] = coordinates["x"] operation["y"] = coordinates["y"] if config.verbose: print( "[call_o1_with_ocr][click] text_element_index", text_element_index, ) print( "[call_o1_with_ocr][click] coordinates", coordinates, ) print( "[call_o1_with_ocr][click] final operation", operation, ) processed_content.append(operation) else: processed_content.append(operation) # wait to append the assistant message so that if the `processed_content` step fails we don't append a message and mess up message history assistant_message = {"role": "assistant", "content": content_str} messages.append(assistant_message) return processed_content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}" ) if config.verbose: print("[Self-Operating Computer][Operate] error", e) traceback.print_exc() return gpt_4_fallback(messages, objective, model) async def call_gpt_4o_labeled(messages, objective, model): time.sleep(1) try: client = config.initialize_openai() confirm_system_prompt(messages, objective, model) file_path = pkg_resources.resource_filename("operate.models.weights", "best.pt") yolo_model = YOLO(file_path) # Load your trained model screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") # Call the function to capture the screen with the cursor capture_screen_with_cursor(screenshot_filename) with open(screenshot_filename, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") img_base64_labeled, label_coordinates = add_labels(img_base64, yolo_model) if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() if config.verbose: print( "[call_gpt_4_vision_preview_labeled] user_prompt", user_prompt, ) vision_message = { "role": "user", "content": [ {"type": "text", "text": user_prompt}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{img_base64_labeled}" }, }, ], } messages.append(vision_message) response = client.chat.completions.create( model="gpt-4o", messages=messages, presence_penalty=1, frequency_penalty=1, ) content = response.choices[0].message.content content = clean_json(content) assistant_message = {"role": "assistant", "content": content} messages.append(assistant_message) content = json.loads(content) if config.verbose: print( "[call_gpt_4_vision_preview_labeled] content", content, ) processed_content = [] for operation in content: print( "[call_gpt_4_vision_preview_labeled] for operation in content", operation, ) if operation.get("operation") == "click": label = operation.get("label") if config.verbose: print( "[Self Operating Computer][call_gpt_4_vision_preview_labeled] label", label, ) coordinates = get_label_coordinates(label, label_coordinates) if config.verbose: print( "[Self Operating Computer][call_gpt_4_vision_preview_labeled] coordinates", coordinates, ) image = Image.open( io.BytesIO(base64.b64decode(img_base64)) ) # Load the image to get its size image_size = image.size # Get the size of the image (width, height) click_position_percent = get_click_position_in_percent( coordinates, image_size ) if config.verbose: print( "[Self Operating Computer][call_gpt_4_vision_preview_labeled] click_position_percent", click_position_percent, ) if not click_position_percent: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] Failed to get click position in percent. Trying another method {ANSI_RESET}" ) return call_gpt_4o(messages) x_percent = f"{click_position_percent[0]:.2f}" y_percent = f"{click_position_percent[1]:.2f}" operation["x"] = x_percent operation["y"] = y_percent if config.verbose: print( "[Self Operating Computer][call_gpt_4_vision_preview_labeled] new click operation", operation, ) processed_content.append(operation) else: if config.verbose: print( "[Self Operating Computer][call_gpt_4_vision_preview_labeled] .append none click operation", operation, ) processed_content.append(operation) if config.verbose: print( "[Self Operating Computer][call_gpt_4_vision_preview_labeled] new processed_content", processed_content, ) return processed_content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}" ) if config.verbose: print("[Self-Operating Computer][Operate] error", e) traceback.print_exc() return call_gpt_4o(messages) def call_ollama_llava(messages): if config.verbose: print("[call_ollama_llava]") time.sleep(1) try: model = config.initialize_ollama() screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") # Call the function to capture the screen with the cursor capture_screen_with_cursor(screenshot_filename) if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() if config.verbose: print( "[call_ollama_llava] user_prompt", user_prompt, ) vision_message = { "role": "user", "content": user_prompt, "images": [screenshot_filename], } messages.append(vision_message) response = model.chat( model="llava", messages=messages, ) # Important: Remove the image path from the message history. # Ollama will attempt to load each image reference and will # eventually timeout. messages[-1]["images"] = None content = response["message"]["content"].strip() content = clean_json(content) assistant_message = {"role": "assistant", "content": content} if config.verbose: print( "[call_ollama_llava] content", content, ) content = json.loads(content) messages.append(assistant_message) return content except ollama.ResponseError as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Operate] Couldn't connect to Ollama. With Ollama installed, run `ollama pull llava` then `ollama serve`{ANSI_RESET}", e, ) except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[llava] That did not work. Trying again {ANSI_RESET}", e, ) print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response was {ANSI_RESET}", content, ) if config.verbose: traceback.print_exc() return call_ollama_llava(messages) async def call_claude_3_with_ocr(messages, objective, model): if config.verbose: print("[call_claude_3_with_ocr]") try: time.sleep(1) client = config.initialize_anthropic() confirm_system_prompt(messages, objective, model) screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") capture_screen_with_cursor(screenshot_filename) # downsize screenshot due to 5MB size limit with open(screenshot_filename, "rb") as img_file: img = Image.open(img_file) # Convert RGBA to RGB if img.mode == "RGBA": img = img.convert("RGB") # Calculate the new dimensions while maintaining the aspect ratio original_width, original_height = img.size aspect_ratio = original_width / original_height new_width = 2560 # Adjust this value to achieve the desired file size new_height = int(new_width / aspect_ratio) if config.verbose: print("[call_claude_3_with_ocr] resizing claude") # Resize the image img_resized = img.resize((new_width, new_height), Image.Resampling.LANCZOS) # Save the resized and converted image to a BytesIO object for JPEG format img_buffer = io.BytesIO() img_resized.save( img_buffer, format="JPEG", quality=85 ) # Adjust the quality parameter as needed img_buffer.seek(0) # Encode the resized image as base64 img_data = base64.b64encode(img_buffer.getvalue()).decode("utf-8") if len(messages) == 1: user_prompt = get_user_first_message_prompt() else: user_prompt = get_user_prompt() vision_message = { "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": img_data, }, }, { "type": "text", "text": user_prompt + "**REMEMBER** Only output json format, do not append any other text.", }, ], } messages.append(vision_message) # anthropic api expect system prompt as an separate argument response = client.messages.create( model="claude-3-opus-20240229", max_tokens=3000, system=messages[0]["content"], messages=messages[1:], ) content = response.content[0].text content = clean_json(content) content_str = content try: content = json.loads(content) # rework for json mode output except json.JSONDecodeError as e: if config.verbose: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] JSONDecodeError: {e} {ANSI_RESET}" ) response = client.messages.create( model="claude-3-opus-20240229", max_tokens=3000, system=f"This json string is not valid, when using with json.loads(content) \ it throws the following error: {e}, return correct json string. \ **REMEMBER** Only output json format, do not append any other text.", messages=[{"role": "user", "content": content}], ) content = response.content[0].text content = clean_json(content) content_str = content content = json.loads(content) if config.verbose: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] content: {content} {ANSI_RESET}" ) processed_content = [] for operation in content: if operation.get("operation") == "click": text_to_click = operation.get("text") if config.verbose: print( "[call_claude_3_ocr][click] text_to_click", text_to_click, ) # Initialize EasyOCR Reader reader = easyocr.Reader(["en"]) # Read the screenshot result = reader.readtext(screenshot_filename) # limit the text to extract has a higher success rate text_element_index = get_text_element( result, text_to_click[:3], screenshot_filename ) coordinates = get_text_coordinates( result, text_element_index, screenshot_filename ) # add `coordinates`` to `content` operation["x"] = coordinates["x"] operation["y"] = coordinates["y"] if config.verbose: print( "[call_claude_3_ocr][click] text_element_index", text_element_index, ) print( "[call_claude_3_ocr][click] coordinates", coordinates, ) print( "[call_claude_3_ocr][click] final operation", operation, ) processed_content.append(operation) else: processed_content.append(operation) assistant_message = {"role": "assistant", "content": content_str} messages.append(assistant_message) return processed_content except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA}[{model}] That did not work. Trying another method {ANSI_RESET}" ) if config.verbose: print("[Self-Operating Computer][Operate] error", e) traceback.print_exc() print("message before convertion ", messages) # Convert the messages to the GPT-4 format gpt4_messages = [messages[0]] # Include the system message for message in messages[1:]: if message["role"] == "user": # Update the image type format from "source" to "url" updated_content = [] for item in message["content"]: if isinstance(item, dict) and "type" in item: if item["type"] == "image": updated_content.append( { "type": "image_url", "image_url": { "url": f"data:image/png;base64,{item['source']['data']}" }, } ) else: updated_content.append(item) gpt4_messages.append({"role": "user", "content": updated_content}) elif message["role"] == "assistant": gpt4_messages.append( {"role": "assistant", "content": message["content"]} ) return gpt_4_fallback(gpt4_messages, objective, model) def get_last_assistant_message(messages): """ Retrieve the last message from the assistant in the messages array. If the last assistant message is the first message in the array, return None. """ for index in reversed(range(len(messages))): if messages[index]["role"] == "assistant": if index == 0: # Check if the assistant message is the first in the array return None else: return messages[index] return None # Return None if no assistant message is found def gpt_4_fallback(messages, objective, model): if config.verbose: print("[gpt_4_fallback]") system_prompt = get_system_prompt("gpt-4o", objective) new_system_message = {"role": "system", "content": system_prompt} # remove and replace the first message in `messages` with `new_system_message` messages[0] = new_system_message if config.verbose: print("[gpt_4_fallback][updated]") print("[gpt_4_fallback][updated] len(messages)", len(messages)) return call_gpt_4o(messages) def confirm_system_prompt(messages, objective, model): """ On `Exception` we default to `call_gpt_4_vision_preview` so we have this function to reassign system prompt in case of a previous failure """ if config.verbose: print("[confirm_system_prompt] model", model) system_prompt = get_system_prompt(model, objective) new_system_message = {"role": "system", "content": system_prompt} # remove and replace the first message in `messages` with `new_system_message` messages[0] = new_system_message if config.verbose: print("[confirm_system_prompt]") print("[confirm_system_prompt] len(messages)", len(messages)) for m in messages: if m["role"] != "user": print("--------------------[message]--------------------") print("[confirm_system_prompt][message] role", m["role"]) print("[confirm_system_prompt][message] content", m["content"]) print("------------------[end message]------------------") def clean_json(content): if config.verbose: print("\n\n[clean_json] content before cleaning", content) if content.startswith("```json"): content = content[ len("```json") : ].strip() # Remove starting ```json and trim whitespace elif content.startswith("```"): content = content[ len("```") : ].strip() # Remove starting ``` and trim whitespace if content.endswith("```"): content = content[ : -len("```") ].strip() # Remove ending ``` and trim whitespace # Normalize line breaks and remove any unwanted characters content = "\n".join(line.strip() for line in content.splitlines()) if config.verbose: print("\n\n[clean_json] content after cleaning", content) return content ================================================ FILE: operate/models/prompts.py ================================================ import platform from operate.config import Config # Load configuration config = Config() # General user Prompts USER_QUESTION = "Hello, I can help you with anything. What would you like done?" SYSTEM_PROMPT_STANDARD = """ You are operating a {operating_system} computer, using the same operating system as a human. From looking at the screen, the objective, and your previous actions, take the next best series of action. You have 4 possible operation actions available to you. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement. 1. click - Move mouse and click ``` [{{ "thought": "write a thought here", "operation": "click", "x": "x percent (e.g. 0.10)", "y": "y percent (e.g. 0.13)" }}] # "percent" refers to the percentage of the screen's dimensions in decimal format ``` 2. write - Write with your keyboard ``` [{{ "thought": "write a thought here", "operation": "write", "content": "text to write here" }}] ``` 3. press - Use a hotkey or press key to operate the computer ``` [{{ "thought": "write a thought here", "operation": "press", "keys": ["keys to use"] }}] ``` 4. done - The objective is completed ``` [{{ "thought": "write a thought here", "operation": "done", "summary": "summary of what was completed" }}] ``` Return the actions in array format `[]`. You can take just one action or multiple actions. Here a helpful example: Example 1: Searches for Google Chrome on the OS and opens it ``` [ {{ "thought": "Searching the operating system to find Google Chrome because it appears I am currently in terminal", "operation": "press", "keys": {os_search_str} }}, {{ "thought": "Now I need to write 'Google Chrome' as a next step", "operation": "write", "content": "Google Chrome" }}, {{ "thought": "Finally I'll press enter to open Google Chrome assuming it is available", "operation": "press", "keys": ["enter"] }} ] ``` Example 2: Focuses on the address bar in a browser before typing a website ``` [ {{ "thought": "I'll focus on the address bar in the browser. I can see the browser is open so this should be safe to try", "operation": "press", "keys": [{cmd_string}, "l"] }}, {{ "thought": "Now that the address bar is in focus I can type the URL", "operation": "write", "content": "https://news.ycombinator.com/" }}, {{ "thought": "I'll need to press enter to go the URL now", "operation": "press", "keys": ["enter"] }} ] ``` A few important notes: - Go to Google Docs and Google Sheets by typing in the Chrome Address bar - Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user. Objective: {objective} """ SYSTEM_PROMPT_LABELED = """ You are operating a {operating_system} computer, using the same operating system as a human. From looking at the screen, the objective, and your previous actions, take the next best series of action. You have 4 possible operation actions available to you. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement. 1. click - Move mouse and click - We labeled the clickable elements with red bounding boxes and IDs. Label IDs are in the following format with `x` being a number: `~x` ``` [{{ "thought": "write a thought here", "operation": "click", "label": "~x" }}] # 'percent' refers to the percentage of the screen's dimensions in decimal format ``` 2. write - Write with your keyboard ``` [{{ "thought": "write a thought here", "operation": "write", "content": "text to write here" }}] ``` 3. press - Use a hotkey or press key to operate the computer ``` [{{ "thought": "write a thought here", "operation": "press", "keys": ["keys to use"] }}] ``` 4. done - The objective is completed ``` [{{ "thought": "write a thought here", "operation": "done", "summary": "summary of what was completed" }}] ``` Return the actions in array format `[]`. You can take just one action or multiple actions. Here a helpful example: Example 1: Searches for Google Chrome on the OS and opens it ``` [ {{ "thought": "Searching the operating system to find Google Chrome because it appears I am currently in terminal", "operation": "press", "keys": {os_search_str} }}, {{ "thought": "Now I need to write 'Google Chrome' as a next step", "operation": "write", "content": "Google Chrome" }}, ] ``` Example 2: Focuses on the address bar in a browser before typing a website ``` [ {{ "thought": "I'll focus on the address bar in the browser. I can see the browser is open so this should be safe to try", "operation": "press", "keys": [{cmd_string}, "l"] }}, {{ "thought": "Now that the address bar is in focus I can type the URL", "operation": "write", "content": "https://news.ycombinator.com/" }}, {{ "thought": "I'll need to press enter to go the URL now", "operation": "press", "keys": ["enter"] }} ] ``` Example 3: Send a "Hello World" message in the chat ``` [ {{ "thought": "I see a messsage field on this page near the button. It looks like it has a label", "operation": "click", "label": "~34" }}, {{ "thought": "Now that I am focused on the message field, I'll go ahead and write ", "operation": "write", "content": "Hello World" }}, ] ``` A few important notes: - Go to Google Docs and Google Sheets by typing in the Chrome Address bar - Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user. Objective: {objective} """ # TODO: Add an example or instruction about `Action: press ['pagedown']` to scroll SYSTEM_PROMPT_OCR = """ You are operating a {operating_system} computer, using the same operating system as a human. From looking at the screen, the objective, and your previous actions, take the next best series of action. You have 4 possible operation actions available to you. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement. 1. click - Move mouse and click - Look for text to click. Try to find relevant text to click, but if there's nothing relevant enough you can return `"nothing to click"` for the text value and we'll try a different method. ``` [{{ "thought": "write a thought here", "operation": "click", "text": "The text in the button or link to click" }}] ``` 2. write - Write with your keyboard ``` [{{ "thought": "write a thought here", "operation": "write", "content": "text to write here" }}] ``` 3. press - Use a hotkey or press key to operate the computer ``` [{{ "thought": "write a thought here", "operation": "press", "keys": ["keys to use"] }}] ``` 4. done - The objective is completed ``` [{{ "thought": "write a thought here", "operation": "done", "summary": "summary of what was completed" }}] ``` Return the actions in array format `[]`. You can take just one action or multiple actions. Here a helpful example: Example 1: Searches for Google Chrome on the OS and opens it ``` [ {{ "thought": "Searching the operating system to find Google Chrome because it appears I am currently in terminal", "operation": "press", "keys": {os_search_str} }}, {{ "thought": "Now I need to write 'Google Chrome' as a next step", "operation": "write", "content": "Google Chrome" }}, {{ "thought": "Finally I'll press enter to open Google Chrome assuming it is available", "operation": "press", "keys": ["enter"] }} ] ``` Example 2: Open a new Google Docs when the browser is already open ``` [ {{ "thought": "I'll focus on the address bar in the browser. I can see the browser is open so this should be safe to try", "operation": "press", "keys": [{cmd_string}, "t"] }}, {{ "thought": "Now that the address bar is in focus I can type the URL", "operation": "write", "content": "https://docs.new/" }}, {{ "thought": "I'll need to press enter to go the URL now", "operation": "press", "keys": ["enter"] }} ] ``` Example 3: Search for someone on Linkedin when already on linkedin.com ``` [ {{ "thought": "I can see the search field with the placeholder text 'search'. I click that field to search", "operation": "click", "text": "search" }}, {{ "thought": "Now that the field is active I can write the name of the person I'd like to search for", "operation": "write", "content": "John Doe" }}, {{ "thought": "Finally I'll submit the search form with enter", "operation": "press", "keys": ["enter"] }} ] ``` A few important notes: - Default to Google Chrome as the browser - Go to websites by opening a new tab with `press` and then `write` the URL - Reflect on previous actions and the screenshot to ensure they align and that your previous actions worked. - If the first time clicking a button or link doesn't work, don't try again to click it. Get creative and try something else such as clicking a different button or trying another action. - Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user. Objective: {objective} """ OPERATE_FIRST_MESSAGE_PROMPT = """ Please take the next best action. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement. Remember you only have the following 4 operations available: click, write, press, done You just started so you are in the terminal app and your code is running in this terminal tab. To leave the terminal, search for a new program on the OS. Action:""" OPERATE_PROMPT = """ Please take the next best action. The `pyautogui` library will be used to execute your decision. Your output will be used in a `json.loads` loads statement. Remember you only have the following 4 operations available: click, write, press, done Action:""" def get_system_prompt(model, objective): """ Format the vision prompt more efficiently and print the name of the prompt used """ if platform.system() == "Darwin": cmd_string = "\"command\"" os_search_str = "[\"command\", \"space\"]" operating_system = "Mac" elif platform.system() == "Windows": cmd_string = "\"ctrl\"" os_search_str = "[\"win\"]" operating_system = "Windows" else: cmd_string = "\"ctrl\"" os_search_str = "[\"win\"]" operating_system = "Linux" if model == "gpt-4-with-som": prompt = SYSTEM_PROMPT_LABELED.format( objective=objective, cmd_string=cmd_string, os_search_str=os_search_str, operating_system=operating_system, ) elif model == "gpt-4-with-ocr" or model == "gpt-4.1-with-ocr" or model == "o1-with-ocr" or model == "claude-3" or model == "qwen-vl": prompt = SYSTEM_PROMPT_OCR.format( objective=objective, cmd_string=cmd_string, os_search_str=os_search_str, operating_system=operating_system, ) else: prompt = SYSTEM_PROMPT_STANDARD.format( objective=objective, cmd_string=cmd_string, os_search_str=os_search_str, operating_system=operating_system, ) # Optional verbose output if config.verbose: print("[get_system_prompt] model:", model) # print("[get_system_prompt] prompt:", prompt) return prompt def get_user_prompt(): prompt = OPERATE_PROMPT return prompt def get_user_first_message_prompt(): prompt = OPERATE_FIRST_MESSAGE_PROMPT return prompt ================================================ FILE: operate/models/weights/__init__.py ================================================ ================================================ FILE: operate/operate.py ================================================ import sys import os import time import asyncio from prompt_toolkit.shortcuts import message_dialog from prompt_toolkit import prompt from operate.exceptions import ModelNotRecognizedException import platform # from operate.models.prompts import USER_QUESTION, get_system_prompt from operate.models.prompts import ( USER_QUESTION, get_system_prompt, ) from operate.config import Config from operate.utils.style import ( ANSI_GREEN, ANSI_RESET, ANSI_YELLOW, ANSI_RED, ANSI_BRIGHT_MAGENTA, ANSI_BLUE, style, ) from operate.utils.operating_system import OperatingSystem from operate.models.apis import get_next_action # Load configuration config = Config() operating_system = OperatingSystem() def main(model, terminal_prompt, voice_mode=False, verbose_mode=False): """ Main function for the Self-Operating Computer. Parameters: - model: The model used for generating responses. - terminal_prompt: A string representing the prompt provided in the terminal. - voice_mode: A boolean indicating whether to enable voice mode. Returns: None """ mic = None # Initialize `WhisperMic`, if `voice_mode` is True config.verbose = verbose_mode config.validation(model, voice_mode) if voice_mode: try: from whisper_mic import WhisperMic # Initialize WhisperMic if import is successful mic = WhisperMic() except ImportError: print( "Voice mode requires the 'whisper_mic' module. Please install it using 'pip install -r requirements-audio.txt'" ) sys.exit(1) # Skip message dialog if prompt was given directly if not terminal_prompt: message_dialog( title="Self-Operating Computer", text="An experimental framework to enable multimodal models to operate computers", style=style, ).run() else: print("Running direct prompt...") # # Clear the console if platform.system() == "Windows": os.system("cls") else: print("\033c", end="") if terminal_prompt: # Skip objective prompt if it was given as an argument objective = terminal_prompt elif voice_mode: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RESET} Listening for your command... (speak now)" ) try: objective = mic.listen() except Exception as e: print(f"{ANSI_RED}Error in capturing voice input: {e}{ANSI_RESET}") return # Exit if voice input fails else: print( f"[{ANSI_GREEN}Self-Operating Computer {ANSI_RESET}|{ANSI_BRIGHT_MAGENTA} {model}{ANSI_RESET}]\n{USER_QUESTION}" ) print(f"{ANSI_YELLOW}[User]{ANSI_RESET}") objective = prompt(style=style) system_prompt = get_system_prompt(model, objective) system_message = {"role": "system", "content": system_prompt} messages = [system_message] loop_count = 0 session_id = None while True: if config.verbose: print("[Self Operating Computer] loop_count", loop_count) try: operations, session_id = asyncio.run( get_next_action(model, messages, objective, session_id) ) stop = operate(operations, model) if stop: break loop_count += 1 if loop_count > 10: break except ModelNotRecognizedException as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}" ) break except Exception as e: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}" ) break def operate(operations, model): if config.verbose: print("[Self Operating Computer][operate]") for operation in operations: if config.verbose: print("[Self Operating Computer][operate] operation", operation) # wait one second time.sleep(1) operate_type = operation.get("operation").lower() operate_thought = operation.get("thought") operate_detail = "" if config.verbose: print("[Self Operating Computer][operate] operate_type", operate_type) if operate_type == "press" or operate_type == "hotkey": keys = operation.get("keys") operate_detail = keys operating_system.press(keys) elif operate_type == "write": content = operation.get("content") operate_detail = content operating_system.write(content) elif operate_type == "click": x = operation.get("x") y = operation.get("y") click_detail = {"x": x, "y": y} operate_detail = click_detail operating_system.mouse(click_detail) elif operate_type == "done": summary = operation.get("summary") print( f"[{ANSI_GREEN}Self-Operating Computer {ANSI_RESET}|{ANSI_BRIGHT_MAGENTA} {model}{ANSI_RESET}]" ) print(f"{ANSI_BLUE}Objective Complete: {ANSI_RESET}{summary}\n") return True else: print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] unknown operation response :({ANSI_RESET}" ) print( f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response {ANSI_RESET}{operation}" ) return True print( f"[{ANSI_GREEN}Self-Operating Computer {ANSI_RESET}|{ANSI_BRIGHT_MAGENTA} {model}{ANSI_RESET}]" ) print(f"{operate_thought}") print(f"{ANSI_BLUE}Action: {ANSI_RESET}{operate_type} {operate_detail}\n") return False ================================================ FILE: operate/utils/__init__.py ================================================ ================================================ FILE: operate/utils/label.py ================================================ import io import base64 import json import os import time import asyncio from PIL import Image, ImageDraw def validate_and_extract_image_data(data): if not data or "messages" not in data: raise ValueError("Invalid request, no messages found") messages = data["messages"] if ( not messages or not isinstance(messages, list) or not messages[-1].get("image_url") ): raise ValueError("No image provided or incorrect format") image_data = messages[-1]["image_url"]["url"] if not image_data.startswith("data:image"): raise ValueError("Invalid image format") return image_data.split("base64,")[-1], messages def get_label_coordinates(label, label_coordinates): """ Retrieves the coordinates for a given label. :param label: The label to find coordinates for (e.g., "~1"). :param label_coordinates: Dictionary containing labels and their coordinates. :return: Coordinates of the label or None if the label is not found. """ return label_coordinates.get(label) def is_overlapping(box1, box2): x1_box1, y1_box1, x2_box1, y2_box1 = box1 x1_box2, y1_box2, x2_box2, y2_box2 = box2 # Check if there is no overlap if x1_box1 > x2_box2 or x1_box2 > x2_box1: return False if ( y1_box1 > y2_box2 or y1_box2 > y2_box1 ): # Adjusted to check 100px proximity above return False return True def add_labels(base64_data, yolo_model): image_bytes = base64.b64decode(base64_data) image_labeled = Image.open(io.BytesIO(image_bytes)) # Corrected this line image_debug = image_labeled.copy() # Create a copy for the debug image image_original = ( image_labeled.copy() ) # Copy of the original image for base64 return results = yolo_model(image_labeled) draw = ImageDraw.Draw(image_labeled) debug_draw = ImageDraw.Draw( image_debug ) # Create a separate draw object for the debug image font_size = 45 labeled_images_dir = "labeled_images" label_coordinates = {} # Dictionary to store coordinates if not os.path.exists(labeled_images_dir): os.makedirs(labeled_images_dir) counter = 0 drawn_boxes = [] # List to keep track of boxes already drawn for result in results: if hasattr(result, "boxes"): for det in result.boxes: bbox = det.xyxy[0] x1, y1, x2, y2 = bbox.tolist() debug_label = "D_" + str(counter) debug_index_position = (x1, y1 - font_size) debug_draw.rectangle([(x1, y1), (x2, y2)], outline="blue", width=1) debug_draw.text( debug_index_position, debug_label, fill="blue", font_size=font_size, ) overlap = any( is_overlapping((x1, y1, x2, y2), box) for box in drawn_boxes ) if not overlap: draw.rectangle([(x1, y1), (x2, y2)], outline="red", width=1) label = "~" + str(counter) index_position = (x1, y1 - font_size) draw.text( index_position, label, fill="red", font_size=font_size, ) # Add the non-overlapping box to the drawn_boxes list drawn_boxes.append((x1, y1, x2, y2)) label_coordinates[label] = (x1, y1, x2, y2) counter += 1 # Save the image timestamp = time.strftime("%Y%m%d-%H%M%S") output_path = os.path.join(labeled_images_dir, f"img_{timestamp}_labeled.png") output_path_debug = os.path.join(labeled_images_dir, f"img_{timestamp}_debug.png") output_path_original = os.path.join( labeled_images_dir, f"img_{timestamp}_original.png" ) image_labeled.save(output_path) image_debug.save(output_path_debug) image_original.save(output_path_original) buffered_original = io.BytesIO() image_original.save(buffered_original, format="PNG") # I guess this is needed img_base64_original = base64.b64encode(buffered_original.getvalue()).decode("utf-8") # Convert image to base64 for return buffered_labeled = io.BytesIO() image_labeled.save(buffered_labeled, format="PNG") # I guess this is needed img_base64_labeled = base64.b64encode(buffered_labeled.getvalue()).decode("utf-8") return img_base64_labeled, label_coordinates def get_click_position_in_percent(coordinates, image_size): """ Calculates the click position at the center of the bounding box and converts it to percentages. :param coordinates: A tuple of the bounding box coordinates (x1, y1, x2, y2). :param image_size: A tuple of the image dimensions (width, height). :return: A tuple of the click position in percentages (x_percent, y_percent). """ if not coordinates or not image_size: return None # Calculate the center of the bounding box x_center = (coordinates[0] + coordinates[2]) / 2 y_center = (coordinates[1] + coordinates[3]) / 2 # Convert to percentages x_percent = x_center / image_size[0] y_percent = y_center / image_size[1] return x_percent, y_percent ================================================ FILE: operate/utils/misc.py ================================================ import json import re def convert_percent_to_decimal(percent): try: # Remove the '%' sign and convert to float decimal_value = float(percent) # Convert to decimal (e.g., 20% -> 0.20) return decimal_value except ValueError as e: print(f"[convert_percent_to_decimal] error: {e}") return None def parse_operations(response): if response == "DONE": return {"type": "DONE", "data": None} elif response.startswith("CLICK"): # Adjust the regex to match the correct format click_data = re.search(r"CLICK \{ (.+) \}", response).group(1) click_data_json = json.loads(f"{{{click_data}}}") return {"type": "CLICK", "data": click_data_json} elif response.startswith("TYPE"): # Extract the text to type try: type_data = re.search(r"TYPE (.+)", response, re.DOTALL).group(1) except: type_data = re.search(r'TYPE "(.+)"', response, re.DOTALL).group(1) return {"type": "TYPE", "data": type_data} elif response.startswith("SEARCH"): # Extract the search query try: search_data = re.search(r'SEARCH "(.+)"', response).group(1) except: search_data = re.search(r"SEARCH (.+)", response).group(1) return {"type": "SEARCH", "data": search_data} return {"type": "UNKNOWN", "data": response} ================================================ FILE: operate/utils/ocr.py ================================================ from operate.config import Config from PIL import Image, ImageDraw import os from datetime import datetime # Load configuration config = Config() def get_text_element(result, search_text, image_path): """ Searches for a text element in the OCR results and returns its index. Also draws bounding boxes on the image. Args: result (list): The list of results returned by EasyOCR. search_text (str): The text to search for in the OCR results. image_path (str): Path to the original image. Returns: int: The index of the element containing the search text. Raises: Exception: If the text element is not found in the results. """ if config.verbose: print("[get_text_element]") print("[get_text_element] search_text", search_text) # Create /ocr directory if it doesn't exist ocr_dir = "ocr" if not os.path.exists(ocr_dir): os.makedirs(ocr_dir) # Open the original image image = Image.open(image_path) draw = ImageDraw.Draw(image) found_index = None for index, element in enumerate(result): text = element[1] box = element[0] if config.verbose: # Draw bounding box in blue draw.polygon([tuple(point) for point in box], outline="blue") if search_text in text: found_index = index if config.verbose: print("[get_text_element][loop] found search_text, index:", index) if found_index is not None: if config.verbose: # Draw bounding box of the found text in red box = result[found_index][0] draw.polygon([tuple(point) for point in box], outline="red") # Save the image with bounding boxes datetime_str = datetime.now().strftime("%Y%m%d_%H%M%S") ocr_image_path = os.path.join(ocr_dir, f"ocr_image_{datetime_str}.png") image.save(ocr_image_path) print("[get_text_element] OCR image saved at:", ocr_image_path) return found_index raise Exception("The text element was not found in the image") def get_text_coordinates(result, index, image_path): """ Gets the coordinates of the text element at the specified index as a percentage of screen width and height. Args: result (list): The list of results returned by EasyOCR. index (int): The index of the text element in the results list. image_path (str): Path to the screenshot image. Returns: dict: A dictionary containing the 'x' and 'y' coordinates as percentages of the screen width and height. """ if index >= len(result): raise Exception("Index out of range in OCR results") # Get the bounding box of the text element bounding_box = result[index][0] # Calculate the center of the bounding box min_x = min([coord[0] for coord in bounding_box]) max_x = max([coord[0] for coord in bounding_box]) min_y = min([coord[1] for coord in bounding_box]) max_y = max([coord[1] for coord in bounding_box]) center_x = (min_x + max_x) / 2 center_y = (min_y + max_y) / 2 # Get image dimensions with Image.open(image_path) as img: width, height = img.size # Convert to percentages percent_x = round((center_x / width), 3) percent_y = round((center_y / height), 3) return {"x": percent_x, "y": percent_y} ================================================ FILE: operate/utils/operating_system.py ================================================ import pyautogui import platform import time import math from operate.utils.misc import convert_percent_to_decimal class OperatingSystem: def write(self, content): try: content = content.replace("\\n", "\n") for char in content: pyautogui.write(char) except Exception as e: print("[OperatingSystem][write] error:", e) def press(self, keys): try: for key in keys: pyautogui.keyDown(key) time.sleep(0.1) for key in keys: pyautogui.keyUp(key) except Exception as e: print("[OperatingSystem][press] error:", e) def mouse(self, click_detail): try: x = convert_percent_to_decimal(click_detail.get("x")) y = convert_percent_to_decimal(click_detail.get("y")) if click_detail and isinstance(x, float) and isinstance(y, float): self.click_at_percentage(x, y) except Exception as e: print("[OperatingSystem][mouse] error:", e) def click_at_percentage( self, x_percentage, y_percentage, duration=0.2, circle_radius=50, circle_duration=0.5, ): try: screen_width, screen_height = pyautogui.size() x_pixel = int(screen_width * float(x_percentage)) y_pixel = int(screen_height * float(y_percentage)) pyautogui.moveTo(x_pixel, y_pixel, duration=duration) start_time = time.time() while time.time() - start_time < circle_duration: angle = ((time.time() - start_time) / circle_duration) * 2 * math.pi x = x_pixel + math.cos(angle) * circle_radius y = y_pixel + math.sin(angle) * circle_radius pyautogui.moveTo(x, y, duration=0.1) pyautogui.click(x_pixel, y_pixel) except Exception as e: print("[OperatingSystem][click_at_percentage] error:", e) ================================================ FILE: operate/utils/screenshot.py ================================================ import os import platform import subprocess import pyautogui from PIL import Image, ImageDraw, ImageGrab import Xlib.display import Xlib.X import Xlib.Xutil # not sure if Xutil is necessary def capture_screen_with_cursor(file_path): user_platform = platform.system() if user_platform == "Windows": screenshot = pyautogui.screenshot() screenshot.save(file_path) elif user_platform == "Linux": # Use xlib to prevent scrot dependency for Linux screen = Xlib.display.Display().screen() size = screen.width_in_pixels, screen.height_in_pixels screenshot = ImageGrab.grab(bbox=(0, 0, size[0], size[1])) screenshot.save(file_path) elif user_platform == "Darwin": # (Mac OS) # Use the screencapture utility to capture the screen with the cursor subprocess.run(["screencapture", "-C", file_path]) else: print(f"The platform you're using ({user_platform}) is not currently supported") def compress_screenshot(raw_screenshot_filename, screenshot_filename): with Image.open(raw_screenshot_filename) as img: # Check if the image has an alpha channel (transparency) if img.mode in ('RGBA', 'LA') or (img.mode == 'P' and 'transparency' in img.info): # Create a white background image background = Image.new('RGB', img.size, (255, 255, 255)) # Paste the image onto the background, using the alpha channel as mask background.paste(img, mask=img.split()[3]) # 3 is the alpha channel # Save the result as JPEG background.save(screenshot_filename, 'JPEG', quality=85) # Adjust quality as needed else: # If no alpha channel, simply convert and save img.convert('RGB').save(screenshot_filename, 'JPEG', quality=85) ================================================ FILE: operate/utils/style.py ================================================ import sys import platform import os from prompt_toolkit.styles import Style as PromptStyle # Define style style = PromptStyle.from_dict( { "dialog": "bg:#88ff88", "button": "bg:#ffffff #000000", "dialog.body": "bg:#44cc44 #ffffff", "dialog shadow": "bg:#003800", } ) # Check if on a windows terminal that supports ANSI escape codes def supports_ansi(): """ Check if the terminal supports ANSI escape codes """ plat = platform.system() supported_platform = plat != "Windows" or "ANSICON" in os.environ is_a_tty = hasattr(sys.stdout, "isatty") and sys.stdout.isatty() return supported_platform and is_a_tty # Define ANSI color codes ANSI_GREEN = "\033[32m" if supports_ansi() else "" # Standard green text ANSI_BRIGHT_GREEN = "\033[92m" if supports_ansi() else "" # Bright/bold green text ANSI_RESET = "\033[0m" if supports_ansi() else "" # Reset to default text color ANSI_BLUE = "\033[94m" if supports_ansi() else "" # Bright blue ANSI_YELLOW = "\033[33m" if supports_ansi() else "" # Standard yellow text ANSI_RED = "\033[31m" if supports_ansi() else "" ANSI_BRIGHT_MAGENTA = "\033[95m" if supports_ansi() else "" # Bright magenta text ================================================ FILE: requirements-audio.txt ================================================ whisper-mic ================================================ FILE: requirements.txt ================================================ annotated-types==0.6.0 anyio==3.7.1 certifi==2023.7.22 charset-normalizer==3.3.2 colorama==0.4.6 contourpy==1.2.0 cycler==0.12.1 distro==1.8.0 EasyProcess==1.1 entrypoint2==1.1 exceptiongroup==1.1.3 fonttools==4.44.0 h11==0.14.0 httpcore==1.0.2 httpx>=0.25.2 idna==3.4 importlib-resources==6.1.1 kiwisolver==1.4.5 matplotlib==3.8.1 MouseInfo==0.1.3 mss==9.0.1 numpy==1.26.1 openai==1.2.3 packaging==23.2 Pillow==10.1.0 prompt-toolkit==3.0.39 PyAutoGUI==0.9.54 pydantic==2.4.2 pydantic_core==2.10.1 PyGetWindow==0.0.9 PyMsgBox==1.0.9 pyparsing==3.1.1 pyperclip==1.8.2 PyRect==0.2.0 pyscreenshot==3.1 PyScreeze==0.1.29 python3-xlib==0.15 python-dateutil==2.8.2 python-dotenv==1.0.0 pytweening==1.0.7 requests==2.31.0 rubicon-objc==0.4.7 six==1.16.0 sniffio==1.3.0 tqdm==4.66.1 typing_extensions==4.8.0 urllib3==2.0.7 wcwidth==0.2.9 zipp==3.17.0 google-generativeai==0.3.0 aiohttp==3.9.1 ultralytics==8.0.227 easyocr==1.7.1 ollama==0.1.6 anthropic ================================================ FILE: setup.py ================================================ from setuptools import setup, find_packages # Read the contents of your requirements.txt file with open("requirements.txt") as f: required = f.read().splitlines() # Read the contents of your README.md file for the project description with open("README.md", "r", encoding="utf-8") as readme_file: long_description = readme_file.read() setup( name="self-operating-computer", version="1.5.8", packages=find_packages(), install_requires=required, # Add dependencies here entry_points={ "console_scripts": [ "operate=operate.main:main_entry", ], }, package_data={ # Include the file in the operate.models.weights package "operate.models.weights": ["best.pt"], }, long_description=long_description, # Add project description here long_description_content_type="text/markdown", # Specify Markdown format # include any other necessary setup options here )