Full Code of reiniscimurs/DRL-robot-navigation-IR-SIM for AI

master 31e1a4d511bb cached

81 files

44.5 MB

1.1M tokens

224 symbols

1 requests

Download .txt

Showing preview only (4,559K chars total). Download the full file or copy to clipboard to get everything.

Repository: reiniscimurs/DRL-robot-navigation-IR-SIM
Branch: master
Commit: 31e1a4d511bb
Files: 81
Total size: 44.5 MB

Directory structure:
gitextract_12_7tajc/

├── .devin/
│   └── wiki.json
├── .github/
│   └── workflows/
│       └── main.yml
├── .gitignore
├── README.md
├── docs/
│   ├── api/
│   │   ├── IR-SIM/
│   │   │   ├── ir-marl-sim.md
│   │   │   └── ir-sim.md
│   │   ├── Testing/
│   │   │   ├── test.md
│   │   │   └── testrnn.md
│   │   ├── Training/
│   │   │   ├── train.md
│   │   │   └── trainrnn.md
│   │   ├── Utils/
│   │   │   ├── replay_buffer.md
│   │   │   └── utils.md
│   │   ├── models/
│   │   │   ├── DDPG.md
│   │   │   ├── HCM.md
│   │   │   ├── MARL/
│   │   │   │   ├── Attention.md
│   │   │   │   └── TD3.md
│   │   │   ├── PPO.md
│   │   │   ├── RCPG.md
│   │   │   ├── SAC.md
│   │   │   ├── TD3.md
│   │   │   ├── __init__.md
│   │   │   └── cnntd3.md
│   │   └── robot_nav.md
│   └── index.md
├── mkdocs.yml
├── pyproject.toml
├── robot_nav/
│   ├── SIM_ENV/
│   │   ├── __init__.py
│   │   ├── marl_sim.py
│   │   ├── sim.py
│   │   └── sim_env.py
│   ├── __init__.py
│   ├── assets/
│   │   └── data.yml
│   ├── eval_points.yaml
│   ├── marl_test_random.py
│   ├── marl_test_single.py
│   ├── marl_train.py
│   ├── models/
│   │   ├── CNNTD3/
│   │   │   ├── CNNTD3.py
│   │   │   └── __init__.py
│   │   ├── DDPG/
│   │   │   ├── DDPG.py
│   │   │   └── __init__.py
│   │   ├── HCM/
│   │   │   ├── __init__.py
│   │   │   └── hardcoded_model.py
│   │   ├── MARL/
│   │   │   ├── Attention/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── g2anet.py
│   │   │   │   └── iga.py
│   │   │   ├── __init__.py
│   │   │   └── marlTD3/
│   │   │       ├── __init__.py
│   │   │       └── marlTD3.py
│   │   ├── PPO/
│   │   │   ├── PPO.py
│   │   │   └── __init__.py
│   │   ├── RCPG/
│   │   │   ├── RCPG.py
│   │   │   └── __init__.py
│   │   ├── SAC/
│   │   │   ├── SAC.py
│   │   │   ├── SAC_actor.py
│   │   │   ├── SAC_critic.py
│   │   │   ├── SAC_utils.py
│   │   │   └── __init__.py
│   │   ├── TD3/
│   │   │   ├── TD3.py
│   │   │   └── __init__.py
│   │   └── __init__.py
│   ├── replay_buffer.py
│   ├── rl_test.py
│   ├── rl_test_random.py
│   ├── rl_train.py
│   ├── rnn_test.py
│   ├── rnn_train.py
│   ├── rvo_test_random.py
│   ├── rvo_test_single.py
│   ├── utils.py
│   └── worlds/
│       ├── circle_world.yaml
│       ├── cross_world.yaml
│       ├── eval_world.yaml
│       ├── multi_robot_world.yaml
│       └── robot_world.yaml
└── tests/
    ├── __init__.py
    ├── test_data.yml
    ├── test_marl_world.yaml
    ├── test_model.py
    ├── test_sim.py
    ├── test_utils.py
    └── test_world.yaml

================================================
FILE CONTENTS
================================================

================================================
FILE: .devin/wiki.json
================================================
{
  "repo_notes": [
    {
      "content": ""
    }
  ],
  "pages": [
    {
      "title": "Overview",
      "purpose": "Introduce the DRL-robot-navigation-IR-SIM project, explaining its purpose as a Deep Reinforcement Learning system for robot navigation using the IR-SIM simulator, and providing a high-level architecture overview",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Getting Started",
      "purpose": "Guide users through installation using Poetry, running the main training script, and monitoring with TensorBoard",
      "parent": "Overview",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Core Training System",
      "purpose": "Explain the main training orchestration, including the training loop structure (epochs, episodes, steps), state-action-reward cycle, and integration with models and environments",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Simulation Environments",
      "purpose": "Document the SIM (single-agent) and MARL_SIM (multi-agent) environment wrappers around IR-SIM, including state observation processing, reward functions, and reset mechanics",
      "parent": "Core Training System",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Data Management and Replay Buffers",
      "purpose": "Describe the replay buffer implementations (ReplayBuffer, RolloutReplayBuffer, RolloutBuffer), the get_buffer factory function, and experience storage/sampling strategies for different algorithms",
      "parent": "Core Training System",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Single-Agent Models",
      "purpose": "Overview of the single-agent DRL model library, explaining the model taxonomy, evolutionary relationships (DDPG→TD3→CNNTD3→RCPG), and shared interfaces",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "TD3 and DDPG",
      "purpose": "Document the Twin Delayed DDPG algorithm and base DDPG, including network architectures, training loop with target smoothing and delayed updates, and state preparation for 10-dimensional states",
      "parent": "Single-Agent Models",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "CNNTD3",
      "purpose": "Explain the CNN-enhanced TD3 model designed for 185-dimensional laser scan data, detailing the 1D convolutional layers for spatial feature extraction and its role as the primary model in rl_train.py",
      "parent": "Single-Agent Models",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "RCPG",
      "purpose": "Document the Recurrent Convolutional Policy Gradient model that adds LSTM/GRU/RNN layers to CNNTD3 for temporal modeling, including its specialized RolloutReplayBuffer for sequence sampling",
      "parent": "Single-Agent Models",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "SAC",
      "purpose": "Describe the Soft Actor-Critic algorithm with entropy regularization, including the DiagGaussianActor with SquashedNormal distribution and DoubleQCritic architecture",
      "parent": "Single-Agent Models",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "PPO",
      "purpose": "Explain the Proximal Policy Optimization on-policy algorithm, its clipped objective, RolloutBuffer for episodic collection, and action standard deviation decay mechanism",
      "parent": "Single-Agent Models",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "HCM (Hardcoded Model)",
      "purpose": "Document the hard-coded navigation model used for collecting collision data and pre-training datasets, including its action computation logic and sample saving functionality",
      "parent": "Single-Agent Models",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Multi-Agent Reinforcement Learning",
      "purpose": "Introduce the MARL capabilities of the system, explaining the multi-agent simulation environment, attention-based coordination mechanisms, and training differences from single-agent approaches",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "MARL Simulation Environment",
      "purpose": "Detail the MARL_SIM class for multi-robot scenarios, including per-agent state processing, two-phase reward functions, connection matrices for communication topology, and conflict-free reset logic",
      "parent": "Multi-Agent Reinforcement Learning",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "marlTD3 with Attention Mechanisms",
      "purpose": "Explain the multi-agent TD3 implementation with hard-soft attention, the Actor and Critic architectures with attention encoders, In-Graph Attention and G2ANET, GoalAttentionLayer message passing, and auxiliary losses (BCE, entropy)",
      "parent": "Multi-Agent Reinforcement Learning",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Environment Configuration",
      "purpose": "Document the YAML configuration system for world setup, robot properties, sensor specifications (lidar2d), obstacle definitions, and evaluation point configurations",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Testing and Validation",
      "purpose": "Overview of the comprehensive testing infrastructure using pytest, including unit tests for models, environments, and utilities, with CI/CD integration via GitHub Actions",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Model Testing",
      "purpose": "Explain the test suite for validating model training capabilities, including parameterized tests for all algorithms, prefilled test data usage, and max bound variant testing",
      "parent": "Testing and Validation",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Simulation and Environment Testing",
      "purpose": "Document tests for SIM and MARL_SIM environments, including state retrieval, step execution, reset functionality, and the cossin utility function",
      "parent": "Testing and Validation",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Buffer and Utility Testing",
      "purpose": "Detail the testing framework for the get_buffer factory function, validating correct buffer type instantiation (ReplayBuffer, RolloutReplayBuffer, RolloutBuffer) based on model type",
      "parent": "Testing and Validation",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Advanced Features",
      "purpose": "Document advanced capabilities including max upper bound loss for Q-value regularization, pretraining from offline datasets, and model-specific architectural enhancements",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Max Upper Bound Loss",
      "purpose": "Explain the optional Q-value bounding mechanism available for TD3, DDPG, and CNNTD3 to address overestimation bias, including the theoretical maximum calculation and loss formulation",
      "parent": "Advanced Features",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Pretraining and Offline Learning",
      "purpose": "Document the Pretraining class for loading offline experience data from YAML files, the get_max_bound utility, and integration with the get_buffer factory for model initialization",
      "parent": "Advanced Features",
      "page_notes": [
        {
          "content": ""
        }
      ]
    },
    {
      "title": "Development and Deployment",
      "purpose": "Cover the development workflow including Poetry dependency management, project structure with pyproject.toml, CI/CD pipeline with GitHub Actions, and TensorBoard integration for training monitoring",
      "page_notes": [
        {
          "content": ""
        }
      ]
    }
  ]
}


================================================
FILE: .github/workflows/main.yml
================================================
name: Run Tests

on:
  push:
    branches: [master]
  pull_request:
    branches: [master]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install Poetry
        run: |
          curl -sSL https://install.python-poetry.org | python3 -
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Install dependencies
        run: poetry install

      - name: Run tests
        run: poetry run pytest


================================================
FILE: .gitignore
================================================
/.venv/
/runs/
/site/


================================================
FILE: README.md
================================================
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/reiniscimurs/DRL-robot-navigation-IR-SIM)

**DRL Robot navigation in IR-SIM**

Deep Reinforcement Learning algorithm implementation for simulated robot navigation in IR-SIM. Using 2D laser sensor data
and information about the goal point a robot learns to navigate to a specified point in the environment.

![Example](https://github.com/reiniscimurs/DRL-robot-navigation-IR-SIM/blob/master/out.gif)

**Installation**

* Package versioning is managed with poetry \
`pip install poetry`
* Clone the repository \
`git clone https://github.com/reiniscimurs/DRL-robot-navigation.git`
* Navigate to the cloned location and install using poetry \
`poetry install`

**Training the model**

* Run the training by executing the train.py file \
`poetry run python robot_nav/rl_train.py`

* To open tensorbord, in a new terminal execute \
`tensorboard --logdir runs`



**Sources**

| Package |                          Description                          |                              Source | 
|:--------|:-------------------------------------------------------------:|------------------------------------:| 
| IR-SIM  |                 Light-weight robot simulator                  | https://github.com/hanruihua/ir-sim |
| PythonRobotics  | Python code collection of robotics algorithms (Path planning) | https://github.com/AtsushiSakai/PythonRobotics |


**Models**

| Model            |                                           Description                                           |                    Model                           Source | 
|:-----------------|:-----------------------------------------------------------------------------------------------:|----------------------------------------------------------:|
| TD3              |                      Twin Delayed Deep Deterministic Policy Gradient model                      | https://github.com/reiniscimurs/DRL-Robot-Navigation-ROS2 | 
| SAC              |                                     Soft Actor-Critic model                                     |                https://github.com/denisyarats/pytorch_sac | 
| PPO              |                               Proximal Policy Optimization model                                |            https://github.com/nikhilbarhate99/PPO-PyTorch | 
| DDPG             |                            Deep Deterministic Policy Gradient model                             |                                          Updated from TD3 | 
| CNNTD3           |                          TD3 model with 1D CNN encoding of laser state                          |                                                         - |
| RCPG             | Recurrent Convolution Policy Gradient - adding recurrence layers (lstm/gru/rnn) to CNNTD3 model |                                                         - |
| MARL: TD3-G2ANet |                     G2ANet attention encoder for TD3 model in MARL setting                      |                                      G2ANet adapted from https://github.com/starry-sky6688/MARL-Algorithms |
| MARL: TD3-IGS    |                 In-Graph Softmax attention model for TD3 model in MARL setting                  |                                                         - |

**Max Upper Bound Models**

Models that support the additional loss of Q values exceeding the maximal possible Q value in the episode. Q values that exceed this upper bound are used to calculate a loss for the model. This helps to control the overestimation of Q values in off-policy actor-critic networks.
To enable max upper bound loss set `use_max_bound = True` when initializing a model.

| Model  |  
|:-------|
| TD3    | 
| DDPG   | 
| CNNTD3 |

**MARL Models**

Multi agent RL setting with training multiple robots a the same time with a single policy. Implementation does not use sensor informaition and only exchanged graph messages are used for navigation.

Read more bout the In-Graph Softmax for MARL implementation [here](https://medium.com/@reinis_86651/learning-graphs-for-marl-on-the-fly-9fe44c356808).

Video of results:

[![EXPERIMENTS](https://img.youtube.com/vi/SGl7sil_dpo/0.jpg)](https://www.youtube.com/watch?v=SGl7sil_dpo)


================================================
FILE: docs/api/IR-SIM/ir-marl-sim.md
================================================
# MARL-IR-SIM

::: robot_nav.SIM_ENV.marl_sim
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/IR-SIM/ir-sim.md
================================================
# IR-SIM

::: robot_nav.SIM_ENV.sim
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/Testing/test.md
================================================
# Testing

::: robot_nav.test
    options:
      show_root_heading: true
      show_source: true

::: robot_nav.test_random
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/Testing/testrnn.md
================================================
# Testing RNN

::: robot_nav.test_rnn
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/Training/train.md
================================================
# Training

::: robot_nav.train
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/Training/trainrnn.md
================================================
# Training RNN

::: robot_nav.train_rnn
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/Utils/replay_buffer.md
================================================
# Replay/Rollout Buffer

::: robot_nav.replay_buffer
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/Utils/utils.md
================================================
# Utils

::: robot_nav.utils
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/models/DDPG.md
================================================
# DDPG

::: robot_nav.models.DDPG.DDPG
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/models/HCM.md
================================================
# Hardcoded Model

::: robot_nav.models.HCM.hardcoded_model
    options:
      show_root_heading: true
      show_source: true


================================================
FILE: docs/api/models/MARL/Attention.md
================================================
# Hard-Soft Attention

::: robot_nav.models.MARL.hardsoftAttention
    options:
      show_root_heading: true
      show_source: true


================================================
FILE: docs/api/models/MARL/TD3.md
================================================
# MARL TD3

::: robot_nav.models.MARL.marlTD3
    options:
      show_root_heading: true
      show_source: true


================================================
FILE: docs/api/models/PPO.md
================================================
# PPO

::: robot_nav.models.PPO.PPO
    options:
      show_root_heading: true
      show_source: true


================================================
FILE: docs/api/models/RCPG.md
================================================
# RCPG

::: robot_nav.models.RCPG.RCPG
    options:
      show_root_heading: true
      show_source: true


================================================
FILE: docs/api/models/SAC.md
================================================
# SAC

::: robot_nav.models.SAC.SAC
    options:
      show_root_heading: true
      show_source: true

::: robot_nav.models.SAC.SAC_actor
    options:
      show_root_heading: true
      show_source: true

::: robot_nav.models.SAC.SAC_critic
    options:
      show_root_heading: true
      show_source: true

::: robot_nav.models.SAC.SAC_utils
    options:
      show_root_heading: true
      show_source: true


================================================
FILE: docs/api/models/TD3.md
================================================
# TD3

::: robot_nav.models.TD3.TD3
    options:
      show_root_heading: true
      show_source: true


================================================
FILE: docs/api/models/__init__.md
================================================
# __init__

## Documentation

```
No documentation available for __init__.
```


================================================
FILE: docs/api/models/cnntd3.md
================================================
# CNNTD3

::: robot_nav.models.CNNTD3.CNNTD3
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/api/robot_nav.md
================================================
# Robot Navigation

Will be updated

::: robot_nav
    options:
      show_root_heading: true
      show_source: true

================================================
FILE: docs/index.md
================================================
# Welcome to DRL-robot-navigation-IR-SIM

**DRL Robot navigation in IR-SIM**

Deep Reinforcement Learning algorithm implementation for simulated robot navigation in IR-SIM. Using 2D laser sensor data
and information about the goal point a robot learns to navigate to a specified point in the environment.

**Installation**

* Package versioning is managed with poetry \
`pip install poetry`
* Clone the repository \
`git clone https://github.com/reiniscimurs/DRL-robot-navigation.git`
* Navigate to the cloned location and install using poetry \
`poetry install`

**Training the model**

* Run the training by executing the train.py file \
`poetry run python robot_nav/train.py`

* To open tensorbord, in a new terminal execute \
`tensorboard --logdir runs`



**Sources**

| Package |                          Description                          |                              Source | 
|:--------|:-------------------------------------------------------------:|------------------------------------:| 
| IR-SIM  |                 Light-weight robot simulator                  | https://github.com/hanruihua/ir-sim |
| PythonRobotics  | Python code collection of robotics algorithms (Path planning) | https://github.com/AtsushiSakai/PythonRobotics |


**Models**

| Model     |                                           Description                                           |                    Model                           Source | 
|:----------|:-----------------------------------------------------------------------------------------------:|----------------------------------------------------------:|
| TD3       |                      Twin Delayed Deep Deterministic Policy Gradient model                      | https://github.com/reiniscimurs/DRL-Robot-Navigation-ROS2 | 
| SAC       |                                     Soft Actor-Critic model                                     |                https://github.com/denisyarats/pytorch_sac | 
| PPO       |                               Proximal Policy Optimization model                                |            https://github.com/nikhilbarhate99/PPO-PyTorch | 
| DDPG      |                            Deep Deterministic Policy Gradient model                             |                                          Updated from TD3 | 
| CNNTD3    |                          TD3 model with 1D CNN encoding of laser state                          |                                                         - |
| RCPG      | Recurrent Convolution Policy Gradient - adding recurrence layers (lstm/gru/rnn) to CNNTD3 model |                                                         - |

**Max Upper Bound Models**

Models that support the additional loss of Q values exceeding the maximal possible Q value in the episode. Q values that exceed this upper bound are used to calculate a loss for the model. This helps to control the overestimation of Q values in off-policy actor-critic networks.
To enable max upper bound loss set `use_max_bound = True` when initializing a model.

| Model  |  
|:-------|
| TD3    | 
| DDPG   | 
| CNNTD3 |



================================================
FILE: mkdocs.yml
================================================
extra:
  python_path:
  - ./robot_nav
nav:
- Home: index.md
- API Reference:
    IR-SIM:
      - SIM: api/IR-SIM/ir-sim
      - MARL SIM: api/IR-SIM/ir-marl-sim
    Models:
      - DDPG: api/models/DDPG.md
      - TD3: api/models/TD3.md
      - CNNTD3: api/models/cnntd3.md
      - RCPG: api/models/RCPG.md
      - HCM: api/models/HCM.md
      - PPO: api/models/PPO.md
      - SAC: api/models/SAC.md
      - MARL:
          - HardSoft Attention: api/models/MARL/Attention
          - TD3: api/models/MARL/TD3
    Training:
      - Train: api/Training/train.md
      - Train RNN: api/Training/trainrnn.md
    Testing:
      - Test: api/Testing/test.md
      - Test RNN: api/Testing/testrnn.md
    Utils:
      - Replay Buffer: api/Utils/replay_buffer.md
      - Utils: api/Utils/utils.md
plugins:
- search
- mkdocstrings
site_name: DRL-robot-navigation-IR-SIM
theme:
  name: material


================================================
FILE: pyproject.toml
================================================
[project]
name = "robot-nav"
version = "0.1.0"
description = ""
authors = [
    {name = "reinis",email = "ireinisi@gmail.com"}
]
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "ir-sim[all] (>=2.3.0,<3.0.0)",
    "torch (>=2.5.1,<3.0.0)",
    "tensorboard (>=2.18.0,<3.0.0)",
    "tqdm (>=4.67.1,<5.0.0)",
    "pytest (>=8.3.4,<9.0.0)",
    "mkdocs (>=1.6.1,<2.0.0)",
    "mkdocstrings[python] (>=0.29.1,<0.30.0)",
    "mkdocs-material (>=9.6.11,<10.0.0)",
    "scipy (>=1.15.2,<2.0.0)",
    "torch-geometric (>=2.6.1,<3.0.0)",
]

[tool.pytest.ini_options]
testpaths = ["tests"]

[build-system]
requires = ["poetry-core>=2.0.0,<3.0.0"]
build-backend = "poetry.core.masonry.api"


================================================
FILE: robot_nav/SIM_ENV/__init__.py
================================================


================================================
FILE: robot_nav/SIM_ENV/marl_sim.py
================================================
import irsim
import numpy as np
import random
import torch

from robot_nav.SIM_ENV.sim_env import SIM_ENV


class MARL_SIM(SIM_ENV):
    """
    Simulation environment for multi-agent robot navigation using IRSim.

    This class extends the SIM_ENV and provides a wrapper for multi-robot
    simulation and interaction, supporting reward computation and custom reset logic.

    Attributes:
        env (object): IRSim simulation environment instance.
        robot_goal (np.ndarray): Current goal position(s) for the robots.
        num_robots (int): Number of robots in the environment.
        x_range (tuple): World x-range.
        y_range (tuple): World y-range.
    """

    def __init__(
        self,
        world_file="multi_robot_world.yaml",
        disable_plotting=False,
        reward_phase=1,
    ):
        """
        Initialize the MARL_SIM environment.

        Args:
            world_file (str, optional):
                Path to an IRSim YAML world configuration. Defaults to
                "multi_robot_world.yaml".
            disable_plotting (bool, optional):
                If True, disables all IRSim plotting and display windows.
                Useful for headless training. Defaults to False.
            reward_phase (int, optional):
                Selects the reward function variant used by `get_reward`.
                Supported values: {1, 2}. Defaults to 1.
        """
        display = False if disable_plotting else True
        self.env = irsim.make(
            world_file, disable_all_plot=disable_plotting, display=display
        )
        robot_info = self.env.get_robot_info(0)
        self.robot_goal = robot_info.goal
        self.num_robots = len(self.env.robot_list)
        self.x_range = self.env._world.x_range
        self.y_range = self.env._world.y_range
        self.reward_phase = reward_phase

    def step(self, action, connection, combined_weights=None):
        """
        Perform a simulation step for all robots using the provided actions and connections.

        Args:
            action (list):
                A list of length `num_robots`. Each element is
                `[linear_velocity, angular_velocity]` to be applied to the
                corresponding robot for this step. Units follow IRSim's
                kinematics setup (typically m/s and rad/s).
            connection (Tensor):
                A torch.Tensor of shape `(num_robots, num_robots-1)` containing
                *logits* that indicate pairwise connections per robot *without*
                the self-connection column. This is not used for logic here,
                but preserved for compatibility (see commented code); you may
                use it upstream to form `combined_weights`.
            combined_weights (Tensor or None, optional):
                A torch.Tensor of shape `(num_robots, num_robots)` **or**
                `(num_robots, num_robots-1)` that encodes visualization weights
                for robot-to-robot edges. When provided, edges from robot *i* to
                *j* are drawn with line width `weight * 2`. If you pass the
                `(num_robots, num_robots-1)` form, ensure indexing aligns with
                how you constructed it (self-column typically omitted).

        Returns:
            tuple:
                (
                    poses (list[list[float]]):
                        `[x, y, theta]` for each robot after the step.
                    distances (list[float]):
                        Euclidean distance from each robot to its goal.
                    coss (list[float]):
                        Cosine of the angle between robot heading and goal vector.
                    sins (list[float]):
                        Sine of the angle between robot heading and goal vector.
                    collisions (list[bool]):
                        Collision flags for each robot, as reported by IRSim.
                    goals (list[bool]):
                        Arrival flags for each robot. If True this step, a new
                        random goal was scheduled for that robot.
                    action (list):
                        Echo of the input `action`.
                    rewards (list[float]):
                        Per-robot rewards computed by `get_reward(...)`.
                    positions (list[list[float]]):
                        `[x, y]` positions for each robot after the step.
                    goal_positions (list[list[float]]):
                        `[x, y]` goal coordinates for each robot after any
                        arrival updates this step.
                )
        """
        self.env.step(action_id=[i for i in range(self.num_robots)], action=action)
        self.env.render()

        poses = []
        distances = []
        coss = []
        sins = []
        collisions = []
        goals = []
        rewards = []
        positions = []
        goal_positions = []
        robot_states = [
            [self.env.robot_list[i].state[0], self.env.robot_list[i].state[1]]
            for i in range(self.num_robots)
        ]
        for i in range(self.num_robots):

            robot_state = self.env.robot_list[i].state
            closest_robots = [
                np.linalg.norm(
                    [
                        robot_states[j][0] - robot_state[0],
                        robot_states[j][1] - robot_state[1],
                    ]
                )
                for j in range(self.num_robots)
                if j != i
            ]
            robot_goal = self.env.robot_list[i].goal
            goal_vector = [
                robot_goal[0].item() - robot_state[0].item(),
                robot_goal[1].item() - robot_state[1].item(),
            ]
            distance = np.linalg.norm(goal_vector)
            goal = self.env.robot_list[i].arrive
            pose_vector = [np.cos(robot_state[2]).item(), np.sin(robot_state[2]).item()]
            cos, sin = self.cossin(pose_vector, goal_vector)
            collision = self.env.robot_list[i].collision
            action_i = action[i]
            reward = self.get_reward(
                goal, collision, action_i, closest_robots, distance, self.reward_phase
            )

            position = [robot_state[0].item(), robot_state[1].item()]
            goal_position = [robot_goal[0].item(), robot_goal[1].item()]

            distances.append(distance)
            coss.append(cos)
            sins.append(sin)
            collisions.append(collision)
            goals.append(goal)
            rewards.append(reward)
            positions.append(position)
            poses.append(
                [robot_state[0].item(), robot_state[1].item(), robot_state[2].item()]
            )
            goal_positions.append(goal_position)

            if combined_weights is not None:
                i_weights = combined_weights[i].tolist()

            for j in range(self.num_robots):
                if combined_weights is not None:
                    weight = i_weights[j]
                    # else:
                    #     weight = 1
                    other_robot_state = self.env.robot_list[j].state
                    other_pos = [
                        other_robot_state[0].item(),
                        other_robot_state[1].item(),
                    ]
                    rx = [position[0], other_pos[0]]
                    ry = [position[1], other_pos[1]]
                    self.env.draw_trajectory(
                        np.array([rx, ry]), refresh=True, linewidth=weight * 2
                    )

            if goal:
                self.env.robot_list[i].set_random_goal(
                    obstacle_list=self.env.obstacle_list,
                    init=True,
                    range_limits=[
                        [self.x_range[0] + 1, self.y_range[0] + 1, -3.141592653589793],
                        [self.x_range[1] - 1, self.y_range[1] - 1, 3.141592653589793],
                    ],
                )

        return (
            poses,
            distances,
            coss,
            sins,
            collisions,
            goals,
            action,
            rewards,
            positions,
            goal_positions,
        )

    def reset(
        self,
        robot_state=None,
        robot_goal=None,
        random_obstacles=False,
        random_obstacle_ids=None,
    ):
        """
        Reset the simulation environment and optionally set robot and obstacle positions.

        Args:
            robot_state (list or None, optional): Initial robot state(s) as [[x], [y], [theta]].
                If None, random states are assigned ensuring minimum spacing between robots.
            robot_goal (list or None, optional): Fixed goal position(s) for robots. If None, random
                goals are generated within the environment boundaries.
            random_obstacles (bool, optional): If True, reposition obstacles randomly within bounds.
            random_obstacle_ids (list or None, optional): IDs of obstacles to randomize. Defaults to
                seven obstacles starting from index equal to the number of robots.

        Returns:
            tuple: (
                poses (list): List of [x, y, theta] for each robot,
                distances (list): Distance to goal for each robot,
                coss (list): Cosine of angle to goal for each robot,
                sins (list): Sine of angle to goal for each robot,
                collisions (list): All False after reset,
                goals (list): All False after reset,
                action (list): Initial action ([[0.0, 0.0], ...]),
                rewards (list): Rewards for initial state,
                positions (list): Initial [x, y] for each robot,
                goal_positions (list): Initial goal [x, y] for each robot,
            )
        """
        if robot_state is None:
            robot_state = [[random.uniform(3, 9)], [random.uniform(3, 9)], [0]]

        init_states = []
        for robot in self.env.robot_list:
            conflict = True
            while conflict:
                conflict = False
                robot_state = [
                    [random.uniform(3, 9)],
                    [random.uniform(3, 9)],
                    [random.uniform(-3.14, 3.14)],
                ]
                pos = [robot_state[0][0], robot_state[1][0]]
                for loc in init_states:
                    vector = [
                        pos[0] - loc[0],
                        pos[1] - loc[1],
                    ]
                    if np.linalg.norm(vector) < 0.6:
                        conflict = True
            init_states.append(pos)

            robot.set_state(
                state=np.array(robot_state),
                init=True,
            )

        if random_obstacles:
            if random_obstacle_ids is None:
                random_obstacle_ids = [i + self.num_robots for i in range(7)]
            self.env.random_obstacle_position(
                range_low=[self.x_range[0], self.y_range[0], -3.14],
                range_high=[self.x_range[1], self.y_range[1], 3.14],
                ids=random_obstacle_ids,
                non_overlapping=True,
            )

        for robot in self.env.robot_list:
            if robot_goal is None:
                robot.set_random_goal(
                    obstacle_list=self.env.obstacle_list,
                    init=True,
                    range_limits=[
                        [self.x_range[0] + 1, self.y_range[0] + 1, -3.141592653589793],
                        [self.x_range[1] - 1, self.y_range[1] - 1, 3.141592653589793],
                    ],
                )
            else:
                self.env.robot.set_goal(np.array(robot_goal), init=True)
        self.env.reset()
        self.robot_goal = self.env.robot.goal

        action = [[0.0, 0.0] for _ in range(self.num_robots)]
        con = torch.tensor(
            [[0.0 for _ in range(self.num_robots - 1)] for _ in range(self.num_robots)]
        )
        poses, distance, cos, sin, _, _, action, reward, positions, goal_positions = (
            self.step(action, con)
        )
        return (
            poses,
            distance,
            cos,
            sin,
            [False] * self.num_robots,
            [False] * self.num_robots,
            action,
            reward,
            positions,
            goal_positions,
        )

    @staticmethod
    def get_reward(goal, collision, action, closest_robots, distance, phase=1):
        """
        Compute the reward for a single robot based on goal progress, collisions,
        control effort, and proximity to other robots.

        Args:
            goal (bool): Whether the robot reached its goal.
            collision (bool): Whether the robot collided with an obstacle or another robot.
            action (list): [linear_velocity, angular_velocity] applied by the robot.
            closest_robots (list): Distances to other robots.
            distance (float): Current distance to the goal.
            phase (int, optional): Reward configuration (1 or 2). Default is 1.

        Returns:
            float: Computed scalar reward.

        Raises:
            ValueError: If an unknown reward phase is specified.
        """

        match phase:
            case 1:
                if goal:
                    return 100.0
                elif collision:
                    return -100.0 * 3 * action[0]
                else:
                    r_dist = 1.5 / distance
                    cl_pen = 0
                    for rob in closest_robots:
                        add = (1.25 - rob) ** 2 if rob < 1.25 else 0
                        cl_pen += add

                    return action[0] - 0.5 * abs(action[1]) - cl_pen + r_dist

            case 2:
                if goal:
                    return 70.0
                elif collision:
                    return -100.0 * 3 * action[0]
                else:
                    cl_pen = 0
                    for rob in closest_robots:
                        add = (3 - rob) ** 2 if rob < 3 else 0
                        cl_pen += add

                    return -0.5 * abs(action[1]) - cl_pen

            case _:
                raise ValueError("Unknown reward phase")


================================================
FILE: robot_nav/SIM_ENV/sim.py
================================================
import irsim
import numpy as np
import random

from robot_nav.SIM_ENV.sim_env import SIM_ENV


class SIM(SIM_ENV):
    """
    A simulation environment interface for robot navigation using IRSim.

    This class wraps around the IRSim environment and provides methods for stepping,
    resetting, and interacting with a mobile robot, including reward computation.

    Attributes:
        env (object): The simulation environment instance from IRSim.
        robot_goal (np.ndarray): The goal position of the robot.
    """

    def __init__(self, world_file="robot_world.yaml", disable_plotting=False):
        """
        Initialize the simulation environment.

        Args:
            world_file (str): Path to the world configuration YAML file.
            disable_plotting (bool): If True, disables rendering and plotting.
        """
        display = False if disable_plotting else True
        self.env = irsim.make(
            world_file, disable_all_plot=disable_plotting, display=display
        )
        robot_info = self.env.get_robot_info(0)
        self.robot_goal = robot_info.goal

    def step(self, lin_velocity=0.0, ang_velocity=0.1):
        """
        Perform one step in the simulation using the given control commands.

        Args:
            lin_velocity (float): Linear velocity to apply to the robot.
            ang_velocity (float): Angular velocity to apply to the robot.

        Returns:
            (tuple): Contains the latest LIDAR scan, distance to goal, cosine and sine of angle to goal,
                   collision flag, goal reached flag, applied action, and computed reward.
        """
        self.env.step(action_id=0, action=np.array([[lin_velocity], [ang_velocity]]))
        self.env.render()

        scan = self.env.get_lidar_scan()
        latest_scan = scan["ranges"]

        robot_state = self.env.get_robot_state()
        goal_vector = [
            self.robot_goal[0].item() - robot_state[0].item(),
            self.robot_goal[1].item() - robot_state[1].item(),
        ]
        distance = np.linalg.norm(goal_vector)
        goal = self.env.robot.arrive
        pose_vector = [np.cos(robot_state[2]).item(), np.sin(robot_state[2]).item()]
        cos, sin = self.cossin(pose_vector, goal_vector)
        collision = self.env.robot.collision
        action = [lin_velocity, ang_velocity]
        reward = self.get_reward(goal, collision, action, latest_scan)

        return latest_scan, distance, cos, sin, collision, goal, action, reward

    def reset(
        self,
        robot_state=None,
        robot_goal=None,
        random_obstacles=True,
        random_obstacle_ids=None,
    ):
        """
        Reset the simulation environment, optionally setting robot and obstacle states.

        Args:
            robot_state (list or None): Initial state of the robot as a list of [x, y, theta, speed].
            robot_goal (list or None): Goal state for the robot.
            random_obstacles (bool): Whether to randomly reposition obstacles.
            random_obstacle_ids (list or None): Specific obstacle IDs to randomize.

        Returns:
            (tuple): Initial observation after reset, including LIDAR scan, distance, cos/sin,
                   and reward-related flags and values.
        """
        if robot_state is None:
            robot_state = [[random.uniform(1, 9)], [random.uniform(1, 9)], [0]]

        self.env.robot.set_state(
            state=np.array(robot_state),
            init=True,
        )

        if random_obstacles:
            if random_obstacle_ids is None:
                random_obstacle_ids = [i + 1 for i in range(7)]
            self.env.random_obstacle_position(
                range_low=[0, 0, -3.14],
                range_high=[10, 10, 3.14],
                ids=random_obstacle_ids,
                non_overlapping=True,
            )

        if robot_goal is None:
            self.env.robot.set_random_goal(
                obstacle_list=self.env.obstacle_list,
                init=True,
                range_limits=[[1, 1, -3.141592653589793], [9, 9, 3.141592653589793]],
            )
        else:
            self.env.robot.set_goal(np.array(robot_goal), init=True)
        self.env.reset()
        self.robot_goal = self.env.robot.goal

        action = [0.0, 0.0]
        latest_scan, distance, cos, sin, _, _, action, reward = self.step(
            lin_velocity=action[0], ang_velocity=action[1]
        )
        return latest_scan, distance, cos, sin, False, False, action, reward

    @staticmethod
    def get_reward(goal, collision, action, laser_scan):
        """
        Calculate the reward for the current step.

        Args:
            goal (bool): Whether the goal has been reached.
            collision (bool): Whether a collision occurred.
            action (list): The action taken [linear velocity, angular velocity].
            laser_scan (list): The LIDAR scan readings.

        Returns:
            (float): Computed reward for the current state.
        """
        if goal:
            return 100.0
        elif collision:
            return -100.0
        else:
            r3 = lambda x: 1.35 - x if x < 1.35 else 0.0
            return action[0] - abs(action[1]) / 2 - r3(min(laser_scan)) / 2


================================================
FILE: robot_nav/SIM_ENV/sim_env.py
================================================
from abc import ABC, abstractmethod
import numpy as np


class SIM_ENV(ABC):
    @abstractmethod
    def step(self):
        raise NotImplementedError("step method must be implemented by subclass.")

    @abstractmethod
    def reset(self):
        raise NotImplementedError("reset method must be implemented by subclass.")

    @staticmethod
    def cossin(vec1, vec2):
        """
        Compute the cosine and sine of the angle between two 2D vectors.

        Args:
            vec1 (list): First 2D vector.
            vec2 (list): Second 2D vector.

        Returns:
            (tuple): (cosine, sine) of the angle between the vectors.
        """
        vec1 = vec1 / np.linalg.norm(vec1)
        vec2 = vec2 / np.linalg.norm(vec2)
        cos = np.dot(vec1, vec2)
        sin = vec1[0] * vec2[1] - vec1[1] * vec2[0]
        return cos, sin

    @staticmethod
    @abstractmethod
    def get_reward():
        raise NotImplementedError("get_reward method must be implemented by subclass.")


================================================
FILE: robot_nav/__init__.py
================================================


================================================
FILE: robot_nav/assets/data.yml
================================================
[File too large to display: 40.2 MB]

================================================
FILE: robot_nav/eval_points.yaml
================================================
robot:
  poses:
    - [[3], [4], [0], [0]]
    - [[8], [1], [1], [0]]
    - [[2], [6], [1], [0]]
    - [[7], [1], [0], [0]]
    - [[7], [6.5], [2], [0]]
    - [[9], [9], [3], [0]]
    - [[2], [9], [1], [0]]
    - [[3], [6], [3], [0]]
    - [[1], [7], [0], [0]]
    - [[5], [7], [3], [0]]
  goals:
    - [[8], [8], [0]]
    - [[2], [9], [0]]
    - [[7], [1], [0]]
    - [[7.2], [9], [0]]
    - [[1], [1], [0]]
    - [[5], [1], [0]]
    - [[7], [4], [0]]
    - [[9], [4], [0]]
    - [[1], [9], [0]]
    - [[5], [1], [0]]






      #robot_states = [np.array([[3],[4],[0],[0]]),
      #                    np.array([[8],[1],[1],[0]]),
      #                    np.array([[2],[6],[1],[0]]),
      #                    np.array([[7],[1],[0],[0]]),
      #                    np.array([[7],[6.5],[2],[0]]),
      #                    np.array([[9],[9],[3],[0]]),
      #                    np.array([[2], [9], [1], [0]]),
      #                    np.array([[3], [6], [3], [0]]),
      #                    np.array([[1], [7], [0], [0]]),
      #                    np.array([[5], [7], [3], [0]]),
      #                    ]
      #    robot_goals = [
      #        np.array([[8], [8], [0]]),
      #        np.array([[2], [9], [0]]),
      #        np.array([[7], [1], [0]]),
      #        np.array([[7.2], [9], [0]]),
      #        np.array([[1], [1], [0]]),
      #        np.array([[5], [1], [0]]),
      #        np.array([[7], [4], [0]]),
      #        np.array([[9], [4], [0]]),
      #        np.array([[1], [9], [0]]),
      #        np.array([[5], [1], [0]]),
      #    ]

================================================
FILE: robot_nav/marl_test_random.py
================================================
import statistics
from pathlib import Path

from tqdm import tqdm
import matplotlib.pyplot as plt
from robot_nav.models.MARL.marlTD3.marlTD3 import TD3

import torch
import numpy as np
from robot_nav.SIM_ENV.marl_sim import MARL_SIM


def outside_of_bounds(poses):
    """
    Check if any robot is outside the defined world boundaries.

    Args:
        poses (list): List of [x, y, theta] poses for each robot.

    Returns:
        bool: True if any robot is outside the 21x21 area centered at (6, 6), else False.
    """
    outside = False
    for pose in poses:
        norm_x = pose[0] - 6
        norm_y = pose[1] - 6
        if abs(norm_x) > 10.5 or abs(norm_y) > 10.5:
            outside = True
            break
    return outside


def main(args=None):
    """Main training function"""

    # ---- Hyperparameters and setup ----
    action_dim = 2  # number of actions produced by the model
    max_action = 1  # maximum absolute value of output actions
    state_dim = 11  # number of input values in the neural network (vector length of state input)
    device = torch.device(
        "cuda" if torch.cuda.is_available() else "cpu"
    )  # using cuda if it is available, cpu otherwise
    epoch = 1  # starting epoch number
    episode = 0
    max_steps = 600  # maximum number of steps in single episode
    steps = 0  # starting step number
    save_every = 5  # save the model every n training cycles
    test_scenarios = 1000

    # ---- Instantiate simulation environment and model ----
    sim = MARL_SIM(
        world_file="worlds/multi_robot_world.yaml",
        disable_plotting=True,
        reward_phase=2,
    )  # instantiate environment

    model = TD3(
        state_dim=state_dim,
        action_dim=action_dim,
        max_action=max_action,
        num_robots=sim.num_robots,
        device=device,
        save_every=save_every,
        load_model=True,
        model_name="TDR-MARL-test",
        load_model_name="TDR-MARL-train",
        load_directory=Path("robot_nav/models/MARL/marlTD3/checkpoint"),
        attention="igs",
    )  # instantiate a model

    connections = torch.tensor(
        [[0.0 for _ in range(sim.num_robots - 1)] for _ in range(sim.num_robots)]
    )

    # ---- Take initial step in environment ----
    poses, distance, cos, sin, collision, goal, a, reward, positions, goal_positions = (
        sim.step([[0, 0] for _ in range(sim.num_robots)], connections)
    )  # get the initial step state
    running_goals = 0
    running_reward = 0
    running_collisions = 0
    running_timesteps = 0

    reward_per_ep = []
    goals_per_ep = []
    col_per_ep = []
    lin_actions = []
    ang_actions = []
    entropy_list = []
    epsilon = 1e-6
    pbar = tqdm(total=test_scenarios)
    while episode < test_scenarios:
        state, terminal = model.prepare_state(
            poses, distance, cos, sin, collision, a, goal_positions
        )
        action, connection, combined_weights = model.get_action(
            np.array(state), False
        )  # get an action from the model

        combined_weights_norm = combined_weights / (
            combined_weights.sum(dim=-1, keepdim=True) + epsilon
        )

        # Entropy for analysis/logging
        entropy = (
            (
                -(combined_weights_norm * (combined_weights_norm + epsilon).log())
                .sum(dim=-1)
                .mean()
            )
            .data.cpu()
            .numpy()
        )
        entropy_list.append(entropy)

        a_in = [
            [(a[0] + 1) / 4, a[1]] for a in action
        ]  # clip linear velocity to [0, 0.5] m/s range

        (
            poses,
            distance,
            cos,
            sin,
            collision,
            goal,
            a,
            reward,
            positions,
            goal_positions,
        ) = sim.step(
            a_in, connection, combined_weights
        )  # get data from the environment
        running_goals += sum(goal)
        running_collisions += sum(collision) / 2
        running_reward += sum(reward)
        running_timesteps += 1
        for j in range(len(a_in)):
            lin_actions.append(a_in[j][0].item())
            ang_actions.append(a_in[j][1].item())
        outside = outside_of_bounds(poses)

        if (
            sum(collision) > 0.5 or steps == max_steps or outside
        ):  # reset environment of terminal state reached, or max_steps were taken
            (
                poses,
                distance,
                cos,
                sin,
                collision,
                goal,
                a,
                reward,
                positions,
                goal_positions,
            ) = sim.reset()
            reward_per_ep.append(running_reward)
            running_reward = 0
            goals_per_ep.append(running_goals)
            running_goals = 0
            col_per_ep.append(running_collisions)
            running_collisions = 0

            steps = 0
            episode += 1
            pbar.update(1)
        else:
            steps += 1

    reward_per_ep = np.array(reward_per_ep, dtype=np.float32)
    goals_per_ep = np.array(goals_per_ep, dtype=np.float32)
    col_per_ep = np.array(col_per_ep, dtype=np.float32)
    lin_actions = np.array(lin_actions, dtype=np.float32)
    ang_actions = np.array(ang_actions, dtype=np.float32)
    entropy = np.array(entropy_list, dtype=np.float32)
    avg_ep_reward = statistics.mean(reward_per_ep)
    avg_ep_reward_std = statistics.stdev(reward_per_ep)
    avg_ep_col = statistics.mean(col_per_ep)
    avg_ep_col_std = statistics.stdev(col_per_ep)
    avg_ep_goals = statistics.mean(goals_per_ep)
    avg_ep_goals_std = statistics.stdev(goals_per_ep)
    mean_lin_action = statistics.mean(lin_actions)
    lin_actions_std = statistics.stdev(lin_actions)
    mean_ang_action = statistics.mean(ang_actions)
    ang_actions_std = statistics.stdev(ang_actions)
    mean_entropy = statistics.mean(entropy)
    mean_entropy_std = statistics.stdev(entropy)

    print(f"avg_ep_reward: {avg_ep_reward}")
    print(f"avg_ep_reward_std: {avg_ep_reward_std}")
    print(f"avg_ep_col: {avg_ep_col}")
    print(f"avg_ep_col_std: {avg_ep_col_std}")
    print(f"avg_ep_goals: {avg_ep_goals}")
    print(f"avg_ep_goals_std: {avg_ep_goals_std}")
    print(f"mean_lin_action: {mean_lin_action}")
    print(f"lin_actions_std: {lin_actions_std}")
    print(f"mean_ang_action: {mean_ang_action}")
    print(f"ang_actions_std: {ang_actions_std}")
    print(f"mean_entropy: {mean_entropy}")
    print(f"mean_entropy_std: {mean_entropy_std}")
    print("..............................................")
    model.writer.add_scalar("test/avg_ep_reward", avg_ep_reward, epoch)
    model.writer.add_scalar("test/avg_ep_reward_std", avg_ep_reward_std, epoch)
    model.writer.add_scalar("test/avg_ep_col", avg_ep_col, epoch)
    model.writer.add_scalar("test/avg_ep_col_std", avg_ep_col_std, epoch)
    model.writer.add_scalar("test/avg_ep_goals", avg_ep_goals, epoch)
    model.writer.add_scalar("test/avg_ep_goals_std", avg_ep_goals_std, epoch)
    model.writer.add_scalar("test/mean_lin_action", mean_lin_action, epoch)
    model.writer.add_scalar("test/lin_actions_std", lin_actions_std, epoch)
    model.writer.add_scalar("test/mean_ang_action", mean_ang_action, epoch)
    model.writer.add_scalar("test/ang_actions_std", ang_actions_std, epoch)
    model.writer.add_scalar("test/mean_entropy", mean_entropy, epoch)
    model.writer.add_scalar("test/mean_entropy_std", mean_entropy_std, epoch)
    bins = 100
    model.writer.add_histogram("test/lin_actions", lin_actions, epoch, max_bins=bins)
    model.writer.add_histogram("test/ang_actions", ang_actions, epoch, max_bins=bins)

    counts, bin_edges = np.histogram(lin_actions, bins=bins)
    fig, ax = plt.subplots()
    ax.bar(
        bin_edges[:-1], counts, width=np.diff(bin_edges), align="edge", log=True
    )  # Log scale on y-axis
    ax.set_xlabel("Value")
    ax.set_ylabel("Frequency (Log Scale)")
    ax.set_title("Histogram with Log Scale")
    model.writer.add_figure("test/lin_actions_hist", fig)

    counts, bin_edges = np.histogram(ang_actions, bins=bins)
    fig, ax = plt.subplots()
    ax.bar(
        bin_edges[:-1], counts, width=np.diff(bin_edges), align="edge", log=True
    )  # Log scale on y-axis
    ax.set_xlabel("Value")
    ax.set_ylabel("Frequency (Log Scale)")
    ax.set_title("Histogram with Log Scale")
    model.writer.add_figure("test/ang_actions_hist", fig)


if __name__ == "__main__":
    main()


================================================
FILE: robot_nav/marl_test_single.py
================================================
import statistics
from pathlib import Path

import irsim
from tqdm import tqdm

from robot_nav.SIM_ENV.sim_env import SIM_ENV
from robot_nav.models.MARL.marlTD3.marlTD3 import TD3

import torch
import numpy as np


class SINGLE_SIM(SIM_ENV):
    """
    Simulation environment for multi-agent robot navigation using IRSim.

    This class extends the SIM_ENV and provides a wrapper for multi-robot
    simulation and interaction, supporting reward computation and custom reset logic.

    Attributes:
        env (object): IRSim simulation environment instance.
        robot_goal (np.ndarray): Current goal position(s) for the robots.
        num_robots (int): Number of robots in the environment.
        x_range (tuple): World x-range.
        y_range (tuple): World y-range.
    """

    def __init__(
        self,
        world_file="worlds/circle_world.yaml",
        disable_plotting=False,
        reward_phase=2,
    ):
        """
        Initialize the MARL_SIM environment.

        Args:
            world_file (str, optional): Path to the world configuration YAML file.
            disable_plotting (bool, optional): If True, disables IRSim rendering and plotting.
        """
        display = False if disable_plotting else True
        self.env = irsim.make(
            world_file, disable_all_plot=disable_plotting, display=display
        )
        self.num_robots = len(self.env.robot_list)
        self.x_range = self.env._world.x_range
        self.y_range = self.env._world.y_range
        self.reward_phase = reward_phase

    def step(self, action, connection, combined_weights=None):
        """
        Perform a simulation step for all robots using the provided actions and connections.

        Args:
            action (list): List of actions for each robot [[lin_vel, ang_vel], ...].
            connection (Tensor): Tensor of shape (num_robots, num_robots-1) containing logits indicating connections between robots.
            combined_weights (Tensor or None, optional): Optional weights for each connection, shape (num_robots, num_robots-1).

        Returns:
            tuple: (
                poses (list): List of [x, y, theta] for each robot,
                distances (list): Distance to goal for each robot,
                coss (list): Cosine of angle to goal for each robot,
                sins (list): Sine of angle to goal for each robot,
                collisions (list): Collision status for each robot,
                goals (list): Goal reached status for each robot,
                action (list): Actions applied,
                rewards (list): Rewards computed,
                positions (list): Current [x, y] for each robot,
                goal_positions (list): Goal [x, y] for each robot,
            )
        """
        act = [
            [0, 0] if self.env.robot_list[i].arrive else action[i]
            for i in range(len(action))
        ]
        self.env.step(action_id=[i for i in range(self.num_robots)], action=act)
        self.env.render()

        poses = []
        distances = []
        coss = []
        sins = []
        collisions = []
        goals = []
        rewards = []
        positions = []
        goal_positions = []
        robot_states = [
            [self.env.robot_list[i].state[0], self.env.robot_list[i].state[1]]
            for i in range(self.num_robots)
        ]
        for i in range(self.num_robots):

            robot_state = self.env.robot_list[i].state
            closest_robots = [
                np.linalg.norm(
                    [
                        robot_states[j][0] - robot_state[0],
                        robot_states[j][1] - robot_state[1],
                    ]
                )
                for j in range(self.num_robots)
                if j != i
            ]
            robot_goal = self.env.robot_list[i].goal
            goal_vector = [
                robot_goal[0].item() - robot_state[0].item(),
                robot_goal[1].item() - robot_state[1].item(),
            ]
            distance = np.linalg.norm(goal_vector)
            goal = self.env.robot_list[i].arrive
            pose_vector = [np.cos(robot_state[2]).item(), np.sin(robot_state[2]).item()]
            cos, sin = self.cossin(pose_vector, goal_vector)
            collision = self.env.robot_list[i].collision
            action_i = action[i]
            reward = self.get_reward(
                goal, collision, action_i, closest_robots, distance, self.reward_phase
            )

            position = [robot_state[0].item(), robot_state[1].item()]
            goal_position = [robot_goal[0].item(), robot_goal[1].item()]

            distances.append(distance)
            coss.append(cos)
            sins.append(sin)
            collisions.append(collision)
            goals.append(goal)
            rewards.append(reward)
            positions.append(position)
            poses.append(
                [robot_state[0].item(), robot_state[1].item(), robot_state[2].item()]
            )
            goal_positions.append(goal_position)

            i_probs = torch.sigmoid(
                connection[i]
            )  # connection[i] is logits for "connect" per pair

            i_connections = i_probs.tolist()
            i_connections.insert(i, 0)
            if combined_weights is not None:
                i_weights = combined_weights[i].tolist()
                i_weights.insert(i, 0)

            for j in range(self.num_robots):
                if i_connections[j] > 0.5:
                    if combined_weights is not None:
                        weight = i_weights[j]
                    else:
                        weight = 1
                    other_robot_state = self.env.robot_list[j].state
                    other_pos = [
                        other_robot_state[0].item(),
                        other_robot_state[1].item(),
                    ]
                    rx = [position[0], other_pos[0]]
                    ry = [position[1], other_pos[1]]
                    self.env.draw_trajectory(
                        np.array([rx, ry]), refresh=True, linewidth=weight
                    )

        return (
            poses,
            distances,
            coss,
            sins,
            collisions,
            goals,
            action,
            rewards,
            positions,
            goal_positions,
        )

    def reset(
        self,
        robot_state=None,
        robot_goal=None,
        random_obstacles=False,
        random_obstacle_ids=None,
    ):
        """
        Reset the simulation environment and optionally set robot and obstacle positions.

        Args:
            robot_state (list or None, optional): Initial state for robots as [x, y, theta, speed].
            robot_goal (list or None, optional): Goal position(s) for the robots.
            random_obstacles (bool, optional): If True, randomly position obstacles.
            random_obstacle_ids (list or None, optional): IDs of obstacles to randomize.

        Returns:
            tuple: (
                poses (list): List of [x, y, theta] for each robot,
                distances (list): Distance to goal for each robot,
                coss (list): Cosine of angle to goal for each robot,
                sins (list): Sine of angle to goal for each robot,
                collisions (list): All False after reset,
                goals (list): All False after reset,
                action (list): Initial action ([[0.0, 0.0], ...]),
                rewards (list): Rewards for initial state,
                positions (list): Initial [x, y] for each robot,
                goal_positions (list): Initial goal [x, y] for each robot,
            )
        """
        self.env.reset()
        self.robot_goal = self.env.robot.goal

        action = [[0.0, 0.0] for _ in range(self.num_robots)]
        con = torch.tensor(
            [[0.0 for _ in range(self.num_robots - 1)] for _ in range(self.num_robots)]
        )
        poses, distance, cos, sin, _, _, action, reward, positions, goal_positions = (
            self.step(action, con)
        )
        return (
            poses,
            distance,
            cos,
            sin,
            [False] * self.num_robots,
            [False] * self.num_robots,
            action,
            reward,
            positions,
            goal_positions,
        )

    @staticmethod
    def get_reward(goal, collision, action, closest_robots, distance, phase=2):
        """
        Calculate the reward for a robot given the current state and action.

        Args:
            goal (bool): Whether the robot reached its goal.
            collision (bool): Whether a collision occurred.
            action (list): [linear_velocity, angular_velocity] applied.
            closest_robots (list): Distances to the closest other robots.
            distance (float): Distance to the goal.
            phase (int, optional): Reward phase/function selector (default: 1).

        Returns:
            float: Computed reward.
        """

        match phase:
            case 1:
                if goal:
                    return 100.0
                elif collision:
                    return -100.0 * 3 * action[0]
                else:
                    r_dist = 1.5 / distance
                    cl_pen = 0
                    for rob in closest_robots:
                        add = 1.5 - rob if rob < 1.5 else 0
                        cl_pen += add

                    return action[0] - 0.5 * abs(action[1]) - cl_pen + r_dist

            case 2:
                if goal:
                    return 70.0
                elif collision:
                    return -100.0 * 3 * action[0]
                else:
                    cl_pen = 0
                    for rob in closest_robots:
                        add = (3 - rob) ** 2 if rob < 3 else 0
                        cl_pen += add

                    return -0.5 * abs(action[1]) - cl_pen

            case _:
                raise ValueError("Unknown reward phase")


def outside_of_bounds(poses):
    """
    Check if any robot is outside the defined world boundaries.

    Args:
        poses (list): List of [x, y, theta] poses for each robot.

    Returns:
        bool: True if any robot is outside the 21x21 area centered at (6, 6), else False.
    """
    outside = False
    for pose in poses:
        norm_x = pose[0] - 6
        norm_y = pose[1] - 6
        if abs(norm_x) > 10.5 or abs(norm_y) > 10.5:
            outside = True
            break
    return outside


def main(args=None):
    """Main training function"""

    # ---- Hyperparameters and setup ----
    action_dim = 2  # number of actions produced by the model
    max_action = 1  # maximum absolute value of output actions
    state_dim = 11  # number of input values in the neural network (vector length of state input)
    device = torch.device(
        "cuda" if torch.cuda.is_available() else "cpu"
    )  # using cuda if it is available, cpu otherwise
    epoch = 1  # starting epoch number
    episode = 0
    max_steps = 600  # maximum number of steps in single episode
    steps = 0  # starting step number
    save_every = 5  # save the model every n training cycles
    test_scenarios = 100

    # ---- Instantiate simulation environment and model ----
    sim = SINGLE_SIM(
        world_file="worlds/circle_world.yaml", disable_plotting=True, reward_phase=2
    )  # instantiate environment

    model = TD3(
        state_dim=state_dim,
        action_dim=action_dim,
        max_action=max_action,
        num_robots=sim.num_robots,
        device=device,
        save_every=save_every,
        load_model=True,
        model_name="TDR-MARL-test",
        load_model_name="TDR-MARL-train",
        load_directory=Path("robot_nav/models/MARL/marlTD3/checkpoint"),
        attention="igs",
    )  # instantiate a model

    connections = torch.tensor(
        [[0.0 for _ in range(sim.num_robots - 1)] for _ in range(sim.num_robots)]
    )

    # ---- Take initial step in environment ----
    poses, distance, cos, sin, collision, goal, a, reward, positions, goal_positions = (
        sim.step([[0, 0] for _ in range(sim.num_robots)], connections)
    )  # get the initial step state
    running_goals = 0
    running_reward = 0
    running_collisions = 0
    running_timesteps = 0
    ran_out_of_time = 0

    reward_per_ep = []
    goals_per_ep = []
    col_per_ep = []
    lin_actions = []
    ang_actions = []
    entropy_list = []
    timesteps_per_ep = []
    epsilon = 1e-6
    pbar = tqdm(total=test_scenarios)
    while episode < test_scenarios:
        state, terminal = model.prepare_state(
            poses, distance, cos, sin, collision, a, goal_positions
        )
        action, connection, combined_weights = model.get_action(
            np.array(state), False
        )  # get an action from the model

        combined_weights_norm = combined_weights / (
            combined_weights.sum(dim=-1, keepdim=True) + epsilon
        )

        entropy = (
            (
                -(combined_weights_norm * (combined_weights_norm + epsilon).log())
                .sum(dim=-1)
                .mean()
            )
            .data.cpu()
            .numpy()
        )
        entropy_list.append(entropy)

        a_in = [[(a[0] + 1) / 4, a[1]] for a in action]

        (
            poses,
            distance,
            cos,
            sin,
            collision,
            goal,
            a,
            reward,
            positions,
            goal_positions,
        ) = sim.step(
            a_in, connection, combined_weights
        )  # get data from the environment
        running_goals += sum(goal)
        running_collisions += sum(collision)
        running_reward += sum(reward)
        running_timesteps += 1
        for j in range(len(a_in)):
            lin_actions.append(a_in[j][0].item())
            ang_actions.append(a_in[j][1].item())
        outside = outside_of_bounds(poses)

        if (
            sum(collision) > 0.5 or steps == max_steps or int(sum(goal)) == len(goal)
        ):  # reset environment of terminal state reached, or max_steps were taken
            (
                poses,
                distance,
                cos,
                sin,
                collision,
                goal,
                a,
                reward,
                positions,
                goal_positions,
            ) = sim.reset()
            reward_per_ep.append(running_reward)
            running_reward = 0
            goals_per_ep.append(running_goals)
            running_goals = 0
            col_per_ep.append(running_collisions)
            running_collisions = 0
            timesteps_per_ep.append(running_timesteps)
            running_timesteps = 0
            if steps == max_steps:
                ran_out_of_time += 1

            steps = 0
            episode += 1
            pbar.update(1)
        else:
            steps += 1

    cols = sum(col_per_ep) / 2
    goals_per_ep = np.array(goals_per_ep, dtype=np.float32)
    col_per_ep = np.array(col_per_ep, dtype=np.float32)
    avg_ep_col = statistics.mean(col_per_ep)
    avg_ep_col_std = statistics.stdev(col_per_ep)
    avg_ep_goals = statistics.mean(goals_per_ep)
    avg_ep_goals_std = statistics.stdev(goals_per_ep)
    t_per_ep = np.array(timesteps_per_ep, dtype=np.float32)
    avg_ep_t = statistics.mean(t_per_ep)
    avg_ep_t_std = statistics.stdev(t_per_ep)

    print(f"avg_ep_col: {avg_ep_col}")
    print(f"avg_ep_col_std: {avg_ep_col_std}")
    print(f"success rate: {test_scenarios - cols}")
    print(f"avg_ep_goals: {avg_ep_goals}")
    print(f"avg_ep_goals_std: {avg_ep_goals_std}")
    print(f"avg_ep_t: {avg_ep_t}")
    print(f"avg_ep_t_std: {avg_ep_t_std}")
    print(f"ran out of time: {ran_out_of_time}")
    print("..............................................")


if __name__ == "__main__":
    main()


================================================
FILE: robot_nav/marl_train.py
================================================
from pathlib import Path

from robot_nav.models.MARL.marlTD3.marlTD3 import TD3

import torch
import numpy as np
from robot_nav.SIM_ENV.marl_sim import MARL_SIM
from utils import get_buffer


def outside_of_bounds(poses):
    """
    Check if any robot is outside the defined world boundaries.

    Args:
        poses (list): List of [x, y, theta] poses for each robot.

    Returns:
        bool: True if any robot is outside the 21x21 area centered at (6, 6), else False.
    """
    outside = False
    for pose in poses:
        norm_x = pose[0] - 6
        norm_y = pose[1] - 6
        if abs(norm_x) > 10.5 or abs(norm_y) > 10.5:
            outside = True
            break
    return outside


def main(args=None):
    """Main training function"""

    # ---- Hyperparameters and setup ----
    action_dim = 2  # number of actions produced by the model
    max_action = 1  # maximum absolute value of output actions
    state_dim = 11  # number of input values in the neural network (vector length of state input)
    device = torch.device(
        "cuda" if torch.cuda.is_available() else "cpu"
    )  # using cuda if it is available, cpu otherwise
    max_epochs = 160  # max number of epochs
    epoch = 1  # starting epoch number
    episode = 0  # starting episode number
    train_every_n = 10  # train and update network parameters every n episodes
    training_iterations = 80  # how many batches to use for single training cycle
    batch_size = 16  # batch size for each training iteration
    max_steps = 300  # maximum number of steps in single episode
    steps = 0  # starting step number
    load_saved_buffer = False  # whether to load experiences from assets/data.yml
    pretrain = False  # whether to use the loaded experiences to pre-train the model (load_saved_buffer must be True)
    pretraining_iterations = (
        10  # number of training iterations to run during pre-training
    )
    save_every = 5  # save the model every n training cycles

    # ---- Instantiate simulation environment and model ----
    sim = MARL_SIM(
        world_file="worlds/multi_robot_world.yaml",
        disable_plotting=True,
        reward_phase=1,
    )  # instantiate environment

    model = TD3(
        state_dim=state_dim,
        action_dim=action_dim,
        max_action=max_action,
        num_robots=sim.num_robots,
        device=device,
        save_every=save_every,
        load_model=False,
        model_name="TDR-MARL-train",
        load_model_name="saved_model",
        load_directory=Path("robot_nav/models/MARL/marlTD3/checkpoint"),
        attention="iga",
    )  # instantiate a model

    # ---- Setup replay buffer and initial connections ----
    replay_buffer = get_buffer(
        model,
        sim,
        load_saved_buffer,
        pretrain,
        pretraining_iterations,
        training_iterations,
        batch_size,
    )
    connections = torch.tensor(
        [[0.0 for _ in range(sim.num_robots - 1)] for _ in range(sim.num_robots)]
    )

    # ---- Take initial step in environment ----
    poses, distance, cos, sin, collision, goal, a, reward, positions, goal_positions = (
        sim.step([[0, 0] for _ in range(sim.num_robots)], connections)
    )  # get the initial step state
    running_goals = 0
    running_collisions = 0
    running_timesteps = 0

    # ---- Main training loop ----
    while epoch < max_epochs:  # train until max_epochs is reached
        state, terminal = model.prepare_state(
            poses, distance, cos, sin, collision, a, goal_positions
        )  # get state a state representation from returned data from the environment

        action, connection, combined_weights = model.get_action(
            np.array(state), True
        )  # get an action from the model

        a_in = [
            [(a[0] + 1) / 4, a[1]] for a in action
        ]  # clip linear velocity to [0, 0.5] m/s range

        (
            poses,
            distance,
            cos,
            sin,
            collision,
            goal,
            a,
            reward,
            positions,
            goal_positions,
        ) = sim.step(
            a_in, connection, combined_weights
        )  # get data from the environment
        running_goals += sum(goal)
        running_collisions += sum(collision)
        running_timesteps += 1
        next_state, terminal = model.prepare_state(
            poses, distance, cos, sin, collision, a, goal_positions
        )  # get a next state representation
        replay_buffer.add(
            state, action, reward, terminal, next_state
        )  # add experience to the replay buffer
        outside = outside_of_bounds(poses)
        if (
            any(terminal) or steps == max_steps or outside
        ):  # reset environment of terminal state reached, or max_steps were taken
            (
                poses,
                distance,
                cos,
                sin,
                collision,
                goal,
                a,
                reward,
                positions,
                goal_positions,
            ) = sim.reset()
            episode += 1
            if episode % train_every_n == 0:
                model.writer.add_scalar(
                    "run/avg_goal", running_goals / running_timesteps, epoch
                )
                model.writer.add_scalar(
                    "run/avg_collision", running_collisions / running_timesteps, epoch
                )
                running_goals = 0
                running_collisions = 0
                running_timesteps = 0
                epoch += 1
                model.train(
                    replay_buffer=replay_buffer,
                    iterations=training_iterations,
                    batch_size=batch_size,
                )  # train the model and update its parameters

            steps = 0
        else:
            steps += 1


if __name__ == "__main__":
    main()


================================================
FILE: robot_nav/models/CNNTD3/CNNTD3.py
================================================
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from numpy import inf
from torch.utils.tensorboard import SummaryWriter

from robot_nav.utils import get_max_bound


class Actor(nn.Module):
    """
    Actor network for the CNNTD3 agent.

    This network takes as input a state composed of laser scan data, goal position encoding,
    and previous action. It processes the scan through a 1D CNN stack and embeds the other
    inputs before merging all features through fully connected layers to output a continuous
    action vector.

    Args:
        action_dim (int): The dimension of the action space.

    Architecture:
        - 1D CNN layers process the laser scan data.
        - Fully connected layers embed the goal vector (cos, sin, distance) and last action.
        - Combined features are passed through two fully connected layers with LeakyReLU.
        - Final action output is scaled with Tanh to bound the values.
    """

    def __init__(self, action_dim):
        super(Actor, self).__init__()

        self.cnn1 = nn.Conv1d(1, 4, kernel_size=8, stride=4)
        self.cnn2 = nn.Conv1d(4, 8, kernel_size=8, stride=4)
        self.cnn3 = nn.Conv1d(8, 4, kernel_size=4, stride=2)

        self.goal_embed = nn.Linear(3, 10)
        self.action_embed = nn.Linear(2, 10)

        self.layer_1 = nn.Linear(36, 400)
        torch.nn.init.kaiming_uniform_(self.layer_1.weight, nonlinearity="leaky_relu")
        self.layer_2 = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2.weight, nonlinearity="leaky_relu")
        self.layer_3 = nn.Linear(300, action_dim)
        self.tanh = nn.Tanh()

    def forward(self, s):
        """
        Forward pass through the Actor network.

        Args:
            s (torch.Tensor): Input state tensor of shape (batch_size, state_dim).
                              The last 5 elements are [distance, cos, sin, lin_vel, ang_vel].

        Returns:
            (torch.Tensor): Action tensor of shape (batch_size, action_dim),
                          with values in range [-1, 1] due to tanh activation.
        """
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        laser = s[:, :-5]
        goal = s[:, -5:-2]
        act = s[:, -2:]
        laser = laser.unsqueeze(1)

        l = F.leaky_relu(self.cnn1(laser))
        l = F.leaky_relu(self.cnn2(l))
        l = F.leaky_relu(self.cnn3(l))
        l = l.flatten(start_dim=1)

        g = F.leaky_relu(self.goal_embed(goal))

        a = F.leaky_relu(self.action_embed(act))

        s = torch.concat((l, g, a), dim=-1)

        s = F.leaky_relu(self.layer_1(s))
        s = F.leaky_relu(self.layer_2(s))
        a = self.tanh(self.layer_3(s))
        return a


class Critic(nn.Module):
    """
    Critic network for the CNNTD3 agent.

    The Critic estimates Q-values for state-action pairs using two separate sub-networks
    (Q1 and Q2), as required by the TD3 algorithm. Each sub-network uses a combination of
    CNN-extracted features, embedded goal and previous action features, and the current action.

    Args:
        action_dim (int): The dimension of the action space.

    Architecture:
        - Shared CNN layers process the laser scan input.
        - Goal and previous action are embedded and concatenated.
        - Each Q-network uses separate fully connected layers to produce scalar Q-values.
        - Both Q-networks receive the full state and current action.
        - Outputs two Q-value tensors (Q1, Q2) for TD3-style training and target smoothing.
    """

    def __init__(self, action_dim):
        super(Critic, self).__init__()
        self.cnn1 = nn.Conv1d(1, 4, kernel_size=8, stride=4)
        self.cnn2 = nn.Conv1d(4, 8, kernel_size=8, stride=4)
        self.cnn3 = nn.Conv1d(8, 4, kernel_size=4, stride=2)

        self.goal_embed = nn.Linear(3, 10)
        self.action_embed = nn.Linear(2, 10)

        self.layer_1 = nn.Linear(36, 400)
        torch.nn.init.kaiming_uniform_(self.layer_1.weight, nonlinearity="leaky_relu")
        self.layer_2_s = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2_s.weight, nonlinearity="leaky_relu")
        self.layer_2_a = nn.Linear(action_dim, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2_a.weight, nonlinearity="leaky_relu")
        self.layer_3 = nn.Linear(300, 1)
        torch.nn.init.kaiming_uniform_(self.layer_3.weight, nonlinearity="leaky_relu")

        self.layer_4 = nn.Linear(36, 400)
        torch.nn.init.kaiming_uniform_(self.layer_1.weight, nonlinearity="leaky_relu")
        self.layer_5_s = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_5_s.weight, nonlinearity="leaky_relu")
        self.layer_5_a = nn.Linear(action_dim, 300)
        torch.nn.init.kaiming_uniform_(self.layer_5_a.weight, nonlinearity="leaky_relu")
        self.layer_6 = nn.Linear(300, 1)
        torch.nn.init.kaiming_uniform_(self.layer_6.weight, nonlinearity="leaky_relu")

    def forward(self, s, action):
        """
        Forward pass through both Q-networks of the Critic.

        Args:
            s (torch.Tensor): Input state tensor of shape (batch_size, state_dim).
                              The last 5 elements are [distance, cos, sin, lin_vel, ang_vel].
            action (torch.Tensor): Current action tensor of shape (batch_size, action_dim).

        Returns:
            (tuple):
                - q1 (torch.Tensor): First Q-value estimate (batch_size, 1).
                - q2 (torch.Tensor): Second Q-value estimate (batch_size, 1).
        """
        laser = s[:, :-5]
        goal = s[:, -5:-2]
        act = s[:, -2:]
        laser = laser.unsqueeze(1)

        l = F.leaky_relu(self.cnn1(laser))
        l = F.leaky_relu(self.cnn2(l))
        l = F.leaky_relu(self.cnn3(l))
        l = l.flatten(start_dim=1)

        g = F.leaky_relu(self.goal_embed(goal))

        a = F.leaky_relu(self.action_embed(act))

        s = torch.concat((l, g, a), dim=-1)

        s1 = F.leaky_relu(self.layer_1(s))
        self.layer_2_s(s1)
        self.layer_2_a(action)
        s11 = torch.mm(s1, self.layer_2_s.weight.data.t())
        s12 = torch.mm(action, self.layer_2_a.weight.data.t())
        s1 = F.leaky_relu(s11 + s12 + self.layer_2_a.bias.data)
        q1 = self.layer_3(s1)

        s2 = F.leaky_relu(self.layer_4(s))
        self.layer_5_s(s2)
        self.layer_5_a(action)
        s21 = torch.mm(s2, self.layer_5_s.weight.data.t())
        s22 = torch.mm(action, self.layer_5_a.weight.data.t())
        s2 = F.leaky_relu(s21 + s22 + self.layer_5_a.bias.data)
        q2 = self.layer_6(s2)
        return q1, q2


# CNNTD3 network
class CNNTD3(object):
    """
    CNNTD3 (Twin Delayed Deep Deterministic Policy Gradient with CNN-based inputs) agent for
    continuous control tasks.

    This class encapsulates the full implementation of the TD3 algorithm using neural network
    architectures for the actor and critic, with optional bounding for critic outputs to
    regularize learning. The agent is designed to train in environments where sensor
    observations (e.g., LiDAR) are used for navigation tasks.

    Args:
        state_dim (int): Dimension of the input state.
        action_dim (int): Dimension of the output action.
        max_action (float): Maximum magnitude of the action.
        device (torch.device): Torch device to use (CPU or GPU).
        lr (float): Learning rate for both actor and critic optimizers.
        save_every (int): Save model every N training iterations (0 to disable).
        load_model (bool): Whether to load a pre-trained model at initialization.
        save_directory (Path): Path to the directory for saving model checkpoints.
        model_name (str): Base name for the saved model files.
        load_directory (Path): Path to load model checkpoints from (if `load_model=True`).
        use_max_bound (bool): Whether to apply maximum Q-value bounding during training.
        bound_weight (float): Weight for the bounding loss term in total loss.
    """

    def __init__(
        self,
        state_dim,
        action_dim,
        max_action,
        device,
        lr=1e-4,
        save_every=0,
        load_model=False,
        save_directory=Path("robot_nav/models/CNNTD3/checkpoint"),
        model_name="CNNTD3",
        load_directory=Path("robot_nav/models/CNNTD3/checkpoint"),
        use_max_bound=False,
        bound_weight=0.25,
    ):
        # Initialize the Actor network
        self.device = device
        self.actor = Actor(action_dim).to(self.device)
        self.actor_target = Actor(action_dim).to(self.device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = torch.optim.Adam(params=self.actor.parameters(), lr=lr)

        # Initialize the Critic networks
        self.critic = Critic(action_dim).to(self.device)
        self.critic_target = Critic(action_dim).to(self.device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        self.critic_optimizer = torch.optim.Adam(params=self.critic.parameters(), lr=lr)

        self.action_dim = action_dim
        self.max_action = max_action
        self.state_dim = state_dim
        self.writer = SummaryWriter(comment=model_name)
        self.iter_count = 0
        if load_model:
            self.load(filename=model_name, directory=load_directory)
        self.save_every = save_every
        self.model_name = model_name
        self.save_directory = save_directory
        self.use_max_bound = use_max_bound
        self.bound_weight = bound_weight

    def get_action(self, obs, add_noise):
        """
        Selects an action for a given observation.

        Args:
            obs (np.ndarray): The current observation/state.
            add_noise (bool): Whether to add exploration noise to the action.

        Returns:
            (np.ndarray): The selected action.
        """
        if add_noise:
            return (
                self.act(obs) + np.random.normal(0, 0.2, size=self.action_dim)
            ).clip(-self.max_action, self.max_action)
        else:
            return self.act(obs)

    def act(self, state):
        """
        Computes the deterministic action from the actor network for a given state.

        Args:
            state (np.ndarray): Input state.

        Returns:
            (np.ndarray): Action predicted by the actor network.
        """
        # Function to get the action from the actor
        state = torch.Tensor(state).to(self.device)
        return self.actor(state).cpu().data.numpy().flatten()

    # training cycle
    def train(
        self,
        replay_buffer,
        iterations,
        batch_size,
        discount=0.99,
        tau=0.005,
        policy_noise=0.2,
        noise_clip=0.5,
        policy_freq=2,
        max_lin_vel=0.5,
        max_ang_vel=1,
        goal_reward=100,
        distance_norm=10,
        time_step=0.3,
    ):
        """
        Trains the CNNTD3 agent using sampled batches from the replay buffer.

        Args:
            replay_buffer (ReplayBuffer): Buffer storing environment transitions.
            iterations (int): Number of training iterations.
            batch_size (int): Size of each training batch.
            discount (float): Discount factor for future rewards.
            tau (float): Soft update rate for target networks.
            policy_noise (float): Std. dev. of noise added to target policy.
            noise_clip (float): Maximum value for target policy noise.
            policy_freq (int): Frequency of actor and target network updates.
            max_lin_vel (float): Maximum linear velocity for bounding calculations.
            max_ang_vel (float): Maximum angular velocity for bounding calculations.
            goal_reward (float): Reward value for reaching the goal.
            distance_norm (float): Normalization factor for distance in bounding.
            time_step (float): Time delta between steps.
        """
        av_Q = 0
        max_Q = -inf
        av_loss = 0
        for it in range(iterations):
            # sample a batch from the replay buffer
            (
                batch_states,
                batch_actions,
                batch_rewards,
                batch_dones,
                batch_next_states,
            ) = replay_buffer.sample_batch(batch_size)
            state = torch.Tensor(batch_states).to(self.device)
            next_state = torch.Tensor(batch_next_states).to(self.device)
            action = torch.Tensor(batch_actions).to(self.device)
            reward = torch.Tensor(batch_rewards).to(self.device).reshape(-1, 1)
            done = torch.Tensor(batch_dones).to(self.device).reshape(-1, 1)

            # Obtain the estimated action from the next state by using the actor-target
            next_action = self.actor_target(next_state)

            # Add noise to the action
            noise = (
                torch.Tensor(batch_actions)
                .data.normal_(0, policy_noise)
                .to(self.device)
            )
            noise = noise.clamp(-noise_clip, noise_clip)
            next_action = (next_action + noise).clamp(-self.max_action, self.max_action)

            # Calculate the Q values from the critic-target network for the next state-action pair
            target_Q1, target_Q2 = self.critic_target(next_state, next_action)

            # Select the minimal Q value from the 2 calculated values
            target_Q = torch.min(target_Q1, target_Q2)
            av_Q += torch.mean(target_Q)
            max_Q = max(max_Q, torch.max(target_Q))
            # Calculate the final Q value from the target network parameters by using Bellman equation
            target_Q = reward + ((1 - done) * discount * target_Q).detach()

            # Get the Q values of the basis networks with the current parameters
            current_Q1, current_Q2 = self.critic(state, action)

            # Calculate the loss between the current Q value and the target Q value
            loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

            if self.use_max_bound:
                max_bound = get_max_bound(
                    next_state,
                    discount,
                    max_ang_vel,
                    max_lin_vel,
                    time_step,
                    distance_norm,
                    goal_reward,
                    reward,
                    done,
                    self.device,
                )
                max_excess_Q1 = F.relu(current_Q1 - max_bound)
                max_excess_Q2 = F.relu(current_Q2 - max_bound)
                max_bound_loss = (max_excess_Q1**2).mean() + (max_excess_Q2**2).mean()
                # Add loss for Q values exceeding maximum possible upper bound
                loss += self.bound_weight * max_bound_loss

            # Perform the gradient descent
            self.critic_optimizer.zero_grad()
            loss.backward()
            self.critic_optimizer.step()

            if it % policy_freq == 0:
                # Maximize the actor output value by performing gradient descent on negative Q values
                # (essentially perform gradient ascent)
                actor_grad, _ = self.critic(state, self.actor(state))
                actor_grad = -actor_grad.mean()
                self.actor_optimizer.zero_grad()
                actor_grad.backward()
                self.actor_optimizer.step()

                # Use soft update to update the actor-target network parameters by
                # infusing small amount of current parameters
                for param, target_param in zip(
                    self.actor.parameters(), self.actor_target.parameters()
                ):
                    target_param.data.copy_(
                        tau * param.data + (1 - tau) * target_param.data
                    )
                # Use soft update to update the critic-target network parameters by infusing
                # small amount of current parameters
                for param, target_param in zip(
                    self.critic.parameters(), self.critic_target.parameters()
                ):
                    target_param.data.copy_(
                        tau * param.data + (1 - tau) * target_param.data
                    )

            av_loss += loss
        self.iter_count += 1
        # Write new values for tensorboard
        self.writer.add_scalar("train/loss", av_loss / iterations, self.iter_count)
        self.writer.add_scalar("train/avg_Q", av_Q / iterations, self.iter_count)
        self.writer.add_scalar("train/max_Q", max_Q, self.iter_count)
        if self.save_every > 0 and self.iter_count % self.save_every == 0:
            self.save(filename=self.model_name, directory=self.save_directory)

    def save(self, filename, directory):
        """
        Saves the current model parameters to the specified directory.

        Args:
            filename (str): Base filename for saved files.
            directory (Path): Path to save the model files.
        """
        Path(directory).mkdir(parents=True, exist_ok=True)
        torch.save(self.actor.state_dict(), "%s/%s_actor.pth" % (directory, filename))
        torch.save(
            self.actor_target.state_dict(),
            "%s/%s_actor_target.pth" % (directory, filename),
        )
        torch.save(self.critic.state_dict(), "%s/%s_critic.pth" % (directory, filename))
        torch.save(
            self.critic_target.state_dict(),
            "%s/%s_critic_target.pth" % (directory, filename),
        )

    def load(self, filename, directory):
        """
        Loads model parameters from the specified directory.

        Args:
            filename (str): Base filename for saved files.
            directory (Path): Path to load the model files from.
        """
        self.actor.load_state_dict(
            torch.load("%s/%s_actor.pth" % (directory, filename))
        )
        self.actor_target.load_state_dict(
            torch.load("%s/%s_actor_target.pth" % (directory, filename))
        )
        self.critic.load_state_dict(
            torch.load("%s/%s_critic.pth" % (directory, filename))
        )
        self.critic_target.load_state_dict(
            torch.load("%s/%s_critic_target.pth" % (directory, filename))
        )
        print(f"Loaded weights from: {directory}")

    def prepare_state(self, latest_scan, distance, cos, sin, collision, goal, action):
        """
        Prepares the environment's raw sensor data and navigation variables into
        a format suitable for learning.

        Args:
            latest_scan (list or np.ndarray): Raw scan data (e.g., LiDAR).
            distance (float): Distance to goal.
            cos (float): Cosine of heading angle to goal.
            sin (float): Sine of heading angle to goal.
            collision (bool): Collision status (True if collided).
            goal (bool): Goal reached status.
            action (list or np.ndarray): Last action taken [lin_vel, ang_vel].

        Returns:
            (tuple):
                - state (list): Normalized and concatenated state vector.
                - terminal (int): Terminal flag (1 if collision or goal, else 0).
        """
        latest_scan = np.array(latest_scan)

        inf_mask = np.isinf(latest_scan)
        latest_scan[inf_mask] = 7.0
        latest_scan /= 7

        # Normalize to [0, 1] range
        distance /= 10
        lin_vel = action[0] * 2
        ang_vel = (action[1] + 1) / 2
        state = latest_scan.tolist() + [distance, cos, sin] + [lin_vel, ang_vel]

        assert len(state) == self.state_dim
        terminal = 1 if collision or goal else 0

        return state, terminal


================================================
FILE: robot_nav/models/CNNTD3/__init__.py
================================================


================================================
FILE: robot_nav/models/DDPG/DDPG.py
================================================
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from numpy import inf
from torch.utils.tensorboard import SummaryWriter

from robot_nav.utils import get_max_bound


class Actor(nn.Module):
    """
    Actor network for the DDPG algorithm.

    This network maps input states to actions using a fully connected feedforward architecture.
    It uses Leaky ReLU activations in the hidden layers and a tanh activation at the output
    to ensure the output actions are in the range [-1, 1].

    Architecture:
        - Linear(state_dim → 400) + LeakyReLU
        - Linear(400 → 300) + LeakyReLU
        - Linear(300 → action_dim) + Tanh

    Args:
        state_dim (int): Dimension of the input state.
        action_dim (int): Dimension of the output action space.
    """

    def __init__(self, state_dim, action_dim):
        super(Actor, self).__init__()

        self.layer_1 = nn.Linear(state_dim, 400)
        torch.nn.init.kaiming_uniform_(self.layer_1.weight, nonlinearity="leaky_relu")
        self.layer_2 = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2.weight, nonlinearity="leaky_relu")
        self.layer_3 = nn.Linear(300, action_dim)
        self.tanh = nn.Tanh()

    def forward(self, s):
        """
        Forward pass of the actor network.

        Args:
            s (torch.Tensor): Input state tensor of shape (batch_size, state_dim).

        Returns:
            (torch.Tensor): Output action tensor of shape (batch_size, action_dim), scaled to [-1, 1].
        """
        s = F.leaky_relu(self.layer_1(s))
        s = F.leaky_relu(self.layer_2(s))
        a = self.tanh(self.layer_3(s))
        return a


class Critic(nn.Module):
    """
    Critic network for the DDPG algorithm.

    This network evaluates the Q-value of a given state-action pair. It separately processes
    state and action inputs through linear layers, combines them, and passes the result through
    another linear layer to predict a scalar Q-value.

    Architecture:
        - Linear(state_dim → 400) + LeakyReLU
        - Linear(400 → 300) [state branch]
        - Linear(action_dim → 300) [action branch]
        - Combine both branches, apply LeakyReLU
        - Linear(300 → 1) for Q-value output

    Args:
        state_dim (int): Dimension of the input state.
        action_dim (int): Dimension of the input action.
    """

    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()

        self.layer_1 = nn.Linear(state_dim, 400)
        torch.nn.init.kaiming_uniform_(self.layer_1.weight, nonlinearity="leaky_relu")
        self.layer_2_s = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2_s.weight, nonlinearity="leaky_relu")
        self.layer_2_a = nn.Linear(action_dim, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2_a.weight, nonlinearity="leaky_relu")
        self.layer_3 = nn.Linear(300, 1)
        torch.nn.init.kaiming_uniform_(self.layer_3.weight, nonlinearity="leaky_relu")

    def forward(self, s, a):
        """
        Forward pass of the critic network.

        Args:
            s (torch.Tensor): State tensor of shape (batch_size, state_dim).
            a (torch.Tensor): Action tensor of shape (batch_size, action_dim).

        Returns:
            (torch.Tensor): Q-value tensor of shape (batch_size, 1).
        """
        s1 = F.leaky_relu(self.layer_1(s))
        self.layer_2_s(s1)
        self.layer_2_a(a)
        s11 = torch.mm(s1, self.layer_2_s.weight.data.t())
        s12 = torch.mm(a, self.layer_2_a.weight.data.t())
        s1 = F.leaky_relu(s11 + s12 + self.layer_2_a.bias.data)
        q = self.layer_3(s1)

        return q


# DDPG network
class DDPG(object):
    """
    Deep Deterministic Policy Gradient (DDPG) agent implementation.

    This class encapsulates the actor-critic learning framework using DDPG, which is suitable
    for continuous action spaces. It supports training, action selection, model saving/loading,
    and state preparation for a reinforcement learning agent, specifically designed for robot navigation.

    Args:
        state_dim (int): Dimension of the input state.
        action_dim (int): Dimension of the action space.
        max_action (float): Maximum action value allowed.
        device (torch.device): Computation device (CPU or GPU).
        lr (float): Learning rate for the optimizers. Default is 1e-4.
        save_every (int): Frequency of saving the model in training iterations. 0 means no saving. Default is 0.
        load_model (bool): Flag indicating whether to load a model from disk. Default is False.
        save_directory (Path): Directory to save the model checkpoints. Default is "robot_nav/models/DDPG/checkpoint".
        model_name (str): Name used for saving and TensorBoard logging. Default is "DDPG".
        load_directory (Path): Directory to load model checkpoints from. Default is "robot_nav/models/DDPG/checkpoint".
        use_max_bound (bool): Whether to enforce a learned upper bound on the Q-value. Default is False.
        bound_weight (float): Weight of the upper bound loss penalty. Default is 0.25.
    """

    def __init__(
        self,
        state_dim,
        action_dim,
        max_action,
        device,
        lr=1e-4,
        save_every=0,
        load_model=False,
        save_directory=Path("robot_nav/models/DDPG/checkpoint"),
        model_name="DDPG",
        load_directory=Path("robot_nav/models/DDPG/checkpoint"),
        use_max_bound=False,
        bound_weight=0.25,
    ):
        # Initialize the Actor network
        self.device = device
        self.actor = Actor(state_dim, action_dim).to(self.device)
        self.actor_target = Actor(state_dim, action_dim).to(self.device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = torch.optim.Adam(params=self.actor.parameters(), lr=lr)

        # Initialize the Critic networks
        self.critic = Critic(state_dim, action_dim).to(self.device)
        self.critic_target = Critic(state_dim, action_dim).to(self.device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        self.critic_optimizer = torch.optim.Adam(params=self.critic.parameters(), lr=lr)

        self.action_dim = action_dim
        self.max_action = max_action
        self.state_dim = state_dim
        self.writer = SummaryWriter(comment=model_name)
        self.iter_count = 0
        if load_model:
            self.load(filename=model_name, directory=load_directory)
        self.save_every = save_every
        self.model_name = model_name
        self.save_directory = save_directory
        self.use_max_bound = use_max_bound
        self.bound_weight = bound_weight

    def get_action(self, obs, add_noise):
        """
        Selects an action based on the observation.

        Args:
            obs (np.array): The current state observation.
            add_noise (bool): Whether to add exploration noise to the action.

        Returns:
            (np.array): Action selected by the actor network.
        """
        if add_noise:
            return (
                self.act(obs) + np.random.normal(0, 0.2, size=self.action_dim)
            ).clip(-self.max_action, self.max_action)
        else:
            return self.act(obs)

    def act(self, state):
        """
        Computes the action for a given state using the actor network.

        Args:
            state (np.array): Environment state.

        Returns:
            (np.array): Action values as output by the actor network.
        """
        state = torch.Tensor(state).to(self.device)
        return self.actor(state).cpu().data.numpy().flatten()

    # training cycle
    def train(
        self,
        replay_buffer,
        iterations,
        batch_size,
        discount=0.99,
        tau=0.005,
        policy_noise=0.2,
        noise_clip=0.5,
        policy_freq=2,
        max_lin_vel=0.5,
        max_ang_vel=1,
        goal_reward=100,
        distance_norm=10,
        time_step=0.3,
    ):
        """
        Trains the actor and critic networks using a replay buffer and soft target updates.

        Args:
            replay_buffer (ReplayBuffer): Replay buffer object with a sample_batch method.
            iterations (int): Number of training iterations.
            batch_size (int): Size of each training batch.
            discount (float): Discount factor for future rewards.
            tau (float): Soft update factor for target networks.
            policy_noise (float): Standard deviation of noise added to target policy.
            noise_clip (float): Maximum value to clip target policy noise.
            policy_freq (int): Frequency of actor and target updates.
            max_lin_vel (float): Maximum linear velocity, used in Q-bound calculation.
            max_ang_vel (float): Maximum angular velocity, used in Q-bound calculation.
            goal_reward (float): Reward given upon reaching goal.
            distance_norm (float): Distance normalization factor.
            time_step (float): Time step used in max bound calculation.
        """
        av_Q = 0
        max_Q = -inf
        av_loss = 0
        for it in range(iterations):
            # sample a batch from the replay buffer
            (
                batch_states,
                batch_actions,
                batch_rewards,
                batch_dones,
                batch_next_states,
            ) = replay_buffer.sample_batch(batch_size)
            state = torch.Tensor(batch_states).to(self.device)
            next_state = torch.Tensor(batch_next_states).to(self.device)
            action = torch.Tensor(batch_actions).to(self.device)
            reward = torch.Tensor(batch_rewards).to(self.device).reshape(-1, 1)
            done = torch.Tensor(batch_dones).to(self.device).reshape(-1, 1)

            # Obtain the estimated action from the next state by using the actor-target
            next_action = self.actor_target(next_state)

            # Add noise to the action
            noise = (
                torch.Tensor(batch_actions)
                .data.normal_(0, policy_noise)
                .to(self.device)
            )
            noise = noise.clamp(-noise_clip, noise_clip)
            next_action = (next_action + noise).clamp(-self.max_action, self.max_action)

            # Calculate the Q values from the critic-target network for the next state-action pair
            target_Q = self.critic_target(next_state, next_action)

            av_Q += torch.mean(target_Q)
            max_Q = max(max_Q, torch.max(target_Q))
            # Calculate the final Q value from the target network parameters by using Bellman equation
            target_Q = reward + ((1 - done) * discount * target_Q).detach()

            # Get the Q values of the basis networks with the current parameters
            current_Q = self.critic(state, action)

            # Calculate the loss between the current Q value and the target Q value
            loss = F.mse_loss(current_Q, target_Q)

            if self.use_max_bound:
                max_bound = get_max_bound(
                    next_state,
                    discount,
                    max_ang_vel,
                    max_lin_vel,
                    time_step,
                    distance_norm,
                    goal_reward,
                    reward,
                    done,
                    self.device,
                )
                max_excess_Q = F.relu(current_Q - max_bound)
                max_bound_loss = (max_excess_Q**2).mean()
                # Add loss for Q values exceeding maximum possible upper bound
                loss += self.bound_weight * max_bound_loss

            # Perform the gradient descent
            self.critic_optimizer.zero_grad()
            loss.backward()
            self.critic_optimizer.step()

            if it % policy_freq == 0:
                # Maximize the actor output value by performing gradient descent on negative Q values
                # (essentially perform gradient ascent)
                actor_grad = self.critic(state, self.actor(state))
                actor_grad = -actor_grad.mean()
                self.actor_optimizer.zero_grad()
                actor_grad.backward()
                self.actor_optimizer.step()

                # Use soft update to update the actor-target network parameters by
                # infusing small amount of current parameters
                for param, target_param in zip(
                    self.actor.parameters(), self.actor_target.parameters()
                ):
                    target_param.data.copy_(
                        tau * param.data + (1 - tau) * target_param.data
                    )
                # Use soft update to update the critic-target network parameters by infusing
                # small amount of current parameters
                for param, target_param in zip(
                    self.critic.parameters(), self.critic_target.parameters()
                ):
                    target_param.data.copy_(
                        tau * param.data + (1 - tau) * target_param.data
                    )

            av_loss += loss
        self.iter_count += 1
        # Write new values for tensorboard
        self.writer.add_scalar("train/loss", av_loss / iterations, self.iter_count)
        self.writer.add_scalar("train/avg_Q", av_Q / iterations, self.iter_count)
        self.writer.add_scalar("train/max_Q", max_Q, self.iter_count)
        if self.save_every > 0 and self.iter_count % self.save_every == 0:
            self.save(filename=self.model_name, directory=self.save_directory)

    def save(self, filename, directory):
        """
        Saves the model parameters to disk.

        Args:
            filename (str): Base filename for saving the model components.
            directory (str or Path): Directory where the model files will be saved.
        """
        Path(directory).mkdir(parents=True, exist_ok=True)
        torch.save(self.actor.state_dict(), "%s/%s_actor.pth" % (directory, filename))
        torch.save(
            self.actor_target.state_dict(),
            "%s/%s_actor_target.pth" % (directory, filename),
        )
        torch.save(self.critic.state_dict(), "%s/%s_critic.pth" % (directory, filename))
        torch.save(
            self.critic_target.state_dict(),
            "%s/%s_critic_target.pth" % (directory, filename),
        )

    def load(self, filename, directory):
        """
        Loads model parameters from disk.

        Args:
            filename (str): Base filename used for loading model components.
            directory (str or Path): Directory to load the model files from.
        """
        self.actor.load_state_dict(
            torch.load("%s/%s_actor.pth" % (directory, filename))
        )
        self.actor_target.load_state_dict(
            torch.load("%s/%s_actor_target.pth" % (directory, filename))
        )
        self.critic.load_state_dict(
            torch.load("%s/%s_critic.pth" % (directory, filename))
        )
        self.critic_target.load_state_dict(
            torch.load("%s/%s_critic_target.pth" % (directory, filename))
        )
        print(f"Loaded weights from: {directory}")

    def prepare_state(self, latest_scan, distance, cos, sin, collision, goal, action):
        """
        Processes raw sensor input and additional information into a normalized state representation.

        Args:
            latest_scan (list or np.array): Raw LIDAR or laser scan data.
            distance (float): Distance to the goal.
            cos (float): Cosine of the angle to the goal.
            sin (float): Sine of the angle to the goal.
            collision (bool): Whether a collision has occurred.
            goal (bool): Whether the goal has been reached.
            action (list or np.array): The action taken in the previous step.

        Returns:
            (tuple): (state vector, terminal flag)
        """
        latest_scan = np.array(latest_scan)

        inf_mask = np.isinf(latest_scan)
        latest_scan[inf_mask] = 7.0

        max_bins = self.state_dim - 5
        bin_size = int(np.ceil(len(latest_scan) / max_bins))

        # Initialize the list to store the minimum values of each bin
        min_values = []

        # Loop through the data and create bins
        for i in range(0, len(latest_scan), bin_size):
            # Get the current bin
            bin = latest_scan[i : i + min(bin_size, len(latest_scan) - i)]
            # Find the minimum value in the current bin and append it to the min_values list
            min_values.append(min(bin) / 7)

        # Normalize to [0, 1] range
        distance /= 10
        lin_vel = action[0] * 2
        ang_vel = (action[1] + 1) / 2
        state = min_values + [distance, cos, sin] + [lin_vel, ang_vel]

        assert len(state) == self.state_dim
        terminal = 1 if collision or goal else 0

        return state, terminal


================================================
FILE: robot_nav/models/DDPG/__init__.py
================================================


================================================
FILE: robot_nav/models/HCM/__init__.py
================================================


================================================
FILE: robot_nav/models/HCM/hardcoded_model.py
================================================
from math import atan2

import numpy as np
from numpy import clip
from torch.utils.tensorboard import SummaryWriter
import yaml


class HCM(object):
    """
    A class representing a Hard-Coded model (HCM) for a robot's navigation system.

    This class contains methods for generating actions based on the robot's state, preparing state
    representations, training (placeholder method), saving/loading models, and logging experiences.
    The method is suboptimal in order to collect collisions for pre-training of DRL models.

    Attributes:
        max_action (float): The maximum possible action value.
        state_dim (int): The dimension of the state representation.
        writer (SummaryWriter): The writer for logging purposes.
        iterator (int): A counter for tracking sample addition.
        save_samples (bool): Whether to save the samples to a file.
        max_added_samples (int): Maximum number of samples to be added to the saved file.
        file_location (str): The file location for saving samples.
    """

    def __init__(
        self,
        state_dim,
        max_action,
        save_samples,
        max_added_samples=10_000,
        file_location="robot_nav/assets/data.yml",
    ):
        """
        Initialize the HCM class with the provided configuration.

        Args:
            state_dim (int): The dimension of the state space.
            max_action (float): The maximum possible action value.
            save_samples (bool): Whether to save samples to a file.
            max_added_samples (int): The maximum number of samples to save.
            file_location (str): The file path for saving samples.
        """
        self.max_action = max_action
        self.state_dim = state_dim
        self.writer = SummaryWriter()
        self.iterator = 0
        self.save_samples = save_samples
        self.max_added_samples = max_added_samples
        self.file_location = file_location

    def get_action(self, state, add_noise):
        """
        Compute the action to be taken based on the current state of the robot.

        Args:
            state (list): The current state of the robot, including LIDAR scan, distance,
                          and other relevant features.
            add_noise (bool): Whether to add noise to the action for exploration.

        Returns:
            (list): The computed action [linear velocity, angular velocity].
        """
        sin = state[-3]
        cos = state[-4]
        angle = atan2(sin, cos)
        laser_nr = self.state_dim - 5
        limit = 1.5

        if min(state[4 : self.state_dim - 9]) < limit:
            state = state.tolist()
            idx = state[:laser_nr].index(min(state[:laser_nr]))
            if idx > laser_nr / 2:
                sign = -1
            else:
                sign = 1

            idx = clip(idx + sign * 5 * (limit / min(state[:laser_nr])), 0, laser_nr)

            angle = ((3.14 / (laser_nr)) * idx) - 1.57

        rot_vel = clip(angle, -1.0, 1.0)
        lin_vel = -abs(rot_vel / 2)
        return [lin_vel, rot_vel]

    # training cycle
    def train(
        self,
        replay_buffer,
        iterations,
        batch_size,
        discount=0.99999,
        tau=0.005,
        policy_noise=0.2,
        noise_clip=0.5,
        policy_freq=2,
    ):
        """
        Placeholder method for training the hybrid control model.

        Args:
            replay_buffer (object): The replay buffer containing experiences.
            iterations (int): The number of training iterations.
            batch_size (int): The batch size for training.
            discount (float): The discount factor for future rewards.
            tau (float): The soft update parameter for target networks.
            policy_noise (float): The noise added to actions during training.
            noise_clip (float): The clipping value for action noise.
            policy_freq (int): The frequency at which to update the policy.

        Note:
            This method is a placeholder and currently does nothing.
        """
        pass

    def save(self, filename, directory):
        """
        Placeholder method to save the current model state to a file.

        Args:
            filename (str): The name of the file where the model will be saved.
            directory (str): The directory where the file will be stored.

        Note:
            This method is a placeholder and currently does nothing.
        """
        pass

    def load(self, filename, directory):
        """
        Placeholder method to load a model state from a file.

        Args:
            filename (str): The name of the file to load the model from.
            directory (str): The directory where the model file is stored.

        Note:
            This method is a placeholder and currently does nothing.
        """
        pass

    def prepare_state(self, latest_scan, distance, cos, sin, collision, goal, action):
        """
        Prepare the state representation for the model based on the current environment.

        Args:
            latest_scan (list): The LIDAR scan data.
            distance (float): The distance to the goal.
            cos (float): The cosine of the robot's orientation angle.
            sin (float): The sine of the robot's orientation angle.
            collision (bool): Whether a collision occurred.
            goal (bool): Whether the goal has been reached.
            action (list): The action taken by the robot, [linear velocity, angular velocity].

        Returns:
            (tuple): A tuple containing the prepared state and a terminal flag (1 if terminal state, 0 otherwise).
        """
        latest_scan = np.array(latest_scan)

        inf_mask = np.isinf(latest_scan)
        latest_scan[inf_mask] = 7.0

        max_bins = self.state_dim - 5
        bin_size = int(np.ceil(len(latest_scan) / max_bins))

        # Initialize the list to store the minimum values of each bin
        min_values = []

        # Loop through the data and create bins
        for i in range(0, len(latest_scan), bin_size):
            # Get the current bin
            bin = latest_scan[i : i + min(bin_size, len(latest_scan) - i)]
            # Find the minimum value in the current bin and append it to the min_values list
            min_values.append(min(bin))
        state = min_values + [distance, cos, sin] + [action[0], action[1]]

        assert len(state) == self.state_dim
        terminal = 1 if collision or goal else 0

        self.iterator += 1
        if self.save_samples and self.iterator < self.max_added_samples:
            action = action if type(action) is list else action
            action = [float(a) for a in action]
            sample = {
                self.iterator: {
                    "latest_scan": latest_scan.tolist(),
                    "distance": distance.tolist(),
                    "cos": cos.tolist(),
                    "sin": sin.tolist(),
                    "collision": collision,
                    "goal": goal,
                    "action": action,
                }
            }
            with open(self.file_location, "a") as outfile:
                yaml.dump(sample, outfile, default_flow_style=False)

        return state, terminal


================================================
FILE: robot_nav/models/MARL/Attention/__init__.py
================================================


================================================
FILE: robot_nav/models/MARL/Attention/g2anet.py
================================================
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class G2ANet(nn.Module):
    """
    Graph-based attention network for multi-agent coordination.

    Encodes per-agent features, builds pairwise geometric edge features, computes
    hard (binary) attention and soft (weighted) attention, and decodes the
    concatenation of self and attended features.

    Args:
        embedding_dim (int): Dimensionality of the internal agent embedding.

    Attributes:
        hard_encoding (nn.Linear): Outputs logits for binary attention (2 classes).
        q (nn.Linear): Query projection for soft attention.
        k (nn.Linear): Key projection from edge features for soft attention.
        v (nn.Linear): Value projection from edge features for soft attention.
        hard_mlp (nn.Sequential): MLP over concatenated node/edge features for hard attention.
        decoding (nn.Linear): Final decoder mapping concatenated embeddings to output.
        embedding1 (nn.Linear): First layer for agent feature encoding.
        embedding2 (nn.Linear): Second layer for agent feature encoding.
    """

    def __init__(self, embedding_dim):
        super(G2ANet, self).__init__()
        self.embedding_dim = embedding_dim
        self.hard_encoding = nn.Linear(embedding_dim, 2)

        # Soft
        self.q = nn.Linear(embedding_dim, embedding_dim, bias=False)
        self.k = nn.Linear(10, embedding_dim, bias=False)
        self.v = nn.Linear(10, embedding_dim)

        self.hard_mlp = nn.Sequential(
            nn.Linear(embedding_dim + 7, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim),
        )

        self.decoding = nn.Linear(embedding_dim * 2, embedding_dim * 2)

        self.embedding1 = nn.Linear(5, 128)
        nn.init.kaiming_uniform_(self.embedding1.weight, nonlinearity="leaky_relu")
        self.embedding2 = nn.Linear(128, embedding_dim)
        nn.init.kaiming_uniform_(self.embedding2.weight, nonlinearity="leaky_relu")

    def encode_agent_features(self, embed):
        """
        Encode per-agent features with a two-layer MLP.

        Args:
            embed (Tensor): Raw agent features of shape (B*N, 5).

        Returns:
            Tensor: Encoded embeddings of shape (B*N, embedding_dim).
        """
        embed = F.leaky_relu(self.embedding1(embed))
        embed = F.leaky_relu(self.embedding2(embed))
        return embed

    def forward(self, embedding):
        """
        Compute hard and soft attentions and produce attended embeddings.

        Args:
            embedding (Tensor): Input tensor of shape (B, N, D), where D ≥ 11. Expected to
                include position (x,y), heading (cos,sin), agent features (for encoding),
                action, and goal.

        Returns:
            tuple:
                output (Tensor): Attended embedding, shape (B*N, 2*embedding_dim).
                hard_logits (Tensor): Hard attention logits (keep class index 1),
                    shape (B*N, N-1).
                unnorm_rel_dist (Tensor): Unnormalized pairwise distances,
                    shape (B*N, N-1, 1).
                mean_entropy (Tensor): Scalar mean entropy of combined (soft×hard) weights.
                hard_weights (Tensor): Binary hard attention mask, shape (B, N, N-1).
                comb_w (Tensor): Combined weights per (receiver, sender) after soft×hard,
                    shape (N, N*(N-1)).
        """
        if embedding.dim() == 2:
            embedding = embedding.unsqueeze(0)
        batch_size, n_agents, _ = embedding.shape

        # Extract sub-features
        embed = embedding[:, :, 4:9].reshape(batch_size * n_agents, -1)
        position = embedding[:, :, :2].reshape(batch_size, n_agents, 2)
        heading = embedding[:, :, 2:4].reshape(
            batch_size, n_agents, 2
        )  # assume (cos(θ), sin(θ))
        action = embedding[:, :, 7:9].reshape(batch_size, n_agents, 2)
        goal = embedding[:, :, -2:].reshape(batch_size, n_agents, 2)

        # Compute pairwise relative goal vectors (for each i,j)
        goal_j = goal.unsqueeze(1).expand(-1, n_agents, -1, -1)
        pos_i = position.unsqueeze(2)
        goal_rel_vec = goal_j - pos_i

        # Encode agent features
        agent_embed = self.encode_agent_features(embed)
        agent_embed = agent_embed.view(batch_size, n_agents, self.embedding_dim)

        # Prep for hard attention: compute all relative geometry for each agent pair
        h_i = agent_embed.unsqueeze(2)  # (B, N, 1, D)
        pos_i = position.unsqueeze(2)  # (B, N, 1, 2)
        pos_j = position.unsqueeze(1)  # (B, 1, N, 2)
        heading_i = heading.unsqueeze(2)  # (B, N, 1, 2)
        heading_j = heading.unsqueeze(1).expand(-1, n_agents, -1, -1)  # (B, 1, N, 2)

        # Compute relative vectors and distance
        rel_vec = pos_j - pos_i  # (B, N, N, 2)
        dx, dy = rel_vec[..., 0], rel_vec[..., 1]
        rel_dist = (
            torch.linalg.vector_norm(rel_vec, dim=-1, keepdim=True) / 12
        )  # (B, N, N, 1)

        # Relative angle in agent i's frame
        angle = torch.atan2(dy, dx) - torch.atan2(heading_i[..., 1], heading_i[..., 0])
        angle = (angle + np.pi) % (2 * np.pi) - np.pi
        rel_angle_sin = torch.sin(angle)
        rel_angle_cos = torch.cos(angle)

        # Other agent's heading
        heading_j_cos = heading_j[..., 0]  # (B, 1, N)
        heading_j_sin = heading_j[..., 1]  # (B, 1, N)

        # Edge features for hard attention
        edge_features = torch.cat(
            [
                rel_dist,  # (B, N, N, 1)
                rel_angle_cos.unsqueeze(-1),  # (B, N, N, 1)
                rel_angle_sin.unsqueeze(-1),  # (B, N, N, 1)
                heading_j_cos.unsqueeze(-1),  # (B, N, N, 1)
                heading_j_sin.unsqueeze(-1),  # (B, N, N, 1)
                action.unsqueeze(1).expand(-1, n_agents, -1, -1),  # (B, N, N, 2)
            ],
            dim=-1,
        )

        # Broadcast agent embedding for all pairs (except self-pairs)
        h_i_expanded = h_i.expand(-1, -1, n_agents, -1)

        # Remove self-pairs using mask
        mask = ~torch.eye(n_agents, dtype=torch.bool, device=embedding.device)
        h_i_flat = h_i_expanded[:, mask].reshape(
            batch_size * n_agents, n_agents - 1, self.embedding_dim
        )
        edge_flat = edge_features[:, mask].reshape(
            batch_size * n_agents, n_agents - 1, -1
        )

        # Concatenate agent embedding and edge features
        hard_input = torch.cat([h_i_flat, edge_flat], dim=-1)

        # Hard attention forward
        h_hard = self.hard_mlp(hard_input)
        hard_logits = self.hard_encoding(h_hard)
        hard_weights = F.gumbel_softmax(hard_logits, hard=False, tau=0.5, dim=-1)[
            ..., 1
        ].unsqueeze(2)
        hard_weights = hard_weights.view(batch_size, n_agents, n_agents - 1)
        # hard_weights = hard_weights.permute(1, 0, 2)

        unnorm_rel_dist = torch.linalg.vector_norm(rel_vec, dim=-1, keepdim=True)
        unnorm_rel_dist = unnorm_rel_dist[:, mask].reshape(
            batch_size * n_agents, n_agents - 1, 1
        )

        # ---- Soft attention computation ----
        attention_outputs = []
        entropy_list = []
        combined_w = []

        # Goal-relative polar features for soft attention
        goal_rel_dist = torch.linalg.vector_norm(goal_rel_vec, dim=-1, keepdim=True)
        goal_angle_global = torch.atan2(goal_rel_vec[..., 1], goal_rel_vec[..., 0])
        heading_angle = torch.atan2(heading_i[..., 1], heading_i[..., 0])
        goal_rel_angle = goal_angle_global - heading_angle
        goal_rel_angle = (goal_rel_angle + np.pi) % (2 * np.pi) - np.pi
        goal_rel_angle_cos = torch.cos(goal_rel_angle).unsqueeze(-1)
        goal_rel_angle_sin = torch.sin(goal_rel_angle).unsqueeze(-1)
        goal_polar = torch.cat(
            [goal_rel_dist, goal_rel_angle_cos, goal_rel_angle_sin], dim=-1
        )

        # Soft attention edge features (include goal polar)
        soft_edge_features = torch.cat([edge_features, goal_polar], dim=-1)

        q = self.q(agent_embed)
        epsilon = 1e-6

        # q = self.q(h_out).reshape(-1, self.args.n_agents, self.args.attention_dim)  # (batch_size, n_agents, args.attention_dim)
        k = self.k(soft_edge_features)
        v = F.relu(self.v(soft_edge_features))
        x = []
        for i in range(n_agents):
            q_i = q[:, i].view(-1, 1, self.embedding_dim)
            k_b = k[:, i, :, :]
            k_i = [
                k_b[:, j] for j in range(n_agents) if j != i
            ]  # keys of all other agents w.r.t agent i
            v_b = v[:, i, :, :]
            v_i = [
                v_b[:, j] for j in range(n_agents) if j != i
            ]  # values of all other agents w.r.t agent i

            k_i = torch.stack(
                k_i, dim=0
            )  # (n_agents - 1, batch_size, args.attention_dim)
            k_i = k_i.permute(1, 2, 0)  # -> (batch_size, embedding_dim, n_agents - 1)
            v_i = torch.stack(v_i, dim=0)
            v_i = v_i.permute(1, 2, 0)

            # (batch_size, 1, attention_dim) * (batch_size, attention_dim，n_agents - 1) = (batch_size, 1，n_agents - 1)
            score = torch.matmul(q_i, k_i)

            # scale then softmax for soft weights
            scaled_score = score / np.sqrt(self.embedding_dim)
            soft_weight = F.softmax(
                scaled_score, dim=-1
            )  # (batch_size，1, n_agents - 1)

            # combine soft and hard weights, then weighted sum of values
            x_i = (v_i * soft_weight * hard_weights[:, i, :].unsqueeze(1)).sum(dim=-1)
            x.append(x_i)

            combined_weights = soft_weight * hard_weights[:, i, :].unsqueeze(
                1
            )  # (B, 1, N-1)
            combined_w.append(combined_weights)

            # Normalize for entropy calculation
            combined_weights_norm = combined_weights / (
                combined_weights.sum(dim=-1, keepdim=True) + epsilon
            )

            # Entropy for analysis/logging
            entropy = (
                -(combined_weights_norm * (combined_weights_norm + epsilon).log())
                .sum(dim=-1)
                .mean()
            )
            entropy_list.append(entropy)

        # concatenate self and attended features per agent
        x = torch.stack(x, dim=1).reshape(
            -1, self.embedding_dim
        )  # (batch_size * n_agents, args.attention_dim)
        self_embed = agent_embed.reshape(-1, self.embedding_dim)
        final_input = torch.cat([self_embed, x], dim=-1)
        output = self.decoding(final_input)
        mean_entropy = torch.stack(entropy_list).mean()
        comb_w = torch.stack(combined_w, dim=1).reshape(n_agents, -1)

        return (
            output,
            hard_logits[..., 1],
            unnorm_rel_dist,
            mean_entropy,
            hard_weights,
            comb_w,
        )


================================================
FILE: robot_nav/models/MARL/Attention/iga.py
================================================
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import softmax


def entropy_from_attention(attn_weights, target_nodes, num_nodes, eps=1e-10):
    """
    Compute mean per-node entropy of incoming softmax attention distributions.

    Args:
        attn_weights (Tensor): Per-edge attention probabilities of shape (num_edges,).
        target_nodes (Tensor): Destination node indices for each edge of shape (num_edges,).
        num_nodes (int): Total number of nodes/agents.
        eps (float, optional): Small constant for numerical stability in log. Defaults to 1e-10.

    Returns:
        Tensor: Scalar tensor containing the mean entropy across nodes that have at least
            one incoming edge. Returns 0.0 if no nodes have incoming edges.
    """

    # Compute log(p), safe for zero
    attn_log = (attn_weights + eps).log()

    # For each edge, p*log(p)
    contrib = -(attn_weights * attn_log)  # [num_edges]

    # Sum contributions per node (i.e., sum over incoming edges)
    entropies = torch.zeros(num_nodes, device=attn_weights.device).index_add_(
        0, target_nodes, contrib
    )
    counts = torch.zeros(num_nodes, device=attn_weights.device).index_add_(
        0, target_nodes, torch.ones_like(attn_weights)
    )

    # (Optional) Only average over nodes with at least one incoming edge
    mask = counts > 0
    entropies = entropies[mask] / counts[mask]
    return (
        entropies.mean()
        if mask.any()
        else torch.tensor(0.0, device=attn_weights.device)
    )


class GoalAttentionLayer(MessagePassing):
    """
    Message-passing layer with learned attention over goal/edge features.

    This layer computes attention scores from concatenated node queries and
    transformed edge attributes, then aggregates value projections with 'add'
    aggregation.

    Args:
        node_dim (int): Dimensionality of node features.
        edge_dim (int): Dimensionality of edge attributes.
        out_dim (int): Output dimensionality for projected messages.
    """

    def __init__(self, node_dim, edge_dim, out_dim):
        super().__init__(aggr="add")  # Could use 'mean' or 'max' as well.
        self.q = torch.nn.Linear(node_dim, out_dim, bias=False)
        self.k = torch.nn.Linear(edge_dim, out_dim, bias=False)
        self.v = torch.nn.Linear(edge_dim, out_dim)
        self.attn_score_layer = nn.Sequential(
            nn.Linear(node_dim * 2, node_dim),
            nn.ReLU(),
            nn.Linear(node_dim, 1),
        )
        self._last_attn_weights = None

    def forward(self, x, edge_index, edge_attr):
        """
        Run attention-based message passing.

        Args:
            x (Tensor): Node features of shape (num_nodes, node_dim).
            edge_index (LongTensor): Edge indices of shape (2, num_edges)
                with format [source, target].
            edge_attr (Tensor): Edge attributes of shape (num_edges, edge_dim).

        Returns:
            tuple:
                out (Tensor): Updated node features of shape (num_nodes, out_dim).
                attn_weights (Tensor or None): Last per-edge softmax weights of
                    shape (num_edges,). None if no edges.
        """
        q = self.q(x)
        out = self.propagate(edge_index, x=q, edge_attr=edge_attr)
        return out, self._last_attn_weights

    def message(self, x_i, edge_attr, index, ptr, size_i):
        """
        Compute per-edge messages using attention weights.

        Args:
            x_i (Tensor): Target-node queries for each edge, shape (num_edges, out_dim).
            edge_attr (Tensor): Edge attributes for each edge, shape (num_edges, edge_dim).
            index (LongTensor): Target node indices per edge, shape (num_edges,).
            ptr: Unused (PyG internal).
            size_i: Unused (PyG internal).

        Returns:
            Tensor: Per-edge messages of shape (num_edges, out_dim).
        """
        k = F.leaky_relu(self.k(edge_attr))
        v = F.leaky_relu(self.v(edge_attr))
        attention_input = torch.cat([x_i, k], dim=-1)
        scores = self.attn_score_layer(attention_input).squeeze(-1)
        attn_weights = softmax(scores, index)  # index = target node index
        self._last_attn_weights = attn_weights.detach()
        return v * attn_weights.unsqueeze(-1)

    def aggregate(self, inputs, index, ptr=None, dim_size=None):
        """
        Aggregate messages per target node.

        Args:
            inputs (Tensor): Per-edge messages, shape (num_edges, out_dim).
            index (LongTensor): Target node indices per edge, shape (num_edges,).
            ptr: Unused (PyG internal).
            dim_size (int or None): Number of target nodes (optional).

        Returns:
            Tensor: Aggregated node features, shape (num_nodes, out_dim).
        """
        return super().aggregate(inputs, index, ptr, dim_size)


class Attention(nn.Module):
    """
    Multi-robot attention mechanism combining hard (binary) and soft (weighted) attentions.

    The module encodes agent features, derives pairwise geometric edge attributes,
    computes hard attention masks and soft attention weights, performs message passing,
    and decodes concatenated self/attended embeddings.

    Args:
        embedding_dim (int): Dimension of the agent embedding vector.

    Attributes:
        embedding1 (nn.Linear): First layer for agent feature encoding.
        embedding2 (nn.Linear): Second layer for agent feature encoding.
        hard_mlp (nn.Sequential): MLP over concatenated node/edge features for hard attention.
        hard_encoding (nn.Linear): Outputs logits for binary attention (2 classes).
        q (nn.Linear): Query projection for soft attention.
        k (nn.Linear): Key projection from edge features for soft attention.
        v (nn.Linear): Value projection from edge features for soft attention.
        attn_score_layer (nn.Sequential): Computes unnormalized scores for soft attention.
        decode_1 (nn.Linear): First decoder layer for concatenated embeddings.
        decode_2 (nn.Linear): Second decoder layer for concatenated embeddings.
        message_graph (GoalAttentionLayer): Graph attention/message-passing layer.
        node_update (nn.Sequential): Optional node-update MLP (unused in forward aggregation).
    """

    def __init__(self, embedding_dim):
        """
        Initialize the attention module.

        Args:
            embedding_dim (int): Output embedding dimension per agent.
        """
        super(Attention, self).__init__()
        self.embedding_dim = embedding_dim

        self.node_update = nn.Sequential(
            nn.Linear(embedding_dim * 2, embedding_dim),
            nn.LeakyReLU(),
            nn.Linear(embedding_dim, embedding_dim),
            nn.LeakyReLU(),
        )

        self.message_graph = GoalAttentionLayer(
            node_dim=embedding_dim, edge_dim=10, out_dim=embedding_dim
        )

        self.embedding1 = nn.Linear(5, 128)
        nn.init.kaiming_uniform_(self.embedding1.weight, nonlinearity="leaky_relu")
        self.embedding2 = nn.Linear(128, embedding_dim)
        nn.init.kaiming_uniform_(self.embedding2.weight, nonlinearity="leaky_relu")

        # Hard attention MLP with distance
        self.hard_mlp = nn.Sequential(
            nn.Linear(embedding_dim + 7, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim),
        )
        self.hard_encoding = nn.Linear(embedding_dim, 2)

        # Soft attention projections
        self.q = nn.Linear(embedding_dim, embedding_dim, bias=False)
        self.k = nn.Linear(10, embedding_dim, bias=False)
        self.v = nn.Linear(10, embedding_dim)

        # Soft attention score network (with polar other robot goal position)
        self.attn_score_layer = nn.Sequential(
            nn.Linear(embedding_dim * 2, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, 1),
        )

        # Decoder
        self.decode_1 = nn.Linear(embedding_dim * 2, embedding_dim * 2)
        nn.init.kaiming_uniform_(self.decode_1.weight, nonlinearity="leaky_relu")
        self.decode_2 = nn.Linear(embedding_dim * 2, embedding_dim * 2)
        nn.init.kaiming_uniform_(self.decode_2.weight, nonlinearity="leaky_relu")

    def encode_agent_features(self, embed):
        """
        Encode per-agent features with a two-layer MLP.

        Args:
            embed (Tensor): Raw agent features of shape (B*N, 5).

        Returns:
            Tensor: Encoded embeddings of shape (B*N, embedding_dim).
        """
        embed = F.leaky_relu(self.embedding1(embed))
        embed = F.leaky_relu(self.embedding2(embed))
        return embed

    def forward(self, embedding):
        """
        Compute hard and soft attentions and produce attended embeddings.

        Args:
            embedding (Tensor): Input tensor of shape (B, N, D), where D ≥ 11. The first
                channels are expected to include position (x,y), heading (cos,sin),
                agent features for encoding, action, and goal.

        Returns:
            tuple:
                att_embedding (Tensor): Attended embedding, shape (B*N, 2*embedding_dim).
                hard_logits (Tensor): Logits for hard attention (keep class index 1),
                    shape (B*N, N-1).
                unnorm_rel_dist (Tensor): Unnormalized pairwise distances, shape (B*N, N-1, 1).
                mean_entropy (Tensor): Scalar mean entropy of soft attention per batch.
                hard_weights (Tensor): Binary hard attention mask, shape (B, N, N).
                comb_w (Tensor): Combined soft weights per (receiver, sender),
                    shape (N, N*(N-1)).
        """
        if embedding.dim() == 2:
            embedding = embedding.unsqueeze(0)
        batch_size, n_agents, _ = embedding.shape

        # Extract sub-features
        embed = embedding[:, :, 4:9].reshape(batch_size * n_agents, -1)
        position = embedding[:, :, :2].reshape(batch_size, n_agents, 2)
        heading = embedding[:, :, 2:4].reshape(
            batch_size, n_agents, 2
        )  # assume (cos(θ), sin(θ))
        action = embedding[:, :, 7:9].reshape(batch_size, n_agents, 2)
        goal = embedding[:, :, -2:].reshape(batch_size, n_agents, 2)

        # Compute pairwise relative goal vectors (for each i,j)
        goal_j = goal.unsqueeze(1).expand(-1, n_agents, -1, -1)
        pos_i = position.unsqueeze(2)
        goal_rel_vec = goal_j - pos_i

        # Encode agent features
        agent_embed = self.encode_agent_features(embed)
        agent_embed = agent_embed.view(batch_size, n_agents, self.embedding_dim)

        # Prep for hard attention: compute all relative geometry for each agent pair
        h_i = agent_embed.unsqueeze(2)  # (B, N, 1, D)
        pos_i = position.unsqueeze(2)  # (B, N, 1, 2)
        pos_j = position.unsqueeze(1)  # (B, 1, N, 2)
        heading_i = heading.unsqueeze(2)  # (B, N, 1, 2)
        heading_j = heading.unsqueeze(1).expand(-1, n_agents, -1, -1)  # (B, 1, N, 2)

        # Compute relative vectors and distance
        rel_vec = pos_j - pos_i  # (B, N, N, 2)
        dx, dy = rel_vec[..., 0], rel_vec[..., 1]
        rel_dist = (
            torch.linalg.vector_norm(rel_vec, dim=-1, keepdim=True) / 12
        )  # (B, N, N, 1)

        # Relative angle in agent i's frame
        angle = torch.atan2(dy, dx) - torch.atan2(heading_i[..., 1], heading_i[..., 0])
        angle = (angle + np.pi) % (2 * np.pi) - np.pi
        rel_angle_sin = torch.sin(angle)
        rel_angle_cos = torch.cos(angle)

        # Other agent's heading
        heading_j_cos = heading_j[..., 0]  # (B, 1, N)
        heading_j_sin = heading_j[..., 1]  # (B, 1, N)

        # Edge features for hard attention
        edge_features = torch.cat(
            [
                rel_dist,  # (B, N, N, 1)
                rel_angle_cos.unsqueeze(-1),  # (B, N, N, 1)
                rel_angle_sin.unsqueeze(-1),  # (B, N, N, 1)
                heading_j_cos.unsqueeze(-1),  # (B, N, N, 1)
                heading_j_sin.unsqueeze(-1),  # (B, N, N, 1)
                action.unsqueeze(1).expand(-1, n_agents, -1, -1),  # (B, N, N, 2)
            ],
            dim=-1,
        )

        # Broadcast agent embedding for all pairs (except self-pairs)
        h_i_expanded = h_i.expand(-1, -1, n_agents, -1)

        h_i_flat = h_i_expanded.reshape(
            batch_size * n_agents, n_agents, self.embedding_dim
        )
        edge_flat = edge_features.reshape(batch_size * n_agents, n_agents, -1)

        # Concatenate agent embedding and edge features
        hard_input = torch.cat([h_i_flat, edge_flat], dim=-1)

        # Hard attention forward
        h_hard = self.hard_mlp(hard_input)
        hard_logits = self.hard_encoding(h_hard)
        hard_weights = F.gumbel_softmax(hard_logits, hard=False, tau=0.2, dim=-1)[
            ..., 1
        ].unsqueeze(2)
        hard_weights = hard_weights.view(batch_size, n_agents, n_agents)

        unnorm_rel_dist = torch.linalg.vector_norm(rel_vec, dim=-1, keepdim=True)
        unnorm_rel_dist = unnorm_rel_dist.reshape(batch_size * n_agents, n_agents, 1)

        # ---- Soft attention computation ----
        q = self.q(agent_embed).reshape(batch_size * n_agents, -1)

        attention_outputs = []
        entropy_list = []
        combined_w = []

        # Goal-relative polar features for soft attention
        goal_rel_dist = torch.linalg.vector_norm(goal_rel_vec, dim=-1, keepdim=True)
        goal_angle_global = torch.atan2(goal_rel_vec[..., 1], goal_rel_vec[..., 0])
        heading_angle = torch.atan2(heading_i[..., 1], heading_i[..., 0])
        goal_rel_angle = goal_angle_global - heading_angle
        goal_rel_angle = (goal_rel_angle + np.pi) % (2 * np.pi) - np.pi
        goal_rel_angle_cos = torch.cos(goal_rel_angle).unsqueeze(-1)
        goal_rel_angle_sin = torch.sin(goal_rel_angle).unsqueeze(-1)
        goal_polar = torch.cat(
            [goal_rel_dist, goal_rel_angle_cos, goal_rel_angle_sin], dim=-1
        )

        # Soft attention edge features (include goal polar)
        soft_edge_features = torch.cat([edge_features, goal_polar], dim=-1)

        attn_outputs = []

        for b in range(batch_size):
            edge_index_list = []
            edge_attr_list = []

            # Agent embeddings for this scenario
            node_feats = agent_embed[b]  # [n, emb_dim]
            soft_feats = soft_edge_features[b]  # [n, n, edge_dim]
            hard_mask = hard_weights[b]  # [n, n]

            # Only build edges where hard attention mask is active, no self-loops
            for i in range(n_agents):
                for j in range(n_agents):
                    if i != j and hard_mask[i, j] > 0.5:
                        edge_index_list.append([j, i])  # from j (src) to i (tgt)
                        edge_attr_list.append(soft_feats[i, j])

            # Edge tensors
            if edge_index_list:
                edge_index = torch.tensor(
                    edge_index_list, dtype=torch.long, device=embedding.device
                ).t()  # [2, num_edges]
                edge_attr = torch.stack(edge_attr_list, dim=0)  # [num_edges, edge_dim]
            else:
                edge_index = torch.zeros(
                    (2, 0), dtype=torch.long, device=embedding.device
                )
                edge_attr = torch.zeros(
                    (0, soft_feats.shape[-1]),
                    dtype=soft_feats.dtype,
                    device=embedding.device,
                )

            # Message passing: edge-to-node aggregation only
            attn_out, attn_weights = self.message_graph(
                node_feats, edge_index, edge_attr
            )  # [n, embedding_dim]
            attn_outputs.append(attn_out)
            batch_entropy = entropy_from_attention(
                attn_weights, edge_index[1], num_nodes=n_agents
            )
            entropy_list.append(batch_entropy)

            combined_weights = torch.zeros(
                (n_agents, n_agents), device=attn_weights.device
            )
            for idx in range(edge_index.shape[1]):
                j = edge_index[0, idx].item()  # sender/source
                i = edge_index[1, idx].item()  # receiver/target
                combined_weights[i, j] = attn_weights[idx]
            combined_w.append(combined_weights)

        # Re-stack results for full batch (shape: [batch_size, n_agents, embedding_dim])
        attn_stack = torch.stack(attn_outputs, dim=0).reshape(batch_size * n_agents, -1)
        self_embed = agent_embed.reshape(batch_size * n_agents, -1)

        # Concat original + attended for each agent
        concat_embed = torch.cat(
            [self_embed, attn_stack], dim=-1
        )  # [batch, n, 2*embedding_dim]

        x = F.leaky_relu(self.decode_1(concat_embed))
        att_embedding = F.leaky_relu(self.decode_2(x))
        mean_entropy = torch.stack(entropy_list).mean()
        comb_w = torch.stack(combined_w, dim=1).reshape(n_agents, -1)

        return (
            att_embedding,
            hard_logits[..., 1],
            unnorm_rel_dist,
            mean_entropy,
            hard_weights,
            comb_w,
        )


================================================
FILE: robot_nav/models/MARL/__init__.py
================================================


================================================
FILE: robot_nav/models/MARL/marlTD3/__init__.py
================================================


================================================
FILE: robot_nav/models/MARL/marlTD3/marlTD3.py
================================================
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from numpy import inf
from torch.utils.tensorboard import SummaryWriter

from robot_nav.models.MARL.Attention.g2anet import G2ANet
from robot_nav.models.MARL.Attention.iga import Attention


class Actor(nn.Module):
    """
    Policy network for multi-agent control with an attention encoder.

    The actor encodes inter-agent context (via IGA or G2ANet attention) and maps
    the attended embedding to continuous actions.

    Args:
        action_dim (int): Number of action dimensions per agent.
        embedding_dim (int): Dimensionality of the attention embedding.
        attention (str): Attention backend, one of {"igs", "g2anet"}.

    Attributes:
        attention (nn.Module): Attention encoder producing attended embeddings and
            diagnostics (hard logits, distances, entropy, masks, combined weights).
        policy_head (nn.Sequential): MLP mapping attended embeddings to actions in [-1, 1].
    """

    def __init__(self, action_dim, embedding_dim, attention):
        super().__init__()
        if attention == "igs":
            self.attention = Attention(embedding_dim)
        elif attention == "g2anet":
            self.attention = G2ANet(embedding_dim)  # ➊ edge classifier
        else:
            raise ValueError("unknown attention mechanism in Actor")

        # ➋ policy head (everything _after_ attention)
        self.policy_head = nn.Sequential(
            nn.Linear(embedding_dim * 2, 400),
            nn.LeakyReLU(),
            nn.Linear(400, 300),
            nn.LeakyReLU(),
            nn.Linear(300, action_dim),
            nn.Tanh(),
        )

    def forward(self, obs, detach_attn=False):
        """
        Run the actor to produce actions and attention diagnostics.

        Args:
            obs (Tensor): Observations of shape (B, N, obs_dim) or (N, obs_dim).
            detach_attn (bool, optional): If True, detaches attention features before
                the policy head (useful for target policy smoothing). Defaults to False.

        Returns:
            tuple:
                action (Tensor): Predicted actions, shape (B*N, action_dim).
                hard_logits (Tensor): Hard attention logits, shape (B*N, N-1).
                pair_d (Tensor): Unnormalized pairwise distances, shape (B*N, N-1, 1).
                mean_entropy (Tensor): Mean soft-attention entropy (scalar tensor).
                hard_weights (Tensor): Binary hard attention mask, shape (B, N, N).
                combined_weights (Tensor): Soft weights per (receiver, sender),
                    shape (N, N*(N-1)) for each batch item (batched internally).
        """
        attn_out, hard_logits, pair_d, mean_entropy, hard_weights, combined_weights = (
            self.attention(obs)
        )
        if detach_attn:  # used in the policy phase
            attn_out = attn_out.detach()
        action = self.policy_head(attn_out)
        return action, hard_logits, pair_d, mean_entropy, hard_weights, combined_weights


class Critic(nn.Module):
    """
    Twin Q-value critic with attention-based state encoding.

    Computes two independent Q estimates (Q1, Q2) from attended embeddings and
    concatenated actions, following the TD3 design.

    Args:
        action_dim (int): Number of action dimensions per agent.
        embedding_dim (int): Dimensionality of the attention embedding.
        attention (str): Attention backend, one of {"igs", "g2anet"}.

    Attributes:
        attention (nn.Module): Attention encoder producing attended embeddings and
            diagnostics (hard logits, distances, entropy, masks).
        layer_1..layer_6 (nn.Linear): MLP layers forming the twin Q networks.
    """

    def __init__(self, action_dim, embedding_dim, attention):
        super(Critic, self).__init__()
        self.embedding_dim = embedding_dim
        if attention == "igs":
            self.attention = Attention(embedding_dim)
        elif attention == "g2anet":
            self.attention = G2ANet(embedding_dim)  # ➊ edge classifier
        else:
            raise ValueError("unknown attention mechanism in Critic")

        self.layer_1 = nn.Linear(self.embedding_dim * 2, 400)
        torch.nn.init.kaiming_uniform_(self.layer_1.weight, nonlinearity="leaky_relu")

        self.layer_2_s = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2_s.weight, nonlinearity="leaky_relu")

        self.layer_2_a = nn.Linear(action_dim, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2_a.weight, nonlinearity="leaky_relu")

        self.layer_3 = nn.Linear(300, 1)
        torch.nn.init.kaiming_uniform_(self.layer_3.weight, nonlinearity="leaky_relu")

        self.layer_4 = nn.Linear(self.embedding_dim * 2, 400)
        torch.nn.init.kaiming_uniform_(self.layer_4.weight, nonlinearity="leaky_relu")

        self.layer_5_s = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_5_s.weight, nonlinearity="leaky_relu")

        self.layer_5_a = nn.Linear(action_dim, 300)
        torch.nn.init.kaiming_uniform_(self.layer_5_a.weight, nonlinearity="leaky_relu")

        self.layer_6 = nn.Linear(300, 1)
        torch.nn.init.kaiming_uniform_(self.layer_6.weight, nonlinearity="leaky_relu")

    def forward(self, embedding, action):
        """
        Compute twin Q values from attended embeddings and actions.

        Args:
            embedding (Tensor): Agent embeddings of shape (B, N, state_dim).
            action (Tensor): Actions of shape (B*N, action_dim).

        Returns:
            tuple:
                Q1 (Tensor): First Q-value estimate, shape (B*N, 1).
                Q2 (Tensor): Second Q-value estimate, shape (B*N, 1).
                mean_entropy (Tensor): Mean soft-attention entropy (scalar tensor).
                hard_logits (Tensor): Hard attention logits, shape (B*N, N-1).
                unnorm_rel_dist (Tensor): Unnormalized pairwise distances, shape (B*N, N-1, 1).
                hard_weights (Tensor): Binary hard attention mask, shape (B, N, N).
        """

        (
            embedding_with_attention,
            hard_logits,
            unnorm_rel_dist,
            mean_entropy,
            hard_weights,
            _,
        ) = self.attention(embedding)

        # Q1
        s1 = F.leaky_relu(self.layer_1(embedding_with_attention))
        s1 = F.leaky_relu(self.layer_2_s(s1) + self.layer_2_a(action))  # ✅ No .data
        q1 = self.layer_3(s1)

        # Q2
        s2 = F.leaky_relu(self.layer_4(embedding_with_attention))
        s2 = F.leaky_relu(self.layer_5_s(s2) + self.layer_5_a(action))  # ✅ No .data
        q2 = self.layer_6(s2)

        return q1, q2, mean_entropy, hard_logits, unnorm_rel_dist, hard_weights


class TD3(object):
    """
    Twin Delayed DDPG (TD3) agent for multi-agent continuous control.

    Wraps actor/critic networks, exploration policy, training loop, logging, and
    checkpointing utilities.

    Args:
        state_dim (int): Per-agent state dimension.
        action_dim (int): Per-agent action dimension.
        max_action (float): Action clip magnitude (actions are clipped to [-max_action, max_action]).
        device (torch.device): Target device for models and tensors.
        num_robots (int): Number of agents.
        lr_actor (float, optional): Actor learning rate. Defaults to 1e-4.
        lr_critic (float, optional): Critic learning rate. Defaults to 3e-4.
        save_every (int, optional): Save frequency in training iterations (0 = disabled). Defaults to 0.
        load_model (bool, optional): If True, loads weights on init. Defaults to False.
        save_directory (Path, optional): Directory for saving checkpoints.
        model_name (str, optional): Base filename for checkpoints. Defaults to "marlTD3".
        load_model_name (str or None, optional): Filename base to load. Defaults to None (uses model_name).
        load_directory (Path, optional): Directory to load checkpoints from.
        attention (str, optional): Attention backend, one of {"igs", "g2anet"}. Defaults to "igs".

    Attributes:
        actor (Actor): Policy network.
        actor_target (Actor): Target policy network.
        critic (Critic): Twin Q-value network.
        critic_target (Critic): Target critic network.
        actor_optimizer (torch.optim.Optimizer): Optimizer over attention + policy head.
        critic_optimizer (torch.optim.Optimizer): Optimizer over critic.
        writer (SummaryWriter): TensorBoard writer.
        iter_count (int): Training iteration counter for logging/saving.
        save_every (int): Save frequency.
        model_name (str): Base filename for checkpoints.
        save_directory (Path): Directory for saving checkpoints.
    """

    def __init__(
        self,
        state_dim,
        action_dim,
        max_action,
        device,
        num_robots,
        lr_actor=1e-4,
        lr_critic=3e-4,
        save_every=0,
        load_model=False,
        save_directory=Path("robot_nav/models/MARL/marlTD3/checkpoint"),
        model_name="marlTD3",
        load_model_name=None,
        load_directory=Path("robot_nav/models/MARL/marlTD3/checkpoint"),
        attention="igs",
    ):
        # Initialize the Actor network
        if attention not in ["igs", "g2anet"]:
            raise ValueError("unknown attention mechanism specified for TD3 model")
        self.num_robots = num_robots
        self.device = device
        self.actor = Actor(action_dim, embedding_dim=256, attention=attention).to(
            self.device
        )  # Using the updated Actor
        self.actor_target = Actor(
            action_dim, embedding_dim=256, attention=attention
        ).to(self.device)
        self.actor_target.load_state_dict(self.actor.state_dict())

        self.attn_params = list(self.actor.attention.parameters())
        self.policy_params = list(self.actor.policy_head.parameters())

        self.actor_optimizer = torch.optim.Adam(
            self.policy_params + self.attn_params, lr=lr_actor
        )  # TD3 policy

        self.critic = Critic(action_dim, embedding_dim=256, attention=attention).to(
            self.device
        )  # Using the updated Critic
        self.critic_target = Critic(
            action_dim, embedding_dim=256, attention=attention
        ).to(self.device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        self.critic_optimizer = torch.optim.Adam(
            params=self.critic.parameters(), lr=lr_critic
        )
        self.action_dim = action_dim
        self.max_action = max_action
        self.state_dim = state_dim
        self.writer = SummaryWriter(comment=model_name)
        self.iter_count = 0
        if load_model_name is None:
            load_model_name = model_name
        if load_model:
            self.load(filename=load_model_name, directory=load_directory)
        self.save_every = save_every
        self.model_name = model_name
        self.save_directory = save_directory

    def get_action(self, obs, add_noise):
        """
        Compute an action for the given observation, with optional exploration noise.

        Args:
            obs (np.ndarray): Observation array of shape (N, state_dim) or (B, N, state_dim).
            add_noise (bool): If True, adds Gaussian exploration noise and clips to bounds.

        Returns:
            tuple:
                action (np.ndarray): Actions reshaped to (N, action_dim).
                connection_logits (Tensor): Hard attention logits from the actor.
                combined_weights (Tensor): Soft attention weights per (receiver, sender).
        """
        action, connection, combined_weights = self.act(obs)
        if add_noise:
            noise = np.random.normal(0, 0.5, size=action.shape)
            noise = [n / 4 if i % 2 else n for i, n in enumerate(noise)]
            action = (action + noise).clip(-self.max_action, self.max_action)

        return action.reshape(-1, 2), connection, combined_weights

    def act(self, state):
        """
        Compute the deterministic action from the current policy.

        Args:
            state (np.ndarray): Observation array of shape (N, state_dim).

        Returns:
            tuple:
                action (np.ndarray): Flattened action vector of shape (N*action_dim,).
                connection_logits (Tensor): Hard attention logits from the actor.
                combined_weights (Tensor): Soft attention weights per (receiver, sender).
        """
        # Function to get the action from the actor
        state = torch.Tensor(state).to(self.device)
        # res = self.attention(state)
        action, connection, _, _, _, combined_weights = self.actor(state)
        return action.cpu().data.numpy().flatten(), connection, combined_weights

    # training cycle
    def train(
        self,
        replay_buffer,
        iterations,
        batch_size,
        discount=0.99,
        tau=0.005,
        policy_noise=0.2,
        noise_clip=0.5,
        policy_freq=2,
        bce_weight=0.01,
        entropy_weight=1,
        connection_proximity_threshold=4,
        max_grad_norm=7.0,
    ):
        """
        Run a TD3 training loop over sampled mini-batches.

        Args:
            replay_buffer: Buffer supporting ``sample_batch(batch_size)`` -> tuple of arrays.
            iterations (int): Number of gradient steps.
            batch_size (int): Mini-batch size.
            discount (float, optional): Discount factor γ. Defaults to 0.99.
            tau (float, optional): Target network update rate. Defaults to 0.005.
            policy_noise (float, optional): Std of target policy noise. Defaults to 0.2.
            noise_clip (float, optional): Clipping range for target noise. Defaults to 0.5.
            policy_freq (int, optional): Actor update period (in critic steps). Defaults to 2.
            bce_weight (float, optional): Weight for hard-connection BCE loss. Defaults to 0.01.
            entropy_weight (float, optional): Weight for attention entropy bonus. Defaults to 1.
            connection_proximity_threshold (float, optional): Distance threshold for the
                positive class when supervising hard connections. Defaults to 4.
            max_grad_norm (float, optional): Gradient clipping norm. Defaults to 7.0.

        Returns:
            None
        """
        av_Q = 0
        max_Q = -inf
        av_loss = 0
        av_critic_loss = 0
        av_critic_entropy = []
        av_actor_entropy = []
        av_actor_loss = 0
        av_critic_bce_loss = []
        av_actor_bce_loss = []

        for it in range(iterations):
            # sample a batch
            (
                batch_states,
                batch_actions,
                batch_rewards,
                batch_dones,
                batch_next_states,
            ) = replay_buffer.sample_batch(batch_size)

            state = (
                torch.Tensor(batch_states)
                .to(self.device)
                .view(batch_size, self.num_robots, self.state_dim)
            )
            next_state = (
                torch.Tensor(batch_next_states)
                .to(self.device)
                .view(batch_size, self.num_robots, self.state_dim)
            )
            action = (
                torch.Tensor(batch_actions)
                .to(self.device)
                .view(batch_size * self.num_robots, self.action_dim)
            )
            reward = (
                torch.Tensor(batch_rewards)
                .to(self.device)
                .view(batch_size * self.num_robots, 1)
            )
            done = (
                torch.Tensor(batch_dones)
                .to(self.device)
                .view(batch_size * self.num_robots, 1)
            )

            with torch.no_grad():
                next_action, _, _, _, _, _ = self.actor_target(
                    next_state, detach_attn=True
                )

            # --- Target smoothing ---
            noise = (
                torch.Tensor(batch_actions)
                .data.normal_(0, policy_noise)
                .to(self.device)
            ).reshape(-1, 2)
            noise = noise.clamp(-noise_clip, noise_clip)
            next_action = (next_action + noise).clamp(-self.max_action, self.max_action)

            # --- Target Q values ---
            target_Q1, target_Q2, _, _, _, _ = self.critic_target(
                next_state, next_action
            )
            target_Q = torch.min(target_Q1, target_Q2)
            av_Q += target_Q.mean()
            max_Q = max(max_Q, target_Q.max().item())
            target_Q = reward + ((1 - done) * discount * target_Q).detach()

            # --- Critic update ---
            (
                current_Q1,
                current_Q2,
                mean_entropy,
                hard_logits,
                unnorm_rel_dist,
                hard_weights,
            ) = self.critic(state, action)
            critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(
                current_Q2, target_Q
            )

            targets = (
                unnorm_rel_dist.flatten() < connection_proximity_threshold
            ).float()
            flat_logits = hard_logits.flatten()
            bce_loss = F.binary_cross_entropy_with_logits(flat_logits, targets)

            av_critic_bce_loss.append(bce_loss)

            total_loss = (
                critic_loss - entropy_weight * mean_entropy + bce_weight * bce_loss
            )
            av_critic_entropy.append(mean_entropy)

            self.critic_optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(self.critic.parameters(), max_grad_norm)
            self.critic_optimizer.step()

            av_loss += total_loss.item()
            av_critic_loss += critic_loss.item()

            # --- Actor update ---
            if it % policy_freq == 0:

                action, hard_logits, unnorm_rel_dist, mean_entropy, hard_weights, _ = (
                    self.actor(state, detach_attn=False)
                )
                targets = (
                    unnorm_rel_dist.flatten() < connection_proximity_threshold
                ).float()
                flat_logits = hard_logits.flatten()
                bce_loss = F.binary_cross_entropy_with_logits(flat_logits, targets)

                av_actor_bce_loss.append(bce_loss)

                actor_Q, _, _, _, _, _ = self.critic(state, action)
                actor_loss = -actor_Q.mean()
                total_loss = (
                    actor_loss - entropy_weight * mean_entropy + bce_weight * bce_loss
                )
                av_actor_entropy.append(mean_entropy)

                self.actor_optimizer.zero_grad()
                total_loss.backward()
                torch.nn.utils.clip_grad_norm_(self.policy_params, max_grad_norm)
                self.actor_optimizer.step()

                av_actor_loss += total_loss.item()

                # Soft update target networks
                for param, target_param in zip(
                    self.actor.parameters(), self.actor_target.parameters()
                ):
                    target_param.data.copy_(
                        tau * param.data + (1 - tau) * target_param.data
                    )

                for param, target_param in zip(
                    self.critic.parameters(), self.critic_target.parameters()
                ):
                    target_param.data.copy_(
                        tau * param.data + (1 - tau) * target_param.data
                    )

        self.iter_count += 1
        self.writer.add_scalar(
            "train/loss_total", av_loss / iterations, self.iter_count
        )
        self.writer.add_scalar(
            "train/critic_loss", av_critic_loss / iterations, self.iter_count
        )
        self.writer.add_scalar(
            "train/av_critic_entropy",
            sum(av_critic_entropy) / len(av_critic_entropy),
            self.iter_count,
        )
        self.writer.add_scalar(
            "train/av_actor_entropy",
            sum(av_actor_entropy) / len(av_actor_entropy),
            self.iter_count,
        )
        self.writer.add_scalar(
            "train/av_critic_bce_loss",
            sum(av_critic_bce_loss) / len(av_critic_bce_loss),
            self.iter_count,
        )
        self.writer.add_scalar(
            "train/av_actor_bce_loss",
            sum(av_actor_bce_loss) / len(av_actor_bce_loss),
            self.iter_count,
        )
        self.writer.add_scalar("train/avg_Q", av_Q / iterations, self.iter_count)
        self.writer.add_scalar("train/max_Q", max_Q, self.iter_count)

        self.writer.add_scalar(
            "train/actor_loss",
            av_actor_loss / (iterations // policy_freq),
            self.iter_count,
        )

        if self.save_every > 0 and self.iter_count % self.save_every == 0:
            self.save(filename=self.model_name, directory=self.save_directory)

    def save(self, filename, directory):
        """
        Saves the current model parameters to the specified directory.

        Args:
            filename (str): Base filename for saved files.
            directory (Path): Path to save the model files.
        """
        Path(directory).mkdir(parents=True, exist_ok=True)
        torch.save(self.actor.state_dict(), "%s/%s_actor.pth" % (directory, filename))
        torch.save(
            self.actor_target.state_dict(),
            "%s/%s_actor_target.pth" % (directory, filename),
        )
        torch.save(self.critic.state_dict(), "%s/%s_critic.pth" % (directory, filename))
        torch.save(
            self.critic_target.state_dict(),
            "%s/%s_critic_target.pth" % (directory, filename),
        )

    def load(self, filename, directory):
        """
        Loads model parameters from the specified directory.

        Args:
            filename (str): Base filename for saved files.
            directory (Path): Path to load the model files from.
        """
        self.actor.load_state_dict(
            torch.load("%s/%s_actor.pth" % (directory, filename))
        )
        self.actor_target.load_state_dict(
            torch.load("%s/%s_actor_target.pth" % (directory, filename))
        )
        self.critic.load_state_dict(
            torch.load("%s/%s_critic.pth" % (directory, filename))
        )
        self.critic_target.load_state_dict(
            torch.load("%s/%s_critic_target.pth" % (directory, filename))
        )
        print(f"Loaded weights from: {directory}")

    def prepare_state(
        self, poses, distance, cos, sin, collision, action, goal_positions
    ):
        """
        Convert raw environment outputs into per-agent state vectors.

        Args:
            poses (list): Per-agent poses [[x, y, theta], ...].
            distance (list): Per-agent distances to goal.
            cos (list): Per-agent cos(heading error to goal).
            sin (list): Per-agent sin(heading error to goal).
            collision (list): Per-agent collision flags (bool or {0,1}).
            action (list): Per-agent last actions [[lin_vel, ang_vel], ...].
            goal_positions (list): Per-agent goals [[gx, gy], ...].

        Returns:
            tuple:
                states (list): Per-agent state vectors (length == state_dim).
                terminal (list): Terminal flags (collision or goal), same length as states.
        """
        states = []
        terminal = []

        for i in range(self.num_robots):
            pose = poses[i]  # [x, y, theta]
            goal_pos = goal_positions[i]  # [goal_x, goal_y]
            act = action[i]  # [lin_vel, ang_vel]

            px, py, theta = pose
            gx, gy = goal_pos

            # Heading as cos/sin
            heading_cos = np.cos(theta)
            heading_sin = np.sin(theta)

            # Last velocity
            lin_vel = act[0] * 2  # Assuming original range [0, 0.5]
            ang_vel = (act[1] + 1) / 2  # Assuming original range [-1, 1]

            # Final state vector
            state = [
                px,
                py,
                heading_cos,
                heading_sin,
                distance[i] / 17,
                cos[i],
                sin[i],
                lin_vel,
                ang_vel,
                gx,
                gy,
            ]

            assert (
                len(state) == self.state_dim
            ), f"State length mismatch: expected {self.state_dim}, got {len(state)}"
            states.append(state)
            terminal.append(collision[i])

        return states, terminal


================================================
FILE: robot_nav/models/PPO/PPO.py
================================================
import torch
import torch.nn as nn
from torch.distributions import MultivariateNormal
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from pathlib import Path
from numpy import inf


class RolloutBuffer:
    """
    Buffer to store rollout data (transitions) for PPO training.

    Attributes:
        actions (list): Actions taken by the agent.
        states (list): States observed by the agent.
        logprobs (list): Log probabilities of the actions.
        rewards (list): Rewards received from the environment.
        state_values (list): Value estimates for the states.
        is_terminals (list): Flags indicating episode termination.
    """

    def __init__(self):
        """
        Initialize empty lists to store buffer elements.
        """
        self.actions = []
        self.states = []
        self.logprobs = []
        self.rewards = []
        self.state_values = []
        self.is_terminals = []

    def clear(self):
        """
        Clear all stored data from the buffer.
        """
        del self.actions[:]
        del self.states[:]
        del self.logprobs[:]
        del self.rewards[:]
        del self.state_values[:]
        del self.is_terminals[:]

    def add(self, state, action, reward, terminal, next_state):
        """
        Add a transition to the buffer. (Partial implementation.)

        Args:
            state (list or np.array): The current observed state.
            action (list or np.array): The action taken.
            reward (float): The reward received after taking the action.
            terminal (bool): Whether the episode terminated.
            next_state (list or np.array): The resulting state after taking the action.
        """
        self.states.append(state)
        self.rewards.append(reward)
        self.is_terminals.append(terminal)


class ActorCritic(nn.Module):
    """
    Actor-Critic neural network model for PPO.

    Attributes:
        actor (nn.Sequential): Policy network (actor) to output action mean.
        critic (nn.Sequential): Value network (critic) to predict state values.
        action_var (Tensor): Diagonal covariance matrix for action distribution.
        device (str): Device used for computation ('cpu' or 'cuda').
        max_action (float): Clipping range for action values.
    """

    def __init__(self, state_dim, action_dim, action_std_init, max_action, device):
        """
        Initialize the Actor and Critic networks.

        Args:
            state_dim (int): Dimension of the input state.
            action_dim (int): Dimension of the action space.
            action_std_init (float): Initial standard deviation of the action distribution.
            max_action (float): Maximum value allowed for an action (clipping range).
            device (str): Device to run the model on.
        """
        super(ActorCritic, self).__init__()

        self.device = device
        self.max_action = max_action

        self.action_dim = action_dim
        self.action_var = torch.full(
            (action_dim,), action_std_init * action_std_init
        ).to(self.device)
        # actor
        self.actor = nn.Sequential(
            nn.Linear(state_dim, 400),
            nn.Tanh(),
            nn.Linear(400, 300),
            nn.Tanh(),
            nn.Linear(300, action_dim),
            nn.Tanh(),
        )
        # critic
        self.critic = nn.Sequential(
            nn.Linear(state_dim, 400),
            nn.Tanh(),
            nn.Linear(400, 300),
            nn.Tanh(),
            nn.Linear(300, 1),
        )

    def set_action_std(self, new_action_std):
        """
        Set a new standard deviation for the action distribution.

        Args:
            new_action_std (float): New standard deviation.
        """
        self.action_var = torch.full(
            (self.action_dim,), new_action_std * new_action_std
        ).to(self.device)

    def forward(self):
        """
        Forward method is not implemented, as it's unused directly.

        Raises:
            NotImplementedError: Always raised when called.
        """
        raise NotImplementedError

    def act(self, state, sample):
        """
        Compute an action, its log probability, and the state value.

        Args:
            state (Tensor): Input state tensor.
            sample (bool): Whether to sample from the action distribution or use mean.

        Returns:
            (Tuple[Tensor, Tensor, Tensor]): Sampled (or mean) action, log probability, and state value.
        """
        action_mean = self.actor(state)
        cov_mat = torch.diag(self.action_var).unsqueeze(dim=0)
        dist = MultivariateNormal(action_mean, cov_mat)

        if sample:
            action = torch.clip(
                dist.sample(), min=-self.max_action, max=self.max_action
            )
        else:
            action = dist.mean
        action_logprob = dist.log_prob(action)
        state_val = self.critic(state)

        return action.detach(), action_logprob.detach(), state_val.detach()

    def evaluate(self, state, action):
        """
        Evaluate action log probabilities, entropy, and state values for given states and actions.

        Args:
            state (Tensor): Batch of states.
            action (Tensor): Batch of actions.

        Returns:
            (Tuple[Tensor, Tensor, Tensor]): Action log probabilities, state values, and distribution entropy.
        """
        action_mean = self.actor(state)

        action_var = self.action_var.expand_as(action_mean)
        cov_mat = torch.diag_embed(action_var).to(self.device)
        dist = MultivariateNormal(action_mean, cov_mat)

        # For Single Action Environments.
        if self.action_dim == 1:
            action = action.reshape(-1, self.action_dim)

        action_logprobs = dist.log_prob(action)
        dist_entropy = dist.entropy()
        state_values = self.critic(state)

        return action_logprobs, state_values, dist_entropy


class PPO:
    """
    Proximal Policy Optimization (PPO) implementation for continuous control tasks.

    Attributes:
        max_action (float): Maximum action value.
        action_std (float): Standard deviation of the action distribution.
        action_std_decay_rate (float): Rate at which to decay action standard deviation.
        min_action_std (float): Minimum allowed action standard deviation.
        state_dim (int): Dimension of the state space.
        gamma (float): Discount factor for future rewards.
        eps_clip (float): Clipping range for policy updates.
        device (str): Device for model computation ('cpu' or 'cuda').
        save_every (int): Interval (in iterations) for saving model checkpoints.
        model_name (str): Name used when saving/loading model.
        save_directory (Path): Directory to save model checkpoints.
        iter_count (int): Number of training iterations completed.
        buffer (RolloutBuffer): Buffer to store trajectories.
        policy (ActorCritic): Current actor-critic network.
        optimizer (torch.optim.Optimizer): Optimizer for actor and critic.
        policy_old (ActorCritic): Old actor-critic network for computing PPO updates.
        MseLoss (nn.Module): Mean squared error loss function.
        writer (SummaryWriter): TensorBoard summary writer.
    """

    def __init__(
        self,
        state_dim,
        action_dim,
        max_action,
        lr_actor=0.0003,
        lr_critic=0.001,
        gamma=0.99,
        eps_clip=0.2,
        action_std_init=0.6,
        action_std_decay_rate=0.015,
        min_action_std=0.1,
        device="cpu",
        save_every=10,
        load_model=False,
        save_directory=Path("robot_nav/models/PPO/checkpoint"),
        model_name="PPO",
        load_directory=Path("robot_nav/models/PPO/checkpoint"),
    ):
        self.max_action = max_action
        self.action_std = action_std_init
        self.action_std_decay_rate = action_std_decay_rate
        self.min_action_std = min_action_std
        self.state_dim = state_dim
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.device = device
        self.save_every = save_every
        self.model_name = model_name
        self.save_directory = save_directory
        self.iter_count = 0

        self.buffer = RolloutBuffer()

        self.policy = ActorCritic(
            state_dim, action_dim, action_std_init, self.max_action, self.device
        ).to(device)
        self.optimizer = torch.optim.Adam(
            [
                {"params": self.policy.actor.parameters(), "lr": lr_actor},
                {"params": self.policy.critic.parameters(), "lr": lr_critic},
            ]
        )

        self.policy_old = ActorCritic(
            state_dim, action_dim, action_std_init, self.max_action, self.device
        ).to(device)
        self.policy_old.load_state_dict(self.policy.state_dict())
        if load_model:
            self.load(filename=model_name, directory=load_directory)

        self.MseLoss = nn.MSELoss()
        self.writer = SummaryWriter(comment=model_name)

    def set_action_std(self, new_action_std):
        """
        Set a new standard deviation for the action distribution.

        Args:
            new_action_std (float): New standard deviation value.
        """
        self.action_std = new_action_std
        self.policy.set_action_std(new_action_std)
        self.policy_old.set_action_std(new_action_std)

    def decay_action_std(self, action_std_decay_rate, min_action_std):
        """
        Decay the action standard deviation by a fixed rate, down to a minimum threshold.

        Args:
            action_std_decay_rate (float): Amount to reduce standard deviation by.
            min_action_std (float): Minimum value for standard deviation.
        """
        print(
            "--------------------------------------------------------------------------------------------"
        )
        self.action_std = self.action_std - action_std_decay_rate
        self.action_std = round(self.action_std, 4)
        if self.action_std <= min_action_std:
            self.action_std = min_action_std
            print(
                "setting actor output action_std to min_action_std : ", self.action_std
            )
        else:
            print("setting actor output action_std to : ", self.action_std)
        self.set_action_std(self.action_std)
        print(
            "--------------------------------------------------------------------------------------------"
        )

    def get_action(self, state, add_noise):
        """
        Sample an action using the current policy (optionally with noise), and store in buffer if noise is added.

        Args:
            state (array_like): Input state for the policy.
            add_noise (bool): Whether to sample from the distribution (True) or use the deterministic mean (False).

        Returns:
            (np.ndarray): Sampled action.
        """

        with torch.no_grad():
            state = torch.FloatTensor(state).to(self.device)
            action, action_logprob, state_val = self.policy_old.act(state, add_noise)

        if add_noise:
            # self.buffer.states.append(state)
            self.buffer.actions.append(action)
            self.buffer.logprobs.append(action_logprob)
            self.buffer.state_values.append(state_val)

        return action.detach().cpu().numpy().flatten()

    def train(self, replay_buffer, iterations, batch_size):
        """
        Train the policy and value function using PPO loss based on the stored rollout buffer.

        Args:
            replay_buffer (object): Placeholder for compatibility (not used).
            iterations (int): Number of epochs to optimize the policy per update.
            batch_size (int): Batch size (not used; training uses the whole buffer).
        """
        # Monte Carlo estimate of returns
        rewards = []
        discounted_reward = 0
        for reward, is_terminal in zip(
            reversed(self.buffer.rewards), reversed(self.buffer.is_terminals)
        ):
            if is_terminal:
                discounted_reward = 0
            discounted_reward = reward + (self.gamma * discounted_reward)
            rewards.insert(0, discounted_reward)

        # Normalizing the rewards
        rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-7)

        # convert list to tensor
        assert len(self.buffer.actions) == len(self.buffer.states)

        states = [torch.tensor(st, dtype=torch.float32) for st in self.buffer.states]
        old_states = torch.squeeze(torch.stack(states, dim=0)).detach().to(self.device)
        old_actions = (
            torch.squeeze(torch.stack(self.buffer.actions, dim=0))
            .detach()
            .to(self.device)
        )
        old_logprobs = (
            torch.squeeze(torch.stack(self.buffer.logprobs, dim=0))
            .detach()
            .to(self.device)
        )
        old_state_values = (
            torch.squeeze(torch.stack(self.buffer.state_values, dim=0))
            .detach()
            .to(self.device)
        )

        # calculate advantages
        advantages = rewards.detach() - old_state_values.detach()

        av_state_values = 0
        max_state_value = -inf
        av_loss = 0
        # Optimize policy for K epochs
        for _ in range(iterations):
            # Evaluating old actions and values
            logprobs, state_values, dist_entropy = self.policy.evaluate(
                old_states, old_actions
            )

            # match state_values tensor dimensions with rewards tensor
            state_values = torch.squeeze(state_values)
            av_state_values += torch.mean(state_values)
            max_state_value = max(max_state_value, max(state_values))
            # Finding the ratio (pi_theta / pi_theta__old)
            ratios = torch.exp(logprobs - old_logprobs.detach())

            # Finding Surrogate Loss
            surr1 = ratios * advantages
            surr2 = (
                torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
            )

            # final loss of clipped objective PPO
            loss = (
                -torch.min(surr1, surr2)
                + 0.5 * self.MseLoss(state_values, rewards)
                - 0.01 * dist_entropy
            )

            # take gradient step
            self.optimizer.zero_grad()
            loss.mean().backward()
            self.optimizer.step()
            av_loss += loss.mean()

        # Copy new weights into old policy
        self.policy_old.load_state_dict(self.policy.state_dict())
        # clear buffer
        self.buffer.clear()
        self.decay_action_std(self.action_std_decay_rate, self.min_action_std)
        self.iter_count += 1
        # Write new values for tensorboard
        self.writer.add_scalar("train/loss", av_loss / iterations, self.iter_count)
        self.writer.add_scalar(
            "train/avg_value", av_state_values / iterations, self.iter_count
        )
        self.writer.add_scalar("train/max_value", max_state_value, self.iter_count)
        if self.save_every > 0 and self.iter_count % self.save_every == 0:
            self.save(filename=self.model_name, directory=self.save_directory)

    def prepare_state(self, latest_scan, distance, cos, sin, collision, goal, action):
        """
        Convert raw sensor and navigation data into a normalized state vector for the policy.

        Args:
            latest_scan (list[float]): LIDAR scan data.
            distance (float): Distance to the goal.
            cos (float): Cosine of angle to the goal.
            sin (float): Sine of angle to the goal.
            collision (bool): Whether the robot has collided.
            goal (bool): Whether the robot has reached the goal.
            action (tuple[float, float]): Last action taken (linear and angular velocities).

        Returns:
            (tuple[list[float], int]): Processed state vector and terminal flag (1 if terminal, else 0).
        """
        latest_scan = np.array(latest_scan)

        inf_mask = np.isinf(latest_scan)
        latest_scan[inf_mask] = 7.0

        max_bins = self.state_dim - 5
        bin_size = int(np.ceil(len(latest_scan) / max_bins))

        # Initialize the list to store the minimum values of each bin
        min_values = []

        # Loop through the data and create bins
        for i in range(0, len(latest_scan), bin_size):
            # Get the current bin
            bin = latest_scan[i : i + min(bin_size, len(latest_scan) - i)]
            # Find the minimum value in the current bin and append it to the min_values list
            min_values.append(min(bin) / 7)

        # Normalize to [0, 1] range
        distance /= 10
        lin_vel = action[0] * 2
        ang_vel = (action[1] + 1) / 2
        state = min_values + [distance, cos, sin] + [lin_vel, ang_vel]

        assert len(state) == self.state_dim
        terminal = 1 if collision or goal else 0

        return state, terminal

    def save(self, filename, directory):
        """
        Save the current policy model to the specified directory.

        Args:
            filename (str): Base name of the model file.
            directory (Path): Directory to save the model to.
        """
        Path(directory).mkdir(parents=True, exist_ok=True)
        torch.save(
            self.policy_old.state_dict(), "%s/%s_policy.pth" % (directory, filename)
        )

    def load(self, filename, directory):
        """
        Load the policy model from a saved checkpoint.

        Args:
            filename (str): Base name of the model file.
            directory (Path): Directory to load the model from.
        """
        self.policy_old.load_state_dict(
            torch.load(
                "%s/%s_policy.pth" % (directory, filename),
                map_location=lambda storage, loc: storage,
            )
        )
        self.policy.load_state_dict(
            torch.load(
                "%s/%s_policy.pth" % (directory, filename),
                map_location=lambda storage, loc: storage,
            )
        )
        print(f"Loaded weights from: {directory}")


================================================
FILE: robot_nav/models/PPO/__init__.py
================================================


================================================
FILE: robot_nav/models/RCPG/RCPG.py
================================================
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from numpy import inf
from torch.utils.tensorboard import SummaryWriter


class Actor(nn.Module):
    """
    Actor network that outputs continuous actions for a given state input.

    Architecture:
        - Processes 1D laser scan inputs through 3 convolutional layers.
        - Embeds goal and previous action inputs using fully connected layers.
        - Combines all features and passes them through an RNN (GRU, LSTM, or RNN).
        - Outputs action values via a fully connected feedforward head with Tanh activation.

    Parameters
    ----------
    action_dim : int
        Dimensionality of the action space.
    rnn : str, optional
        Type of RNN layer to use ("lstm", "gru", or "rnn").
    """

    def __init__(self, action_dim, rnn="gru"):
        super(Actor, self).__init__()
        assert rnn in ["lstm", "gru", "rnn"], "Unsupported rnn type"

        self.cnn1 = nn.Conv1d(1, 4, kernel_size=8, stride=4)
        self.cnn2 = nn.Conv1d(4, 8, kernel_size=8, stride=4)
        self.cnn3 = nn.Conv1d(8, 4, kernel_size=4, stride=2)

        self.goal_embed = nn.Linear(3, 10)
        self.action_embed = nn.Linear(2, 10)

        if rnn == "lstm":
            self.rnn = nn.LSTM(
                input_size=36, hidden_size=36, num_layers=1, batch_first=True
            )
        elif rnn == "gru":
            self.rnn = nn.GRU(
                input_size=36, hidden_size=36, num_layers=1, batch_first=True
            )
        else:
            self.rnn = nn.RNN(
                input_size=36, hidden_size=36, num_layers=1, batch_first=True
            )

        self.layer_1 = nn.Linear(36, 400)
        torch.nn.init.kaiming_uniform_(self.layer_1.weight, nonlinearity="leaky_relu")
        self.layer_2 = nn.Linear(400, 300)
        torch.nn.init.kaiming_uniform_(self.layer_2.weight, nonlinearity="leaky_relu")
        self.layer_3 = nn.Linear(300, action_dim)
        self.tanh = nn.Tanh()

    def forward(self, s):
        if len(s.shape) == 2:
            s = s.unsqueeze(0)

        batch_n, hist_n, state_n = s.shape
        s = s.reshape(batch_n * hist_n, state_n)

        laser = s[:, :-5]
        goal = s[:, -5:-2]
        act = s[:, -2:]
        laser = laser.unsqueeze(1)

        l = F.leaky_relu(self.cnn1(laser))
        l = F.leaky_relu(self.cnn2(l))
        l = F.leaky_relu(self.cnn3(l))
        l = l.flatten(start_dim=1)

        g = F.leaky_relu(self.goal_embed(goal))

        a = F.leaky_relu(self.action_embed(act))

        s = torch.concat((l, g, a), dim=-1)

        s = s.reshape(batch_n, hist_n, -1)
        output, _ = self.rnn(s)
        last_output = output[:, -1, :]
        s = F.leaky_relu(self.layer_1(last_output))
        s = F.leaky_relu(self.layer_2(s))
        a = self.tanh(self.layer_3(s))
        return a


class Critic(nn.Module):
    """
    Critic network that estimates Q-values for state-action pairs.

    Architecture:
        - Processes the same input as the Actor (laser scan, goal, and previous action).
        - Uses two separate Q-networks (double Q-learning) for stability.
        - Each Q-network receives both the RNN-processed state and current action.

    Parameters
    ----------
    action_dim : int
        Dimensionality of the action space.
    rnn : str, optional
        Type of RNN layer to use ("lstm", "gru", or "rnn").
    """

    def __init__(self, action_dim, rnn="gru"):
        super(Critic, self).__init__()
        assert rnn in ["lstm", "gru", "rnn"], "Unsupported rnn type"

        self.cnn1 = nn.Conv1d(1, 4, kernel_size=8, stride=4)
        self.cnn2 = nn.Conv1d(4, 8, kernel_size=8, stride=4)
        self.cnn3 = nn.Conv1d(8, 4, kernel_size=4, stride=2)

        self.goal_embed = nn.Linear(3, 10)
        self.action_embed = nn.Linear(2, 10)

        if rnn == "lstm":
            self.rnn = nn.LSTM(
                input_size=36, hidden_size=36, num_layers=1, batch_first=True
            )
        elif rnn == "gru":
            self.rnn = nn.GRU(
                input_size=36, hidden_size=36, num_layers=1, batch_first=True
            )
        else:
            self.rnn = nn.RNN(
                input_size=36, hidden_size=36, num_layers=1, batch_first=True
            )

        self.layer_1 = nn.Linear(36, 40

Download .txt

gitextract_12_7tajc/

├── .devin/
│   └── wiki.json
├── .github/
│   └── workflows/
│       └── main.yml
├── .gitignore
├── README.md
├── docs/
│   ├── api/
│   │   ├── IR-SIM/
│   │   │   ├── ir-marl-sim.md
│   │   │   └── ir-sim.md
│   │   ├── Testing/
│   │   │   ├── test.md
│   │   │   └── testrnn.md
│   │   ├── Training/
│   │   │   ├── train.md
│   │   │   └── trainrnn.md
│   │   ├── Utils/
│   │   │   ├── replay_buffer.md
│   │   │   └── utils.md
│   │   ├── models/
│   │   │   ├── DDPG.md
│   │   │   ├── HCM.md
│   │   │   ├── MARL/
│   │   │   │   ├── Attention.md
│   │   │   │   └── TD3.md
│   │   │   ├── PPO.md
│   │   │   ├── RCPG.md
│   │   │   ├── SAC.md
│   │   │   ├── TD3.md
│   │   │   ├── __init__.md
│   │   │   └── cnntd3.md
│   │   └── robot_nav.md
│   └── index.md
├── mkdocs.yml
├── pyproject.toml
├── robot_nav/
│   ├── SIM_ENV/
│   │   ├── __init__.py
│   │   ├── marl_sim.py
│   │   ├── sim.py
│   │   └── sim_env.py
│   ├── __init__.py
│   ├── assets/
│   │   └── data.yml
│   ├── eval_points.yaml
│   ├── marl_test_random.py
│   ├── marl_test_single.py
│   ├── marl_train.py
│   ├── models/
│   │   ├── CNNTD3/
│   │   │   ├── CNNTD3.py
│   │   │   └── __init__.py
│   │   ├── DDPG/
│   │   │   ├── DDPG.py
│   │   │   └── __init__.py
│   │   ├── HCM/
│   │   │   ├── __init__.py
│   │   │   └── hardcoded_model.py
│   │   ├── MARL/
│   │   │   ├── Attention/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── g2anet.py
│   │   │   │   └── iga.py
│   │   │   ├── __init__.py
│   │   │   └── marlTD3/
│   │   │       ├── __init__.py
│   │   │       └── marlTD3.py
│   │   ├── PPO/
│   │   │   ├── PPO.py
│   │   │   └── __init__.py
│   │   ├── RCPG/
│   │   │   ├── RCPG.py
│   │   │   └── __init__.py
│   │   ├── SAC/
│   │   │   ├── SAC.py
│   │   │   ├── SAC_actor.py
│   │   │   ├── SAC_critic.py
│   │   │   ├── SAC_utils.py
│   │   │   └── __init__.py
│   │   ├── TD3/
│   │   │   ├── TD3.py
│   │   │   └── __init__.py
│   │   └── __init__.py
│   ├── replay_buffer.py
│   ├── rl_test.py
│   ├── rl_test_random.py
│   ├── rl_train.py
│   ├── rnn_test.py
│   ├── rnn_train.py
│   ├── rvo_test_random.py
│   ├── rvo_test_single.py
│   ├── utils.py
│   └── worlds/
│       ├── circle_world.yaml
│       ├── cross_world.yaml
│       ├── eval_world.yaml
│       ├── multi_robot_world.yaml
│       └── robot_world.yaml
└── tests/
    ├── __init__.py
    ├── test_data.yml
    ├── test_marl_world.yaml
    ├── test_model.py
    ├── test_sim.py
    ├── test_utils.py
    └── test_world.yaml

Download .txt

SYMBOL INDEX (224 symbols across 31 files)

FILE: robot_nav/SIM_ENV/marl_sim.py
  class MARL_SIM (line 9) | class MARL_SIM(SIM_ENV):
    method __init__ (line 24) | def __init__(
    method step (line 55) | def step(self, action, connection, combined_weights=None):
    method reset (line 207) | def reset(
    method get_reward (line 314) | def get_reward(goal, collision, action, closest_robots, distance, phas...

FILE: robot_nav/SIM_ENV/sim.py
  class SIM (line 8) | class SIM(SIM_ENV):
    method __init__ (line 20) | def __init__(self, world_file="robot_world.yaml", disable_plotting=Fal...
    method step (line 35) | def step(self, lin_velocity=0.0, ang_velocity=0.1):
    method reset (line 68) | def reset(
    method get_reward (line 124) | def get_reward(goal, collision, action, laser_scan):

FILE: robot_nav/SIM_ENV/sim_env.py
  class SIM_ENV (line 5) | class SIM_ENV(ABC):
    method step (line 7) | def step(self):
    method reset (line 11) | def reset(self):
    method cossin (line 15) | def cossin(vec1, vec2):
    method get_reward (line 34) | def get_reward():

FILE: robot_nav/marl_test_random.py
  function outside_of_bounds (line 13) | def outside_of_bounds(poses):
  function main (line 33) | def main(args=None):

FILE: robot_nav/marl_test_single.py
  class SINGLE_SIM (line 14) | class SINGLE_SIM(SIM_ENV):
    method __init__ (line 29) | def __init__(
    method step (line 51) | def step(self, action, connection, combined_weights=None):
    method reset (line 177) | def reset(
    method get_reward (line 231) | def get_reward(goal, collision, action, closest_robots, distance, phas...
  function outside_of_bounds (line 279) | def outside_of_bounds(poses):
  function main (line 299) | def main(args=None):

FILE: robot_nav/marl_train.py
  function outside_of_bounds (line 11) | def outside_of_bounds(poses):
  function main (line 31) | def main(args=None):

FILE: robot_nav/models/CNNTD3/CNNTD3.py
  class Actor (line 13) | class Actor(nn.Module):
    method __init__ (line 32) | def __init__(self, action_dim):
    method forward (line 49) | def forward(self, s):
  class Critic (line 85) | class Critic(nn.Module):
    method __init__ (line 104) | def __init__(self, action_dim):
    method forward (line 131) | def forward(self, s, action):
  class CNNTD3 (line 180) | class CNNTD3(object):
    method __init__ (line 205) | def __init__(
    method get_action (line 246) | def get_action(self, obs, add_noise):
    method act (line 264) | def act(self, state):
    method train (line 279) | def train(
    method save (line 418) | def save(self, filename, directory):
    method load (line 438) | def load(self, filename, directory):
    method prepare_state (line 460) | def prepare_state(self, latest_scan, distance, cos, sin, collision, go...

FILE: robot_nav/models/DDPG/DDPG.py
  class Actor (line 13) | class Actor(nn.Module):
    method __init__ (line 31) | def __init__(self, state_dim, action_dim):
    method forward (line 41) | def forward(self, s):
  class Critic (line 57) | class Critic(nn.Module):
    method __init__ (line 77) | def __init__(self, state_dim, action_dim):
    method forward (line 89) | def forward(self, s, a):
  class DDPG (line 112) | class DDPG(object):
    method __init__ (line 135) | def __init__(
    method get_action (line 176) | def get_action(self, obs, add_noise):
    method act (line 194) | def act(self, state):
    method train (line 208) | def train(
    method save (line 344) | def save(self, filename, directory):
    method load (line 364) | def load(self, filename, directory):
    method prepare_state (line 386) | def prepare_state(self, latest_scan, distance, cos, sin, collision, go...

FILE: robot_nav/models/HCM/hardcoded_model.py
  class HCM (line 9) | class HCM(object):
    method __init__ (line 27) | def __init__(
    method get_action (line 53) | def get_action(self, state, add_noise):
    method train (line 88) | def train(
    method save (line 117) | def save(self, filename, directory):
    method load (line 130) | def load(self, filename, directory):
    method prepare_state (line 143) | def prepare_state(self, latest_scan, distance, cos, sin, collision, go...

FILE: robot_nav/models/MARL/Attention/g2anet.py
  class G2ANet (line 7) | class G2ANet(nn.Module):
    method __init__ (line 29) | def __init__(self, embedding_dim):
    method encode_agent_features (line 52) | def encode_agent_features(self, embed):
    method forward (line 66) | def forward(self, embedding):

FILE: robot_nav/models/MARL/Attention/iga.py
  function entropy_from_attention (line 9) | def entropy_from_attention(attn_weights, target_nodes, num_nodes, eps=1e...
  class GoalAttentionLayer (line 48) | class GoalAttentionLayer(MessagePassing):
    method __init__ (line 62) | def __init__(self, node_dim, edge_dim, out_dim):
    method forward (line 74) | def forward(self, x, edge_index, edge_attr):
    method message (line 94) | def message(self, x_i, edge_attr, index, ptr, size_i):
    method aggregate (line 116) | def aggregate(self, inputs, index, ptr=None, dim_size=None):
  class Attention (line 132) | class Attention(nn.Module):
    method __init__ (line 158) | def __init__(self, embedding_dim):
    method encode_agent_features (line 210) | def encode_agent_features(self, embed):
    method forward (line 224) | def forward(self, embedding):

FILE: robot_nav/models/MARL/marlTD3/marlTD3.py
  class Actor (line 14) | class Actor(nn.Module):
    method __init__ (line 32) | def __init__(self, action_dim, embedding_dim, attention):
    method forward (line 51) | def forward(self, obs, detach_attn=False):
  class Critic (line 79) | class Critic(nn.Module):
    method __init__ (line 97) | def __init__(self, action_dim, embedding_dim, attention):
    method forward (line 131) | def forward(self, embedding, action):
  class TD3 (line 171) | class TD3(object):
    method __init__ (line 208) | def __init__(
    method get_action (line 268) | def get_action(self, obs, add_noise):
    method act (line 290) | def act(self, state):
    method train (line 310) | def train(
    method save (line 531) | def save(self, filename, directory):
    method load (line 551) | def load(self, filename, directory):
    method prepare_state (line 573) | def prepare_state(

FILE: robot_nav/models/PPO/PPO.py
  class RolloutBuffer (line 10) | class RolloutBuffer:
    method __init__ (line 23) | def __init__(self):
    method clear (line 34) | def clear(self):
    method add (line 45) | def add(self, state, action, reward, terminal, next_state):
  class ActorCritic (line 61) | class ActorCritic(nn.Module):
    method __init__ (line 73) | def __init__(self, state_dim, action_dim, action_std_init, max_action,...
    method set_action_std (line 111) | def set_action_std(self, new_action_std):
    method forward (line 122) | def forward(self):
    method act (line 131) | def act(self, state, sample):
    method evaluate (line 157) | def evaluate(self, state, action):
  class PPO (line 185) | class PPO:
    method __init__ (line 210) | def __init__(
    method set_action_std (line 264) | def set_action_std(self, new_action_std):
    method decay_action_std (line 275) | def decay_action_std(self, action_std_decay_rate, min_action_std):
    method get_action (line 300) | def get_action(self, state, add_noise):
    method train (line 324) | def train(self, replay_buffer, iterations, batch_size):
    method prepare_state (line 423) | def prepare_state(self, latest_scan, distance, cos, sin, collision, go...
    method save (line 468) | def save(self, filename, directory):
    method load (line 481) | def load(self, filename, directory):

FILE: robot_nav/models/RCPG/RCPG.py
  class Actor (line 11) | class Actor(nn.Module):
    method __init__ (line 29) | def __init__(self, action_dim, rnn="gru"):
    method forward (line 60) | def forward(self, s):
  class Critic (line 92) | class Critic(nn.Module):
    method __init__ (line 109) | def __init__(self, action_dim, rnn="gru"):
    method forward (line 151) | def forward(self, s, action):
  class RCPG (line 194) | class RCPG(object):
    method __init__ (line 227) | def __init__(
    method get_action (line 265) | def get_action(self, obs, add_noise):
    method act (line 288) | def act(self, state):
    method train (line 307) | def train(
    method save (line 426) | def save(self, filename, directory):
    method load (line 449) | def load(self, filename, directory):
    method prepare_state (line 474) | def prepare_state(self, latest_scan, distance, cos, sin, collision, go...

FILE: robot_nav/models/SAC/SAC.py
  class SAC (line 13) | class SAC(object):
    method __init__ (line 45) | def __init__(
    method save (line 146) | def save(self, filename, directory):
    method load (line 162) | def load(self, filename, directory):
    method train (line 181) | def train(self, replay_buffer, iterations, batch_size):
    method alpha (line 205) | def alpha(self):
    method get_action (line 212) | def get_action(self, obs, add_noise):
    method act (line 230) | def act(self, obs, sample=False):
    method update_critic (line 249) | def update_critic(self, obs, action, reward, next_obs, done, step):
    method update_actor_and_alpha (line 284) | def update_actor_and_alpha(self, obs, step):
    method update (line 329) | def update(self, replay_buffer, step, batch_size):
    method prepare_state (line 364) | def prepare_state(self, latest_scan, distance, cos, sin, collision, go...

FILE: robot_nav/models/SAC/SAC_actor.py
  class TanhTransform (line 10) | class TanhTransform(pyd.transforms.Transform):
    method __init__ (line 29) | def __init__(self, cache_size=1):
    method atanh (line 39) | def atanh(x):
    method __eq__ (line 51) | def __eq__(self, other):
    method _call (line 60) | def _call(self, x):
    method _inverse (line 72) | def _inverse(self, y):
    method log_abs_det_jacobian (line 86) | def log_abs_det_jacobian(self, x, y):
  class SquashedNormal (line 102) | class SquashedNormal(pyd.transformed_distribution.TransformedDistribution):
    method __init__ (line 109) | def __init__(self, loc, scale):
    method mean (line 125) | def mean(self):
  class DiagGaussianActor (line 138) | class DiagGaussianActor(nn.Module):
    method __init__ (line 153) | def __init__(self, obs_dim, action_dim, hidden_dim, hidden_depth, log_...
    method forward (line 165) | def forward(self, obs):
    method log (line 190) | def log(self, writer, step):

FILE: robot_nav/models/SAC/SAC_critic.py
  class DoubleQCritic (line 7) | class DoubleQCritic(nn.Module):
    method __init__ (line 21) | def __init__(self, obs_dim, action_dim, hidden_dim, hidden_depth):
    method forward (line 34) | def forward(self, obs, action):
    method log (line 56) | def log(self, writer, step):

FILE: robot_nav/models/SAC/SAC_utils.py
  function soft_update_params (line 12) | def soft_update_params(net, target_net, tau):
  function set_seed_everywhere (line 26) | def set_seed_everywhere(seed):
  function make_dir (line 40) | def make_dir(*path_parts):
  function weight_init (line 58) | def weight_init(m):
  class MLP (line 73) | class MLP(nn.Module):
    method __init__ (line 85) | def __init__(
    method forward (line 92) | def forward(self, x):
  function mlp (line 105) | def mlp(input_dim, hidden_dim, output_dim, hidden_depth, output_mod=None):
  function to_np (line 132) | def to_np(t):

FILE: robot_nav/models/TD3/TD3.py
  class Actor (line 13) | class Actor(nn.Module):
    method __init__ (line 31) | def __init__(self, state_dim, action_dim):
    method forward (line 41) | def forward(self, s):
  class Critic (line 57) | class Critic(nn.Module):
    method __init__ (line 75) | def __init__(self, state_dim, action_dim):
    method forward (line 96) | def forward(self, s, a):
  class TD3 (line 128) | class TD3(object):
    method __init__ (line 129) | def __init__(
    method get_action (line 190) | def get_action(self, obs, add_noise):
    method act (line 208) | def act(self, state):
    method train (line 222) | def train(
    method save (line 361) | def save(self, filename, directory):
    method load (line 381) | def load(self, filename, directory):
    method prepare_state (line 403) | def prepare_state(self, latest_scan, distance, cos, sin, collision, go...

FILE: robot_nav/replay_buffer.py
  class ReplayBuffer (line 8) | class ReplayBuffer(object):
    method __init__ (line 16) | def __init__(self, buffer_size, random_seed=123):
    method add (line 29) | def add(self, s, a, r, t, s2):
    method size (line 48) | def size(self):
    method sample_batch (line 57) | def sample_batch(self, batch_size):
    method return_buffer (line 80) | def return_buffer(self):
    method clear (line 95) | def clear(self):
  class RolloutReplayBuffer (line 103) | class RolloutReplayBuffer(object):
    method __init__ (line 110) | def __init__(self, buffer_size, random_seed=123, history_len=10):
    method add (line 126) | def add(self, s, a, r, t, s2):
    method size (line 147) | def size(self):
    method sample_batch (line 156) | def sample_batch(self, batch_size):
    method clear (line 216) | def clear(self):

FILE: robot_nav/rl_test.py
  function main (line 9) | def main(args=None):

FILE: robot_nav/rl_test_random.py
  function main (line 11) | def main(args=None):

FILE: robot_nav/rl_train.py
  function main (line 9) | def main(args=None):
  function evaluate (line 106) | def evaluate(model, epoch, sim, eval_episodes=10):

FILE: robot_nav/rnn_test.py
  function main (line 11) | def main(args=None):

FILE: robot_nav/rnn_train.py
  function main (line 12) | def main(args=None):
  function evaluate (line 126) | def evaluate(model, epoch, sim, history_len, max_steps, eval_episodes=10):

FILE: robot_nav/rvo_test_random.py
  class RVO_RANDOM_SIM (line 17) | class RVO_RANDOM_SIM(SIM_ENV):
    method __init__ (line 33) | def __init__(self, world_file="multi_robot_world.yaml", disable_plotti...
    method step (line 53) | def step(self):
    method reset (line 95) | def reset(
    method get_reward (line 180) | def get_reward(self):
  function outside_of_bounds (line 194) | def outside_of_bounds(poses):
  function main (line 216) | def main(args=None):

FILE: robot_nav/rvo_test_single.py
  class RVO_SINGLE_SIM (line 14) | class RVO_SINGLE_SIM(SIM_ENV):
    method __init__ (line 30) | def __init__(self, world_file="multi_robot_world.yaml", disable_plotti...
    method step (line 50) | def step(self):
    method reset (line 81) | def reset(
    method get_reward (line 113) | def get_reward(self):
  function outside_of_bounds (line 126) | def outside_of_bounds(poses):
  function main (line 148) | def main(args=None):

FILE: robot_nav/utils.py
  class Pretraining (line 11) | class Pretraining:
    method __init__ (line 22) | def __init__(
    method load_buffer (line 34) | def load_buffer(self):
    method train (line 88) | def train(
  function get_buffer (line 114) | def get_buffer(
  function get_max_bound (line 182) | def get_max_bound(

FILE: tests/test_model.py
  function test_models (line 29) | def test_models(model, state_dim):
  function test_max_bound_models (line 68) | def test_max_bound_models(model, state_dim):
  function test_marl_models (line 100) | def test_marl_models():

FILE: tests/test_sim.py
  function test_sim (line 16) | def test_sim():
  function test_marl_sim (line 33) | def test_marl_sim():
  function test_sincos (line 56) | def test_sincos():

FILE: tests/test_utils.py
  function test_buffer (line 12) | def test_buffer():
  function test_rollout_buffer (line 51) | def test_rollout_buffer():
  function test_ppo_buffer (line 90) | def test_ppo_buffer():

Download .json

Condensed preview — 81 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (4,750K chars).

[
  {
    "path": ".devin/wiki.json",
    "chars": 8535,
    "preview": "{\n  \"repo_notes\": [\n    {\n      \"content\": \"\"\n    }\n  ],\n  \"pages\": [\n    {\n      \"title\": \"Overview\",\n      \"purpose\": "
  },
  {
    "path": ".github/workflows/main.yml",
    "chars": 602,
    "preview": "name: Run Tests\n\non:\n  push:\n    branches: [master]\n  pull_request:\n    branches: [master]\n\njobs:\n  test:\n    runs-on: u"
  },
  {
    "path": ".gitignore",
    "chars": 22,
    "preview": "/.venv/\n/runs/\n/site/\n"
  },
  {
    "path": "README.md",
    "chars": 4226,
    "preview": "[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/reiniscimurs/DRL-robot-navigation-IR-SIM)\n\n**DRL "
  },
  {
    "path": "docs/api/IR-SIM/ir-marl-sim.md",
    "chars": 112,
    "preview": "# MARL-IR-SIM\n\n::: robot_nav.SIM_ENV.marl_sim\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/IR-SIM/ir-sim.md",
    "chars": 102,
    "preview": "# IR-SIM\n\n::: robot_nav.SIM_ENV.sim\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/Testing/test.md",
    "chars": 190,
    "preview": "# Testing\n\n::: robot_nav.test\n    options:\n      show_root_heading: true\n      show_source: true\n\n::: robot_nav.test_ran"
  },
  {
    "path": "docs/api/Testing/testrnn.md",
    "chars": 104,
    "preview": "# Testing RNN\n\n::: robot_nav.test_rnn\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/Training/train.md",
    "chars": 98,
    "preview": "# Training\n\n::: robot_nav.train\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/Training/trainrnn.md",
    "chars": 106,
    "preview": "# Training RNN\n\n::: robot_nav.train_rnn\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/Utils/replay_buffer.md",
    "chars": 119,
    "preview": "# Replay/Rollout Buffer\n\n::: robot_nav.replay_buffer\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/Utils/utils.md",
    "chars": 95,
    "preview": "# Utils\n\n::: robot_nav.utils\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/models/DDPG.md",
    "chars": 105,
    "preview": "# DDPG\n\n::: robot_nav.models.DDPG.DDPG\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/models/HCM.md",
    "chars": 127,
    "preview": "# Hardcoded Model\n\n::: robot_nav.models.HCM.hardcoded_model\n    options:\n      show_root_heading: true\n      show_source"
  },
  {
    "path": "docs/api/models/MARL/Attention.md",
    "chars": 134,
    "preview": "# Hard-Soft Attention\n\n::: robot_nav.models.MARL.hardsoftAttention\n    options:\n      show_root_heading: true\n      show"
  },
  {
    "path": "docs/api/models/MARL/TD3.md",
    "chars": 113,
    "preview": "# MARL TD3\n\n::: robot_nav.models.MARL.marlTD3\n    options:\n      show_root_heading: true\n      show_source: true\n"
  },
  {
    "path": "docs/api/models/PPO.md",
    "chars": 103,
    "preview": "# PPO\n\n::: robot_nav.models.PPO.PPO\n    options:\n      show_root_heading: true\n      show_source: true\n"
  },
  {
    "path": "docs/api/models/RCPG.md",
    "chars": 106,
    "preview": "# RCPG\n\n::: robot_nav.models.RCPG.RCPG\n    options:\n      show_root_heading: true\n      show_source: true\n"
  },
  {
    "path": "docs/api/models/SAC.md",
    "chars": 413,
    "preview": "# SAC\n\n::: robot_nav.models.SAC.SAC\n    options:\n      show_root_heading: true\n      show_source: true\n\n::: robot_nav.mo"
  },
  {
    "path": "docs/api/models/TD3.md",
    "chars": 103,
    "preview": "# TD3\n\n::: robot_nav.models.TD3.TD3\n    options:\n      show_root_heading: true\n      show_source: true\n"
  },
  {
    "path": "docs/api/models/__init__.md",
    "chars": 79,
    "preview": "# __init__\n\n## Documentation\n\n```\nNo documentation available for __init__.\n```\n"
  },
  {
    "path": "docs/api/models/cnntd3.md",
    "chars": 111,
    "preview": "# CNNTD3\n\n::: robot_nav.models.CNNTD3.CNNTD3\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/api/robot_nav.md",
    "chars": 117,
    "preview": "# Robot Navigation\n\nWill be updated\n\n::: robot_nav\n    options:\n      show_root_heading: true\n      show_source: true"
  },
  {
    "path": "docs/index.md",
    "chars": 3100,
    "preview": "# Welcome to DRL-robot-navigation-IR-SIM\n\n**DRL Robot navigation in IR-SIM**\n\nDeep Reinforcement Learning algorithm impl"
  },
  {
    "path": "mkdocs.yml",
    "chars": 883,
    "preview": "extra:\n  python_path:\n  - ./robot_nav\nnav:\n- Home: index.md\n- API Reference:\n    IR-SIM:\n      - SIM: api/IR-SIM/ir-sim\n"
  },
  {
    "path": "pyproject.toml",
    "chars": 699,
    "preview": "[project]\nname = \"robot-nav\"\nversion = \"0.1.0\"\ndescription = \"\"\nauthors = [\n    {name = \"reinis\",email = \"ireinisi@gmail"
  },
  {
    "path": "robot_nav/SIM_ENV/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/SIM_ENV/marl_sim.py",
    "chars": 14332,
    "preview": "import irsim\nimport numpy as np\nimport random\nimport torch\n\nfrom robot_nav.SIM_ENV.sim_env import SIM_ENV\n\n\nclass MARL_S"
  },
  {
    "path": "robot_nav/SIM_ENV/sim.py",
    "chars": 5269,
    "preview": "import irsim\nimport numpy as np\nimport random\n\nfrom robot_nav.SIM_ENV.sim_env import SIM_ENV\n\n\nclass SIM(SIM_ENV):\n    \""
  },
  {
    "path": "robot_nav/SIM_ENV/sim_env.py",
    "chars": 1000,
    "preview": "from abc import ABC, abstractmethod\nimport numpy as np\n\n\nclass SIM_ENV(ABC):\n    @abstractmethod\n    def step(self):\n   "
  },
  {
    "path": "robot_nav/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/eval_points.yaml",
    "chars": 1585,
    "preview": "robot:\n  poses:\n    - [[3], [4], [0], [0]]\n    - [[8], [1], [1], [0]]\n    - [[2], [6], [1], [0]]\n    - [[7], [1], [0], ["
  },
  {
    "path": "robot_nav/marl_test_random.py",
    "chars": 8553,
    "preview": "import statistics\nfrom pathlib import Path\n\nfrom tqdm import tqdm\nimport matplotlib.pyplot as plt\nfrom robot_nav.models."
  },
  {
    "path": "robot_nav/marl_test_single.py",
    "chars": 15962,
    "preview": "import statistics\nfrom pathlib import Path\n\nimport irsim\nfrom tqdm import tqdm\n\nfrom robot_nav.SIM_ENV.sim_env import SI"
  },
  {
    "path": "robot_nav/marl_train.py",
    "chars": 5936,
    "preview": "from pathlib import Path\n\nfrom robot_nav.models.MARL.marlTD3.marlTD3 import TD3\n\nimport torch\nimport numpy as np\nfrom ro"
  },
  {
    "path": "robot_nav/models/CNNTD3/CNNTD3.py",
    "chars": 19807,
    "preview": "from pathlib import Path\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom num"
  },
  {
    "path": "robot_nav/models/CNNTD3/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/DDPG/DDPG.py",
    "chars": 17073,
    "preview": "from pathlib import Path\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom num"
  },
  {
    "path": "robot_nav/models/DDPG/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/HCM/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/HCM/hardcoded_model.py",
    "chars": 7291,
    "preview": "from math import atan2\n\nimport numpy as np\nfrom numpy import clip\nfrom torch.utils.tensorboard import SummaryWriter\nimpo"
  },
  {
    "path": "robot_nav/models/MARL/Attention/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/MARL/Attention/g2anet.py",
    "chars": 11028,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport numpy as np\n\n\nclass G2ANet(nn.Module):\n    \"\"\""
  },
  {
    "path": "robot_nav/models/MARL/Attention/iga.py",
    "chars": 17359,
    "preview": "import numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch_geometric.nn import Mes"
  },
  {
    "path": "robot_nav/models/MARL/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/MARL/marlTD3/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/MARL/marlTD3/marlTD3.py",
    "chars": 24747,
    "preview": "from pathlib import Path\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom num"
  },
  {
    "path": "robot_nav/models/PPO/PPO.py",
    "chars": 18338,
    "preview": "import torch\nimport torch.nn as nn\nfrom torch.distributions import MultivariateNormal\nimport numpy as np\nfrom torch.util"
  },
  {
    "path": "robot_nav/models/PPO/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/RCPG/RCPG.py",
    "chars": 19045,
    "preview": "from pathlib import Path\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom num"
  },
  {
    "path": "robot_nav/models/RCPG/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/SAC/SAC.py",
    "chars": 15470,
    "preview": "from pathlib import Path\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom statistics import mean\nim"
  },
  {
    "path": "robot_nav/models/SAC/SAC_actor.py",
    "chars": 5709,
    "preview": "import torch\nimport math\nfrom torch import nn\nimport torch.nn.functional as F\nfrom torch import distributions as pyd\n\nim"
  },
  {
    "path": "robot_nav/models/SAC/SAC_critic.py",
    "chars": 2135,
    "preview": "import torch\nfrom torch import nn\n\nimport robot_nav.models.SAC.SAC_utils as utils\n\n\nclass DoubleQCritic(nn.Module):\n    "
  },
  {
    "path": "robot_nav/models/SAC/SAC_utils.py",
    "chars": 3955,
    "preview": "import numpy as np\nimport torch\nfrom torch import nn\nfrom torch import distributions as pyd\nimport torch.nn.functional a"
  },
  {
    "path": "robot_nav/models/SAC/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/TD3/TD3.py",
    "chars": 18294,
    "preview": "from pathlib import Path\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom num"
  },
  {
    "path": "robot_nav/models/TD3/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/models/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "robot_nav/replay_buffer.py",
    "chars": 6844,
    "preview": "import random\nfrom collections import deque\nimport itertools\n\nimport numpy as np\n\n\nclass ReplayBuffer(object):\n    \"\"\"\n "
  },
  {
    "path": "robot_nav/rl_test.py",
    "chars": 3191,
    "preview": "from models.TD3.TD3 import TD3\n\nimport torch\nimport numpy as np\nfrom robot_nav.SIM_ENV.sim import SIM\nimport yaml\n\n\ndef "
  },
  {
    "path": "robot_nav/rl_test_random.py",
    "chars": 6460,
    "preview": "from robot_nav.models.SAC.SAC import SAC\nimport statistics\nimport numpy as np\nimport tqdm\nimport matplotlib.pyplot as pl"
  },
  {
    "path": "robot_nav/rl_train.py",
    "chars": 5557,
    "preview": "from robot_nav.models.CNNTD3.CNNTD3 import CNNTD3\n\nimport torch\nimport numpy as np\nfrom robot_nav.SIM_ENV.sim import SIM"
  },
  {
    "path": "robot_nav/rnn_test.py",
    "chars": 3494,
    "preview": "from collections import deque\n\nfrom robot_nav.models.RCPG.RCPG import RCPG\n\nimport torch\nimport numpy as np\nfrom robot_n"
  },
  {
    "path": "robot_nav/rnn_train.py",
    "chars": 6314,
    "preview": "from robot_nav.models.RCPG.RCPG import RCPG\nfrom collections import deque\n\nimport torch\nimport numpy as np\n\n\nfrom robot_"
  },
  {
    "path": "robot_nav/rvo_test_random.py",
    "chars": 10237,
    "preview": "import statistics\n\nfrom tqdm import tqdm\n\nimport torch\nimport numpy as np\n\n\nimport irsim\nimport numpy as np\nimport rando"
  },
  {
    "path": "robot_nav/rvo_test_single.py",
    "chars": 8160,
    "preview": "import statistics\n\nfrom tqdm import tqdm\n\n\nimport irsim\nimport numpy as np\nimport random\nimport torch\n\nfrom robot_nav.SI"
  },
  {
    "path": "robot_nav/utils.py",
    "chars": 9263,
    "preview": "from typing import List\nfrom tqdm import tqdm\nimport yaml\nimport torch\n\nfrom robot_nav.models.RCPG.RCPG import RCPG\nfrom"
  },
  {
    "path": "robot_nav/worlds/circle_world.yaml",
    "chars": 865,
    "preview": "world:\n  height: 12  # the height of the world\n  width: 12   # the height of the world\n  step_time: 0.3  # Calculate eac"
  },
  {
    "path": "robot_nav/worlds/cross_world.yaml",
    "chars": 1247,
    "preview": "world:\n  height: 12  # the height of the world\n  width: 12   # the height of the world\n  step_time: 0.3  # Calculate eac"
  },
  {
    "path": "robot_nav/worlds/eval_world.yaml",
    "chars": 1418,
    "preview": "world:\n  height: 10  # the height of the world\n  width: 10   # the height of the world\n  step_time: 0.3  # Hz calculate "
  },
  {
    "path": "robot_nav/worlds/multi_robot_world.yaml",
    "chars": 753,
    "preview": "world:\n  height: 12  # the height of the world\n  width: 12   # the height of the world\n  step_time: 0.3  # Calculate eac"
  },
  {
    "path": "robot_nav/worlds/robot_world.yaml",
    "chars": 1827,
    "preview": "world:\n  height: 10  # the height of the world\n  width: 10   # the height of the world\n  step_time: 0.3  # 10Hz calculat"
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/test_data.yml",
    "chars": 4214798,
    "preview": "1:\n  action:\n  - 0.0\n  - 0.0\n  collision: false\n  cos: 0.7071067811865476\n  distance: 9.899494936611665\n  goal: false\n  "
  },
  {
    "path": "tests/test_marl_world.yaml",
    "chars": 627,
    "preview": "world:\n  height: 12  # the height of the world\n  width: 12   # the height of the world\n  step_time: 1.0\n  sample_time: 1"
  },
  {
    "path": "tests/test_model.py",
    "chars": 3578,
    "preview": "from pathlib import Path\n\nimport torch\n\nfrom robot_nav.SIM_ENV.marl_sim import MARL_SIM\nfrom robot_nav.models.RCPG.RCPG "
  },
  {
    "path": "tests/test_sim.py",
    "chars": 1925,
    "preview": "import os\n\nimport pytest\nimport torch\n\nfrom robot_nav.SIM_ENV.marl_sim import MARL_SIM\nfrom robot_nav.SIM_ENV.sim import"
  },
  {
    "path": "tests/test_utils.py",
    "chars": 2601,
    "preview": "from pathlib import Path\n\nfrom robot_nav.models.SAC.SAC import SAC\nfrom robot_nav.models.PPO.PPO import PPO, RolloutBuff"
  },
  {
    "path": "tests/test_world.yaml",
    "chars": 1356,
    "preview": "world:\n  height: 10  # the height of the world\n  width: 10   # the height of the world\n  step_time: 1.0\n  sample_time: 1"
  }
]

// ... and 1 more files (download for full content)

About this extraction

This page contains the full source code of the reiniscimurs/DRL-robot-navigation-IR-SIM GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 81 files (44.5 MB), approximately 1.1M tokens, and a symbol index with 224 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo