Repository: jurgisp/memory-maze
Branch: main
Commit: 4030901cef3b
Files: 14
Total size: 63.2 KB

Directory structure:
gitextract_h395e807/

├── .gitignore
├── LICENSE
├── README.md
├── gui/
│   ├── recording.py
│   ├── requirements.txt
│   └── run_gui.py
├── memory_maze/
│   ├── __init__.py
│   ├── gym_wrappers.py
│   ├── helpers.py
│   ├── maze.py
│   ├── oracle.py
│   ├── tasks.py
│   └── wrappers.py
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.*
!.gitignore

__pycache__/
*.egg-info

sandbox/
log/

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2022 jurgisp

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
**Status:** Stable release

[![PyPI](https://img.shields.io/pypi/v/memory-maze.svg)](https://pypi.python.org/pypi/memory-maze/#history)

# Memory Maze

Memory Maze is a 3D domain of randomized mazes designed for evaluating the long-term memory abilities of RL agents. Memory Maze isolates long-term memory from confounding challenges, such as exploration, and requires remembering several pieces of information: the positions of objects, the wall layout, and keeping track of agent’s own position.

| Memory 9x9 | Memory 11x11 | Memory 13x13 | Memory 15x15 |
|------------|--------------|--------------|--------------|
| ![map-9x9](https://user-images.githubusercontent.com/3135115/177040204-fbf3b558-d063-49d3-9973-ae113137782f.png) | ![map-11x11](https://user-images.githubusercontent.com/3135115/177040184-16ccb614-b897-44db-ab2c-7ae66e14c007.png) | ![map-13x13](https://user-images.githubusercontent.com/3135115/177040164-d3edb11f-de6a-4c17-bce2-38e539639f40.png) | ![map-15x15](https://user-images.githubusercontent.com/3135115/177040126-b9a0f861-b15b-492c-9216-89502e8f8ae9.png) |

Key features:
- Online RL memory tasks (with baselines)
- Offline dataset for representation learning (with baselines)
- Verified that memory is the key challenge
- Challenging but solvable by human baseline
- Easy installation via a simple pip command
- Available `gym` and `dm_env` interfaces
- Supports headless and hardware rendering
- Interactive GUI for human players
- Hidden state information for probe evaluation

Also see the accompanying research paper: [Evaluating Long-Term Memory in 3D Mazes](https://arxiv.org/abs/2210.13383)

```
@article{pasukonis2022memmaze,
  title={Evaluating Long-Term Memory in 3D Mazes},
  author={Pasukonis, Jurgis and Lillicrap, Timothy and Hafner, Danijar},
  journal={arXiv preprint arXiv:2210.13383},
  year={2022}
}
```

## Installation

Memory Maze builds on the [`dm_control`](https://github.com/deepmind/dm_control) and [`mujoco`](https://github.com/deepmind/mujoco) packages, which are automatically installed as dependencies:

```sh
pip install memory-maze
```

## Play Yourself

Memory Maze allows you to play the levels in human mode. We used this mode for recording the human baseline scores. These are the instructions for launching the GUI:

```sh
# GUI dependencies
pip install gym pygame pillow imageio

# Launch with standard 64x64 resolution
python gui/run_gui.py

# Launch with higher 256x256 resolution
python gui/run_gui.py --env "memory_maze:MemoryMaze-9x9-HD-v0"
```

## Task Description

The task is based on a game known as scavenger hunt or treasure hunt:
- The agent starts in a randomly generated maze, which contains several objects of different colors.
- The agent is prompted to find the target object of a specific color, indicated by the border color in the observation image.
- Once the agent successfully finds and touches the correct object, it gets a +1 reward and the next random object is chosen as a target.
- If the agent touches an object of the wrong color, there is no effect.
- Throughout the episode, the maze layout and the locations of the objects do not change.
- The episode continues for a fixed amount of time, so the total episode reward equals the number of reached targets.

<p align="center"><img width="256" src="https://user-images.githubusercontent.com/3135115/177040240-847f0f0d-b20b-4652-83c3-a486f6f22c22.gif"></p>

An agent with long-term memory only has to explore each maze once (which is possible in a time much shorter than the length of an episode) and can afterwards follow the shortest path to each requested target, whereas an agent with no memory has to randomly wander through the maze to find each target.

There are 4 size variations of the maze. The largest maze 15x15 is designed to be challenging but solvable for humans (see benchmark results below), but out of reach for the state-of-the-art RL methods. The smaller sizes are provided as stepping stones, with 9x9 being solvable with current RL methods.

| Size | env_id | Objects | Episode steps | Mean human score | Mean max score |
|:---------:|-----------------------|:---:|:-----:|:----:|:----:|
| **9x9**   | `MemoryMaze-9x9-v0`   |  3  | 1000  | 26.4 | 34.8 |
| **11x11** | `MemoryMaze-11x11-v0` |  4  | 2000  | 44.3 | 58.0 |
| **13x13** | `MemoryMaze-13x13-v0` |  5  | 3000  | 55.5 | 74.5 |
| **15x15** | `MemoryMaze-15x15-v0` |  6  | 4000  | 67.7 | 87.7 |

The mazes are generated with [labmaze](https://github.com/deepmind/labmaze), the same algorithm as used by [DmLab-30](https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30). The 9x9 corresponds to the [small](https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30#goal-locations-small) variant and 15x15 corresponds to the [large](https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30#goal-locations-large) variant.

## Gym Interface

You can create the environment using the [Gym](https://github.com/openai/gym) interface:

```python
!pip install gym
import gym

# Set this if you are getting "Unable to load EGL library" error:
#  os.environ['MUJOCO_GL'] = 'glfw'  

env = gym.make('memory_maze:MemoryMaze-9x9-v0')
env = gym.make('memory_maze:MemoryMaze-11x11-v0')
env = gym.make('memory_maze:MemoryMaze-13x13-v0')
env = gym.make('memory_maze:MemoryMaze-15x15-v0')
```

**Troubleshooting:** if you are getting "Unable to load EGL library error", that is because we enable MuJoCo headless GPU rendering (`MUJOCO_GL=egl`) by default. If you are testing locally on your machine, you can enable windowed rendering instead (`MUJOCO_GL=glfw`). [Read here](https://github.com/deepmind/dm_control#rendering) about the different rendering options. 

The default environment has 64x64 image observations:

```python
>>> env.observation_space
Box(0, 255, (64, 64, 3), uint8)
```

There are 6 discrete actions:

```python
>>> env.action_space
Discrete(6)  # (noop, forward, left, right, forward_left, forward_right)
```

To create an environment with extra observations for debugging and probe analysis, append `ExtraObs` to the names:

```python
>>> env = gym.make('memory_maze:MemoryMaze-9x9-ExtraObs-v0')
>>> env.observation_space
Dict(
    agent_dir: Box(-inf, inf, (2,), float64), 
    agent_pos: Box(-inf, inf, (2,), float64),
    image: Box(0, 255, (64, 64, 3), uint8),
    maze_layout: Box(0, 1, (9, 9), uint8),
    target_color: Box(-inf, inf, (3,), float64),
    target_pos: Box(-inf, inf, (2,), float64),
    target_vec: Box(-inf, inf, (2,), float64),
    targets_pos: Box(-inf, inf, (3, 2), float64),
    targets_vec: Box(-inf, inf, (3, 2), float64)
)
```

We also register [additional variants](memory_maze/__init__.py) of the environment that can be useful in certain scenarios.

## DeepMind Interface

You can create the environment using the [dm_env](https://github.com/deepmind/dm_env) interface:

```python
from memory_maze import tasks

env = tasks.memory_maze_9x9()
env = tasks.memory_maze_11x11()
env = tasks.memory_maze_13x13()
env = tasks.memory_maze_15x15()
```

Each observation is a dictionary that includes `image` key:

```python
>>> env.observation_spec()
{
  'image': BoundedArray(shape=(64, 64, 3), ...)
}
```

The constructor accepts a number of arguments, which can be used to tweak the environment:

```python
env = tasks.memory_maze_9x9(
    global_observables=True,
    image_only_obs=False,
    top_camera=False,
    camera_resolution=64,
    control_freq=4.0,
    discrete_actions=True,
)
```

## Offline Dataset

[**Dataset download here** (~100GB per dataset)](https://drive.google.com/drive/folders/1RcnkTZVwEHnAQeEuw7X8Y1RPSmrFLDFB)

We provide two datasets of experience collected from the Memory Maze environment: Memory Maze 9x9 (30M) and Memory Maze 15x15 (30M). Each dataset contains 30 thousand trajectories from Memory Maze 9x9 and 15x15 environments respectively, split into 29k trajectories for training and 1k for evaluation. All trajectories are 1000 steps long, so each dataset has 30M steps total.

The data is generated with a scripted policy that navigates to randomly chosen points in the maze under action noise. This choice of policy was made to generate diverse trajectories that explore the maze effectively and that form spatial loops, which can be important for learning long-term memory. We intentionally avoid recording data with a trained agent to ensure a diverse data distribution and to avoid dataset bias that could favor some methods over others. Because of this, the rewards are quite sparse in the data, occurring on average 1-2 times per trajectory.

Each trajectory is saved as an NPZ file with the following entries available:

| Key            | Shape              | Type   | Description                                   |
|----------------|--------------------|--------|-----------------------------------------------|
| `image`        | (64, 64, 3)        | uint8  | First-person view observation                 |
| `action`       | (6)                | binary | Last action, one-hot encoded                  |
| `reward`       | ()                 | float  | Last reward                                   |
| `maze_layout`  | (9, 9) or (15, 15) | binary | Maze layout (wall / no wall)                  |
| `agent_pos`    | (2)                | float  | Agent position in global coordinates          |
| `agent_dir`    | (2)                | float  | Agent orientation as a unit vector            |
| `targets_pos`  | (3, 2) or (6, 2)   | float  | Object locations in global coordinates        |
| `targets_vec`  | (3, 2) or (6, 2)   | float  | Object locations in agent-centric coordinates |
| `target_pos`   | (2)                | float  | Current target object location, global        |
| `target_vec`   | (2)                | float  | Current target object location, agent-centric |
| `target_color` | (3)                | float  | Current target object color RGB               |

You can load a trajectory using [`np.load()`](https://numpy.org/doc/stable/reference/generated/numpy.load.html) to obtain a dictionary of Numpy arrays as follows:

```python
episode = np.load('trajectory.npz')
episode = {key: episode[key] for key in episode.keys()}

assert episode['image'].shape == (1001, 64, 64, 3)
assert episode['image'].dtype == np.uint8
```

All tensors have a leading time dimension, e.g. `image` tensor has shape (1001, 64, 64, 3). The tensor length is 1001 because there are 1000 steps (actions) in a trajectory, `image[0]` is the observation *before* the first action, and `image[-1]` is the observation *after* the last action.

## Online RL Baselines

In our [research paper](https://arxiv.org/abs/2210.13383), we evaluate the model-free [IMPALA](https://github.com/google-research/seed_rl/tree/master/agents/vtrace) agent and the model-based [Dreamer](https://github.com/jurgisp/pydreamer) agent as baselines.

<p align="center">
  <img width="650" alt="baselines" src="https://user-images.githubusercontent.com/3135115/197349778-74073613-bf6c-449b-b5c2-07adf21030ff.png">
  <br/>
  <img width="650" alt="training" src="https://user-images.githubusercontent.com/3135115/197485498-60560934-2629-47b0-ada8-0484398800d0.png">
</p>

Here are videos of the learned behaviors:

**Memory 9x9 - Dreamer (TBTT)**

https://user-images.githubusercontent.com/3135115/197378287-4e413440-7097-4d11-8627-3d7fac0845f1.mp4

**Memory 9x9 - IMPALA (400M)**

https://user-images.githubusercontent.com/3135115/197378929-7fe3f374-c11c-409a-8a95-03feeb489330.mp4

**Memory 15x15 - Dreamer (TBTT)**

https://user-images.githubusercontent.com/3135115/197378324-fb99b496-dba8-4b00-ad80-2d6e19ba8acd.mp4

**Memory 15x15 - IMPALA (400M)**

https://user-images.githubusercontent.com/3135115/197378936-939e7615-9dad-4765-b0ef-a49c5a38fe28.mp4

## Offline Probing Baselines

Here we visualize probe predictions alongside trajectories of the offline dataset, as explained in [the paper](https://arxiv.org/abs/2210.13383). These trajectories are from the offline dataset, where the agent just navigates to random points in the maze, it does *not* try to collect rewards.

Bottom-left: Object location predictions (x) versus the actual locations (o).

Bottom-right: Wall layout predictions (dark green = true positive, light green = true negative, light red = false positive, dark red = false negative).

**Memory 9x9 Walls Objects - RSSM (TBTT)**

https://user-images.githubusercontent.com/3135115/197379227-775ec5bc-0780-4dcc-b7f1-660bc7cf95f1.mp4

**Memory 9x9 Walls Objects - Supervised oracle**

https://user-images.githubusercontent.com/3135115/197379235-a5ea0388-2718-4035-8bbc-064ecc9ea444.mp4

**Memory 15x15 Walls Objects - RSSM (TBTT)**

https://user-images.githubusercontent.com/3135115/197379245-fb96bd12-6ef5-481e-adc6-f119a39e8e43.mp4

**Memory 15x15 Walls Objects - Supervised oracle**

https://user-images.githubusercontent.com/3135115/197379248-26a8093e-8b54-443c-b154-e33e0383b5e4.mp4

## Questions

Please [open an issue][issues] on Github.

[issues]: https://github.com/jurgisp/memory-maze/issues


================================================
FILE: gui/recording.py
================================================
from datetime import datetime
from pathlib import Path

import gym
import imageio
import numpy as np

from PIL import Image


class SaveNpzWrapper(gym.Wrapper):

    def __init__(self, env, log_dir, video_fps=30, video_size=256, video_format='mp4'):
        env = ActionRewardResetWrapper(env)
        env = CollectWrapper(env)
        super().__init__(env)
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(parents=True, exist_ok=True)
        self.video_fps = video_fps
        self.video_size = video_size
        self.video_format = video_format

    def step(self, action):
        obs, reward, done, info = self.env.step(action)  # type: ignore
        data = info.get('episode')
        if data:
            ep_id = info['episode_id']
            ep_reward = data['reward'].sum()
            ep_steps = len(data['reward']) - 1
            ep_name = f'{ep_id}-r{ep_reward:.0f}-{ep_steps:04}'
            self._save_npz(data, self.log_dir / f'{ep_name}.npz')
            if self.video_format:
                self._save_video(data, self.log_dir / f'{ep_name}.{self.video_format}')
        return obs, reward, done, info

    def _save_npz(self, data, path):
        with path.open('wb') as f:
            np.savez_compressed(f, **data)
        print(f'Saved {path}', {k: v.shape for k, v in data.items()})
    
    def _save_video(self, data, path):
        writer = imageio.get_writer(path, fps=self.video_fps)
        for frame in data['image']:
            img = Image.fromarray(frame)
            img = img.resize((self.video_size, self.video_size), resample=0)
            writer.append_data(np.array(img))
        writer.close()
        print(f'Saved {path}')


class CollectWrapper(gym.Wrapper):
    """Copied from pydreamer.envs.wrappers."""

    def __init__(self, env):
        super().__init__(env)
        self.env = env
        self.episode = []
        self.episode_id = ''

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.episode.append(obs.copy())
        if done:
            episode = {k: np.array([t[k] for t in self.episode]) for k in self.episode[0]}
            info['episode'] = episode
        info['episode_id'] = self.episode_id
        return obs, reward, done, info

    def reset(self):
        obs = self.env.reset()
        self.episode = [obs.copy()]
        self.episode_id = datetime.now().strftime('%Y%m%dT%H%M%S')
        return obs


class ActionRewardResetWrapper(gym.Wrapper):
    """Copied from pydreamer.envs.wrappers."""

    def __init__(self, env, no_terminal=False):
        super().__init__(env)
        self.env = env
        self.no_terminal = no_terminal
        # Handle environments with one-hot or discrete action, but collect always as one-hot
        self.action_size = env.action_space.n if hasattr(env.action_space, 'n') else env.action_space.shape[0]

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        if isinstance(action, int):
            action_vec = np.zeros(self.action_size)
            action_vec[action] = 1.0
        else:
            assert isinstance(action, np.ndarray) and action.shape == (self.action_size,), "Wrong one-hot action shape"
            action_vec = action
        obs['action'] = action_vec
        obs['reward'] = np.array(reward)
        obs['terminal'] = np.array(False if self.no_terminal or 'TimeLimit.truncated' in info or info.get('time_limit') else done)
        obs['reset'] = np.array(False)
        return obs, reward, done, info

    def reset(self):
        obs = self.env.reset()
        obs['action'] = np.zeros(self.action_size)
        obs['reward'] = np.array(0.0)
        obs['terminal'] = np.array(False)
        obs['reset'] = np.array(True)
        return obs


================================================
FILE: gui/requirements.txt
================================================
gym
pygame
pillow
imageio
imageio-ffmpeg


================================================
FILE: gui/run_gui.py
================================================
import os, sys

import argparse
from collections import defaultdict

import gym
import numpy as np
import pygame
import pygame.freetype
from gym import spaces
from PIL import Image

from recording import SaveNpzWrapper

if 'MUJOCO_GL' not in os.environ:
    if "linux" in sys.platform:
        os.environ['MUJOCO_GL'] = 'osmesa' # Software rendering to avoid rendering interference with pygame
    else:
        os.environ['MUJOCO_GL'] = 'glfw'  # Windowed rendering

PANEL_LEFT = 250
PANEL_RIGHT = 250
FOCUS_HACK = False
RECORD_DIR = './log'
K_NONE = tuple()


def get_keymap(env):
    return {
        tuple(): 0,
        (pygame.K_UP, ): 1,
        (pygame.K_LEFT, ): 2,
        (pygame.K_RIGHT, ): 3,
        (pygame.K_UP, pygame.K_LEFT): 4,
        (pygame.K_UP, pygame.K_RIGHT): 5,
    }


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--env', type=str, default='memory_maze:MemoryMaze-9x9-v0')
    parser.add_argument('--size', type=int, nargs=2, default=(600, 600))
    parser.add_argument('--fps', type=int, default=6)
    parser.add_argument('--random', type=float, default=0.0)
    parser.add_argument('--noreset', action='store_true')
    parser.add_argument('--fullscreen', action='store_true')
    parser.add_argument('--nonoop', action='store_true', help='Pause instead of noop')
    parser.add_argument('--record', action='store_true')
    parser.add_argument('--record_mp4', action='store_true')
    parser.add_argument('--record_gif', action='store_true')
    args = parser.parse_args()
    render_size = args.size
    window_size = (render_size[0] + PANEL_LEFT + PANEL_RIGHT, render_size[1])

    print(f'Creating environment: {args.env}')
    env = gym.make(args.env, disable_env_checker=True)

    if isinstance(env.observation_space, spaces.Dict):
        print('Observation space:')
        for k, v in env.observation_space.spaces.items():  # type: ignore
            print(f'{k:>25}: {v}')
    else:
        print(f'Observation space:  {env.observation_space}')
    print(f'Action space:  {env.action_space}')

    if args.record:
        env = SaveNpzWrapper(
            env,
            RECORD_DIR,
            video_format='mp4' if args.record_mp4 else 'gif' if args.record_gif else None,
            video_fps=args.fps * 2)

    keymap = get_keymap(env)

    steps = 0
    return_ = 0.0
    episode = 0
    obs = env.reset()

    pygame.init()
    start_fullscreen = args.fullscreen or FOCUS_HACK
    screen = pygame.display.set_mode(window_size, pygame.FULLSCREEN if start_fullscreen else 0)
    if FOCUS_HACK and not args.fullscreen:
        # Hack: for some reason app window doesn't get focus when launching, so
        # we launch it as full screen and then exit full screen.
        pygame.display.toggle_fullscreen()
    clock = pygame.time.Clock()
    font = pygame.freetype.SysFont('Mono', 16)
    fontsmall = pygame.freetype.SysFont('Mono', 12)

    running = True
    paused = False
    speedup = False

    while running:

        # Rendering

        screen.fill((64, 64, 64))

        # Render image observation
        if isinstance(obs, dict):
            assert 'image' in obs, 'Expecting dictionary observation with obs["image"]'
            image = obs['image']  # type: ignore
        else:
            assert isinstance(obs, np.ndarray) and len(obs.shape) == 3, 'Expecting image observation'
            image = obs
        image = Image.fromarray(image)
        image = image.resize(render_size, resample=0)
        image = np.array(image)
        surface = pygame.surfarray.make_surface(image.transpose((1, 0, 2)))
        screen.blit(surface, (PANEL_LEFT, 0))

        # Render statistics
        lines = obs_to_text(obs, env, steps, return_)
        y = 5
        for line in lines:
            text_surface, rect = font.render(line, (255, 255, 255))
            screen.blit(text_surface, (16, y))
            y += font.size + 2  # type: ignore

        # Render keymap help
        lines = keymap_to_text(keymap)
        y = 5
        for line in lines:
            text_surface, rect = fontsmall.render(line, (255, 255, 255))
            screen.blit(text_surface, (render_size[0] + PANEL_LEFT + 16, y))
            y += fontsmall.size + 2  # type: ignore

        pygame.display.flip()
        clock.tick(args.fps if not speedup else 0)

        # Keyboard input

        pygame.event.pump()
        keys_down = defaultdict(bool)
        for event in pygame.event.get():
            if event.type == pygame.QUIT:  # Close
                running = False
            if event.type == pygame.KEYDOWN:
                keys_down[event.key] = True
        keys_hold = pygame.key.get_pressed()

        # Action keys
        action = keymap[K_NONE]  # noop, if no keys pressed
        for keys, act in keymap.items():
            if all(keys_hold[key] or keys_down[key] for key in keys):
                # The last keymap entry which has all keys pressed wins
                action = act

        # Special keys
        force_reset = False
        speedup = False
        if keys_down[pygame.K_ESCAPE]:  # Quit
            running = False
        if keys_down[pygame.K_SPACE]:  # Pause
            paused = not paused
        else:
            if action != keymap[K_NONE]:
                paused = False  # unpause on action press
        if keys_down[pygame.K_BACKSPACE]:  # Force reset
            force_reset = True
        if keys_hold[pygame.K_TAB]:
            speedup = True

        if paused:
            continue
        if action == keymap[K_NONE] and args.nonoop and not force_reset:
            continue

        # Environment step

        if args.random:
            if np.random.random() < args.random:
                action = env.action_space.sample()

        obs, reward, done, info = env.step(action)  # type: ignore
        # print({k: v for k, v in obs.items() if k != 'image'})
        steps += 1
        return_ += reward

        # Episode end

        if reward:
            print(f'reward: {reward}')
        if done or force_reset:
            print(f'Episode done - length: {steps}  return: {return_}')
            obs = env.reset()
            steps = 0
            return_ = 0.0
            episode += 1
            if done and args.record:
                # If recording, require relaunch for next episode
                running = False

    pygame.quit()


def obs_to_text(obs, env, steps, return_):
    kvs = []
    kvs.append(('## Stats ##', ''))
    kvs.append(('', ''))
    kvs.append(('step', steps))
    kvs.append(('return', return_))
    lines = [f'{k:<15} {v:>5}' for k, v in kvs]
    return lines


def keymap_to_text(keymap, verbose=False):
    kvs = []
    kvs.append(('## Commands ##', ''))
    kvs.append(('', ''))

    # mapped actions
    kvs.append(('forward', 'up arrow'))
    kvs.append(('left', 'left arrow'))
    kvs.append(('right', 'right arrow'))

    # special actions
    kvs.append(('', ''))
    kvs.append(('reset', 'backspace'))
    kvs.append(('pause', 'space'))
    kvs.append(('speed up', 'tab'))
    kvs.append(('quit', 'esc'))

    lines = [f'{k:<15} {v}' for k, v in kvs]
    return lines


if __name__ == '__main__':
    main()


================================================
FILE: memory_maze/__init__.py
================================================
import os

# NOTE: Env MUJOCO_GL=egl is necessary for headless hardware rendering on GPU,
# but breaks when running on a CPU machine. Alternatively set MUJOCO_GL=osmesa.
if 'MUJOCO_GL' not in os.environ:
    os.environ['MUJOCO_GL'] = 'egl'

from . import tasks

try:
    # Register gym environments, if gym is available

    from typing import Callable
    from functools import partial as f

    import dm_env
    import gym
    from gym.envs.registration import register

    from .gym_wrappers import GymWrapper

    def _make_gym_env(dm_task: Callable[[], dm_env.Environment], **kwargs):
        dmenv = dm_task(**kwargs)
        return GymWrapper(dmenv)

    sizes = {
        '9x9': tasks.memory_maze_9x9,
        '11x11': tasks.memory_maze_11x11,
        '13x13': tasks.memory_maze_13x13,
        '15x15': tasks.memory_maze_15x15,
    }

    for key, dm_task in sizes.items():
        # Image-only obs space
        register(id=f'MemoryMaze-{key}-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True))  # Standard
        register(id=f'MemoryMaze-{key}-Vis-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, good_visibility=True))  # Easily visible targets
        register(id=f'MemoryMaze-{key}-HD-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, camera_resolution=256))  # High-res camera
        register(id=f'MemoryMaze-{key}-Top-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, camera_resolution=256, top_camera=True))  # Top-down camera
        
        # Extra global observables (dict obs space)
        register(id=f'MemoryMaze-{key}-ExtraObs-v0', entry_point=f(_make_gym_env, dm_task, global_observables=True))
        register(id=f'MemoryMaze-{key}-ExtraObs-Vis-v0', entry_point=f(_make_gym_env, dm_task, global_observables=True, good_visibility=True))
        register(id=f'MemoryMaze-{key}-ExtraObs-Top-v0', entry_point=f(_make_gym_env, dm_task, global_observables=True, camera_resolution=256, top_camera=True))
        
        # Oracle observables with shortest path shown
        register(id=f'MemoryMaze-{key}-Oracle-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, global_observables=True, show_path=True))
        register(id=f'MemoryMaze-{key}-Oracle-Top-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, global_observables=True, show_path=True, camera_resolution=256, top_camera=True))
        register(id=f'MemoryMaze-{key}-Oracle-ExtraObs-v0', entry_point=f(_make_gym_env, dm_task, global_observables=True, show_path=True))
        
        # High control frequency
        register(id=f'MemoryMaze-{key}-HiFreq-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, control_freq=40))
        register(id=f'MemoryMaze-{key}-HiFreq-Vis-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, control_freq=40, good_visibility=True))
        register(id=f'MemoryMaze-{key}-HiFreq-HD-v0', entry_point=f(_make_gym_env, dm_task, image_only_obs=True, control_freq=40, camera_resolution=256))

        # Six colors even for smaller mazes
        register(id=f'MemoryMaze-{key}-6CL-v0', entry_point=f(_make_gym_env, dm_task, randomize_colors=True, image_only_obs=True))
        register(id=f'MemoryMaze-{key}-6CL-Top-v0', entry_point=f(_make_gym_env, dm_task, randomize_colors=True, image_only_obs=True, camera_resolution=256, top_camera=True))
        register(id=f'MemoryMaze-{key}-6CL-ExtraObs-v0', entry_point=f(_make_gym_env, dm_task, randomize_colors=True, global_observables=True))
        

except ImportError:
    print('memory_maze: gym environments not registered.')
    raise


================================================
FILE: memory_maze/gym_wrappers.py
================================================
from typing import Any, Tuple
import numpy as np

import dm_env
import gym
from dm_env import specs
from gym import spaces


class GymWrapper(gym.Env):

    def __init__(self, env: dm_env.Environment):
        self.env = env
        self.action_space = _convert_to_space(env.action_spec())
        self.observation_space = _convert_to_space(env.observation_spec())

    def reset(self) -> Any:
        ts = self.env.reset()
        return ts.observation

    def step(self, action) -> Tuple[Any, float, bool, dict]:
        ts = self.env.step(action)
        assert not ts.first(), "dm_env.step() caused reset, reward will be undefined."
        assert ts.reward is not None
        done = ts.last()
        terminal = ts.last() and ts.discount == 0.0
        info = {}
        if done and not terminal:
            info['TimeLimit.truncated'] = True  # acme.GymWrapper understands this and converts back to dm_env.truncation()
        return ts.observation, ts.reward, done, info


def _convert_to_space(spec: Any) -> gym.Space:
    # Inverse of acme.gym_wrappers._convert_to_spec

    if isinstance(spec, specs.DiscreteArray):
        return spaces.Discrete(spec.num_values)

    if isinstance(spec, specs.BoundedArray):
        return spaces.Box(
            shape=spec.shape,
            dtype=spec.dtype,
            low=spec.minimum.item() if len(spec.minimum.shape) == 0 else spec.minimum,
            high=spec.maximum.item() if len(spec.maximum.shape) == 0 else spec.maximum)
    
    if isinstance(spec, specs.Array):
        return spaces.Box(
            shape=spec.shape,
            dtype=spec.dtype,
            low=-np.inf,
            high=np.inf)

    if isinstance(spec, tuple):
        return spaces.Tuple(_convert_to_space(s) for s in spec)

    if isinstance(spec, dict):
        return spaces.Dict({key: _convert_to_space(value) for key, value in spec.items()})

    raise ValueError(f'Unexpected spec: {spec}')


================================================
FILE: memory_maze/helpers.py
================================================
from dm_env.specs import BoundedArray, DiscreteArray
import numpy as np

def sample_spec(space: BoundedArray) -> np.ndarray:
    if isinstance(space, DiscreteArray):
        return np.random.randint(space.num_values, size=space.shape)
    
    if isinstance(space, BoundedArray):
        return np.random.uniform(space.minimum, space.maximum, size=space.shape)
    
    raise NotImplementedError


================================================
FILE: memory_maze/maze.py
================================================
from typing import Optional
import functools
import string

import labmaze
import numpy as np
from dm_control import mjcf
from dm_control.composer.observation import observable as observable_lib
from dm_control.locomotion.arenas import covering, labmaze_textures, mazes
from dm_control.locomotion.props import target_sphere
from dm_control.locomotion.tasks import random_goal_maze
from dm_control.locomotion.walkers import jumping_ball
from labmaze import assets as labmaze_assets
from numpy.random import RandomState

DEFAULT_CONTROL_TIMESTEP = 0.025
DEFAULT_PHYSICS_TIMESTEP = 0.005

TARGET_COLORS = [
    np.array([170, 38, 30]) / 220,  # red
    np.array([99, 170, 88]) / 220,  # green
    np.array([39, 140, 217]) / 220,  # blue
    np.array([93, 105, 199]) / 220,  # purple
    np.array([220, 193, 59]) / 220,  # yellow
    np.array([220, 128, 107]) / 220,  # salmon
]


class RollingBallWithFriction(jumping_ball.RollingBallWithHead):

    def _build(self, roll_damping=5.0, steer_damping=20.0, **kwargs):
        super()._build(**kwargs)
        # Increase friction to the joints, so the movement feels more like traditional
        # first-person navigation control, without much acceleration/deceleration.
        self._mjcf_root.find('joint', 'roll').damping = roll_damping
        self._mjcf_root.find('joint', 'steer').damping = steer_damping


class MemoryMazeTask(random_goal_maze.NullGoalMaze):
    # Adapted from dm_control.locomotion.tasks.RepeatSingleGoalMaze

    def __init__(self,
                 walker,
                 maze_arena,
                 n_targets=3,
                 target_radius=0.3,
                 target_height_above_ground=0.0,
                 target_reward_scale=1.0,
                 target_randomize_colors=False,
                 enable_global_task_observables=False,
                 camera_resolution=64,
                 physics_timestep=DEFAULT_PHYSICS_TIMESTEP,
                 control_timestep=DEFAULT_CONTROL_TIMESTEP,
                 ):
        super().__init__(
            walker=walker,
            maze_arena=maze_arena,
            randomize_spawn_position=True,
            randomize_spawn_rotation=True,
            contact_termination=False,
            enable_global_task_observables=enable_global_task_observables,
            physics_timestep=physics_timestep,
            control_timestep=control_timestep
        )
        self.n_targets = n_targets
        self._target_radius = target_radius
        self._target_height_above_ground = target_height_above_ground
        self._target_reward_scale = target_reward_scale
        self._target_randomize_colors = target_randomize_colors

        self._targets = []
        self._target_colors = list(TARGET_COLORS)  # This contains all colors, not only n_targets
        self._create_targets()
        self._current_target_ix = 0
        self._rewarded_this_step = False
        self._targets_obtained = 0

        if enable_global_task_observables:
            # Add egocentric vectors to targets
            xpos_origin_callable = lambda phys: phys.bind(walker.root_body).xpos

            def _target_pos(physics, targets, index):
                return physics.bind(targets[index].geom).xpos

            for i in range(n_targets):
                # Absolute target position
                walker.observables.add_observable(
                    f'target_abs_{i}',
                    observable_lib.Generic(functools.partial(_target_pos, targets=self._targets, index=i)),
                )
                # Relative target position
                walker.observables.add_egocentric_vector(
                    f'target_rel_{i}',
                    observable_lib.Generic(functools.partial(_target_pos, targets=self._targets, index=i)),
                    origin_callable=xpos_origin_callable)

        self._task_observables = super().task_observables

        def _current_target_index(_):
            return self._current_target_ix

        def _current_target_color(_):
            return self._target_colors[self._current_target_ix]

        self._task_observables['target_index'] = observable_lib.Generic(_current_target_index)
        self._task_observables['target_index'].enabled = True
        self._task_observables['target_color'] = observable_lib.Generic(_current_target_color)
        self._task_observables['target_color'].enabled = True

        self._walker.observables.egocentric_camera.height = camera_resolution
        self._walker.observables.egocentric_camera.width = camera_resolution
        self._maze_arena.observables.top_camera.height = camera_resolution
        self._maze_arena.observables.top_camera.width = camera_resolution

    @property
    def task_observables(self):
        return self._task_observables

    @property
    def name(self):
        return 'memory_maze'

    def initialize_episode_mjcf(self, rng: RandomState):
        self._maze_arena.regenerate(rng)  # Bypass super()._initialize_episode_mjcf(), because it ignores rng
        while True:
            if self._target_randomize_colors:
                # Recreate target objects with new colors
                self._create_targets(clear_existing=True, randomize_colors=True, rng=rng)
            ok = self._place_targets(rng)
            if not ok:
                # Could not place targets - regenerate the maze
                self._maze_arena.regenerate(rng)
                continue
            break
        self._pick_new_target(rng)

    def initialize_episode(self, physics, rng: RandomState):
        super().initialize_episode(physics, rng)
        self._rewarded_this_step = False
        self._targets_obtained = 0

    def after_step(self, physics, rng: RandomState):
        super().after_step(physics, rng)
        self._rewarded_this_step = False
        for i, target in enumerate(self._targets):
            if target.activated:
                if i == self._current_target_ix:
                    self._rewarded_this_step = True
                    self._targets_obtained += 1
                    self._pick_new_target(rng)
                target.reset(physics)  # Resets activated=False

    def should_terminate_episode(self, physics):
        return super().should_terminate_episode(physics)

    def get_reward(self, physics):
        if self._rewarded_this_step:
            return self._target_reward_scale
        return 0.0

    def _create_targets(self, clear_existing=False, randomize_colors=False, rng: Optional[RandomState] = None):
        if clear_existing:
            while self._targets:
                target = self._targets.pop()
                target.detach()  # Important to detach old targets, if creating new ones
        else:
            assert not self._targets, 'Targets already created.'

        if randomize_colors:
            assert rng is not None
            rng.shuffle(self._target_colors)

        for i in range(self.n_targets):
            color = self._target_colors[i]
            target = target_sphere.TargetSphere(
                radius=self._target_radius,
                height_above_ground=self._target_radius + self._target_height_above_ground,
                rgb1=tuple(color * 1.0),
                rgb2=tuple(color * 1.0),
            )
            self._targets.append(target)
            self._maze_arena.attach(target)

    def _place_targets(self, rng: RandomState) -> bool:
        possible_positions = list(self._maze_arena.target_positions)
        rng.shuffle(possible_positions)
        if len(possible_positions) < len(self._targets):
            # Too few rooms - need to regenerate the maze
            return False
        for target, pos in zip(self._targets, possible_positions):
            mjcf.get_attachment_frame(target.mjcf_model).pos = pos
        return True

    def _pick_new_target(self, rng: RandomState):
        while True:
            ix = rng.randint(len(self._targets))
            if self._targets[ix].activated:
                continue  # Skip the target that the agent is touching
            self._current_target_ix = ix
            break


class FixedWallTexture(labmaze_textures.WallTextures):
    """Selects a single texture instead of a collection to sample from."""

    def _build(self, style, texture_name):
        labmaze_textures = labmaze_assets.get_wall_texture_paths(style)
        self._mjcf_root = mjcf.RootElement(model='labmaze_' + style)
        self._textures = []
        if texture_name not in labmaze_textures:
            raise ValueError(f'`texture_name` should be one of {labmaze_textures.keys()}: got {texture_name}')
        texture_path = labmaze_textures[texture_name]
        self._textures.append(self._mjcf_root.asset.add(  # type: ignore
            'texture', type='2d', name=texture_name,
            file=texture_path.format(texture_name)))


class FixedFloorTexture(labmaze_textures.FloorTextures):
    """Selects a single texture instead of a collection to sample from."""

    def _build(self, style, texture_names):
        labmaze_textures = labmaze_assets.get_floor_texture_paths(style)
        self._mjcf_root = mjcf.RootElement(model='labmaze_' + style)
        self._textures = []
        if isinstance(texture_names, str):
            texture_names = [texture_names]
        for texture_name in texture_names:
            if texture_name not in labmaze_textures:
                raise ValueError(f'`texture_name` should be one of {labmaze_textures.keys()}: got {texture_name}')
            texture_path = labmaze_textures[texture_name]
            self._textures.append(self._mjcf_root.asset.add(  # type: ignore
                'texture', type='2d', name=texture_name,
                file=texture_path.format(texture_name)))


class MazeWithTargetsArena(mazes.MazeWithTargets):
    """Fork of mazes.RandomMazeWithTargets."""

    def _build(self,
               x_cells,
               y_cells,
               xy_scale=2.0,
               z_height=2.0,
               max_rooms=4,
               room_min_size=3,
               room_max_size=5,
               spawns_per_room=0,
               targets_per_room=0,
               max_variations=26,
               simplify=True,
               skybox_texture=None,
               wall_textures=None,
               floor_textures=None,
               aesthetic='default',
               name='random_maze',
               random_seed=None):
        assert random_seed, "Expected to be set by tasks._memory_maze()"
        super()._build(
            maze=TextMazeVaryingWalls(
                height=y_cells,
                width=x_cells,
                max_rooms=max_rooms,
                room_min_size=room_min_size,
                room_max_size=room_max_size,
                max_variations=max_variations,
                spawns_per_room=spawns_per_room,
                objects_per_room=targets_per_room,
                simplify=simplify,
                random_seed=random_seed),
            xy_scale=xy_scale,
            z_height=z_height,
            skybox_texture=skybox_texture,
            wall_textures=wall_textures,
            floor_textures=floor_textures,
            aesthetic=aesthetic,
            name=name)

    def regenerate(self, random_state):
        """Generates a new maze layout.

        Patch of MazeWithTargets.regenerate() which uses random_state.
        """
        self._maze.regenerate()
        # logging.debug('GENERATED MAZE:\n%s', self._maze.entity_layer)
        self._find_spawn_and_target_positions()

        if self._text_maze_regenerated_hook:
            self._text_maze_regenerated_hook()

        # Remove old texturing planes.
        for geom_name in self._texturing_geom_names:
            del self._mjcf_root.worldbody.geom[geom_name]
        self._texturing_geom_names = []

        # Remove old texturing materials.
        for material_name in self._texturing_material_names:
            del self._mjcf_root.asset.material[material_name]
        self._texturing_material_names = []

        # Remove old actual-wall geoms.
        self._maze_body.geom.clear()

        self._current_wall_texture = {
            wall_char: random_state.choice(wall_textures)  # PATCH: use random_state for wall textures
            for wall_char, wall_textures in self._wall_textures.items()
        }

        for wall_char in self._wall_textures:
            self._make_wall_geoms(wall_char)
        self._make_floor_variations()

    def _make_floor_variations(self, build_tile_geoms_fn=None):
        """Fork of mazes.MazeWithTargets._make_floor_variations().

        Makes the room floors different if possible, instead of sampling randomly.
        """
        _DEFAULT_FLOOR_CHAR = '.'

        main_floor_texture = self._floor_textures[0]
        if len(self._floor_textures) > 1:
            room_floor_textures = self._floor_textures[1:]
        else:
            room_floor_textures = [main_floor_texture]

        for i_var, variation in enumerate(_DEFAULT_FLOOR_CHAR + string.ascii_uppercase):
            if variation not in self._maze.variations_layer:
                break

            if build_tile_geoms_fn is None:
                # Break the floor variation down to odd-sized tiles.
                tiles = covering.make_walls(self._maze.variations_layer,
                                            wall_char=variation,
                                            make_odd_sized_walls=True)
            else:
                tiles = build_tile_geoms_fn(wall_char=variation)

            if variation == _DEFAULT_FLOOR_CHAR:
                variation_texture = main_floor_texture
            else:
                variation_texture = room_floor_textures[i_var % len(room_floor_textures)]

            for i, tile in enumerate(tiles):
                tile_mid = covering.GridCoordinates(
                    (tile.start.y + tile.end.y - 1) / 2,
                    (tile.start.x + tile.end.x - 1) / 2)
                tile_pos = np.array([(tile_mid.x - self._x_offset) * self._xy_scale,
                                     -(tile_mid.y - self._y_offset) * self._xy_scale,
                                     0.0])
                tile_size = np.array([(tile.end.x - tile_mid.x - 0.5) * self._xy_scale,
                                      (tile.end.y - tile_mid.y - 0.5) * self._xy_scale,
                                      self._xy_scale])
                if variation == _DEFAULT_FLOOR_CHAR:
                    tile_name = 'floor_{}'.format(i)
                else:
                    tile_name = 'floor_{}_{}'.format(variation, i)
                self._tile_geom_names[tile.start] = tile_name
                self._texturing_material_names.append(tile_name)
                self._texturing_geom_names.append(tile_name)
                material = self._mjcf_root.asset.add(
                    'material', name=tile_name, texture=variation_texture,
                    texrepeat=(2 * tile_size[[0, 1]] / self._xy_scale))
                self._mjcf_root.worldbody.add(
                    'geom', name=tile_name, type='plane', material=material,
                    pos=tile_pos, size=tile_size, contype=0, conaffinity=0)


class TextMazeVaryingWalls(labmaze.RandomMaze):
    """Augments standard generated labmaze with some walls marked with different chars."""

    def regenerate(self):
        super().regenerate()
        self._block_variations()

    def _block_variations(self):
        nblocks = 3
        wall_chars = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

        n = self.entity_layer.shape[0]
        ivar = 0
        for i in range(nblocks):
            for j in range(nblocks):
                i_from = i * n // nblocks
                i_to = (i + 1) * n // nblocks
                j_from = j * n // nblocks
                j_to = (j + 1) * n // nblocks
                self._change_block_char(i_from, i_to, j_from, j_to, wall_chars[ivar])
                ivar += 1

    def _change_block_char(self, i1, i2, j1, j2, char):
        grid = self.entity_layer
        i, j = np.where(grid[i1:i2, j1:j2] == '*')
        grid[i + i1, j + j1] = char


================================================
FILE: memory_maze/oracle.py
================================================
from collections import deque
from typing import List, Optional, Tuple
import numpy as np

from memory_maze.wrappers import ObservationWrapper


class PathToTargetWrapper(ObservationWrapper):
    """Find shortest path to target and indicate it on maze_layout. Used for Oracle."""

    def observation_spec(self):
        spec = self.env.observation_spec()
        assert isinstance(spec, dict)
        assert 'agent_pos' in spec
        assert 'target_pos' in spec
        assert 'maze_layout' in spec
        return spec

    def observation(self, obs):
        assert isinstance(obs, dict)
        # Find shortest path (in gridworld) from agent to target
        maze = obs['maze_layout']
        start = tuple(obs['agent_pos'].astype(int))
        finish = tuple(obs['target_pos'].astype(int))
        path = breadth_first_search(maze, start, finish)
        if path:
            for x, y in path:
                maze[y, x] = 2  # Update maze_layout observation
        return obs


class DrawMinimapWrapper(ObservationWrapper):
    """Show maze_layout as minimap in image observation. Used for Oracle."""

    def observation_spec(self):
        spec = self.env.observation_spec()
        assert isinstance(spec, dict)
        assert 'maze_layout' in spec
        assert 'image' in spec
        assert 'agent_dir' in spec
        return spec

    def observation(self, obs):
        from PIL import Image

        assert isinstance(obs, dict)
        maze = obs['maze_layout']
        x, y = obs['agent_pos']
        dx, dy = obs['agent_dir']
        angle = np.arctan2(dx, dy)
        N = maze.shape[0]
        SIZE = N * 2

        # Draw map
        map = np.zeros((N, N, 3), np.uint8)  # walls in black
        map[:, :] += (maze == 1)[..., None] * np.array([[[255, 255, 255]]], np.uint8)  # corridors in white
        map[:, :] += (maze == 2)[..., None] * np.array([[[0, 255, 0]]], np.uint8)  # path in green
        map[int(y), int(x)] = np.array([255, 0, 0], np.uint8)  # agent in red
        map = np.flip(map, 0)

        # Scale, rotate, translate
        mapimg = Image.fromarray(map)
        mapimg = mapimg.resize((SIZE, SIZE), resample=0)
        tx = (x - N / 2) / N * SIZE
        ty = - (y - N / 2) / N * SIZE
        mapimg = mapimg.transform(mapimg.size, 0,
                                  (1, 0, tx,
                                   0, 1, ty),
                                  resample=0)
        mapimg = mapimg.rotate(angle / np.pi * 180, resample=0)

        # Overlay minimap onto observation image top-right corner
        img = obs['image']
        img[:SIZE, -SIZE:] = img[:SIZE, -SIZE:] // 2 + np.array(mapimg) // 2
        return obs


def breadth_first_search(maze: np.ndarray, start: Tuple[int, int], finish: Tuple[int, int]) -> Optional[List[Tuple[int, int]]]:
    h, w = maze.shape

    queue = deque()
    visited = np.zeros(maze.shape, dtype=bool)
    backtrace = np.zeros(maze.shape + (2,), dtype=int)

    xs, ys = start
    queue.append((xs, ys))
    visited[ys, xs] = True

    while len(queue) > 0:
        x, y = queue.popleft()
        for dx, dy in [(-1, 0), (1, 0), (0, -1), (0, 1)]:
            x1 = x + dx
            y1 = y + dy
            if 0 <= x1 < w and 0 <= y1 < h and maze[y1, x1] and not visited[y1, x1]:
                queue.append((x1, y1))
                visited[y1, x1] = True
                backtrace[y1, x1, :] = np.array([x, y])
                if (x1, y1) == finish:
                    break

    xf, yf = finish
    if not visited[yf, xf]:
        return None

    path = []
    path.append((xf, yf))
    while (xf, yf) != start:
        xf, yf = backtrace[yf, xf]
        path.append((xf, yf))
    path.reverse()
    return path


================================================
FILE: memory_maze/tasks.py
================================================
import numpy as np
from dm_control import composer
from dm_control.locomotion.arenas import labmaze_textures

from memory_maze.maze import *
from memory_maze.oracle import DrawMinimapWrapper, PathToTargetWrapper
from memory_maze.wrappers import *

# Slow control (4Hz), so that agent without HRL has a chance.
# Native control would be ~20Hz, so this corresponds roughly to action_repeat=5.
DEFAULT_CONTROL_FREQ = 4.0


def memory_maze_9x9(**kwargs):
    """
    Maze based on DMLab30-explore_goal_locations_small
    {
        mazeHeight = 11,  # with outer walls
        mazeWidth = 11,
        roomCount = 4,
        roomMaxSize = 5,
        roomMinSize = 3,
    }
    """
    return _memory_maze(9, 3, 250, **kwargs)


def memory_maze_11x11(**kwargs):
    return _memory_maze(11, 4, 500, **kwargs)


def memory_maze_13x13(**kwargs):
    return _memory_maze(13, 5, 750, **kwargs)


def memory_maze_15x15(**kwargs):
    """
    Maze based on DMLab30-explore_goal_locations_large
    {
        mazeHeight = 17,  # with outer walls
        mazeWidth = 17,
        roomCount = 9,
        roomMaxSize = 3,
        roomMaxSize = 3,
    }
    """
    return _memory_maze(15, 6, 1000, max_rooms=9, room_max_size=3, **kwargs)


def _memory_maze(
    maze_size,  # measured without exterior walls
    n_targets,
    time_limit,
    max_rooms=6,
    room_min_size=3,
    room_max_size=5,
    control_freq=DEFAULT_CONTROL_FREQ,
    discrete_actions=True,
    image_only_obs=False,
    target_color_in_image=True,
    global_observables=False,
    top_camera=False,
    good_visibility=False,
    show_path=False,
    camera_resolution=64,
    seed=None,
    randomize_colors=False,
):
    random_state = np.random.RandomState(seed)
    walker = RollingBallWithFriction(camera_height=0.3, add_ears=top_camera)
    arena = MazeWithTargetsArena(
        x_cells=maze_size + 2,  # inner size => outer size
        y_cells=maze_size + 2,
        xy_scale=2.0,
        z_height=1.5 if not good_visibility else 0.4,
        max_rooms=max_rooms,
        room_min_size=room_min_size,
        room_max_size=room_max_size,
        spawns_per_room=1,
        targets_per_room=1,
        floor_textures=FixedFloorTexture('style_01', ['blue', 'blue_bright']),
        wall_textures=dict({
            '*': FixedWallTexture('style_01', 'yellow'),  # default wall
        }, **{str(i): labmaze_textures.WallTextures('style_01') for i in range(10)}  # variations
        ),
        skybox_texture=None,
        random_seed=random_state.randint(2147483648),
    )

    task = MemoryMazeTask(
        walker=walker,
        maze_arena=arena,
        n_targets=n_targets,
        target_radius=0.6,
        target_height_above_ground=0.5 if good_visibility else -0.6,
        enable_global_task_observables=True,  # Always add to underlying env, but not always expose in RemapObservationWrapper
        control_timestep=1.0 / control_freq,
        camera_resolution=camera_resolution,
        target_randomize_colors=randomize_colors,
    )

    if top_camera:
        task.observables['top_camera'].enabled = True

    env = composer.Environment(
        time_limit=time_limit - 1e-3,  # subtract epsilon to make sure ep_length=time_limit*fps
        task=task,
        random_state=random_state,
        strip_singleton_obs_buffer_dim=True)

    obs_mapping = {
        'image': 'walker/egocentric_camera' if not top_camera else 'top_camera',
        'target_color': 'target_color',
    }
    if global_observables:
        env = TargetsPositionWrapper(env, task._maze_arena.xy_scale, task._maze_arena.maze.width, task._maze_arena.maze.height)
        env = AgentPositionWrapper(env, task._maze_arena.xy_scale, task._maze_arena.maze.width, task._maze_arena.maze.height)
        env = MazeLayoutWrapper(env)
        obs_mapping = dict(obs_mapping, **{
            'agent_pos': 'agent_pos',
            'agent_dir': 'agent_dir',
            'targets_vec': 'targets_vec',
            'targets_pos': 'targets_pos',
            'target_vec': 'target_vec',
            'target_pos': 'target_pos',
            'maze_layout': 'maze_layout',
        })

    env = RemapObservationWrapper(env, obs_mapping)

    if target_color_in_image:
        env = TargetColorAsBorderWrapper(env)

    if show_path:
        env = PathToTargetWrapper(env)
        env = DrawMinimapWrapper(env)

    if image_only_obs:
        assert target_color_in_image, 'Image-only observation only makes sense with target_color_in_image'
        env = ImageOnlyObservationWrapper(env)

    if discrete_actions:
        env = DiscreteActionSetWrapper(env, [
            np.array([0.0, 0.0]),  # noop
            np.array([-1.0, 0.0]),  # forward
            np.array([0.0, -1.0]),  # left
            np.array([0.0, +1.0]),  # right
            np.array([-1.0, -1.0]),  # forward + left
            np.array([-1.0, +1.0]),  # forward + right
        ])

    return env


================================================
FILE: memory_maze/wrappers.py
================================================


from typing import Any, Dict, List

import dm_env
import numpy as np
from dm_env import specs


class Wrapper(dm_env.Environment):
    """Base class for dm_env.Environment wrapper."""

    def __init__(self, env: dm_env.Environment):
        self.env = env

    def __getattr__(self, name):
        if name.startswith('__'):
            raise AttributeError(f'Attempted to get missing private attribute {name}')
        return getattr(self.env, name)

    def step(self, action) -> dm_env.TimeStep:
        return self.env.step(action)

    def reset(self) -> dm_env.TimeStep:
        return self.env.reset()

    def action_spec(self) -> Any:
        return self.env.action_spec()

    def discount_spec(self) -> Any:
        return self.env.discount_spec()

    def observation_spec(self) -> Any:
        return self.env.observation_spec()

    def reward_spec(self) -> Any:
        return self.env.reward_spec()

    def close(self):
        return self.env.close()


class ObservationWrapper(Wrapper):
    """Base class for observation wrapper."""

    def observation_spec(self):
        raise NotImplementedError

    def observation(self, obs: Any) -> Any:
        raise NotImplementedError

    def step(self, action) -> dm_env.TimeStep:
        step_type, discount, reward, observation = self.env.step(action)
        return dm_env.TimeStep(step_type, discount, reward, self.observation(observation))

    def reset(self) -> dm_env.TimeStep:
        step_type, discount, reward, observation = self.env.reset()
        return dm_env.TimeStep(step_type, discount, reward, self.observation(observation))


class RemapObservationWrapper(ObservationWrapper):
    """Select a subset of dictionary observation keys and rename them."""

    def __init__(self, env: dm_env.Environment, mapping: Dict[str, str]):
        super().__init__(env)
        self.mapping = mapping

    def observation_spec(self):
        spec = self.env.observation_spec()
        assert isinstance(spec, dict)
        return {key: spec[key_orig] for key, key_orig in self.mapping.items()}

    def observation(self, obs):
        assert isinstance(obs, dict)
        return {key: obs[key_orig] for key, key_orig in self.mapping.items()}


class TargetsPositionWrapper(ObservationWrapper):
    """Collects and postporcesses walker/target_rel_{i} relative position vectors into 
    targets_vec (n_targets,2) tensor, and walker/targets_abs_{i} absolute positions 
    into targets_pos tensor."""

    def __init__(self, env: dm_env.Environment, maze_xy_scale, maze_width, maze_height):
        super().__init__(env)
        self.maze_xy_scale = maze_xy_scale
        self.center_ji = np.array([maze_width - 2.0, maze_height - 2.0]) / 2.0

        spec = self.env.observation_spec()
        assert isinstance(spec, dict)
        assert 'walker/target_rel_0' in spec
        assert 'walker/target_abs_0' in spec
        assert 'target_index' in spec

        i = 0
        while f'walker/target_rel_{i}' in spec:
            assert f'walker/target_abs_{i}' in spec
            i += 1

        self.n_targets = i

    def observation_spec(self):
        spec = self.env.observation_spec()
        assert isinstance(spec, dict)
        # All targets
        spec['targets_vec'] = specs.Array((self.n_targets, 2), float, 'targets_vec')
        spec['targets_pos'] = specs.Array((self.n_targets, 2), float, 'targets_pos')
        # Current target
        spec['target_vec'] = specs.Array((2,), float, 'target_vec')
        spec['target_pos'] = specs.Array((2,), float, 'target_pos')
        return spec

    def observation(self, obs):
        assert isinstance(obs, dict)
        # All targets
        x_rel = np.zeros((self.n_targets, 2))
        x_abs = np.zeros((self.n_targets, 2))
        for i in range(self.n_targets):
            x_rel[i] = obs[f'walker/target_rel_{i}'][:2] / self.maze_xy_scale
            x_abs[i] = obs[f'walker/target_abs_{i}'][:2] / self.maze_xy_scale + self.center_ji
        obs['targets_vec'] = x_rel
        obs['targets_pos'] = x_abs
        # Current target
        target_ix = int(obs['target_index'])
        obs['target_vec'] = x_rel[target_ix]
        obs['target_pos'] = x_abs[target_ix]
        return obs


class AgentPositionWrapper(ObservationWrapper):
    """Postprocesses absolute_position and absolute_orientation."""

    def __init__(self, env: dm_env.Environment, maze_xy_scale, maze_width, maze_height):
        super().__init__(env)
        self.maze_xy_scale = maze_xy_scale
        self.center_ji = np.array([maze_width - 2.0, maze_height - 2.0]) / 2.0

    def observation_spec(self):
        spec = self.env.observation_spec()
        # absolute_position and absolute_orientation should already be generated by the environment.
        assert isinstance(spec, dict) and 'absolute_position' in spec and 'absolute_orientation' in spec
        # Add agent_pos, measured in grid coordinates
        spec['agent_pos'] = specs.Array((2, ), float, 'agent_pos')
        # Add agent_dir as 2-vector
        spec['agent_dir'] = specs.Array((2, ), float, 'agent_dir')
        return spec

    def observation(self, obs):
        assert isinstance(obs, dict)
        walker_xy = obs['absolute_position'][:2]
        walker_ji = walker_xy / self.maze_xy_scale + self.center_ji
        # agent_pos, measured in grid coordinates, where bottom-left coordinate is (0.1,0.1),
        # and top-right coordinate for a 15x15 maze is (14.9,14.9)
        obs['agent_pos'] = walker_ji
        # Pick orientation vector such, that going forward increases agent_pos in the direction of agent_dir.
        obs['agent_dir'] = obs['absolute_orientation'][:2, 1]
        return obs


class MazeLayoutWrapper(ObservationWrapper):
    """Postprocesses maze_layout observation."""

    def observation_spec(self):
        spec = self.env.observation_spec()
        # maze_layout should already be generated by the environment
        assert isinstance(spec, dict) and 'maze_layout' in spec
        # Change char array to binary array, removing outer walls
        n, m = spec['maze_layout'].shape
        spec['maze_layout'] = specs.BoundedArray((n - 2, m - 2), np.uint8, 0, 1, 'maze_layout')
        return spec

    def observation(self, obs):
        assert isinstance(obs, dict)
        maze = obs['maze_layout']
        maze = maze[1:-1, 1:-1]  # Remove outer walls
        maze = np.flip(maze, 0)  # Flip vertical axis so that bottom-left is at maze[0,0]
        nonwalls = (maze == ' ') | (maze == 'P') | (maze == 'G')
        obs['maze_layout'] = nonwalls.astype(np.uint8)
        return obs


class ImageOnlyObservationWrapper(ObservationWrapper):
    """Select one of the dictionary observation keys as observation."""

    def __init__(self, env: dm_env.Environment, key: str = 'image'):
        super().__init__(env)
        self.key = key

    def observation_spec(self):
        spec = self.env.observation_spec()
        assert isinstance(spec, dict)
        return spec[self.key]

    def observation(self, obs):
        assert isinstance(obs, dict)
        return obs[self.key]


class DiscreteActionSetWrapper(Wrapper):
    """Change action space from continuous to discrete with given set of action vectors."""

    def __init__(self, env: dm_env.Environment, action_set: List[np.ndarray]):
        super().__init__(env)
        self.action_set = action_set

    def action_spec(self):
        return specs.DiscreteArray(len(self.action_set))

    def step(self, action) -> dm_env.TimeStep:
        return self.env.step(self.action_set[action])


class TargetColorAsBorderWrapper(ObservationWrapper):
    """MemoryMaze-specific wrapper, which draws target_color as border on the image."""

    def observation_spec(self):
        spec = self.env.observation_spec()
        assert isinstance(spec, dict)
        assert 'target_color' in spec
        return spec

    def observation(self, obs):
        assert isinstance(obs, dict)
        assert 'target_color' in obs and 'image' in obs
        target_color = obs['target_color']
        img = obs['image']
        B = int(2 * np.sqrt(img.shape[0] // 64))
        img[:, :B] = target_color * 255 * 0.7
        img[:, -B:] = target_color * 255 * 0.7
        img[:B, :] = target_color * 255 * 0.7
        img[-B:, :] = target_color * 255 * 0.7
        return obs


================================================
FILE: setup.py
================================================
from setuptools import setup
import pathlib

__version__ = "1.0.3"

setup(
    name="memory-maze",
    version=__version__,
    author="Jurgis Pasukonis",
    author_email="jurgisp@gmail.com",
    url="https://github.com/jurgisp/memory-maze",
    description="Memory Maze is an environment to benchmark memory abilities of RL agents",
    long_description=pathlib.Path('README.md').read_text(),
    long_description_content_type='text/markdown',
    zip_safe=False,
    python_requires=">=3",
    packages=["memory_maze"],
    install_requires=[
        'dm_control'
    ],
)