MarcoPolo: A Next-Generation Framework for Reinforcement Learning

MarcoPolo is an advanced reinforcement learning (RL) framework designed to facilitate the fast, accessible, and reproducible training of multiple artificial intelligence (AI) agents through RL. It improves upon existing techniques to create a robust system that allows agents to learn and adapt efficiently across different environments and scenarios. This approach to learning enables the training of generalized agents capable of zero shot learning.

The Core Process

MarcoPolo operates through a core process of optimization, transfer, and evolution:

  1. Optimize:
  • Train agents to perform better in their respective environments.
  1. Transfer:
  • Move agents between environments to identify top performers and facilitate cross-learning.
  1. Evolve:
  • Create new environments based on novelty and complexity, retiring older ones to keep the training dynamic and challenging.

Advantages:

  • Modularity:
  • MarcoPolo’s design allows different algorithms, environments, and backends to be easily integrated with existing systems.
  • Scalability:
  • Capable of handling large-scale training tasks, making it suitable for more practical applications involving multiple agents.
  • Flexibility:
  • Supports multi-agent simulations, enabling complex interactions and cooperative learning among agents.
  • Reproducibility:
  • MarcoPolo is seed safe, enabling accurate reproduction of simulations of interest and creating checkpoints during training regimen.

Real-World Use

MarcoPolo has been utilized by several research institutions such as UC Berkeley, Virginia Tech, AFRL, and the University of Dayton Research Institute, showcasing its effectiveness and adaptability in various research and development scenarios.

Its ability to increase the accessibility of multi-agent automated curriculum learning makes it a valuable tool for researchers and developers alike in pushing the boundaries of AI capabilities. For more detailed information and to access the open-source project, visit the MarcoPolo GitHub repository.

Technical Background

Reinforcement learning (RL) has been a promising subfield of AI for teaching models various kinds of problem solving. RL typically includes three main components: agents, the environment, and the reward structure. The environment represents the boundaries in which the simulations take place, whether it be a physical location like space, or something more abstract like a text document. The environment is populated with agents which can carry out actions. The “action space” describes the capabilities of the agents, which could include the following for a satellite: burn fuel, rotate, or do nothing. The reward function is designed to reward or penalize agents for various actions, such as getting to a waypoint within a certain time limit or being inefficient with fuel use, respectively. Once the reward function is defined, the agents are then exposed to a set of environments where they move within their action space to achieve the maximum reward. After the agents have interacted with enough environments and the reward function has been appropriately tuned to the task at hand, they begin to exhibit the desired behaviors. The agents are normally what are perceived as the intelligent part of AI, but the coordination of all parts is required to develop their capabilities.  

There is a weakness to the reinforcement learning process, though. When agents are finished with training and implemented in a real scenario, there is a high chance of failure if this scenario is unlike any that they had previously experienced. Many RL approaches fall short in this respect, where environments that were not previously encountered in training often confuse agents trained with traditional RL. That is because the AI had learned to solve those specific environments encountered in training instead of learning general, problem solving skills. For example, if agents are trained to operate in space but the environment does not accurately represent sunlight conditions, then the satellite agents will have no idea how to protect their sensor from UV damage. The failure to apply what is learned in one domain (simulations) to another (real life) is referred to as negative transfer.

We minimize the issue of negative transfer with Adaptive Curriculum Learning (ACL), an enhanced approach to RL that rewards AI for building skills so that they are better equipped to handle unknowns. One of the main differences between traditional reinforcement learning and ACL is that in ACL the environments gradually evolve in complexity, so AI can incrementally learn more complex skills as it solves each challenge. The team at Mobius Logic built MarcoPolo so that AI can face challenges that start simple and slowly increase in difficulty. This is in line with the concept of “walk before you run”, where basics are established before more complicated concepts. A satellite agent may first learn to travel in straight lines, then develop more sophisticated movement patterns upon the addition of obstacles. The greatest strength in this skill-based learning approach is how it can enable one-shot learning, where a highly skilled agent is able to solve a completely new scenario in one attempt. One-shot learning minimizes the chance of failure in missions where the setting can be radically different from the training environment.

MarcoPolo effectively implements ACL to produce highly skilled agents capable of handling tasks that were otherwise unapproachable by traditional RL. These capabilities are further expanded with the incorporation of multiple agents that are trained to work as a team. Previous iterations of RL systems become exponentially more complicated as more agents are added to the system, but MarcoPolo handles their addition with ease.

Share this post