This is a loosely organized crosspost of a project I did in Doina Precup's excellent course on Reinforcement Learning at McGill University.
Research into the use of Minecraft as a platform for building custom reinforcement learning environments. Detailed blog post coming soon.Can Minecraft be a powerful alternative to classic control, Nvidia Isaac, and many other RL frameworks due to its open ended nature and possibilities for custom environment crafting?
Challenges:
It’s practical, check minerl-parkour repo (replicated study)
Theoretically possible via custom action and observation space in Malmo, but difficult to setup due to lack of support for Malmo with current environments.
This paper introduces Minecraft as a reinforcement learning environment, and aims to provide insights as to whether or not it is a useful platform for introductory RL education. Through doing a meta-analysis of challenges and frameworks using Minecraft as an environment, as well as a study on reducing the complexity of the game to allow for experimentation on lower end hardware, the practicality of using Minecraft is analyzed. Due to training time limitations, this paper does not have concrete data on algorithm performance, for this please see the also submitted comparison of deep RL algorithms on Atari games which re-uses much of the code from this exploration.
At 238,000,000 copies sold, Minecraft is by far the most popular video game in the world. It has been subject to cutting edge reinforcement learning research from top firms including OpenAI, Nvidia, Microsoft, DeepMind, and more. The top papers regarding Minecraft as an environment for reinforcement learning have focused on training models to play the game based on human priors. This paper aims to evaluate the state of Minecraft as an environment for simple, yet customizable RL experiments. For as long as Minecraft has been around, players have been constructing challenges for eachother like SkyBlock, parkour maps, mazes, and more. This paper provides a meta-analysis of existing works in Minecraft based reinforcement learning, as well as the results of various experiments in reduction of observation and action spaces, reward engineering, environment building, and developer experience.
2016’s Project Malmo
2018’s marLo
2018’s MalmoEnv
2019-current MineRL:
2021’s IGLU
2022’s MineDojo
This paper focuses on implementations using MineRL and Minedojo, as they are currently maintained and provide a robust set of environments and mostly current documentation.
The goal of this exploration is to follow in the spirit of Minecraft players in creating a challenge for the agent to solve. The challenge is meant to have the following properties
As a naïve first attempt, MineDojo’s hunt cow environment was used.
env = minedojo.make(task_id="hunt_cow", image_size=(160, 256))
This default environment spawns the player in a plains biome, with a cow nearby. The player gets a sparse reward for killing the cow.
The Model
Based on existing solutions to Minecraft challenges, there seemed to be a consensus that PPO (proximal policy optimization) with a CNN as a function approximatior is a robust solution. The first experiments were run using a custom implemented PPO algorithm, with later tests using Stable Baselines implementations to eliminate the variable of an incorrectly implemented algorithm. 4 frame frame stacking was used for temporal awareness and multi instance parallel learning was used to speed up training time when using Stable Baselines based models.
With the sparse reward environment, quite predictably the model failed to converge to any sort of useful result. Because the agent never got a reward early on, there was no signal to guide the policy so the agent was essentially acting with no awareness of its task.
To fix this, some reward engineering was necessary. Based on example code, a custom environment was written which calculates a dense reward based on some in game signals as well as some human information about how the game is meant to be played.
Reduction of action space, removing inventory and crafting related actions, keeping only movement, camera movement, and attack, allows for more effective training by not needing to train the model to not take unnecessary actions. Naive of the action space like this caused the model to start making some meaningful progress.
The first iteration of the dense reward environment provides rewards as follows:
After a day of training, this reward system converged to an interesting strategy — it would immediately look at the ground wherever it was and begin to dig a hole. The hole would cause entities nearby to fall in, maximizing the distance score, and while the player stares at the ground, it will hit any entity that is standing in the same spot, giving more reward if it is a cow.
This strategy was sometimes effective, but different from the choices a human may make when playing the game. This demonstrates Minecraft’s open ended nature, and how we can’t necessarily expect humanlike behaviour. To make the agent behave more as expected, an additional reward signal was added to penalize looking straight at the sky (which is all that some iterations of training would do), and to penalize looking down at the ground (which is only useful for the hole digging strategy).
Unfortunately, the model based on the augmented reward signal did not show sings of convergence before the timeline for submitting this project prevented further training. In an effort to boost training speed, the code was re-implemented to use stable-baselines3 based PPO with multiprocessing using a custom wrapper around Minedojo.
This approach did not have sufficient training time to converge.
To improve performance, several approaches were taken:
Future work for this approach would be to further wrap the environment, for an observation space of the player coordinates (3 float), 9x9 nearby voxel info (for terrain awareness), current camera angle, and the coordinates of the nearest cow. Using an MLP, a model could be trained based on these privileged observations, and then its actions could be transferred to a CNN based model with the screen as input via imitation learning. Another approach which would allow this model to converge would be to bias it towards human-like actions by imitation learning on the MineRL dataset.
Using vectorized environments from MineRL, which represent the states and actions as vectors, we can more intelligently reduce the action space, so that the agent takes actions from a set which “make sense”.
The way we define this set of actions is by taking the human generated data from the MineRL dataset. This dataset provides data collected from human players over several months of running a Minecraft server. A dataset is provided containing 2gb of vectorized states and actions which, when clustered, represent a set of actions which make sense in the context of the given task. In this case, the task under study was chopping trees. The player spawns in a forest with an axe and must collect as much wood as possible.
In this task, our naive PPO agent failed to do anything, never receiving its sparse reward. On the other hand, an agent trained in the vectorized k-mean’s’d action space took actions which make sense in the context of the challenge, and performed better. However, with limited training time it was also not able to converge, though it is more likely that it might given more iterations.
An agent trained using imitation learning via behavioural cloning on the vectorized observation space performed even better. Comparing it to an equivalent behavioural cloning agent trained on a non-vectorized environment, it is likely to be far faster to train. However, due to limited time in submitting this assignment, these tests did not have time to run to conclusion. What is known is that after less than an hour of training, the vectorized agent is seeing similar rewards to the vanilla behavioural cloning agent after several hours on the same hardware.
It is possible that due to the reduced action space of the vectorized agent, it will may underperform the traditionally trained agent given very long training time.
For this reason, the vectorized environment can be used to provide a quick expert baseline to new challenges which can then be imitated by a model with a full action and observation space access.
Most notable among Minecraft RL challenges are those hosted by MineRL, such as the MineRL Basalt challenge. These challenges rely on the MineRL dataset and environment and have users train models to perform tasks from navigating simple terrain to obtaining diamonds. The most recent such challenge was in 2022 with MineRL Basalt. As time has gone on, these challenges have gotten increasingly complex, relying on advanced model architectures to provide results.
Earlier MineRL challenges are the focus of this paper, as they more closely provide an avenue for exploration of deep RL algorithms in a way that may be more interesting than traditional environments.
This paper leaves much of its learning objectives incomplete due to training time limitations. The extremely slow nature of Minecraft as an environment (training at around 30fps with 8 concurrent envs) makes it a challenge unless techniques are carefully chosen prior to experimentation. Because of this, also submitted for this assignment is the paper on deep RL algorithms in Atari games with Meilin Lyu. That paper has more concrete comparison of algorithms, as well as a deeper analysis of techniques not seen in class.
One particularly interesting aspect of Minecraft as an environment is its customizability. Since it has such a range of observations, actions, reward functions, and scenarios, it can be used to run ablation studies on which aspects of a game can correlate to performance of various RL approaches. An exploration of the same challenge, with dense vs sparse reward, rgb vs voxel vs vectorized observations, reduced vs full vs clustered action spaces, and more can yield useful information about the specific aspects of a game which make certain techniques more applicable.