
2025/02/23
Key Elements of MDP Reinforcement Learning
Games offer more than just simple entertainment; they provide an engaging experience where players explore worlds they have created. However, many games follow predetermined paths, which often means that players' explorations do not lead to true creation. I aim to overcome this limitation by developing dynamic game worlds where players and characters create together.
Exploration in Game Worlds
The world that players experience in games is like a living organism. Each player's choices and actions change the world and create new stories. Games like the Pokémon series and The Elder Scrolls have demonstrated the joy of such exploration. However, there is always a sense of something missing.
Exploring New Possibilities
Games like Dying Light 2 have marketed themselves by emphasizing player exploration. However, I want to take it a step further. I envision a game world created collaboratively by players and characters. While current technology may not fully realize this goal, this series aims to explore its potential and marketability.
Markov Decision Process (MDP)
Today, as the first topic in this series, I will introduce the Markov Decision Process (MDP), which forms the foundation of game AI. For those encountering it for the first time, it may seem challenging, but I will explain it based on my understanding.
Key Elements of MDP
MDP is a method for mathematically modeling decision-making problems, primarily used to find optimal actions in an environment. The most important elements here are state, action, and reward.
- State: Represents the current situation, indicating the environment in which the character is located within the game.
- Action: Refers to the various options a character can choose.
- Reward: The result obtained when a specific action is taken, playing a crucial role in the character's learning and growth.
There is also the probability of transitioning to the next state. For example, the probability of moving through a wall when stuck to it is 0. These probabilities can vary depending on the game's rules and environment. Ultimately, the goal is to implement the three elements of state, action, and reward in code. The implementation method will vary depending on the game's characteristics and objectives.
The elements mentioned above are generally already defined in games.
In a racing game, the car can choose actions to change its direction and speed.
The car can be in a state close to the starting line, overturned in the middle, or near the finish line.
The reward is highest when reaching the finish line. Therefore, the player controls the car to get closer to the finish line.
Let's explore the variables or elements we teach the computer.
Policy: Refers to the method of selecting actions in a specific state. For example, if the finish line is in front of you, moving towards it is a good action. Moving left or backward is worse than moving towards the finish line. The policy determines what action the agent will take. Generally, a good policy selects actions with high rewards.
Value (V): Evaluates the potential value of a specific state, meaning the expected long-term reward when starting from that state and following the optimal policy.
Q-value (Q): Evaluates the potential value when taking a specific action in a specific state, meaning the expected long-term reward when following the optimal policy after taking that action.
These three elements can be implemented differently depending on the methodology. We will revisit these methodologies later.
Conclusion
These are the core concepts learned in the early stages of reinforcement learning. Various implementation methods exist, and they open up endless possibilities depending on our imagination. Through these concepts, we can create more creative and immersive game worlds.