Project 5: Nim with Q-learning

How the program should run
The Q-learning phase
The game-playing phase
Hints
Testing your program
Sample output
At the end of the project
Grading
Submission instructions
Challenges

In this project, you will design a program to play the game of Nim optimally. Nim is a 2-player game that starts with three piles of objects. Players alternate removing any number of objects from a single pile (a player may not remove objects from multiple piles on a single turn). The player who is forced to take the last object loses (in some variants, the player who takes the last object wins, but in our version, they lose).

Your program will allow the user to specify the starting configuration of how many objects are in each of the three piles, then use Q-learning to figure out the optimal strategy for both players simultaneously. After the Q-learning finishes, the program should allow the user to play as many games of Nim against the computer as they want, with the user picking if they want to go first or let the computer go first. There are guaranteed winning strategies for certain starting configurations (e.g., with piles of 3-4-5 the first player has a guaranteed win if they play perfectly) and the computer should always win in those cases (assuming Q-learning simulated enough games to learn the Q-values well enough).

How the program should run

Prompt the user for three integers: the starting number of objects in each of the three piles. This is your initial board configuration.
Prompt the user for an integer nn, the number of simulated games to run via Q-learning.
Run \(n\) simulated games of Nim from the initial board configuration to construct a table of (hopefully, close to optimal) Q-values.
Print out the table of Q-values that were learned.
Allow the user to play unlimited games of Nim from the initial board configuration until the user wants to stop. The user should be asked whether to take the role of Player A (first player) or Player B (second player). The computer should take the other role. The computer player should play with an optimal strategy according to the Q-values learned during the Q-learning simulated games.

The Q-learning phase

Use \(\gamma = 0.9\) and \(\alpha=1\). The reason we use \(\alpha=1\) is because in a deterministic environment (one with no randomness), Q-learning learns quickest with \(\alpha=1\) and still guarantees convergence to the correct Q* values. In a stochastic environment (where taking an action from a state may lead probabilistically to one of many next states), such as the Ebola environment we examined in class, the Q-learning algorithm must use an \(\alpha\) less than 1 in order to guarantee convergence to the correct Q* values. The Wikipedia article on Q-learning has a nice discussion of this.

In our environment, we call the first player “A” and the second player “B.” There are only two rewards in the whole game. On any move that causes player A to win, there is a reward of 1000, and on any move that causes B to win, there is a reward of -1000.

Each action that Player A and Player B take in the simulated games should be chosen randomly. (In the real world, the actions are often chosen according to a policy that balances exploration and exploitation, but for simplicity, we will always explore.) Recall that we need two update equations:

When Player A moves from state \(s\): \(Q[s, a] \gets Q[s, a] + \alpha \left[ r + \gamma \min_{a'} Q[s', a'] - Q[s, a] \right]\)

When Player B moves from state \(s\): \(Q[s, a] \gets Q[s, a] + \alpha \left[ r + \gamma \max_{a'} Q[s', a'] - Q[s, a] \right]\)

Note that we are still maximizing or minimizing over the action \(a'\) taken from state \(s'\); only the action \(a\) from state \(s\) is chosen at random. (If we pick \(a'\) randomly instead of maximizing or minimizing, we actually get a different reinforcement learning algorithm called SARSA.)

You can use the printed table of Q-values to debug your program. One idea you can use is that after the values have converged to the optimal Q-values, if there is a guaranteed winning strategy for the Player A in some state, then the function should list at least one state-action combination with a positive Q-value for that state. If there is no such strategy (i.e., Player A always loses if the Player B plays perfectly), all of the Q-values will be negative for the state in question.

The game-playing phase

After all \(n\) simulated games have been played, the user is given a chance to play against the computer either as Player A or Player B. Here, the computer should not move randomly; it should use the optimal policy derived from the table of Q-values:

For Player A: \(\pi(s) = \arg \max_{a} Q[s, a]\)

For Player B: \(\pi(s) = \arg \min_{a} Q[s, a]\)

Hints

This project has a few quirks. The biggest hurdle is changing the Q-learning update equation to handle the fact that we have two players that are playing against each other, which we have done above. Make sure you understand why we make those changes.

You can store the table of Q-values using a map or dictionary or hash table. To do this, you will need to store each state-action pair as a single item that can be mapped to the Q-value for that pair. This can be done in many ways, but one straight-forward idea is to use a 6-character string for a state-action pair as follows.
The first character is either “A” or “B” for the player whose turn it is. The next three characters are the numbers of the objects left in each pile (we’ll assume there will never be more than nine objects in a pile). The last two characters represent the action taken: the number of the pile (0, 1, or 2) followed by the number of objects to remove from that pile.

For example, if we start with piles of 3, 4, and 5 objects, and the first player (A) chooses to remove 1 object from pile zero (the first pile), that would be the state-action combination “A34501.” Our board now has 2, 4, and 5 objects left, and now it’s B’s turn. If B decides to remove 2 objects from the second pile (pile 1), this would be represented by the string “B24512.”

This representation of state-action pairs allows us to store the table of Q-values easily as a mapping from strings to doubles.

An alternate representation is a two-level map (a map of maps), where the first level maps states like (“A345”) to a map containing actions and their corresponding Q-values. This makes the bookkeeping a little more difficult, but it makes it easier to find, for example, what actions are available from a given state.

Testing your program

Under the initial board 3-4-5, the computer should always win if it goes first. Similarly for 0-1-2, 1-2-4 and 2-3-4. Under 2-4-6, the computer should always win if it goes second. Similarly for 1-2-3.

Sample output

Here are some different tables of Q-values and sample output for playing the games:

The 0-1-2 game

In this game, Player A is guaranteed to win if they play perfectly. Notice that for the state-action pairs for the opening move (A012xx), there is only one positive Q-value. This indicates that Player A must make that specific move to guarantee a win, otherwise it opens the door for Player B to possibly win._

The 1-2-3 game

Player B is guaranteed to win here (with perfect play), no matter what Player A does on the opening move. You can deduce this because all of the opening moves for Player A (A123xx) have negative Q-values. This is why the computer will always win if allowed to go second with this board configuration._

You should make sure your program still works for larger board sizes as specified under “testing your program” above.

At the end of the project

As you are preparing to submit the project, please prepare a text file (.txt, pdf, or Word doc is fine) answering the following questions:
1. What bugs and conceptual difficulties did you encounter? How did you overcome them? What did you learn?
2. Describe whatever help (if any) that you received. Don’t include readings, lectures, and exercises, but do include any help from other sources, such as websites or people (including classmates and friends) and attribute them by name.
3. Describe any serious problems you encountered while writing the program.
4. Mention any challenges that you did.
5. List any other feedback you have. Feel free to provide any feedback on how much you learned from doing the assignment, and whether you enjoyed doing it.
Please also add a comment at the top of your program stating your name and a pledge that you have followed the honor code and collaboration policy for this project. This can be as simple as writing “I have neither given nor received unauthorized aid on this program.” You can find the collaboration policy on the syllabus.

Grading

You will be graded on correctness of your program, including the formatting (this makes it easier to grade).
Your project will also be graded on the appropriateness and efficiency of the algorithms you choose, and on coding style.

Submission instructions

Through Canvas, upload all your source code files and your file with the answers to the questions above.

Challenges

Q-learning is often a slow algorithm, so one thing we can do to speed it up is to use the same (state, action) pair in the Q-table to represent a set of states and actions that are related. For instance, if when playing Nim starting from 3-4-5, you encounter a state 1-2-1, this is equivalent to the states 1-1-2 and 2-1-1 with some shuffling of the order of the states. You can change the Q-learning algorithm to “merge” these states together (for instance, by always using the representation of a state with the sizes of the piles in ascending order). Try making your code take advantage of this Note that you will have to make sure to switch the actions as well; for instance, changing (state, action) pair (A121, 01) to (A112, 11).
Use 2-player Q-learning to solve a larger game such as Connect-Four. You already have a lot of the starter code from this from the minimax project, and solving a scaled down version such as 3x3x3 should be doable. How big a board can you get to before it starts taking too long to train?