Counterfactual Regret Minimization in Poker

October 03, 2018

Iulian Doroftei

Counterfactual Regret Minimization in Poker

The phrase "Counterfactual Regret Minimization" may sound overly sophisticated to insert into a discussion of poker strategy. (Indeed, it probably is.) But luckily the concept to which it refers can be explained more simply, and in fact may prove to be quite useful to those struggling to play poker profitably and even to already successful players.

Counterfactual Regret Minimization or "CRM" refers to an algorithm used by neural networks to train at playing perfect poker. It bears a striking resemblance to heuristic strategies (that is, those based on experience) that human players employ when they try to learn poker by playing.

When winning poker pros are asked how have they managed to be consistently successful, their answers usually come in two varieties referring to two different learning systems: heuristic and analytical.

Those who jump right in at the poker table and learn the game by trial and error are adopting the heuristic approach. Meanwhile those who prefer first to analyze the game's mechanics and strictly apply its mathematical principles at the tables, regardless of what their "intuition" might whisper into their ears, are taking the analytical approach.

Obviously, most highly accomplished players will employ both learning methods. That said, most of the poker pros — especially the old schoolers and the live grinders — prefer to learn heuristically by playing a high volume and thus developing an intuitively correct strategy, one which many of them cannot even explain in their own words.

When it comes to experience-based learning, the IT systems have a huge upper-hand over humans. They can acquire in a matter of minutes the experience that a human player will not manage to accumulate in a lifetime. This exact type of brute force was employed by the researchers at the University of Alberta in Canada in 2007 while pursuing a goal to teach an artificial intelligence (AI) machine to crush at poker while running on a supercomputer able to solve quadrillions of calculations per second.

The machine had a theoretically basic task to perform: play poker against itself until it masters the game to perfection. The start was utterly dull, as the AI began by randomly "clicking buttons" — to check, to fold, to call, bet, or raise, and to work through the whole possible range of bet sizes. In other words, to execute all of the possible actions at any given moment in a hand.

The task was sifting through options with the sole goal of getting the most profit against the opponent. After each play, the AI would store the result, compare it with past outcomes of a different decision in the same situation, and evaluate the best move in that given context.

The specific situations were grouped in clusters (called "epochs") to allow the back propagation algorithm to converge on a combination of weights with a decent level of accuracy. This technique is similar to the (frankly rare) real-life situation of a human player learning poker without a teacher or any instructional materials or information about the game. Such a player would simply sit at the table and blindly try to choose among options based solely on the financial outcome, learning the game by trial and error (and with massive money and time investments).

But unlike a human player, a computer can play serious poker with nothing at stake, and billions of hands of it, too. The researchers let the supercomputer run at huge speeds throughout innumerable situations, until the two virtual opponents learned to defend themselves perfectly and broke even over a huge sample size.

They reached the "Nash equilibrium" or that sweet spot theorized by mathematician John Nash at which both players are behaving unexploitably — i.e. "perfect" — by employing a "Game Theory Optimal" strategy. (That's right — the famous "GTO" many poker players love to mention as a casual brag while having a beer with their peers.)

Without further ado, let's look a little more closely at the phrase "Counterfactual Regret Minimization" which is the name of the algorithm used in programming the neural network to play poker.

"Counterfactual" is a conditional which expresses the potential outcome of something that did not happen. For example: "If I hadn't dropped Computer Science I would've become a wealthy developer." Or, to give an example from poker: "If I hadn't called the raise on the river from that elderly lady on my left with my top two pair, I would still be in the damn tournament."

"Regret Minimization" refers to the strategy used by the computer when it follows the directive only to consider decisions that in the past have caused the lowest dose of regret. Put in financial terms, such a directive might be stated "in my next decision, I should especially weigh the past decisions that had averaged the highest profits." This directive is even referred to as "positive regret" in scientific terms.

As a caveat, it is worth noting that a wrong decision could accidentally yield a positive result, whereas a correct move could similarly create a negative result. As poker players, you know all about that — the right move doesn't always win, and the wrong move doesn't always lose.

But this truth about individual hands doesn't apply when it comes to big data, either throughout an entire poker career or during those billions of simulations run by the supercomputer. After enough time, the wrong approach will lose money and the correct one will at least break even against an equal opponent.

Now, how does all of this apply to what we humans experience at the tables, and to how poker players learn to make better decisions and become better players?

The feeling of regret is probably one of the most frequent states of mind a poker player will experience. When properly punished, the errors in our game cause immediate financial damage in the form of lost pots and/or eliminations from tournaments. "I couldn't get that hand out of my mind for days," we often hear.

A punished error is immediately followed by regret, whether it be it a reasoning error (ignoring a portion of the villain's range or adding unrealistic combos into that range), a psychological glitch (tilt), or simply running out of time (mostly after the introduction of the shot clock).

We instinctively imagine reeling it all back and instead proceeding correctly, but nothing can be done anymore. The regrets are, as they say, futile.

But that's not true in the case of an AI mind. There, the regrets are not only not futile, but they are the very basis of the learning process. And this can be quite a lesson for we human players. After a bad decision has ended up in an elimination or with a huge lost pot, it's better not to let the regret pass without a use, but to take note of the error and make sure we it won't commit it again in a similar situation.

Probably the most common example is the tilt caused by getting knocked back to "square one" (or worse) in tournaments. Being left with 20 blinds or so after losing a huge pot during the early stage of an event looks like the end of our journey. We recall how solid our 133-BB stack used to be, how hard it had been for us to build it, and how foolishly we just lost it. The next step (often, and unfortunately) is to loosen up our game, wishing for a coinflip — even as an underdog — or anything to get our chips back.

Obviously, this type of approach ends with a frustrating elimination most of the time. And the regret strikes right as we stand up to leave the table. This is the very moment we have to put our regret to a good use, the best time for learning. It's the moment we should imprint in our mind that we will restrain our frustration and play wisely with "only" 20 big blinds in a future similar situation.

But the best use of regret is to foresee it before the situation arises. The AI's CRM algorithm could help us especially right before an important decision when we feel we are on the verge of doing something stupid. That's the moment we should ask ourselves: "Fold, call, or raise all in — which of these options will most likely cause me to regret my decision?". After asking ourselves this question, we then eliminate the options we intuitively feel will spur this annoying feeling of regret.

To sum up such advice in a single phrase — listen to your intuition, especially when it gives desperate signals that you are about to do something foolish that you'll later regret.

Sharelines

Supercomputers employ "Counterfactual Regret Minimization." Poker players can benefit from CRM, too.
A discussion of "Counterfactual Regret Minimization" and its potential application to poker players.