How an ace-hole AI bot built by Facebook, CMU boffins whipped a table of human poker pros
This code doesn't need coolers to win: Two-core machine used to outplay world-beating elites
Analysis Artificially intelligent software can comfortably outmatch human poker pros and amateurs in one-on-one matches, that much is known.
Now, for the first time, an AI bot has been built that can beat humans professionals on six-player no-limit Hold’em tables no less, and has been described in an academic paper published on Thursday in Science.
The cyber-shark, dubbed Pluribus, learned how to play the popular card game by playing trillions of games against itself repeatedly over eight days. When it played 10,000 hands against five elite poker pro humans – including World Series of Poker tournament champions Chris Ferguson and Michael Gagliano – it won decisively.
If the AI bot had been playing for real money rather than play chips, it would have reaped an average of $1,000 per hour playing six-player no-limit Hold’em against the pros, according to Noam Brown, first author of the paper and a research scientist at Facebook AI Research. The software, from what we can tell, played $50/$100 no-limit Hold’em cash games with $10,000 buy-ins.
At the heart of Pluribus is a self-play algorithm known as counterfactual regret minimization (CFM), which is used by other poker bots as well.
The software essentially plays against multiple copies of itself to gradually improve its skill. A table of virtual players is created and given randomized strategies. For every iteration of the algorithm, one player is chosen to be the so-called traverser.
After each simulated hand between the gang in an iteration, the code reviews how well the traverser played, and whether it could have have done any better against its virtual opponents, given their known individual strategies. The algorithm calculates the traverser's counterfactual regret, or in other words, how much the traverser regretted not making a move that would have turned out to be beneficial. At the end of the iteration, this counterfactual regret is used to update the traverser's strategy so that there is a higher probability of it in future making an action that it previously regretted not making. And then it's on to the next iteration and another traverser is picked.
It's quite smart, because it means the software learns the hard way from when it, say, should have called when it had a good hand, or raised when it needed to force out other players, or folded when the price of calling wasn't worth the risk.
At the tables
After the Pluribus was trained as described, it took on the human poker experts. The software’s decisions improved during play by monitoring how its flesh-and-blood opponents played. It would consider four strategies during the game: one where it stuck to a pre-computed strategy called the "blueprint," one where it leaned towards aggressively raising more often than not, one where it played like a nit and folded more often, and one where it tended to be a calling station.
When it looked at which cards had been dealt on the table, it ignored the two cards it was privately holding – its hole cards – and instead ran through all the possible combinations of cards it could be holding, given the community cards on the table, and determined which action it would take for each of them.
Revealed: How Libratus bot felted poker pros – and now it has cyber-security in its sightsREAD MORE
Crucially, it was programmed to balance its actions so that it wasn't always giving away the strength or weakness of its hand (always raising with aces, or always folding anything below three of a kind, for instance).
Then it looked at its actual hole cards and went with the action it had assigned to that combination. This ensured it bluffed, was aggressive, was trapping, and so on, in a balanced way that wasn't obvious or predictable. This approach seemed to work well whether Pluribus was playing against five human players in a game or five bots and a human. When it played against four other bots and a human for over 5,000 hands, it still won convincingly.
“It was incredibly fascinating getting to play against the poker bot and seeing some of the strategies it chose,” said Michael Gagliano, who won a World Series of Poker bracelet in 2016. “There were several plays that humans simply are not making at all, especially relating to its bet sizing.”
Chris "Jesus" Ferguson, a known tight multi-bracelet player, added: "Pluribus is a very hard opponent to play against. It’s really hard to pin him down on any kind of hand. He’s also very good at making thin value bets on the river. He’s very good at extracting value out of his good hands."
Now, let's a couple of things out of the way: one, yes, this wasn't for real money, which means the humans had nothing to lose and thus may have played differently, although $10,000 is not a lot to these guys anyway, real or not. Also, while the humans are head and shoulders above the vast majority of poker players, and have won millions upon millions of dollars, but are not the very, very best in the world. We're thinking Phil Ivey, Daniel Negreanu, Fedor Holz, Erik Seidel, Justin Bonomo, and so on. By that we mean, this software has not defeated the entirely of humanity.
With that aside, this is pretty cool tech: it can see off fierce professionals.
No GPUs needed
Pluribus was trained on server with 64 CPU cores and ran for a total of 12,400 CPU core hours over eight days. It required less than 512GB of memory. Its masterminds at Facebook and Carnegie Mellon University (CMU) reckoned if they had rented computing resources via public cloud instances it would have cost under $150 to train.
After training, Pluribus played against its human opponents running on a system with two CPUs and requiring less than 128GB of memory. It typically takes anywhere from one to 33 seconds to perform a search operation for each action in a game.
“Some experts in the field have worried that future AI research will be dominated by large teams with access to millions of dollars in computing resources. We believe Pluribus is powerful evidence that novel approaches that require only modest resources can drive cutting-edge AI research,” Brown said.
Reducing the complexity in Poker
Pluribus is not too different from its predecessors DeepStack and Libratus. It still uses the CFM algorithm, but no longer relies on calculating the Nash equilibrium, a solution proposed in game theory that finds the optimum stable strategy, where there is no incentive to deviate from the equilibrium if the other opponents don’t either. For example, the Nash equilibrium in a game of rock-paper-scissors is to randomly pick between the three options assuming that that’s what your opponent is doing too.
Computing the Nash equilibrium is fine if there’s only one other person to play against. But as soon as the game contains three or more players, it becomes too much for a computer to handle. Instead, Pluribus combines its poker knowledge gained from self-play and combines that with a search algorithm that needs to consider only a few moves ahead rather than the whole game.
The search process is further simplified to reduce the complexity. Not every action needs to be considered, similar decision points in the game are bucketed together and treated as identical. The researchers describe this as abstraction, and Pluribus uses it when considering what actions it should take and what information is available.
“Action abstraction reduces the number of different actions the AI needs to consider. No-limit Texas hold’em normally allows any whole-dollar bet between $100 and $10,000. However, in practice there is little difference between betting $200 and betting $201. To reduce the complexity of forming a strategy, Pluribus only considers a few different bet sizes at any given decision point,” they wrote in the paper.
When it comes down to information abstraction, although a ten-high straight and nine-high straight are a different set of cards, they result in similar strategies. The AI bot will group these together and consider them as identical so it doesn’t have to compute two different strategies for each scenario.
Poker is a complicated problem to solve since each player cannot see what cards other players have during a hand, making it an imperfect information game. By using abstraction, the game’s complexity is reduced and it can play against multiple opponents effectively.
Online poker probably isn't ruined?
The code won’t be public, thank goodness, so poker enthusiasts won’t be able to spin up their own AI master bots to try and make a quick buck online. In fact, since the main parts of the code were written at CMU in a lab led by computer science professor Tuomas Sandholm, the license for the software actually belongs to two companies he founded: Strategic Machines, and Strategy Robot.
Facebook helped build on top of that code for research purposes. Sandholm told The Register that although Pluribus is used for poker, it’d be applicable to similar scenarios with imperfect information.
Strategic Machine is looking at applying the technology to a range of industries, including gaming, finance, and healthcare. Strategy Robot is military focused and targets areas like intelligence and security.
“The additional piece of code that Facebook has been collaborating with me on is poker specific so that additional piece cannot be used for defense applications,” Sandholm added. ®