Public-Belief Real-Time Search for Production Poker AI
Subtitle: Methodology, architecture, and evaluation framework for the Poker Skill decision engine
Authors: Poker Skill Decision Systems
Date: 2026-05-01
Abstract
This paper describes a public-belief real-time search methodology used by the Poker Skill decision engine to produce auditable recommendations in no-limit Texas Hold’em. The approach combines three components: a blueprint policy that provides baseline action probabilities derived from offline self-play, a public-belief reconstructor that converts the public history of an observed hand into a posterior over the hidden private hands the opponent could be holding, and a bounded real-time search that refines a decision at action time using that belief as input. Each recommendation produces a decision receipt that captures the reasoning artifact for after-the-fact review. The decision engine itself runs on a trusted backend boundary so that decision authority is not exposed to mobile clients. The paper outlines the conceptual framework, the design constraints that shape it, the evaluation dimensions that matter for a production poker assistant, and the explicit limitations of the approach.
Executive Summary
Production poker AI faces three intertwined problems: hidden information, large action spaces, and the need to produce decisions on a clock. A pure offline policy is fast but cannot reason about the specific hand history in front of it. A pure online solver is expressive but unbounded in latency and difficult to audit. The Poker Skill decision engine bridges these through public-belief real-time search.
The blueprint policy is trained offline and gives a fast, defensible default action distribution at any decision point. At action time, the public-belief reconstructor reads the public history of the current hand and computes a posterior over the opponent’s private hand that is consistent with the blueprint and with the observed action sequence. A bounded real-time search then takes that belief, builds a small subgame around the current decision, and refines the action recommendation under a fixed iteration budget. The output is a decision receipt: a structured record of inputs, intermediate reasoning, the recommended action, and the bound conditions that produced it.
The decision engine runs entirely behind a trusted backend boundary. The mobile practice app does not hold decision authority, does not see the raw belief, and does not perform search; it sends a structured hand-state request and receives a recommendation. This boundary is the primary safeguard against tamper, leakage, and replay forgery.
The architecture is deliberately scoped. It does not claim solved poker, superhuman play, or equilibrium guarantees for the full game. It describes an engineering approach to producing auditable, time-bounded poker decisions in production: use an offline baseline as a prior, reconstruct the opponent range implied by public action history, refine the current decision under a bounded search budget, and preserve the decision trail in a receipt.
Problem Statement
No-limit Texas Hold’em is a partially observable, sequential, multi-player game with a continuous bet-sizing dimension and a large effective action space. Each player is dealt private cards that the opponent never directly observes; over the course of a hand, public cards and public actions are revealed, and players take turns committing chips under uncertainty. The complete information set at any decision point is therefore a combination of the public history and the player’s own private hand, and a sound decision must reason about all opponent private hands consistent with the public history.
Three complications follow.
First, the strategy space is enormous. Even after standard bet-size discretization, the game tree of a single hand has many decision nodes, and the cross-product of opponent ranges with future public-card distributions is larger still. Tabular approaches do not fit; abstractions are required.
Second, the action time budget is real. A practical poker assistant has tens to a few hundred milliseconds per recommendation. Bounded latency is a feature, not a side effect. An algorithm that produces a strong recommendation in unbounded time is not, by itself, useful in production.
Third, the recommendation must be auditable. For coaching, training, and post-hand review, the reasoning behind a recommendation is as important as the recommendation itself. A black-box action without an inspectable trail is a poor coaching signal and an unreviewable production artifact.
These three pressures rule out both naive offline policies and naive online solvers. They motivate a hybrid: an offline-trained baseline that is fast and defensible, an inference step that converts public history into a working belief, and a bounded online refinement that produces a recommendation and an artifact within a fixed time budget.
Production Constraints for Poker AI
Several non-negotiable constraints shape the architecture.
Bounded latency. The decision engine must produce a recommendation under an iteration budget chosen so that the worst-case recommendation time is acceptable for the surface that consumes it. Iteration budget rather than wall-clock SLA is the primary control surface; latency is a derived measurement.
Determinism and reproducibility. Given the same inputs, blueprint, search seed, and iteration budget, the decision engine must produce the same recommendation. Reproducibility is required so that decision receipts replay to identical outputs and so that audits can recompute results from the receipt alone.
Auditability. Every recommendation must yield a structured record sufficient to reconstruct the reasoning. The record must hold the inputs, the bound conditions, the intermediate values that drove the decision, and the recommended action.
Trust boundary. Decision authority does not run on the client. The mobile practice app sends a structured hand-state request over an authenticated channel and receives a recommendation; the search and the belief never leave the trusted backend boundary.
Separation of policy from infrastructure. The blueprint, the reconstructor, and the search are conceptually separable so that one can change without forcing changes in the others. This is what makes the architecture reviewable; it is also what makes future evolution of any single component tractable.
Measurement discipline. Architecture and runtime quality are different kinds of evidence. Component contracts, trust boundaries, and receipt structure can be evaluated by inspection; lift, latency, posterior calibration, and trace fidelity require measured runtime evaluation.
Related Work
The Poker Skill decision engine is an engineering integration of ideas the imperfect-information games community has developed over roughly two decades. That lineage matters because the system uses field-standard concepts such as blueprint policies, public-belief states, regret minimization, and bounded subgame search. The system does not claim a contribution to the academic state of the art in any of the directions below.
Counterfactual regret minimization and its variants. Counterfactual Regret Minimization (CFR), introduced by Zinkevich et al. (2007), is the workhorse algorithm for approximating Nash equilibria in extensive-form games with imperfect information. Lanctot et al. (2009) extended CFR with sampling, producing Monte Carlo CFR (MCCFR), which made the technique tractable on larger trees by avoiding full traversal. Tammelin (2014) introduced CFR+, a substantially faster variant whose use was central to the formal solution of heads-up limit Texas hold’em (Bowling et al., 2015). The blueprint policy referenced throughout this paper inherits from this lineage: it is an offline-trained baseline whose construction draws on techniques from this body of work. The system applies existing techniques; it does not claim a contribution to equilibrium-finding.
Subgame solving and continual re-solving. Imperfect-information games cannot in general be decomposed and solved subgame-by-subgame the way perfect-information games can; a refined local strategy must be reconciled with the strategy of the surrounding game. The depth-limited, belief-conditioned subgame approach used at action time owes its conceptual shape to two strands of work. DeepStack (Moravcik et al., 2017) introduced continual re-solving with neural value functions for heads-up no-limit Texas hold’em, demonstrating that decisions could be refined locally at action time using a learned terminal-value model rather than a fully solved abstraction. Brown and Sandholm (2017) introduced safe and nested subgame solving with explicit guarantees about how a refined local strategy interacts with the surrounding strategy of the original game. The bounded real-time search described in this paper uses a depth-limited subgame anchored by a blueprint and a working belief; it does not target a Nash equilibrium of the full game and does not inherit the safety guarantees of safe subgame solving.
Superhuman poker systems. Libratus (Brown & Sandholm, 2018) defeated top human professionals in heads-up no-limit Texas hold’em using a blueprint-plus-subgame-solving-plus-self-improvement architecture; Pluribus (Brown & Sandholm, 2019) extended a related approach to six-player no-limit Texas hold’em and likewise defeated top humans. The Poker Skill decision engine does not claim superhuman performance and is not a reimplementation of either system. It uses vocabulary such as “blueprint”, “real-time search”, and “subgame” that those works have made standard.
Public-belief states and combined RL plus search. The idea that the right state for reasoning in an imperfect-information game is not a player’s information state but a distribution over private states consistent with the public history is older than its current name; it is implicit in CFR-style range tracking, in DeepStack’s range-based re-solving, and in safe subgame solving. Brown et al. (2020) made the formulation explicit with ReBeL, Recursive Belief-based Learning, which combines self-play reinforcement learning with search over public belief states and which provably converges to a Nash equilibrium in two-player zero-sum games. Schmid et al. (2023) generalized the construction in Student of Games, a single learning algorithm that handles both perfect- and imperfect-information games. The public-belief reconstructor described in this paper is a Bayesian update over the opponent’s private hand given the public history under a fixed blueprint prior; it is a working belief sufficient for downstream search, not a learned recursive belief representation, and it does not carry the convergence properties of the works above.
Evaluation in imperfect-information games. Quantitative evaluation of agents in imperfect-information games is not straightforward. Lower-bound exploitability via local best-response analysis (Lisy & Bowling, 2017) provides a smoke-test signal even where exact best-response computation is intractable. AIVAT (Burch et al., 2018) reduces the variance of head-to-head match estimates by exploiting known terminal and chance-node distributions. The runtime evaluation methodology described elsewhere in this paper does not yet rely on either tool, but both are appropriate references for what a thorough external evaluation could look like.
Position of the system. The blueprint, the public-belief construction, and the bounded subgame search are applications of public ideas to a production decision engine. The practical contribution is the integration: the trusted backend boundary that holds the engine, the decision-receipt artifact written at action time, and the separation between inspectable architecture and measured runtime behavior. None of those substitute for the prior work on which they rest.
System Architecture
The decision engine is composed of four logical components: the blueprint policy, the public-belief reconstructor, the bounded real-time search, and the decision-receipt writer. These run inside a single trusted backend service that exposes a request/response interface to authorized callers.
Blueprint policy. The blueprint is an offline-trained policy that maps an information set, abstracted appropriately, to a distribution over actions, in the sense the term is used in production poker AI systems (Brown & Sandholm, 2018). It is not optimal; it is a fast, defensible default that the rest of the system uses as a prior. It is the answer to the question “in the absence of any extra reasoning, what is a reasonable thing to do here?”
Public-belief reconstructor. Given the public history of the current hand, the reconstructor computes a posterior distribution over the opponent’s private hand. This posterior is the working belief. It is computed by combining the blueprint as a prior with the likelihood of the observed action sequence under that prior. The output is a probability vector over hand classes, normalized over the support consistent with the public information.
Bounded real-time search. Given the working belief and the current decision point, the search builds a finite subgame rooted at that decision and runs a fixed number of refinement iterations against it. Output is a refined action distribution and the supporting intermediate quantities. The search is deliberately bounded; it is not a solver and does not target equilibrium.
Decision-receipt writer. The receipt writer aggregates the inputs, the working belief summary, the bound conditions, the search output, and the recommended action into a structured record suitable for after-the-fact review. The receipt does not contain implementation internals; it contains the conceptual quantities that a reviewer would use to ask “would I have made the same call?”
These four components are sequenced for every recommendation. The blueprint is consulted to seed the search and to compute the prior used by the reconstructor. The reconstructor produces the working belief. The search consumes the belief and the blueprint and produces a refined recommendation. The receipt writer captures everything.
The arrangement gives each component a single, well-defined job. The blueprint owns “what would I do without thinking?” The reconstructor owns “what does the opponent’s action history say about their hand?” The search owns “given that, what should I actually do here?” The receipt writer owns “what did we just decide and why?”
Public-Belief Reconstruction
A central conceptual move, with substantial public lineage in imperfect-information game solving (Moravcik et al., 2017; Brown et al., 2020), is that the working belief at any decision point is recoverable from the public history of the hand alone. The public history consists of the actions taken so far by each player, the public cards revealed so far, and the betting structure of the game; it does not contain anyone’s private cards.
The reconstructor treats the blueprint as a prior over what each player would have done at each prior decision in the hand, conditional on the various private hands they could have held. Bayes’ rule then says that the posterior over the opponent’s private hand, given the observed public history, is proportional to the prior probability of holding that hand multiplied by the likelihood that the blueprint would have produced the observed action sequence given that hand. Normalizing over the support of consistent hands yields a posterior.
In practice this is implemented as a step-by-step update: at each node in the public history, the conditional likelihood of the observed action under the blueprint is folded into the running posterior, and at chance nodes the support is restricted to hands consistent with the public cards. The result at the current decision point is a working belief: a probability vector over the opponent’s possible private hands, shaped by every public action so far.
Two properties of this belief are worth naming. First, it is internally consistent with the blueprint: a player who really plays the blueprint and is faced with the observed public history would have produced that posterior. Second, it is fully recoverable from public information; the reconstructor never needs the opponent’s private cards to compute it. That is the property that makes downstream search well-defined: the search reasons about the opponent’s possible hands from the same vantage point a human reviewer has.
The fidelity of the belief to actual opponent behavior is, of course, only as good as the blueprint. If the blueprint is a poor model of how a particular opponent plays, the posterior will be miscalibrated. The architecture does not claim otherwise; it claims that, given the chosen blueprint, the belief is the right belief.
Bounded Real-Time Search
With the working belief in hand, the search constructs a subgame rooted at the current decision and refines the action recommendation under a fixed iteration budget.
The subgame is a finite, abstracted game tree extending forward from the current decision for a bounded number of decision rounds. Bet sizes are discretized; chance nodes are sampled or summarized; terminal nodes are evaluated using a learned or heuristic value function consistent with the blueprint, in the spirit of depth-limited subgame solving with learned terminal values (Moravcik et al., 2017). The subgame is constructed so that the working belief sits naturally at the root: the opponent’s private hand is unknown, but its distribution is the posterior the reconstructor just computed.
Inside the subgame, the search runs a regret-style refinement loop (Zinkevich et al., 2007; Tammelin, 2014) for a fixed number of iterations. Each iteration improves the recommended action distribution at the root, and the loop is anytime in the sense that stopping early still yields a valid distribution. The fixed iteration budget controls compute, controls latency, and provides reproducibility; the search returns the same answer for the same inputs every time.
The search’s output is twofold: a refined action distribution at the root, and a set of intermediate quantities including per-action expected value estimates, the iteration-by-iteration trajectory of the recommendation, and a summary of the subgame’s structural assumptions. The first is the recommendation. The second is the raw material for the decision receipt.
A few characteristics are worth emphasizing.
The search is bounded, not optimal. It does not target a Nash equilibrium of the original game and does not claim to find one. With a finite iteration budget on a finite, abstracted subgame, the output is a refinement, not a fixed point.
The search uses the blueprint as both a prior and a regularizer. Fully unanchored search on a small subgame can drift into pathological strategies; using the blueprint to seed and to anchor keeps the refinement in the neighborhood of a defensible policy.
The search is reproducible. Given the same belief, blueprint, subgame definition, and iteration budget, the search produces identical output. This is what makes decision receipts replayable.
Decision Receipts and Replayability
A decision receipt is a structured record produced for every recommendation. It is the durable artifact that lets a reviewer answer “what did we decide, and why?” without re-running the engine.
A receipt contains, at minimum: the public history of the hand at the decision point; a summary of the working belief, including top mass, entropy, and a sample of high-mass hand classes; the blueprint and search settings in effect, described at an implementation-neutral level; the refined action distribution; the recommended action; and a small set of human-readable explanation fields generated from the intermediate quantities of the search.
The receipt is written before the recommendation is returned and does not depend on side effects in the consuming surface. It is intended for asynchronous review, replay, and aggregation. A reviewer can read a single receipt and reconstruct the reasoning chain; a batch of receipts can be scanned to surface anomalies.
Two properties of receipts matter for operations.
First, replayability. A receipt contains everything needed to recompute the recommendation. Given the engine version, the receipt is sufficient input to produce the same output. This makes the engine testable against historical traffic and lets a reviewer ask “would the current engine still recommend this?” without rerunning a full session.
Second, audit boundary. The receipt is conceptually structured but does not expose implementation internals; it surfaces the belief, the bound conditions, and the recommended action at the abstraction at which a poker reviewer would think. This keeps the audit surface stable as the implementation evolves.
Receipts are also the unit of accumulation: aggregating receipts over time yields a population view of where the engine is confident, where the belief is sharp versus diffuse, and where recommendations cluster on the borderline between actions. That population view feeds back into blueprint review and search-budget tuning.
Mobile/Backend Serving Boundary
The Poker Skill mobile practice app is the primary surface for receiving recommendations during practice and review. The decision engine does not run on the mobile client.
The client sends a structured hand-state request over an authenticated, server-terminated channel. The request describes the hand: stack sizes, position, public actions so far, public cards if any, and any session-level setting that affects the recommendation. The request does not contain decision authority.
The trusted backend boundary terminates the request, runs the decision engine, writes a decision receipt, and returns the recommendation. The blueprint, the reconstructor, and the search live entirely on the backend. The working belief is never serialized to the client.
This boundary is the primary integrity guarantee. The client does not have the model, does not see the belief, does not perform search, and cannot fabricate or alter recommendations. Replay of decisions is anchored to backend receipts, not to client logs. If the client is tampered with, the worst it can do is misrepresent the hand state; it cannot misrepresent the decision because the decision is computed elsewhere. This is not a complete defense against a hostile client, but it is the right place to draw the line.
The boundary also stabilizes operational characteristics. Latency, throughput, blueprint labels, and search-budget changes are managed centrally. A blueprint update is handled behind the trust boundary, not through the mobile client. The client is responsible only for collecting the hand state and rendering the recommendation.
Evaluation Framework
The system should be evaluated along architectural and runtime axes, because each answers a different question.
Architecture. Architectural evaluation asks whether the contracts between components are coherent, whether the public-belief construction is sound under its stated assumptions, whether the receipt captures the quantities needed for replay, and whether the trust boundary is placed correctly. This review can identify gaps in coverage, ambiguous contracts, missing receipt fields, and trust-boundary violations.
Runtime behavior. Runtime evaluation asks how the recommendation changes as the iteration budget grows, how latency behaves under representative load, how calibrated and concentrated the working belief is on representative hands, how readable the decision traces are, and whether repeated runs against identical inputs produce identical recommendations and receipts.
The main runtime dimensions are:
- Iteration-vs-quality lift. How does recommendation quality change as the search iteration budget increases against a controlled set of decision points?
- Latency distribution. What is the wall-clock distribution of recommendation latency at the chosen iteration budget under representative load?
- Belief concentration. On a controlled set of hands, where does the posterior put its mass, and how does that mass move as the public history accumulates?
- Trace fidelity. Are decision traces readable and faithful to the search’s actual progression?
- Run reproducibility. Do runs against identical inputs produce identical recommendations and receipts at the level of granularity a reviewer needs?
Limitations
Not solved poker. Poker is not solved and the Poker Skill decision engine does not solve it. The blueprint is not optimal, the search is bounded, and the working belief is only as good as the blueprint that anchors it.
Not superhuman. No claim is made about the decision engine playing above expert human level, or above any specific reference player or system. Such a claim would require a designed match against a defined opponent population with appropriate sample sizes.
Not equilibrium-guaranteed. The bounded search does not target or attain Nash equilibrium of the original game. It produces a refinement of the blueprint anchored on the current public history; it does not produce a fixed point.
Not validated at six-max. The architectural framing extends conceptually to multi-way play, but six-max validation has not been performed and is not claimed. Multi-way play introduces additional complications, including folded-player belief modeling, harder subgame definitions, and more chance nodes, that are out of scope for this document.
Latency is budget-dependent. Latency is a derived quantity from the iteration budget, subgame shape, hardware class, and runtime conditions. Any deployment has to choose a budget that fits the surface consuming the recommendation.
Blueprint dependence. The working belief is conditional on the blueprint being a reasonable model of opponent play. Against players who deviate sharply from the blueprint, the belief is miscalibrated and the search is anchored to a poor prior.
Abstraction loss. The subgame is finite and abstracted: bet sizes are discretized, chance nodes are sampled or summarized, and terminal values are estimated. These choices trade fidelity for tractability and add error that is not captured by reproducibility alone.
References
Bowling, M., Burch, N., Johanson, M., and Tammelin, O. (2015). Heads-up limit hold’em poker is solved. Science, 347(6218), 145-149. doi:10.1126/science.1259433.
Brown, N., Bakhtin, A., Lerer, A., and Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2007.13544.
Brown, N., and Sandholm, T. (2017). Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017).
Brown, N., and Sandholm, T. (2018). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374), 418-424. doi:10.1126/science.aao1733.
Brown, N., and Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science, 365(6456), 885-890. doi:10.1126/science.aay2400.
Burch, N., Schmid, M., Moravcik, M., and Bowling, M. (2018). AIVAT: A new variance reduction technique for agent evaluation in imperfect information games. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). arXiv:1612.06915.
Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. (2009). Monte Carlo sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems 22 (NeurIPS 2009).
Lisy, V., and Bowling, M. (2017). Equilibrium approximation quality of current no-limit poker bots. In AAAI-17 Workshop on Computer Poker and Imperfect Information Games. arXiv:1612.07547.
Moravcik, M., Schmid, M., Burch, N., Lisy, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337), 508-513. doi:10.1126/science.aam6960.
Schmid, M., Moravcik, M., Burch, N., Kadlec, R., Davidson, J., Waugh, K., et al. (2023). Student of Games: A unified learning algorithm for both perfect and imperfect information games. Science Advances, 9(46), eadg3256. doi:10.1126/sciadv.adg3256.
Tammelin, O. (2014). Solving large imperfect information games using CFR+. arXiv:1407.5042.
Zinkevich, M., Johanson, M., Bowling, M., and Piccione, C. (2007). Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20 (NeurIPS 2007).