#import "../common.typ": * #import "@preview/prooftrees:0.1.0": * #show: doc => conf("Probabilistic Programming", doc) #import "@preview/simplebnf:0.1.0": * #let prob = "probability" #let tt = "tt" #let ff = "ff" #let tru = "true" #let fls = "false" #let ret = "return" #let flip = "flip" #let real = "real" #let Env = "Env" #let Bool = "Bool" #let dbp(x) = $db(#x)_p$ #let dbe(x) = $db(#x)_e$ #let sharpsat = $\#"SAT"$ #let sharpP = $\#"P"$ - Juden Pearl - Probabilistic graphical models *Definition.* Probabilistic programs are programs that denote #prob distributions. Example: ``` x <- flip 1/2 x <- flip 1/2 return x if y ``` $x$ is a random variable that comes from a coin flip. Instead of having a value, the output is a _function_ that equals $ db( #[``` x <- flip 1/2 x <- flip 1/2 return x if y ```] ) = cases(tt mapsto 3/4, ff mapsto 1/4) $ tt is "semantic" true and ff is "semantic" false Sample space $Omega = {tt, ff}$, #prob distribution on $Omega$ -> [0, 1] Semantic brackets $db(...)$ === TinyPPL Syntax $ #bnf( Prod( $p$, annot: $sans("Pure program")$, { Or[$x$][_variable_] Or[$ifthenelse(p, p, p)$][_conditional_] Or[$p or p$][_conjunction_] Or[$p and p$][_disjunction_] Or[$tru$][_true_] Or[$fls$][_false_] Or[$e$ $e$][_application_] }), ) $ $ #bnf( Prod( $e$, annot: $sans("Probabilistic program")$, { Or[$x arrow.l e, e$][_assignment_] Or[$ret p$][_return_] Or[$flip real$][_random_] }), ) $ Semantics of pure terms - $dbe(p) : Env -> Bool$ - $dbe(x) ([x mapsto tt]) = tt$ - $dbe(tru) (rho) = tt$ - $dbe(p_1 and p_2) (rho) = dbe(p_1) (rho) and dbe(p_2) (rho)$ - the second $and$ is a "semantic" $and$ env is a mapping from identifiers to B Semantics of probabilistic terms - $dbe(e) : Env -> ({tt, ff} -> [0, 1])$ - $dbe(flip 1/2) (rho) = [tt mapsto 1/2, ff mapsto 1/2]$ - $dbe(ret p) (rho) = v mapsto cases(1 "if" dbp(p) (rho) = v, 0 "else")$ // - $dbe(x <- e_1 \, e_2) (rho) = dbp(e_2) (rho ++ [x mapsto dbp(e_1)])$ - $dbe(x <- e_1 \, e_2) (rho) = v' mapsto sum_(v in {tt, ff}) dbe(e_1) (rho) (v) times db(e_2) (rho [x mapsto v])(v')$ - "monadic semantics" of PPLs - https://homepage.cs.uiowa.edu/~jgmorrs/eecs762f19/papers/ramsay-pfeffer.pdf - https://www.sciencedirect.com/science/article/pii/S1571066119300246 Getting back a probability distribution === Tractability What is the complexity class of computing these? #sharpP - Input: boolean formula $phi$ - Output: number of solutions to $phi$ - $sharpsat(x or y) = 3$ https://en.wikipedia.org/wiki/Toda%27s_theorem This language is actually incredibly intractable. There is a reduction from TinyPPL to #sharpsat Reduction: - Given a formula like $phi = (x or y) and (y or z)$ - Write a program where each variable is assigned a $flip 1/2$: $x <- flip 1\/2 \ y <- flip 1\/2 \ z <- flip 1\/2 \ ret (x or y) and (y or z)$ How hard is this? $#sharpsat (phi) = 2^("# vars") times db("encoded program") (emptyset) (tt)$ *Question.* Why do we care about the computational complexity of our denotational semantics? _Answer._ Gives us a lower bound on our operational semantics. *Question.* What's the price of adding features like product/sum types? _Answer._ Any time you add a syntactic construct, it comes at a price. === Systems in the wild - Stan https://en.wikipedia.org/wiki/Stan_(software) - https://www.tensorflow.org/probability Google - https://pyro.ai/ Uber === Tiny Cond Observe, can prune worlds that can't happen. Un-normalized semantics where the probabilities of things don't sum to 1. #let observe = "observe" $ db(observe p \; e) (rho) (v) = cases( db(e) (rho) (v) "if" db(p) (rho) = tt, 0 "else", ) $ To normalize, divide by the sum $ db(e)(rho)(v) = frac(db(e) (rho) (v), db(e) (rho) (tt) + db(e) (rho) (ff)) $ == Operational sampling semantics Expectation of a random variable, $EE$ $ EE_Pr[f] = sum^N_w Pr(w) times f(w) \ approx 1/N sum^N_(w tilde Pr) f(w) $ Approximate with a finite sample, where $w$ ranges over the sample To prove our runtime strategy sound, we're going to relate it to an _expectation_. https://en.wikipedia.org/wiki/Concentration_inequality === Big step semantics #let bigstep(sigma, e, v) = $#sigma tack.r #e arrow.b.double #v$ $ sigma tack.r angle.l e . rho angle.r arrow.b.double v $ $tack.r$ read as "in the context of" We externalized the randomness and put it all in $sigma$ This is like pattern matching: $ v :: bigstep(sigma, flip 1/2, v )$ == Lecture 3 === Rejection sampling === Search $x <- flip 1/2 \ y <- ifthenelse(x, flip 1/3, flip 1/4) \ z <- ifthenelse(y, flip 1/6, flip 1/7) \ ret z$ You can draw a search tree of probabilities. Add up the probabilities to get the probability that a program returns a specific value. You can share $z$ since it doesn't depend directly on $x$. This builds a *binary decision diagram*. === Knowledge compilation https://en.wikipedia.org/wiki/Knowledge_compilation Relationship between hardness of propositional reasoning tasks and its syntax of the formula SAT for DNF is easy. What kinds of structure enables efficient reasoning? ==== Succinctness $cal(L)_1$ is more succinct than $cal(L)_2$ if it's efficient (polynomial-time) to translate (in a semantics-preserving way) programs written in $cal(L)_2$ to programs written in $cal(L)_1$ ==== Canonicity of BDDs There is only 1 structural BDD for any particular formula.