Typical distributions Name Param PMF Mean Var Bernoulli 伯努利 p p p ⋯ \cdots ⋯ p p p p q pq p q Binomial 二项 n , p n,p n , p ( n k ) p k q n − k \binom{n}{k}p^k q^{n-k} ( k n  ) p k q n − k n p np n p n p q npq n p q FS 首次成功 p p p p q k − 1 pq^{k-1} p q k − 1 1 / p 1/p 1 / p q / p 2 q/p^2 q / p 2 Geom 几何 p p p p q k pq^k p q k q / p q/p q / p q / p 2 q/p^2 q / p 2 NBinom 负二项 r , p r,p r , p ( r + n − 1 r − 1 ) p r q n \binom{r+n-1}{r-1}p^rq^n ( r − 1 r + n − 1  ) p r q n r q / p rq/p r q / p r q / p 2 rq/p^2 r q / p 2 HGeom 超几何 w , b , n w,b,n w , b , n ( w k ) ( b n − k ) ( w + b n ) \frac{\binom{w}{k}\binom{b}{n-k}}{\binom{w+b}{n}} ( n w + b  ) ( k w  ) ( n − k b  )  μ = n w w + b \mu = \frac{nw}{w+b} μ = w + b n w  ( w + b − n w + b − 1 ) n μ n ( 1 − μ n ) \left(\frac{w+b-n}{w+b-1}\right)n\frac{\mu}{n}\left(1-\frac{\mu}{n}\right) ( w + b − 1 w + b − n  ) n n μ  ( 1 − n μ  ) Poisson 泊松 λ \lambda λ e − λ λ k k ! \frac{e^{-\lambda}\lambda^k}{k!} k ! e − λ λ k  λ \lambda λ λ \lambda λ 
Name Param PDF Mean Var Uniform 均匀 a < b a<b a < b 1 b − a \frac{1}{b-a} b − a 1  x ∈ ( a , b ) x\in (a,b) x ∈ ( a , b ) a + b 2 \frac{a+b}{2} 2 a + b  ( b − a ) 2 12 \frac{(b-a)^2}{12} 1 2 ( b − a ) 2  Normal 正态 μ , σ 2 \mu, \sigma^2 μ , σ 2 1 σ 2 π e − ( x − μ ) 2 / ( 2 σ 2 ) \frac{1}{\sigma \sqrt{2\pi} }e^{-(x-\mu)^2 / (2\sigma^2)} σ 2 π  1  e − ( x − μ ) 2 / ( 2 σ 2 ) μ \mu μ σ 2 \sigma^2 σ 2 Expo 指数 λ \lambda λ λ e − λ x \lambda e^{-\lambda x} λ e − λ x x > 0 x>0 x > 0 1 / λ 1/\lambda 1 / λ 1 / λ 2 1/\lambda^2 1 / λ 2 Gamma a , λ a, \lambda a , λ Γ ( a ) − 1 ( λ x ) a e − λ x x − 1 \Gamma(a)^{-1} (\lambda x)^a e^{-\lambda x} x^{-1} Γ ( a ) − 1 ( λ x ) a e − λ x x − 1 a / λ a/\lambda a / λ a / λ 2 a/\lambda^2 a / λ 2 Beta a , b a,b a , b Γ ( a + b ) Γ ( a ) Γ ( b ) x a − 1 ( 1 − x ) b − 1 \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1} (1-x)^{b-1} Γ ( a ) Γ ( b ) Γ ( a + b )  x a − 1 ( 1 − x ) b − 1 μ = a a + b \mu = \frac{a}{a+b} μ = a + b a  μ ( 1 − μ ) a + b + 1 \frac{\mu(1-\mu)}{a+b+1} a + b + 1 μ ( 1 − μ )  
Definition of Conditional Probability 条件概率 P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A\mid B) = \frac{P(A\cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B )  
Chain Rule 链式法则
P ( A 1 , A 2 ) = P ( A 1 ) P ( A 2 ∣ A 1 ) P(A_1, A_2) = P(A_1)P(A_2\mid A_1) P ( A 1  , A 2  ) = P ( A 1  ) P ( A 2  ∣ A 1  ) 
P ( A 1 , ⋯   , A n ) = P ( A 1 ) P ( A 2 ∣ A 1 ) P ( A 3 ∣ A 1 , A 2 ) ⋯ P ( A n ∣ A 1 , ⋯   , A n − 1 ) P(A_1, \cdots, A_n) = P(A_1)P(A_2\mid A_1)P(A_3\mid A_1, A_2) \cdots P(A_n\mid A_1, \cdots, A_n-1) P ( A 1  , ⋯ , A n  ) = P ( A 1  ) P ( A 2  ∣ A 1  ) P ( A 3  ∣ A 1  , A 2  ) ⋯ P ( A n  ∣ A 1  , ⋯ , A n  − 1 ) 
Bayes’ Rule 贝叶斯公式 
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A )  
The Law of Total Probability / LOTP 全概率公式 
Let A 1 , ⋯   , A n A_1, \cdots, A_n A 1  , ⋯ , A n  S S S P ( A i ) > 0 P(A_i)>0 P ( A i  ) > 0 
P ( B ) = ∑ i = 1 n P ( B ∣ A i ) P ( A i ) P(B) = \sum_{i=1}^{n} P(B\mid A_i)P(A_i) P ( B ) = i = 1 ∑ n  P ( B ∣ A i  ) P ( A i  ) 
Inference & Bayes’ Rule (其实就是把LOTP和贝叶斯公式套一起)
P ( A i ∣ B ) = P ( A i ) P ( B ∣ A i ) P ( A 1 ) P ( B ∣ A 1 ) + ⋯ + P ( A n ) P ( B ∣ A n ) P(A_i\mid B) = \frac{P(A_i)P(B\mid A_i)}{P(A_1)P(B\mid A_1) + \cdots + P(A_n)P(B\mid A_n)} P ( A i  ∣ B ) = P ( A 1  ) P ( B ∣ A 1  ) + ⋯ + P ( A n  ) P ( B ∣ A n  ) P ( A i  ) P ( B ∣ A i  )  
P ( A ∣ B , E ) = P ( B ∣ A , E ) P ( A ∣ E ) P ( B ∣ E ) P(A\mid B, E) = \frac{P(B\mid A, E)P(A\mid E)}{P(B\mid E)} P ( A ∣ B , E ) = P ( B ∣ E ) P ( B ∣ A , E ) P ( A ∣ E )  
LOTP with extra conditioning P ( B ∣ E ) = ∑ i = 1 n P ( B ∣ A i , E ) P ( A i , E ) P(B\mid E) = \sum_{i=1}^{n} P(B\mid A_i, E)P(A_i, E) P ( B ∣ E ) = i = 1 ∑ n  P ( B ∣ A i  , E ) P ( A i  , E ) 
P ( A ∩ B ) = P ( A ) P ( B ) P(A\cap B) = P(A)P(B) P ( A ∩ B ) = P ( A ) P ( B ) 
 If P ( A ) > 0 P(A)>0 P ( A ) > 0 P ( B ) > 0 P(B)>0 P ( B ) > 0 
P ( A ∣ B ) = P ( A ) , P ( B ∣ A ) = P ( B ) P(A\mid B) = P(A), P(B\mid A)=P(B) P ( A ∣ B ) = P ( A ) , P ( B ∣ A ) = P ( B ) 
Covariance Definition 
C o v ( x , y ) = E ( ( X − E X ) ( Y − E Y ) ) = E ( X Y ) − E ( X ) E ( Y ) \mathrm{Cov}(x,y) = E((X-EX)(Y-EY)) = E(XY) - E(X)E(Y) C o v ( x , y ) = E ( ( X − E X ) ( Y − E Y ) ) = E ( X Y ) − E ( X ) E ( Y ) 
Properties 
C o v ( X , X ) = V a r ( X ) Cov(X,X) = Var(X) C o v ( X , X ) = V a r ( X ) 
C o v ( X , Y ) = C o v ( Y , X ) Cov(X,Y) = Cov(Y,X) C o v ( X , Y ) = C o v ( Y , X ) 
C o v ( X , c ) = 0 Cov(X,c) = 0 C o v ( X , c ) = 0 
C o v ( a X , Y ) = a ⋅ C o v ( X , Y ) Cov(aX, Y) = a \cdot Cov(X,Y) C o v ( a X , Y ) = a ⋅ C o v ( X , Y ) 
C o v ( X + Y , Z ) = C o v ( X , Z ) + C o v ( Y , Z ) Cov(X+Y,Z) = Cov(X,Z)+ Cov(Y,Z) C o v ( X + Y , Z ) = C o v ( X , Z ) + C o v ( Y , Z ) 
C o v ( X + Y , Z + W ) = C o v ( X , Z ) + C o v ( X , W ) + C o v ( Y , Z ) + C o v ( Y , W ) Cov(X+Y, Z+W) = Cov(X,Z) + Cov(X,W) + Cov(Y,Z) + Cov(Y,W) C o v ( X + Y , Z + W ) = C o v ( X , Z ) + C o v ( X , W ) + C o v ( Y , Z ) + C o v ( Y , W ) 
V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y) V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) 
For n n n X 1 , ⋯   , X n X_1, \cdots, X_n X 1  , ⋯ , X n  
V a r ( X 1 + ⋯ + X n ) = V a r ( X 1 ) + ⋯ + V a r ( X n ) + 2 ∑ i < j C o v ( X i , Y j ) Var(X_1 + \cdots + X_n) = Var(X_1) + \cdots + Var(X_n) + 2 \sum_{i<j}Cov(X_i, Y_j) V a r ( X 1  + ⋯ + X n  ) = V a r ( X 1  ) + ⋯ + V a r ( X n  ) + 2 i < j ∑  C o v ( X i  , Y j  ) 
Correlation Definition 
C o r r ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} C o r r ( X , Y ) = V a r ( X ) V a r ( Y )  C o v ( X , Y )  
Properties 
C o v ( X , Y ) = 0     o r     C o r r ( X , Y ) = 0     ⇒     U n c o r r e l a t e d Cov(X,Y) = 0 \mathrm{\ \ \ or\ \ \ } Corr(X,Y) = 0 \ \ \ \Rightarrow \ \ \ \mathrm{Uncorrelated} C o v ( X , Y ) = 0       o r       C o r r ( X , Y ) = 0       ⇒       U n c o r r e l a t e d 
I n d e p e n d e n t     ⇒     U n c o r r e l a t e d \mathrm{Independent \ \ \ } \Rightarrow\ \ \ \mathrm{Uncorrelated} I n d e p e n d e n t       ⇒       U n c o r r e l a t e d 
− 1 ≤ C o r r ( X , Y ) ≤ 1 -1 \leq Corr(X,Y) \leq 1 − 1 ≤ C o r r ( X , Y ) ≤ 1 
Multinomial Joint PMF : X ∼ M u l t k ( n , p ) X\sim Mult_k(n, p) X ∼ M u l t k  ( n , p ) 
P ( X 1 = n 1 , ⋯   , X k = n k ) = n ! n 1 ! n 2 ! ⋯ n k ! p 1 n 1 ⋯ p k n k P(X_1=n_1, \cdots, X_k=n_k) = \frac{n!}{n_1!n_2!\cdots n_k!}p_1^{n_1}\cdots p_k^{n_k} P ( X 1  = n 1  , ⋯ , X k  = n k  ) = n 1  ! n 2  ! ⋯ n k  ! n !  p 1 n 1   ⋯ p k n k   
for n 1 + ⋯ + n k = n n_1 + \cdots + n_k = n n 1  + ⋯ + n k  = n 
Multinomial Marginal :
X ∼ M u l t k ( n , p )    ⇒    X j ∼ B i n ( n , p j ) X\sim Mult_k(n,p) \mathrm{\ \ \Rightarrow\ \ } X_j \sim Bin(n, p_j) X ∼ M u l t k  ( n , p )     ⇒     X j  ∼ B i n ( n , p j  ) 
Definition : A random vector X = ( X 1 , ⋯   , X k ) X=(X_1, \cdots, X_k) X = ( X 1  , ⋯ , X k  ) X j X_j X j  
That is, t 1 X 1 + ⋯ + t k X k t_1 X_1 + \cdots + t_k X_k t 1  X 1  + ⋯ + t k  X k  t 1 , ⋯   , t k t_1, \cdots, t_k t 1  , ⋯ , t k  
When k = 2 k=2 k = 2 Bivariate Normal (二元正态分布 BVN) 
Theorem : If ( X 1 , X 2 , X 3 ) (X_1, X_2, X_3) ( X 1  , X 2  , X 3  ) ( X 1 , X 2 ) (X_1, X_2) ( X 1  , X 2  ) 
Theorem :
…
X X X g ( X ) g(X) g ( X ) Let X X X f X f_X f X  Y = g ( X ) Y=g(X) Y = g ( X ) g g g Y Y Y 
f Y ( y ) = f X ( x ) ∣ d x d y ∣ f_Y(y) = f_X(x) \left\lvert \frac{\mathrm{d}x}{\mathrm{d}y} \right\rvert f Y  ( y ) = f X  ( x ) ∣ ∣ ∣ ∣ ∣  d y d x  ∣ ∣ ∣ ∣ ∣  
Let X = ( X 1 , ⋯   , X n ) X=(X_1, \cdots, X_n) X = ( X 1  , ⋯ , X n  ) f X ( x ) f_X(x) f X  ( x ) Y = g ( X ) Y=g(X) Y = g ( X ) g g g y = g ( x ) y=g(x) y = g ( x ) ∂ x i ∂ y i \frac{\partial x_i}{\partial y_i} ∂ y i  ∂ x i   
Jacobi 
∂ x ∂ y = ( ∂ x 1 ∂ y 1 ∂ x 1 ∂ y 2 ⋯ ∂ x 1 ∂ y n ⋮ ⋮ ⋮ ∂ x n ∂ y 1 ∂ x n ∂ y 2 ⋯ ∂ x n ∂ y n ) \frac{\partial x}{\partial y} = \begin{pmatrix} \frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \cdots & \frac{\partial x_1}{\partial y_n}\\ \vdots & \vdots & & \vdots\\ \frac{\partial x_n}{\partial y_1} & \frac{\partial x_n}{\partial y_2} & \cdots & \frac{\partial x_n}{\partial y_n} \end{pmatrix} ∂ y ∂ x  = ⎝ ⎜ ⎜ ⎛  ∂ y 1  ∂ x 1   ⋮ ∂ y 1  ∂ x n    ∂ y 2  ∂ x 1   ⋮ ∂ y 2  ∂ x n    ⋯ ⋯  ∂ y n  ∂ x 1   ⋮ ∂ y n  ∂ x n    ⎠ ⎟ ⎟ ⎞  
Then the joint PDF of Y Y Y 
f Y ( y ) = f X ( x ) ∣ ∂ x ∂ y ∣ f_Y(y) = f_X(x) \left\lvert \frac{\partial x}{\partial y} \right\rvert f Y  ( y ) = f X  ( x ) ∣ ∣ ∣ ∣ ∣  ∂ y ∂ x  ∣ ∣ ∣ ∣ ∣  
Theorem : X , Y X,Y X , Y T = X + Y T=X+Y T = X + Y 
P ( T = t ) = ∑ x P ( Y = t − x ) P ( X = x ) = ∑ y P ( X = t − y ) P ( Y = y ) \begin{aligned} P(T=t) &= \sum_x P(Y=t-x)P(X=x)\\ &= \sum_y P(X=t-y)P(Y=y) \end{aligned} P ( T = t )  = x ∑  P ( Y = t − x ) P ( X = x ) = y ∑  P ( X = t − y ) P ( Y = y )  
Theorem : X , Y X,Y X , Y T = X + Y T=X+Y T = X + Y 
f T ( t ) = ∫ − ∞ ∞ f Y ( t − x ) f X ( x ) d x = ∫ − ∞ ∞ f X ( t − y ) f Y ( y ) d y \begin{aligned} f_T(t) &= \int_{-\infty}^{\infty} f_Y(t-x)f_X(x)dx\\ &= \int_{-\infty}^{\infty} f_X(t-y)f_Y(y)dy \end{aligned} f T  ( t )  = ∫ − ∞ ∞  f Y  ( t − x ) f X  ( x ) d x = ∫ − ∞ ∞  f X  ( t − y ) f Y  ( y ) d y  
CDF & PDF of order statistics 顺序统计量的PDF和CDF :X 1 , ⋯   , X n X_1, \cdots, X_n X 1  , ⋯ , X n  F F F f f f X ( j ) X_{(j)} X ( j )  
P ( X ( j ) ≤ x ) = ∑ k = j n ( n k ) F ( x ) k ( 1 − F ( X ) ) n − k P(X_{(j)}\leq x) = \sum_{k=j}^{n} \binom{n}{k} F(x)^k (1-F(X))^{n-k} P ( X ( j )  ≤ x ) = k = j ∑ n  ( k n  ) F ( x ) k ( 1 − F ( X ) ) n − k 
f X ( j ) ( x ) = n ( n − 1 j − 1 ) f ( x ) F ( x ) j − 1 ( 1 − F ( x ) ) n − j f_{X_{(j)}}(x) = n \binom{n-1}{j-1} f(x) F(x)^{j-1} (1-F(x))^{n-j} f X ( j )   ( x ) = n ( j − 1 n − 1  ) f ( x ) F ( x ) j − 1 ( 1 − F ( x ) ) n − j 
Joint PDF 顺序统计量的联合分布 : (x 1 < x 2 < ⋯ < x n x_1 < x_2 < \cdots < x_n x 1  < x 2  < ⋯ < x n  
f X ( 1 ) , X ( 2 ) , ⋯   , X ( n ) ( x 1 , ⋯   , x n ) = n ! ∏ i = 1 n f ( X i ) f_{X_{(1)}, X_{(2)},\cdots, X_{(n)}}(x_1, \cdots, x_n) = n! \prod_{i=1}^{n} f(X_i) f X ( 1 )  , X ( 2 )  , ⋯ , X ( n )   ( x 1  , ⋯ , x n  ) = n ! i = 1 ∏ n  f ( X i  ) 
Related Indentity 离散与连续中的恒等式 
Theorem: For 0 < p < 1 0<p<1 0 < p < 1 k k k 
∑ j = 0 k ( n j ) p j ( 1 − p ) n − j = n ! k ! ( n − k − 1 ) ! ∫ p 1 x k ( 1 − x ) n − k − 1 d x \sum_{j=0}^{k} \binom{n}{j} p^j (1-p)^{n-j} = \frac{n!}{k! (n-k-1)!}\int_p^1 x^k (1-x)^{n-k-1} dx j = 0 ∑ k  ( j n  ) p j ( 1 − p ) n − j = k ! ( n − k − 1 ) ! n !  ∫ p 1  x k ( 1 − x ) n − k − 1 d x 
X ∼ B e t a ( a , b ) X\sim Beta(a,b) X ∼ B e t a ( a , b ) a > 0 , b > 0 a>0, b>0 a > 0 , b > 0 PDF : (for 0 < x < 1 0<x<1 0 < x < 1 f ( x ) = 1 β ( a , b ) x a − 1 ( 1 − x ) b − 1 f(x) = \frac{1}{\beta(a,b)} x^{a-1} (1-x)^{b-1} f ( x ) = β ( a , b ) 1  x a − 1 ( 1 − x ) b − 1 
β ( a , b ) = ∫ 0 1 x a − 1 ( 1 − x ) b − 1 d x \beta(a,b) = \int_{0}^{1} x^{a-1} (1-x)^{b-1}dx β ( a , b ) = ∫ 0 1  x a − 1 ( 1 − x ) b − 1 d x 
β ( a , b ) = ( a − 1 ) ! ( b − 1 ) ! ( a + b − 1 ) ! = Γ ( a ) Γ ( b ) Γ ( a + b ) \beta(a,b) = \frac{(a-1)!(b-1)!}{(a+b-1)!} = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} β ( a , b ) = ( a + b − 1 ) ! ( a − 1 ) ! ( b − 1 ) !  = Γ ( a + b ) Γ ( a ) Γ ( b )  
∫ 0 1 ( n k ) x k ( 1 − x ) n − k d x = 1 n + 1 \int_0^1 \binom{n}{k} x^k (1-x)^{n-k} dx = \frac{1}{n+1} ∫ 0 1  ( k n  ) x k ( 1 − x ) n − k d x = n + 1 1  
for any integer k k k n n n 0 ≤ k ≤ n 0\leq k \leq n 0 ≤ k ≤ n 
n n n k k k n n n p ^ \hat{p} p ^  
设先验分布为 p ∼ B e t a ( a , b ) p\sim Beta(a,b) p ∼ B e t a ( a , b ) p ∼ B e t a ( a + k , b + n − k ) p\sim Beta(a+k, b+n-k) p ∼ B e t a ( a + k , b + n − k ) E ( p ) = a + k b + n − k E(p) = \frac{a+k}{b+n-k} E ( p ) = b + n − k a + k  
如果先验分布是beta,且data是条件二项分布 given p p p 
Γ ( . ) \Gamma(.) Γ ( . ) For a > 0 a>0 a > 0 
Γ ( a ) = ∫ 0 ∞ x a − 1 e − x d x \Gamma(a) = \int_0^\infty x^{a-1} e^{-x} dx Γ ( a ) = ∫ 0 ∞  x a − 1 e − x d x 
Properties:
Γ ( a + 1 ) = a Γ ( a ) \Gamma(a+1) = a\Gamma(a) Γ ( a + 1 ) = a Γ ( a ) a > 0 a>0 a > 0 Γ ( n ) = ( n − 1 ) ! \Gamma(n) = (n-1)! Γ ( n ) = ( n − 1 ) ! a a a Y ∼ G a m m a ( a , λ ) Y\sim Gamma(a, \lambda) Y ∼ G a m m a ( a , λ ) a > 0 a>0 a > 0 λ > 0 \lambda>0 λ > 0 PDF f ( y ) = 1 Γ ( a ) ( λ y ) a e − λ y 1 y f(y) = \frac{1}{\Gamma(a)} (\lambda y)^a e^{-\lambda y} \frac{1}{y} f ( y ) = Γ ( a ) 1  ( λ y ) a e − λ y y 1  
G a m m a ( 1 , λ ) = E x p o ( λ ) Gamma(1, \lambda) = Expo(\lambda) G a m m a ( 1 , λ ) = E x p o ( λ ) 
Gamma: convolution of exponential :
X 1 , ⋯   , X n X_1, \cdots, X_n X 1  , ⋯ , X n  E x p o ( λ ) Expo(\lambda) E x p o ( λ ) X 1 + ⋯ + X n ∼ G a m m a ( n , λ ) X_1 + \cdots + X_n \sim Gamma(n, \lambda) X 1  + ⋯ + X n  ∼ G a m m a ( n , λ ) 
Gamma分布可以看做n个指数分布的卷积/叠加
Beta-Gamma connection (bank-post office story) :
independent X ∼ G a m m a ( a , λ ) X\sim Gamma(a, \lambda) X ∼ G a m m a ( a , λ ) Y ∼ ( b , λ ) Y\sim (b, \lambda) Y ∼ ( b , λ ) 
X + Y ∼ G a m m a ( a + b , λ ) X X + Y ∼ B e t a ( a , b ) \begin{aligned} X+Y &\sim Gamma(a+b, \lambda)\\ \frac{X}{X+Y} &\sim Beta(a, b) \end{aligned} X + Y X + Y X   ∼ G a m m a ( a + b , λ ) ∼ B e t a ( a , b )  
and they are independent.
. . X X X Y Y Y P ( X = x ) = ∑ y P ( X = x ∣ Y = y ) P ( Y = y ) P(X=x) = \sum_y P(X=x\mid Y=y) P(Y=y) P ( X = x ) = ∑ y  P ( X = x ∣ Y = y ) P ( Y = y ) X X X Y Y Y P ( X = x ) = ∫ − ∞ ∞ P ( X = x ∣ Y = y ) f Y ( y ) d y P(X=x) = \int_{-\infty}^{\infty} P(X=x\mid Y=y) f_Y(y) dy P ( X = x ) = ∫ − ∞ ∞  P ( X = x ∣ Y = y ) f Y  ( y ) d y X X X Y Y Y f X ( x ) = ∑ y f X ( x ∣ Y = y ) P ( Y = y ) f_X(x) = \sum_y f_X(x\mid Y=y) P(Y=y) f X  ( x ) = ∑ y  f X  ( x ∣ Y = y ) P ( Y = y ) X X X Y Y Y f X ( x ) = ∫ − ∞ ∞ f X ∣ Y ( x ∣ y ) f Y ( y ) d y f_X(x) = \int_{-\infty}^{\infty} f_{X\mid Y} (x\mid y) f_Y(y) dy f X  ( x ) = ∫ − ∞ ∞  f X ∣ Y  ( x ∣ y ) f Y  ( y ) d y 
. . X X X Y Y Y P ( Y = y ∣ X = x ) = P ( X = x ∣ Y = y ) P ( Y = y ) P ( X = x ) P(Y=y\mid X=x) = \frac{P(X=x\mid Y=y)P(Y=y)}{P(X=x)} P ( Y = y ∣ X = x ) = P ( X = x ) P ( X = x ∣ Y = y ) P ( Y = y )  X X X Y Y Y f Y ( y ∣ X = x ) = P ( X = x ∣ Y = y ) f Y ( y ) P ( X = x ) f_Y(y\mid X=x) = \frac{P(X=x\mid Y=y)f_Y(y)}{P(X=x)} f Y  ( y ∣ X = x ) = P ( X = x ) P ( X = x ∣ Y = y ) f Y  ( y )  X X X Y Y Y P ( Y = y ∣ X = x ) = f X ( x ∣ Y = y ) P ( Y = y ) f X ( x ) P(Y=y\mid X=x) = \frac{f_X(x\mid Y=y)P(Y=y)}{f_X(x)} P ( Y = y ∣ X = x ) = f X  ( x ) f X  ( x ∣ Y = y ) P ( Y = y )  X X X Y Y Y f Y ∣ X ( y ∣ x ) = f X ∣ Y ( x ∣ y ) f Y ( y ) f X ( x ) f_{Y\mid X} (y\mid x) = \frac{f_{X\mid Y}(x\mid y) f_Y(y)}{f_X(x)} f Y ∣ X  ( y ∣ x ) = f X  ( x ) f X ∣ Y  ( x ∣ y ) f Y  ( y )  
Given the observation value x, the MAP rule selects a value θ ^ \hat{\theta} θ ^ θ \theta θ p Θ ∣ X ( θ ∣ x ) p_{\Theta\mid X} (\theta\mid x) p Θ ∣ X  ( θ ∣ x ) f Θ ∣ X ( θ ∣ x ) f_{\Theta\mid X} (\theta\mid x) f Θ ∣ X  ( θ ∣ x ) 
Conditional expectation given an event :
E ( Y ∣ A ) = ∑ y y P ( Y = y ∣ A ) E ( Y ∣ A ) = ∫ − ∞ ∞ y f ( y ∣ A ) d y E(Y\mid A) = \sum_y y P(Y=y\mid A)\\ E(Y\mid A) = \int_{-\infty}^{\infty} y f(y\mid A) dy E ( Y ∣ A ) = y ∑  y P ( Y = y ∣ A ) E ( Y ∣ A ) = ∫ − ∞ ∞  y f ( y ∣ A ) d y 
在测试足够多时, E ( Y ∣ A ) E(Y\mid A) E ( Y ∣ A ) Y Y Y 
LOTE  (law of total expectation)
E ( Y ) = ∑ i = 1 n E ( Y ∣ A i ) P ( A i ) E(Y) = \sum_{i=1}^{n} E(Y\mid A_i) P(A_i) E ( Y ) = i = 1 ∑ n  E ( Y ∣ A i  ) P ( A i  ) 
Definition: Conditional expectation given an r.v 
g ( x ) = E ( Y ∣ X = x ) g(x) = E(Y\mid X=x) g ( x ) = E ( Y ∣ X = x ) 
E ( Y ∣ X ) E(Y\mid X) E ( Y ∣ X ) X X X 
Dropping what’s independent :
If X X X Y Y Y E ( Y ∣ X ) = E ( Y ) E(Y\mid X) = E(Y) E ( Y ∣ X ) = E ( Y ) 
Taking out what’s known :
For any function h h h E ( h ( X ) Y ∣ X ) = h ( X ) E ( Y ∣ X ) E(h(X)Y\mid X) = h(X) E(Y\mid X) E ( h ( X ) Y ∣ X ) = h ( X ) E ( Y ∣ X ) 
Linearity: 线性性 
E ( Y 1 + Y 2 ∣ X ) = E ( Y 1 ∣ X ) + E ( Y 2 ∣ X ) E(Y_1 + Y_2\mid X) = E(Y_1\mid X) + E(Y_2 \mid X) E ( Y 1  + Y 2  ∣ X ) = E ( Y 1  ∣ X ) + E ( Y 2  ∣ X ) 
Adam’s Law: 亚当定理 “套娃定理” 
E ( E ( Y ∣ X ) ) = E ( Y ) E(E(Y\mid X)) = E(Y) E ( E ( Y ∣ X ) ) = E ( Y ) 
Adam’s Law with extra conditioning: 
E ( E ( Y ∣ X , Z ) ∣ Z ) = E ( Y ∣ Z ) E(E(Y\mid X, Z)\mid Z) = E(Y\mid Z) E ( E ( Y ∣ X , Z ) ∣ Z ) = E ( Y ∣ Z ) 
E ( E ( X ∣ Z , Y ) ∣ Y ) = E ( X ∣ Y ) E(E(X\mid Z, Y)\mid Y) = E(X\mid Y) E ( E ( X ∣ Z , Y ) ∣ Y ) = E ( X ∣ Y ) 
V a r ( Y ∣ X ) = E [ ( Y − E ( Y ∣ X ) ) 2 ∣ X ] Var(Y\mid X) = E[(Y-E(Y\mid X))^2\mid X] V a r ( Y ∣ X ) = E [ ( Y − E ( Y ∣ X ) ) 2 ∣ X ] 
V a r ( Y ∣ X ) = E ( Y 2 ∣ X ) − ( E ( Y ∣ X ) ) 2 Var(Y\mid X) = E(Y^2\mid X) - (E(Y\mid X))^2 V a r ( Y ∣ X ) = E ( Y 2 ∣ X ) − ( E ( Y ∣ X ) ) 2 
V a r ( Y ) = E ( V a r ( Y ∣ X ) ) + V a r ( E ( Y ∣ X ) ) Var(Y) = E(Var(Y\mid X)) + Var(E(Y\mid X)) V a r ( Y ) = E ( V a r ( Y ∣ X ) ) + V a r ( E ( Y ∣ X ) ) 
The linear regression model uses a single explanatory variable X X X Y Y Y Y Y Y X X X E ( Y ∣ X ) = a + b X E(Y\mid X) = a+bX E ( Y ∣ X ) = a + b X 
An equivalent way to express this is to write: Y = a + b X + ϵ Y=a+bX+\epsilon Y = a + b X + ϵ 
{ a = E ( Y ) − b E ( X ) = E ( Y ) − C o v ( X , Y ) V a r ( X ) ⋅ E ( X ) b = C o v ( X , Y ) V a r ( X ) \begin{cases} a = E(Y) - bE(X) = E(Y) - \frac{Cov(X,Y)}{Var(X)}\cdot E(X) \\ b = \frac{Cov(X,Y)}{Var(X)} \end{cases} { a = E ( Y ) − b E ( X ) = E ( Y ) − V a r ( X ) C o v ( X , Y )  ⋅ E ( X ) b = V a r ( X ) C o v ( X , Y )   
The LLSE of Y Y Y X X X L [ Y ∣ X ] L[Y\mid X] L [ Y ∣ X ] a + b X a+bX a + b X E [ ( Y − a − b X ) 2 ] E[(Y-a-bX)^2] E [ ( Y − a − b X ) 2 ] 
L [ Y ∣ X ] = E ( Y ) + C o v ( X , Y ) V a r ( X ) ( X − E ( X ) ) L[Y\mid X] = E(Y) + \frac{Cov(X,Y)}{Var(X)} (X-E(X)) L [ Y ∣ X ] = E ( Y ) + V a r ( X ) C o v ( X , Y )  ( X − E ( X ) ) 
The MMSE of Y Y Y X X X 
g ( X ) = E ( Y ∣ X ) g(X) = E(Y\mid X) g ( X ) = E ( Y ∣ X ) 
Y − E ( Y ∣ X ) ⊥ h ( X ) Y-E(Y\mid X) \bot h(X) Y − E ( Y ∣ X ) ⊥ h ( X ) 
E ( ( Y − E ( Y ∣ X ) ) ⋅ h ( X ) ) = 0 E((Y-E(Y\mid X)) \cdot h(X)) = 0 E ( ( Y − E ( Y ∣ X ) ) ⋅ h ( X ) ) = 0 
Theorem :
(a) For any function ϕ ( . ) \phi(.) ϕ ( . ) E ( ( Y − E ( Y ∣ X ) ) ⋅ ϕ ( X ) ) = 0 E((Y-E(Y\mid X)) \cdot \phi(X)) = 0 E ( ( Y − E ( Y ∣ X ) ) ⋅ ϕ ( X ) ) = 0 
(b) Moreover, if the function g ( X ) g(X) g ( X ) E ( ( Y − g ( X ) ) ⋅ h ( X ) ) = 0 E((Y-g(X)) \cdot h(X)) = 0 E ( ( Y − g ( X ) ) ⋅ h ( X ) ) = 0 ϕ \phi ϕ g ( X ) = E ( Y ∣ X ) g(X) = E(Y\mid X) g ( X ) = E ( Y ∣ X ) 
Theorem : Let X X X Y Y Y 
E [ Y ∣ X ] = L [ Y ∣ X ] = E ( Y ) + C o v ( X , Y ) V a r ( X ) ( X − E ( X ) ) E[Y\mid X] = L[Y\mid X] = E(Y) + \frac{Cov(X,Y)}{Var(X)} (X-E(X)) E [ Y ∣ X ] = L [ Y ∣ X ] = E ( Y ) + V a r ( X ) C o v ( X , Y )  ( X − E ( X ) ) 
MLE估值就是 使得给定数据的联合分布概率最大:θ n ^ = a r g max  θ P X ( x 1 , ⋯   , x n ; θ ) \hat{\theta_n} = arg \max_{\theta} P_X(x_1, \cdots, x_n; \theta) θ n  ^  = a r g max θ  P X  ( x 1  , ⋯ , x n  ; θ )  Observations X i X_i X i  x = ( x 1 , ⋯   , x n ) x = (x_1, \cdots, x_n) x = ( x 1  , ⋯ , x n  )  Log-likelihood function :log  [ P X ( x 1 , ⋯   , x n ; θ ) ] = log  ∏ i = 1 n P X i ( x i ; θ ) = ∑ i = 1 n log  [ P X i ( x i ; θ ) ] log  [ f X ( x 1 , ⋯   , x n ; θ ) ] = log  ∏ i = 1 n f X i ( x i ; θ ) = ∑ i = 1 n log  [ f X i ( x i ; θ ) ] \log\left[P_X(x_1, \cdots, x_n;\theta)\right] = \log \prod_{i=1}^{n} P_{X_i} (x_i;\theta) = \sum_{i=1}^{n} \log\left[ P_{X_i}(x_i;\theta) \right]\\ \log\left[f_X(x_1, \cdots, x_n;\theta)\right] = \log \prod_{i=1}^{n} f_{X_i} (x_i;\theta) = \sum_{i=1}^{n} \log\left[ f_{X_i}(x_i;\theta) \right] log  [ P X  ( x 1  , ⋯ , x n  ; θ ) ] = log  i = 1 ∏ n  P X i   ( x i  ; θ ) = i = 1 ∑ n  log  [ P X i   ( x i  ; θ ) ] log  [ f X  ( x 1  , ⋯ , x n  ; θ ) ] = log  i = 1 ∏ n  f X i   ( x i  ; θ ) = i = 1 ∑ n  log  [ f X i   ( x i  ; θ ) ] 
MLE under independent case : θ n ^ = a r g max  θ ∑ i = 1 n log  [ P X i ( x i ; θ ) ] θ n ^ = a r g max  θ ∑ i = 1 n log  [ f X i ( x i ; θ ] \hat{\theta_n} = arg\max_{\theta} \sum_{i=1}^{n} \log\left[P_{X_i}(x_i;\theta) \right]\\ \hat{\theta_n} = arg\max_{\theta} \sum_{i=1}^{n} \log\left[f_{X_i}(x_i;\theta \right] θ n  ^  = a r g θ max  i = 1 ∑ n  log  [ P X i   ( x i  ; θ ) ] θ n  ^  = a r g θ max  i = 1 ∑ n  log  [ f X i   ( x i  ; θ ] 
n ( X n ‾ − μ σ ) → N ( 0 , 1 ) \sqrt{n} \left(\frac{\overline{X_n} - \mu}{\sigma} \right) \to \mathcal{N}(0,1) n  ( σ X n   − μ  ) → N ( 0 , 1 ) 
in distribution. In words, the CDF of the left-hand side approaches the CDF of the standard normal distribution.
For large n n n X n ‾ \overline{X_n} X n   N ( μ , σ 2 ) \mathcal{N}(\mu,\sigma^2) N ( μ , σ 2 )  For large n n n n X n ‾ n\overline{X_n} n X n   N ( n μ , n σ 2 ) \mathcal{N}(n\mu,n\sigma^2) N ( n μ , n σ 2 )  (也就是说,n很大的时候,不管X n X_n X n  X n ‾ \overline{X_n} X n    Poisson convergence to normal 
Let Y ∼ P o i s ( n ) Y\sim Pois(n) Y ∼ P o i s ( n ) n n n P o i s ( 1 ) Pois(1) P o i s ( 1 ) n n n Y ∼ N ( n , n ) Y\sim \mathcal{N}(n,n) Y ∼ N ( n , n ) 
Gamma convergence to normal 
Let Y ∼ G a m m a ( n , λ ) Y\sim Gamma(n, \lambda) Y ∼ G a m m a ( n , λ ) n n n E x p o ( λ ) Expo(\lambda) E x p o ( λ ) n n n Y ∼ N ( n λ , n λ 2 ) Y\sim \mathcal{N}(\frac{n}{\lambda}, \frac{n}{\lambda^2}) Y ∼ N ( λ n  , λ 2 n  ) 
Binomial convergence to normal 
Let Y ∼ B i n ( n , p ) Y\sim Bin(n,p) Y ∼ B i n ( n , p ) n n n B e r n ( p ) Bern(p) B e r n ( p ) n n n Y ∼ N ( n p , n p ( 1 − p ) ) Y\sim \mathcal{N}(np, np(1-p)) Y ∼ N ( n p , n p ( 1 − p ) ) 
P ( Y = k ) = P ( k − 1 2 < Y < k + 1 2 ) ≈ ϕ ( k + 1 2 − n p n p ( 1 − p ) ) − ϕ ( k − 1 2 − n p n p ( 1 − p ) ) \begin{aligned} P(Y=k) &= P(k-\frac{1}{2} < Y < k+\frac{1}{2})\\ &\approx \phi(\frac{k + \frac{1}{2} - np}{\sqrt{np(1-p)}}) - \phi(\frac{k - \frac{1}{2} - np}{\sqrt{np(1-p)}}) \end{aligned} P ( Y = k )  = P ( k − 2 1  < Y < k + 2 1  ) ≈ ϕ ( n p ( 1 − p )  k + 2 1  − n p  ) − ϕ ( n p ( 1 − p )  k − 2 1  − n p  )  
P ( k ≤ Y ≤ I ) = P ( k − 1 2 < Y < I + 1 2 ) ≈ ϕ ( I + 1 2 − n p n p ( 1 − p ) ) − ϕ ( k − 1 2 − n p n p ( 1 − p ) ) \begin{aligned} P(k\leq Y\leq I) &= P(k-\frac{1}{2} < Y < I+\frac{1}{2})\\ &\approx \phi(\frac{I + \frac{1}{2} - np}{\sqrt{np(1-p)}}) - \phi(\frac{k - \frac{1}{2} - np}{\sqrt{np(1-p)}}) \end{aligned} P ( k ≤ Y ≤ I )  = P ( k − 2 1  < Y < I + 2 1  ) ≈ ϕ ( n p ( 1 − p )  I + 2 1  − n p  ) − ϕ ( n p ( 1 − p )  k − 2 1  − n p  )  
Let X 1 , ⋯   , X n X_1, \cdots, X_n X 1  , ⋯ , X n  μ \mu μ σ 2 \sigma^2 σ 2 X n ‾ = 1 n ∑ j = 1 n X j \overline{X_n} = \frac{1}{n}\sum_{j=1}^n X_j X n   = n 1  ∑ j = 1 n  X j  μ \mu μ σ 2 / n \sigma^2 / n σ 2 / n 
算术平均它自己也是一个随机变量,当n趋近于无穷大时,方差趋于零
The sample mean X n ‾ \overline{X_n} X n   μ \mu μ n → ∞ n\to\infty n → ∞ 1 1 1 X n ‾ → μ \overline{X_n}\to\mu X n   → μ 1 1 1  For all ϵ > 0 \epsilon > 0 ϵ > 0 P ( ∣ X n ‾ − μ ∣ > ϵ ) → 0 P(\lvert \overline{X_n} - \mu \rvert > \epsilon) \to 0 P ( ∣ X n   − μ ∣ > ϵ ) → 0 n → ∞ n\to\infty n → ∞  For any r.v.s X , Y X,Y X , Y  ∣ E ( X Y ) ∣ ≤ E ( X 2 ) E ( Y 2 ) \lvert E(XY)\rvert \leq \sqrt{E(X^2)E(Y^2)} ∣ E ( X Y ) ∣ ≤ E ( X 2 ) E ( Y 2 )  
P ( X = 0 ) ≤ V a r ( X ) E ( X 2 ) P(X=0)\leq \frac{Var(X)}{E(X^2)} P ( X = 0 ) ≤ E ( X 2 ) V a r ( X )  
If f f f 0 ≤ λ 1 , λ 2 ≤ 1 , λ 1 + λ 2 = 1 0\leq \lambda_1, \lambda_2 \leq 1, \lambda_1 + \lambda_2 = 1 0 ≤ λ 1  , λ 2  ≤ 1 , λ 1  + λ 2  = 1 x 1 , x 2 x_1, x_2 x 1  , x 2   f ( λ 1 x 1 + λ 2 x 2 ) ≤ λ 1 f ( x 1 ) + λ 2 f ( x 2 ) f(\lambda_1 x_1 + \lambda_2 x_2) \leq \lambda_1 f(x_1) + \lambda_2 f(x_2) f ( λ 1  x 1  + λ 2  x 2  ) ≤ λ 1  f ( x 1  ) + λ 2  f ( x 2  ) 
Let X X X g g g E ( g ( x ) ) ≥ g ( E ( X ) ) E(g(x))\geq g(E(X)) E ( g ( x ) ) ≥ g ( E ( X ) ) g g g E ( g ( x ) ) ≤ g ( E ( X ) ) E(g(x))\leq g(E(X)) E ( g ( x ) ) ≤ g ( E ( X ) ) g ( X ) = a + b X g(X) = a+bX g ( X ) = a + b X  X is discrete r.v. The entropy of X X X  H ( X ) = ∑ j = 1 n p j log  2 ( 1 p j ) H(X) = \sum_{j=1}^{n} p_j \log_2 (\frac{1}{p_j}) H ( X ) = j = 1 ∑ n  p j  log  2  ( p j  1  ) 
用Jensen不等式证明,当X是uniform的时候熵最大