Importance sampling
Clash Royale CLAN TAG#URR8PPP
In statistics, importance sampling is a general technique for estimating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. It is related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.
Contents
1 Basic theory
2 Application to probabilistic inference
3 Application to simulation
3.1 Mathematical approach
3.2 Conventional biasing methods
3.2.1 Scaling
3.2.2 Translation
3.3 Effects of system complexity
3.4 Evaluation of importance sampling
3.5 Variance cost function
3.6 Multiple and adaptive importance sampling
4 See also
5 Notes
6 References
7 External links
Basic theory
Let X:Ω→Rdisplaystyle X:Omega to mathbb R be a random variable in some probability space (Ω,F,P)displaystyle (Omega ,mathcal F,P). We wish to estimate the expected value of X under P, denoted E[X;P]. If we have statistically independent random samples x1,…,xndisplaystyle x_1,ldots ,x_n, generated according to P, then an empirical estimate of E[X;P] is
- E^n[X;P]=1n∑i=1nxidisplaystyle widehat mathbf E _n[X;P]=frac 1nsum _i=1^nx_i
and the precision of this estimate depends on the variance of X,
- var[E^n;P]=var[X;P]/ndisplaystyle operatorname var [widehat mathbf E _n;P]=operatorname var [X;P]/n
The basic idea of importance sampling is to sample the states from a different distribution to lower the variance of the estimation of E[X;P], or when sampling from P is difficult.
This is accomplished by first choosing a random variable L≥0displaystyle Lgeq 0 such that E[L;P]=1 and that P-almost everywhere L(ω)≠0displaystyle L(omega )neq 0.
With the variate L we define a probability P(L)displaystyle P^(L) that satisfies
- E[X;P]=E[XL;P(L)].displaystyle mathbf E [X;P]=mathbf E left[frac XL;P^(L)right].
The variable X/L will thus be sampled under P(L) to estimate E[X;P] as above and this estimation is improved when
var[XL;P(L)]<var[X;P]displaystyle operatorname var left[frac XL;P^(L)right]<operatorname var [X;P].
When X is of constant sign over Ω, the best variable L would clearly be L∗=XE[X;P]≥0displaystyle L^*=frac Xmathbf E [X;P]geq 0, so that X/L* is the searched constant E[X;P] and a single sample under P(L*) suffices to give its value. Unfortunately we cannot take that choice, because E[X;P] is precisely the value we are looking for! However this theoretical best case L* gives us an insight into what importance sampling does:
- ∀a∈R,P(L∗)(X∈[a;a+da])=∫ω∈X∈[a;a+da]X(ω)E[X;P]dP(ω)=1E[X;P]aP(X∈[a;a+da])displaystyle beginalignedforall ain mathbb R ,;P^(L^*)(Xin [a;a+da])&=int _omega in Xin [a;a+da]frac X(omega )E[X;P]dP(omega )\&=frac 1E[X;P];a,P(Xin [a;a+da])endaligned
to the right, aP(X∈[a;a+da])displaystyle a,P(Xin [a;a+da]) is one of the infinitesimal elements that sum up to E[X;P]:
- E[X;P]=∫a=−∞+∞aP(X∈[a;a+da])displaystyle E[X;P]=int _a=-infty ^+infty a,P(Xin [a;a+da])
therefore, a good probability change P(L) in importance sampling will redistribute the law of X so that its samples' frequencies are sorted directly according to their weights in E[X;P]. Hence the name "importance sampling."
Importance sampling is often used as a Monte Carlo integrator.
When Pdisplaystyle P is the uniform distribution and Ω=Rdisplaystyle Omega =mathbb R , E[X;P] corresponds to the integral of the real function X:R→Rdisplaystyle X:mathbb R to mathbb R .
Application to probabilistic inference
Such methods are frequently used to estimate posterior densities or expectations in state and/or parameter estimation problems in probabilistic models that are too hard to treat analytically, for example in Bayesian networks.
Application to simulation
Importance sampling is a variance reduction technique that can be used in the Monte Carlo method. The idea behind importance sampling is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. If these "important" values are emphasized by sampling more frequently, then the estimator variance can be reduced. Hence, the basic methodology in importance sampling is to choose a distribution which "encourages" the important values. This use of "biased" distributions will result in a biased estimator if it is applied directly in the simulation. However, the simulation outputs are weighted to correct for the use of the biased distribution, and this ensures that the new importance sampling estimator is unbiased. The weight is given by the likelihood ratio, that is, the Radon–Nikodym derivative of the true underlying distribution with respect to the biased simulation distribution.
The fundamental issue in implementing importance sampling simulation is the choice of the biased distribution which encourages the important regions of the input variables. Choosing or designing a good biased distribution is the "art" of importance sampling. The rewards for a good distribution can be huge run-time savings; the penalty for a bad distribution can be longer run times than for a general Monte Carlo simulation without importance sampling.
Consider Xdisplaystyle X to be the sample and f(X)g(X)displaystyle frac f(X)g(X) to be the likelihood ratio, where fdisplaystyle f is the probability density (mass) function of the desired distribution and gdisplaystyle g is the probability density (mass) function of the biased/proposal/sample distribution. Then the problem can be characterized by choosing the sample distribution gdisplaystyle g that minimizes the variance of the scaled sample:
g∗=mingVarg(Xf(X)g(X))displaystyle g^*=min _goperatorname Var _gleft(Xfrac f(X)g(X)right).
It can be shown that the following distribution minimizes the above variance:[1]
- g∗(X)=|X|f(X)∫|x|f(x)dx.displaystyle g^*(X)=frac Xx.
It is easy to see that when X≥0displaystyle Xgeq 0, this variance becomes 0.
Mathematical approach
Consider estimating by simulation the probability ptdisplaystyle p_t, of an event X≥t displaystyle Xgeq t , where Xdisplaystyle X is a random variable with distribution Fdisplaystyle F and probability density function f(x)=F′(x)displaystyle f(x)=F'(x),, where prime denotes derivative. A Kdisplaystyle K-length independent and identically distributed (i.i.d.) sequence Xidisplaystyle X_i, is generated from the distribution Fdisplaystyle F, and the number ktdisplaystyle k_t of random variables that lie above the threshold tdisplaystyle t are counted. The random variable ktdisplaystyle k_t is characterized by the Binomial distribution
- P(kt=k)=(Kk)ptk(1−pt)K−k,k=0,1,…,K.displaystyle P(k_t=k)=K choose kp_t^k(1-p_t)^K-k,,quad quad k=0,1,dots ,K.
One can show that E[kt/K]=ptdisplaystyle operatorname E [k_t/K]=p_t, and var[kt/K]=pt(1−pt)/Kdisplaystyle operatorname var [k_t/K]=p_t(1-p_t)/K, so in the limit K→∞displaystyle Kto infty we are able to obtain ptdisplaystyle p_t. Note that the variance is low if pt≈1displaystyle p_tapprox 1. Importance sampling is concerned with the determination and use of an alternate density function f∗displaystyle f_*,(for Xdisplaystyle X), usually referred to as a biasing density, for the simulation experiment. This density allows the event X≥t displaystyle Xgeq t to occur more frequently, so the sequence lengths Kdisplaystyle K gets smaller for a given estimator variance. Alternatively, for a given Kdisplaystyle K, use of the biasing density results in a variance smaller than that of the conventional Monte Carlo estimate. From the definition of ptdisplaystyle p_t,, we can introduce f∗displaystyle f_*, as below.
- pt=E[1(X≥t)]=∫1(x≥t)f(x)f∗(x)f∗(x)dx=E∗[1(X≥t)W(X)]displaystyle beginalignedp_t&=E[1(Xgeq t)]\&=int 1(xgeq t)frac f(x)f_*(x)f_*(x),dx\&=E_*[1(Xgeq t)W(X)]endaligned
where
- W(⋅)≡f(⋅)f∗(⋅)displaystyle W(cdot )equiv frac f(cdot )f_*(cdot )
is a likelihood ratio and is referred to as the weighting function. The last equality in the above equation motivates the estimator
- p^t=1K∑i=1K1(Xi≥t)W(Xi),Xi∼f∗displaystyle hat p_t=frac 1K,sum _i=1^K1(X_igeq t)W(X_i),,quad quad X_isim f_*
This is the importance sampling estimator of ptdisplaystyle p_t, and is unbiased. That is, the estimation procedure is to generate i.i.d. samples from f∗displaystyle f_*, and for each sample which exceeds tdisplaystyle t,, the estimate is incremented by the weight Wdisplaystyle W, evaluated at the sample value. The results are averaged over Kdisplaystyle K, trials. The variance of the importance sampling estimator is easily shown to be
- var∗p^t=1Kvar∗[1(X≥t)W(X)]=1KE∗[1(X≥t)2W2(X)]−pt2=1KE[1(X≥t)W(X)]−pt2displaystyle beginalignedoperatorname var _*hat p_t&=frac 1Koperatorname var _*[1(Xgeq t)W(X)]\&=frac 1KleftE_*[1(Xgeq t)^2W^2(X)]-p_t^2right\&=frac 1KleftE[1(Xgeq t)W(X)]-p_t^2rightendaligned
Now, the importance sampling problem then focuses on finding a biasing density f∗displaystyle f_*, such that the variance of the importance sampling estimator is less than the variance of the general Monte Carlo estimate. For some biasing density function, which minimizes the variance, and under certain conditions reduces it to zero, it is called an optimal biasing density function.
Conventional biasing methods
Although there are many kinds of biasing methods, the following two methods are most widely used in the applications of importance sampling.
Scaling
Shifting probability mass into the event region X≥t displaystyle Xgeq t by positive scaling of the random variable Xdisplaystyle X, with a number greater than unity has the effect of increasing the variance (mean also) of the density function. This results in a heavier tail of the density, leading to an increase in the event probability. Scaling is probably one of the earliest biasing methods known and has been extensively used in practice. It is simple to implement and usually provides conservative simulation gains as compared to other methods.
In importance sampling by scaling, the simulation density is chosen as the density function of the scaled random variable aXdisplaystyle aX,, where usually a>1displaystyle a>1 for tail probability estimation. By transformation,
- f∗(x)=1af(xa)displaystyle f_*(x)=frac 1afbigg (frac xabigg ),
and the weighting function is
- W(x)=af(x)f(x/a)displaystyle W(x)=afrac f(x)f(x/a),
While scaling shifts probability mass into the desired event region, it also pushes mass into the complementary region X<tdisplaystyle X<t, which is undesirable. If Xdisplaystyle X, is a sum of ndisplaystyle n, random variables, the spreading of mass takes place in an ndisplaystyle n, dimensional space. The consequence of this is a decreasing importance sampling gain for increasing ndisplaystyle n,, and is called the dimensionality effect.
A modern version of importance sampling by scaling is e.g. so-called sigma-scaled sampling (SSS) which is running multiple MC analysis with different scaling factors. In opposite to many other high yield estimation methods (like worst-case distances WCD) SSS does not suffer much from the dimensionality problem. Also addressing multiple MC outputs causes no degradation in efficiency. On the other hand, as WCD, SSS is only designed for Gaussian statistical variables, and in opposite to WCD, the SSS method is not designed to provide accurate statistical corners.
Translation
Another simple and effective biasing technique employs translation of the density function (and hence random variable) to place much of its probability mass in the rare event region. Translation does not suffer from a dimensionality effect and has been successfully used in several applications relating to simulation of digital communication systems. It often provides better simulation gains than scaling. In biasing by translation, the simulation density is given by
- f∗(x)=f(x−c),c>0displaystyle f_*(x)=f(x-c),quad c>0,
where cdisplaystyle c, is the amount of shift and is to be chosen to minimize the variance of the importance sampling estimator.
Effects of system complexity
The fundamental problem with importance sampling is that designing good biased distributions becomes more complicated as the system complexity increases. Complex systems are the systems with long memory since complex processing of a few inputs is much easier to handle. This dimensionality or memory can cause problems in three ways:
- long memory (severe intersymbol interference (ISI))
- unknown memory (Viterbi decoders)
- possibly infinite memory (adaptive equalizers)
In principle, the importance sampling ideas remain the same in these situations, but the design becomes much harder. A successful approach to combat this problem is essentially breaking down a simulation into several smaller, more sharply defined subproblems. Then importance sampling strategies are used to target each of the simpler subproblems. Examples of techniques to break the simulation down are conditioning and error-event simulation (EES) and regenerative simulation.
Evaluation of importance sampling
In order to identify successful importance sampling techniques, it is useful to be able to quantify the run-time savings due to the use of the importance sampling approach. The performance measure commonly used is σMC2/σIS2displaystyle sigma _MC^2/sigma _IS^2,, and this can be interpreted as the speed-up factor by which the importance sampling estimator achieves the same precision as the MC estimator. This has to be computed empirically since the estimator variances are not likely to be analytically possible when their mean is intractable. Other useful concepts in quantifying an importance sampling estimator are the variance bounds and the notion of asymptotic efficiency. One related measure is the so-called Effective Sample Size (ESS).[2]
Variance cost function
Variance is not the only possible cost function for a simulation, and other cost functions, such as the mean absolute deviation, are used in various statistical applications. Nevertheless, the variance is the primary cost function addressed in the literature, probably due to the use of variances in confidence intervals and in the performance measure σMC2/σIS2displaystyle sigma _MC^2/sigma _IS^2,.
An associated issue is the fact that the ratio σMC2/σIS2displaystyle sigma _MC^2/sigma _IS^2, overestimates the run-time savings due to importance sampling since it does not include the extra computing time required to compute the weight function. Hence, some people evaluate the net run-time improvement by various means. Perhaps a more serious overhead to importance sampling is the time taken to devise and program the technique and analytically derive the desired weight function.
Multiple and adaptive importance sampling
When different proposal distributions, gn(x)displaystyle g_n(x) , n=1,…,N,displaystyle n=1,ldots ,N, are jointly used for drawing the samples x1,⋯,xN,displaystyle x_1,cdots ,x_N, different proper weighting functions can be employed (e.g., see [3][4][5][6]). In an adaptive setting, the proposal distributions, gn,t(x)displaystyle g_n,t(x) , n=1,…,N,displaystyle n=1,ldots ,N, and t=1,…,T,displaystyle t=1,ldots ,T, are updated each iteration tdisplaystyle t of the adaptive importance sampling algorithm. Hence, since a population of proposal densities is used, several suitable combinations of sampling and weighting schemes can be employed.[7][8][9][10][11][12][13]
See also
- Monte Carlo method
- Variance reduction
- Stratified sampling
- Recursive stratified sampling
- VEGAS algorithm
Particle filter — a sequential Monte Carlo method, which uses importance sampling- Auxiliary field Monte Carlo
- Rejection sampling
Notes
^ Rubinstein, R. Y., & Kroese, D. P. (2011). Simulation and the Monte Carlo method (Vol. 707). John Wiley & Sons.
^ Martino, Luca; Elvira, Víctor; Louzada, Francisco. "Effective sample size for importance sampling based on discrepancy measures". Signal Processing. 131: 386–401. arXiv:1602.03572. doi:10.1016/j.sigpro.2016.08.025..mw-parser-output cite.citationfont-style:inherit.mw-parser-output qquotes:"""""""'""'".mw-parser-output code.cs1-codecolor:inherit;background:inherit;border:inherit;padding:inherit.mw-parser-output .cs1-lock-free abackground:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center.mw-parser-output .cs1-lock-limited a,.mw-parser-output .cs1-lock-registration abackground:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center.mw-parser-output .cs1-lock-subscription abackground:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registrationcolor:#555.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration spanborder-bottom:1px dotted;cursor:help.mw-parser-output .cs1-hidden-errordisplay:none;font-size:100%.mw-parser-output .cs1-visible-errorfont-size:100%.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-formatfont-size:95%.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-leftpadding-left:0.2em.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-rightpadding-right:0.2em
^ Veach, Eric; Guibas, Leonidas J. (1995-01-01). "Optimally Combining Sampling Techniques for Monte Carlo Rendering". Proceedings of the 22Nd Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH '95. New York, NY, USA: ACM: 419–428. doi:10.1145/218380.218498. ISBN 0-89791-701-4.
^ Owen, Art; Associate, Yi Zhou (2000-03-01). "Safe and Effective Importance Sampling". Journal of the American Statistical Association. 95 (449): 135–143. doi:10.1080/01621459.2000.10473909. ISSN 0162-1459.
^ Elvira, V.; Martino, L.; Luengo, D.; Bugallo, M.F. (2015-10-01). "Efficient Multiple Importance Sampling Estimators". IEEE Signal Processing Letters. 22 (10): 1757–1761. arXiv:1505.05391. Bibcode:2015ISPL...22.1757E. doi:10.1109/LSP.2015.2432078. ISSN 1070-9908.
^ Elvira, Víctor; Martino, Luca; Luengo, David; Bugallo, Mónica F. "Improving population Monte Carlo: Alternative weighting and resampling schemes". Signal Processing. 131: 77–91. arXiv:1607.02758. doi:10.1016/j.sigpro.2016.07.012.
^ Cappé, O.; Guillin, A.; Marin, J. M.; Robert, C. P. (2004-12-01). "Population Monte Carlo". Journal of Computational and Graphical Statistics. 13 (4): 907–929. doi:10.1198/106186004X12803. ISSN 1061-8600.
^ Martino, L.; Elvira, V.; Luengo, D.; Corander, J. (2017-05-01). "Layered adaptive importance sampling". Statistics and Computing. 27 (3): 599–623. arXiv:1505.04732. doi:10.1007/s11222-016-9642-5. ISSN 0960-3174.
^ Cappé, Olivier; Douc, Randal; Guillin, Arnaud; Marin, Jean-Michel; Robert, Christian P. (2008-04-25). "Adaptive importance sampling in general mixture classes". Statistics and Computing. 18 (4): 447–459. arXiv:0710.4242. doi:10.1007/s11222-008-9059-x. ISSN 0960-3174.
^ Cornuet, Jean-Marie; Marin, Jean-Michel; Mira, Antonietta; Robert, Christian P. (2012-12-01). "Adaptive Multiple Importance Sampling". Scandinavian Journal of Statistics. 39 (4): 798–812. arXiv:0907.1254. doi:10.1111/j.1467-9469.2011.00756.x. ISSN 1467-9469.
^ Martino, L.; Elvira, V.; Luengo, D.; Corander, J. (2015-08-01). "An Adaptive Population Importance Sampler: Learning From Uncertainty". IEEE Transactions on Signal Processing. 63 (16): 4422–4437. Bibcode:2015ITSP...63.4422M. doi:10.1109/TSP.2015.2440215. ISSN 1053-587X.
^ Bugallo, Mónica F.; Martino, Luca; Corander, Jukka (2015-12-01). "Adaptive importance sampling in signal processing". Digital Signal Processing. Special Issue in Honour of William J. (Bill) Fitzgerald. 47: 36–49. doi:10.1016/j.dsp.2015.05.014.
^ Bugallo, M. F.; Elvira, V.; Martino, L.; Luengo, D.; Miguez, J.; Djuric, P. M. (July 2017). "Adaptive Importance Sampling: The past, the present, and the future". IEEE Signal Processing Magazine. 34 (4): 60–79. Bibcode:2017ISPM...34...60B. doi:10.1109/msp.2017.2699226. ISSN 1053-5888.
References
Arouna, Bouhari (2004). "Adaptative Monte Carlo Method, A Variance Reduction Technique". Monte Carlo Methods and Their Applications. 10 (1): 1–24. doi:10.1515/156939604323091180.
Bucklew, James Antonio (2004). Introduction to Rare Event Simulation. New York: Springer-Verlag.
Doucet, A.; de Freitas, N.; Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Springer. ISBN 978-0-387-95146-1.
Ferrari, M.; Bellini, S. (2001). "Importance Sampling simulation of turbo product codes". The IEEE International Conference on Communications. 9: 2773–2777. doi:10.1109/ICC.2001.936655.
Mazonka, Oleg (2016). "Easy as Pi: The Importance Sampling Method" (PDF). Journal of Reference. 16.
Oberg, Tommy (2001). Modulation, Detection, and Coding. New York: John Wiley & Sons.
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 7.9.1 Importance Sampling". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8.
Ripley, B. D. (1987). Stochastic Simulation. Wiley & Sons.
Smith, P. J.; Shafi, M.; Gao, H. (1997). "Quick simulation: A review of importance sampling techniques in communication systems". IEEE J. Select. Areas Commun. 15 (4): 597–613. doi:10.1109/49.585771.
Srinivasan, R. (2002). Importance sampling – Applications in communications and detection. Berlin: Springer-Verlag.
External links
Sequential Monte Carlo Methods (Particle Filtering) homepage on University of Cambridge
Introduction to importance sampling in rare-event simulations European journal of Physics. PDF document.
Adaptive monte carlo methods for rare event simulation: adaptive monte carlo methods for rare event simulations Winter Simulation Conference