How to speed up evaluation of expression (multiple variables) to gain speed? Compile?

I obtained some long expression and need to use them with a lot of different parameters inside a simulation. However, evaluation of the expressions is pretty slow, so that I would like to speed it up. The first (and only) thing that came to my mind was Compile. But, unfortunately, speed is not improved significantly.

Compile

This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):

fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"]; funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]]; fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming (* 0.0836853, 0.045048 *) funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming (* 0.0718691, 0.045048 *)

As you can see, Compile basically has no effect. The bigger issue is that the harmful expressions I am dealing with rather take 10s to evaluate and I'd like to see one or more orders of magnitude speed up (if possible). I hope that the small example here is essentially limited by the same effect as the bigger expressions, so that it serves as a proper toy example.

Compile

Background on fun: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).

fun

Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile or some other trick? Currently this is a serious bottleneck in my simulations.

Compile

You haven't generated the expression for the function fun by hand, have you? Please give us the code that produces this expression. If it contains many Sum: That's great because the Sums can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile which may perform even better.
– Henrik Schumacher
Aug 23 at 9:13

fun

Sum

Compile

@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain Sum in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sums were used or is it a general idea?
– Lukas
Aug 23 at 9:19

Sum

Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute Listable). For example, if a is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a] will be compute much faster than Exp /@ a or Table[ Exp[x],x,a]. And Total[Exp[a]] is mush faster than Sum[Exp[x], x, a].
– Henrik Schumacher
Aug 23 at 9:22

Listable

a

Exp[a]

Exp /@ a

Table[ Exp[x],x,a]

Total[Exp[a]]

Sum[Exp[x], x, a]

Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
– Henrik Schumacher
Aug 23 at 9:31

@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
– Lukas
Aug 23 at 10:04

1 Answer
1

The code for fun[tt, sig, time, expon, bShift, lam]] was not correctly inlined into the compiled function funComp. Most robust ways is to use With as follows. I also apply N to avoid some integer to double typecasts that would otherwise take place at runtime.

fun[tt, sig, time, expon, bShift, lam]]

funComp

With

N

funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]], Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, code, CompilationTarget -> "C" ] ];

(Compilation takes longer now because now there happens a lot.)

Resulting timings on my machine:

fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming

0.072316, 0.045048

0.000197, 0.045048

PS. You might get a further minor performance boost by changing expon, _Integer to expon, _Real because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.

expon, _Integer

expon, _Real

PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True) might be your friend:

RuntimeAttributes -> Listable, Parallelization -> True

funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]], Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real, code, CompilationTarget -> "C", RuntimeAttributes -> Listable, Parallelization -> True ] ];

Here some timing examples.

a = RandomReal[-1, 1, 100]; r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]

9.128

0.151

0.00040

0.000047

2.60209*10^-18, 6.07153*10^-18, 0.

This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
– Lukas
Aug 23 at 10:02

You're welcome!
– Henrik Schumacher
Aug 23 at 10:11

Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define funComp, not funComp2? Or the funComp is the OP's function, and you have two examples with timing results for two different funComp2, as in the 1st and 3rd code sections (they differ by Listable and Parallel)?
– kkm
Aug 29 at 7:25

funComp

funComp2

funComp

funComp2

Listable

Parallel

@kkm Well observed! I corrected that and also replaced AbsoluteTiming by RepeatedTiming. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True gets only active if (i) the compiled function has also the attribute Listable and (ii) the function is also called in the appropriate way: For r3, funComp3 is Mapped over the input vector; this does not exploit the Listable property while r4 profits from both listability and (a bit) from parallelization.
– Henrik Schumacher
Aug 29 at 7:39

AbsoluteTiming

RepeatedTiming

Parallelization->True

Listable

r3

funComp3

Map

Listable

r4

It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
– kkm
Aug 29 at 7:48

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt