How to speed up evaluation of expression (multiple variables) to gain speed? Compile?
How to speed up evaluation of expression (multiple variables) to gain speed? Compile?
I obtained some long expression and need to use them with a lot of different parameters inside a simulation. However, evaluation of the expressions is pretty slow, so that I would like to speed it up. The first (and only) thing that came to my mind was Compile
. But, unfortunately, speed is not improved significantly.
Compile
This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):
fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)
As you can see, Compile
basically has no effect. The bigger issue is that the harmful expressions I am dealing with rather take 10s to evaluate and I'd like to see one or more orders of magnitude speed up (if possible). I hope that the small example here is essentially limited by the same effect as the bigger expressions, so that it serves as a proper toy example.
Compile
Background on fun
: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).
fun
Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile
or some other trick? Currently this is a serious bottleneck in my simulations.
Compile
fun
Sum
Sum
Compile
@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain
Sum
in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sum
s were used or is it a general idea?– Lukas
Aug 23 at 9:19
Sum
Sum
Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute
Listable
). For example, if a
is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a]
will be compute much faster than Exp /@ a
or Table[ Exp[x],x,a]
. And Total[Exp[a]]
is mush faster than Sum[Exp[x], x, a]
.– Henrik Schumacher
Aug 23 at 9:22
Listable
a
Exp[a]
Exp /@ a
Table[ Exp[x],x,a]
Total[Exp[a]]
Sum[Exp[x], x, a]
Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
– Henrik Schumacher
Aug 23 at 9:31
@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
– Lukas
Aug 23 at 10:04
1 Answer
1
The code for fun[tt, sig, time, expon, bShift, lam]]
was not correctly inlined into the compiled function funComp
. Most robust ways is to use With
as follows. I also apply N
to avoid some integer to double typecasts that would otherwise take place at runtime.
fun[tt, sig, time, expon, bShift, lam]]
funComp
With
N
funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C"
]
];
(Compilation takes longer now because now there happens a lot.)
Resulting timings on my machine:
fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
0.072316, 0.045048
0.000197, 0.045048
PS. You might get a further minor performance boost by changing expon, _Integer
to expon, _Real
because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.
expon, _Integer
expon, _Real
PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True
) might be your friend:
RuntimeAttributes -> Listable, Parallelization -> True
funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True
]
];
Here some timing examples.
a = RandomReal[-1, 1, 100];
r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]
9.128
0.151
0.00040
0.000047
2.60209*10^-18, 6.07153*10^-18, 0.
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
– Lukas
Aug 23 at 10:02
You're welcome!
– Henrik Schumacher
Aug 23 at 10:11
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define
funComp
, not funComp2
? Or the funComp
is the OP's function, and you have two examples with timing results for two different funComp2
, as in the 1st and 3rd code sections (they differ by Listable
and Parallel
)?– kkm
Aug 29 at 7:25
funComp
funComp2
funComp
funComp2
Listable
Parallel
@kkm Well observed! I corrected that and also replaced
AbsoluteTiming
by RepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True
gets only active if (i) the compiled function has also the attribute Listable
and (ii) the function is also called in the appropriate way: For r3
, funComp3
is Map
ped over the input vector; this does not exploit the Listable
property while r4
profits from both listability and (a bit) from parallelization.– Henrik Schumacher
Aug 29 at 7:39
AbsoluteTiming
RepeatedTiming
Parallelization->True
Listable
r3
funComp3
Map
Listable
r4
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
– kkm
Aug 29 at 7:48
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
You haven't generated the expression for the function
fun
by hand, have you? Please give us the code that produces this expression. If it contains manySum
: That's great because theSum
s can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead ofCompile
which may perform even better.– Henrik Schumacher
Aug 23 at 9:13