Using std::chrono::steady_clock to benchmark code in a thread/async
Using std::chrono::steady_clock to benchmark code in a thread/async
Suppose I have lots of computations that I want to run (and benchmark CPU time) in multiple threads. As a toy example:
#include <chrono>
#include <future>
#include <iostream>
#include <vector>
using unit_t = std::chrono::nanoseconds;
unit_t::rep expensive_computation()
auto start = std::chrono::steady_clock::now();
// Something time-consuming here...
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<unit_t>(end - start).count();
return duration;
int main()
std::vector<std::future<unit_t::rep>> computations;
for (int i = 0; i < 100; i++)
computations.push_back(std::async(expensive_computation));
for (size_t i = 0; i < computations.size(); i++)
auto duration = computations[i].get();
std::cout << "#" << i << " took " << duration << "ns" << std::endl;
I'm concerned that since steady_clock
is montonic across threads the underlying clock ticks per process and not per thread (if any thread is scheduled the clock ticks for all threads). This would mean that if a thread were sleeping, steady_clock
would still be ticking for it and this time would incorrectly be included in the duration
for that thread. Is my suspicion correct? Or does steady_clock
tick only for thread CPU time within a thread?
steady_clock
steady_clock
duration
steady_clock
Put another way, is this approach a safe way to independently time lots of computations (such that no CPU time spent on one thread will affect the duration
of another thread)? Or do I need to spin off separate processes for each computation to make the steady_clock
only tick when the computation is running/scheduled?
duration
steady_clock
edit: I also recognize that spinning up more threads than cores may be an inefficient approach to this problem (although, I don't particularly care about computation throughput; moreover, I just want them all as a group to complete in the fastest time). I suspect in practice, I'd need to maintain a small-constant bounded list of threads in flight (say capped at the number of cores) and only start new computations as a core becomes available. But, this shouldn't have an impact on timing that I care about above; it should only affect the wall clock time.
A future ready check should improve the code before the future.get(). At ns resolution which is also CPU domain, a check regarding minimum resolution with the system may be essential.
– seccpur
Aug 26 at 15:27
@geza This is a very good point. I'm currently working to secure an environment like that. One of the concerns I'd have even on a 100% idle machine is that some computations take significantly longer than others (and those would perhaps be more impacted by the load, potentially artificially increasing their time). But I suspect this is largely unavoidable.
– Bailey Parker
Aug 26 at 15:37
@seccpur What do you mean by a "future ready check?" Does
future.get()
not call future.wait()
? In the toy example, I'm not particularly concerned with the latency of results (just the overall runtime). Also, the ns resolution was an artifact of this being a toy example. My real benchmark resolution will probably be μs or ms.– Bailey Parker
Aug 26 at 15:39
future.get()
future.wait()
@BaileyParker:"One of the concerns I'd have even on a 100% idle machine is that some computations take significantly longer than others". What do you mean by this? (just a note: remember to turn off CPU frequency scaling).
– geza
Aug 26 at 15:49
2 Answers
2
The standard specifies that steady_clock
model physical time (as opposed to CPU time).
steady_clock
From [time.clock.steady]:
Objects of class steady_clock
represent clocks for which values of time_point never decrease as physical time advances and for which values of time_point
advance at a steady rate relative to real time. That is, the clock may not be adjusted.
steady_clock
time_point
That being said, how well an implementation models physical time is a QOI issue. Nevertheless, your code looks fine to me.
Should your experimentations prove unsatisfactory, clients of <chrono>
can also author their own custom clocks that will have first class status within the <chrono>
library.
<chrono>
<chrono>
Oops, seems like I misunderstood here then. I will try it out with the real computations and see if this is satisfactory (although, determining if there is interference will likely prove tricky). While researching, I did come across
pthread_getcpuclockid
so perhaps a custom clock wrapping that could be more accurate? As @geza notes in the comments though, my goal of having equivalent times to serial execution is likely unachievable given caching, hyperthreading etc. Thanks for you insight!– Bailey Parker
Aug 26 at 15:33
pthread_getcpuclockid
This would mean that if a thread were sleeping, steady_clock would
still be ticking for it and this time would incorrectly be included in
the duration for that thread.
That won't be incorrectly though, as the standard specifies for
class std::chrono::steady_clock
that it measures physical time, not CPU time or any other time. See here under [time.clock.steady]:
std::chrono::steady_clock
Objects of class steady_clock
represent clocks for which values of
time_point
never decrease as physical time advances and for which values of time_point
advance at a steady rate relative to
real time ...
steady_clock
time_point
time_point
That said, your code looks fine in that it will give you the time measured for each thread run. Is there a reason for you to want to measure CPU time here? If so then let me know in the comments.
Thanks for the standard links. I will read up! Physical time perhaps is okay (in my specific case, computations do no I/O). My real goal here is that the timings produced should be equivalent to if all the computations had been run and timed serially. As long as this is the case, then that is fine.
– Bailey Parker
Aug 26 at 15:26
@BaileyParker Yes, this is the case. Glad I could help.
– SkepticalEmpiricist
Aug 26 at 15:27
@BaileyParker Seeing from your comment to the other answer, I may not have fully understood your requirement for accuracy here. Your code in the post isn't equivalent mathematically to running the same tasks serially. Just equivalent in terms of measuring total time to execute all the tasks.
– SkepticalEmpiricist
Aug 26 at 16:05
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I think you should measure on a 100% idle machine. In this case it doesn't really matter which clock you use (if you measure things on a loaded machine, your benchmark could be inaccurate, even if you measure thread time: HyperThreading, cache usage, etc. could affect results).
– geza
Aug 26 at 15:26