Using std::chrono::steady_clock to benchmark code in a thread/async

Using std::chrono::steady_clock to benchmark code in a thread/async



Suppose I have lots of computations that I want to run (and benchmark CPU time) in multiple threads. As a toy example:


#include <chrono>
#include <future>
#include <iostream>
#include <vector>


using unit_t = std::chrono::nanoseconds;

unit_t::rep expensive_computation()
auto start = std::chrono::steady_clock::now();
// Something time-consuming here...
auto end = std::chrono::steady_clock::now();

auto duration = std::chrono::duration_cast<unit_t>(end - start).count();

return duration;


int main()
std::vector<std::future<unit_t::rep>> computations;

for (int i = 0; i < 100; i++)
computations.push_back(std::async(expensive_computation));


for (size_t i = 0; i < computations.size(); i++)
auto duration = computations[i].get();
std::cout << "#" << i << " took " << duration << "ns" << std::endl;




I'm concerned that since steady_clock is montonic across threads the underlying clock ticks per process and not per thread (if any thread is scheduled the clock ticks for all threads). This would mean that if a thread were sleeping, steady_clock would still be ticking for it and this time would incorrectly be included in the duration for that thread. Is my suspicion correct? Or does steady_clock tick only for thread CPU time within a thread?


steady_clock


steady_clock


duration


steady_clock



Put another way, is this approach a safe way to independently time lots of computations (such that no CPU time spent on one thread will affect the duration of another thread)? Or do I need to spin off separate processes for each computation to make the steady_clock only tick when the computation is running/scheduled?


duration


steady_clock



edit: I also recognize that spinning up more threads than cores may be an inefficient approach to this problem (although, I don't particularly care about computation throughput; moreover, I just want them all as a group to complete in the fastest time). I suspect in practice, I'd need to maintain a small-constant bounded list of threads in flight (say capped at the number of cores) and only start new computations as a core becomes available. But, this shouldn't have an impact on timing that I care about above; it should only affect the wall clock time.





I think you should measure on a 100% idle machine. In this case it doesn't really matter which clock you use (if you measure things on a loaded machine, your benchmark could be inaccurate, even if you measure thread time: HyperThreading, cache usage, etc. could affect results).
– geza
Aug 26 at 15:26





A future ready check should improve the code before the future.get(). At ns resolution which is also CPU domain, a check regarding minimum resolution with the system may be essential.
– seccpur
Aug 26 at 15:27






@geza This is a very good point. I'm currently working to secure an environment like that. One of the concerns I'd have even on a 100% idle machine is that some computations take significantly longer than others (and those would perhaps be more impacted by the load, potentially artificially increasing their time). But I suspect this is largely unavoidable.
– Bailey Parker
Aug 26 at 15:37





@seccpur What do you mean by a "future ready check?" Does future.get() not call future.wait()? In the toy example, I'm not particularly concerned with the latency of results (just the overall runtime). Also, the ns resolution was an artifact of this being a toy example. My real benchmark resolution will probably be μs or ms.
– Bailey Parker
Aug 26 at 15:39


future.get()


future.wait()





@BaileyParker:"One of the concerns I'd have even on a 100% idle machine is that some computations take significantly longer than others". What do you mean by this? (just a note: remember to turn off CPU frequency scaling).
– geza
Aug 26 at 15:49





2 Answers
2



The standard specifies that steady_clock model physical time (as opposed to CPU time).


steady_clock



From [time.clock.steady]:



Objects of class steady_clock represent clocks for which values of time_point never decrease as physical time advances and for which values of time_point advance at a steady rate relative to real time. That is, the clock may not be adjusted.


steady_clock


time_point



That being said, how well an implementation models physical time is a QOI issue. Nevertheless, your code looks fine to me.



Should your experimentations prove unsatisfactory, clients of <chrono> can also author their own custom clocks that will have first class status within the <chrono> library.


<chrono>


<chrono>





Oops, seems like I misunderstood here then. I will try it out with the real computations and see if this is satisfactory (although, determining if there is interference will likely prove tricky). While researching, I did come across pthread_getcpuclockid so perhaps a custom clock wrapping that could be more accurate? As @geza notes in the comments though, my goal of having equivalent times to serial execution is likely unachievable given caching, hyperthreading etc. Thanks for you insight!
– Bailey Parker
Aug 26 at 15:33


pthread_getcpuclockid



This would mean that if a thread were sleeping, steady_clock would
still be ticking for it and this time would incorrectly be included in
the duration for that thread.



That won't be incorrectly though, as the standard specifies for
class std::chrono::steady_clock that it measures physical time, not CPU time or any other time. See here under [time.clock.steady]:


std::chrono::steady_clock



Objects of class steady_­clock represent clocks for which values of
time_­point never decrease as physical time advances and for which values of time_­point advance at a steady rate relative to
real time ...


steady_­clock


time_­point


time_­point



That said, your code looks fine in that it will give you the time measured for each thread run. Is there a reason for you to want to measure CPU time here? If so then let me know in the comments.





Thanks for the standard links. I will read up! Physical time perhaps is okay (in my specific case, computations do no I/O). My real goal here is that the timings produced should be equivalent to if all the computations had been run and timed serially. As long as this is the case, then that is fine.
– Bailey Parker
Aug 26 at 15:26






@BaileyParker Yes, this is the case. Glad I could help.
– SkepticalEmpiricist
Aug 26 at 15:27






@BaileyParker Seeing from your comment to the other answer, I may not have fully understood your requirement for accuracy here. Your code in the post isn't equivalent mathematically to running the same tasks serially. Just equivalent in terms of measuring total time to execute all the tasks.
– SkepticalEmpiricist
Aug 26 at 16:05






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)