Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>clock_gettime(CLOCK_MONOTONIC, &ts);

IIRC this call has an overhead of a few ns. Isn't that close to or even higher than the time it takes to perform the action being measured on every loop? If so the author is just measuring clock_gettime. It can be confirmed on the author's system by simply calling clock_gettime without calling the measured action (movntdqa / cache flush) and comparing the results. An alternative approach is to call clock_gettime before and after the loop, not on every iteration, and then take an average.



So long as the overhead is predictable (which is an interesting assumption) then it'll just contribute noise and not affect the frequency analysis.


Taking an average is the wrong technique here. The author is trying to measure variance between equivalent code due to hardware events. I mean, on average, DRAM access is very fast! But sometimes it's not, and that's what the article is about.

And in any case on a modern x86 system that call (which is in the VDSO, it's not a syscall) just reads the TSC and scales the result. That's going to be reliable and predictable, and much faster than the DRAM refresh excursions being measured.


clock_gettime overhead is on the order of ~20 ns (or about 60 cycles) on most systems implementing VDSO gettime, and is quite stable. The author is trying to measure something on the order of 75 ns.


I've actually been measuring clock_gettime(CLOCK_REALTIME) vDSO call lately, and when called and already hot (ie L1I cache) it is still 350 ticks as measured by rdtscp. I even had an open stack overflow question on this.

https://stackoverflow.com/questions/53252050/why-does-the-ca...

How are you getting 60 cycles?


I have measured it several times in various places with fairly consistent results. Of course, if you are on a platform which doesn't offer VDSO for your clock, or which disables or virtualizes `rdtsc` then the results could be much longer.

One of the places I measure it is in uarch-bench [1], where running `uarch-bench --clock-overhead` produces this output:

    ----- Clock Stats --------
                                                      Resolution (ns)               Runtime (ns)
                           Name                        min/  med/  avg/  max         min/  med/  avg/  max
                     StdClockAdapt<system_clock>      25.0/ 27.0/ 27.0/ 29.0        27.1/ 27.4/ 27.6/ 30.6
                     StdClockAdapt<steady_clock>      25.0/ 26.0/ 26.9/ 94.0        27.0/ 27.0/ 27.1/ 32.6
            StdClockAdapt<high_resolution_clock>      26.0/ 27.0/ 27.0/ 28.0        27.1/ 27.5/ 27.7/ 30.0
                  GettimeAdapter<CLOCK_REALTIME>      25.0/ 26.0/ 25.7/ 27.0        25.1/ 25.5/ 25.6/ 48.3
           GettimeAdapter<CLOCK_REALTIME_COARSE>       0.0/  0.0/  0.0/  0.0         7.2/  7.3/  7.3/  7.3
                 GettimeAdapter<CLOCK_MONOTONIC>      24.0/ 25.0/ 25.5/ 27.0        24.7/ 24.7/ 24.9/ 27.2
          GettimeAdapter<CLOCK_MONOTONIC_COARSE>       0.0/  0.0/  0.0/  0.0         7.0/  7.2/  7.2/  7.3
             GettimeAdapter<CLOCK_MONOTONIC_RAW>     355.0/358.0/357.8/361.0       357.4/358.2/358.1/360.5
        GettimeAdapter<CLOCK_PROCESS_CPUTIME_ID>     432.0/437.0/436.4/440.0       434.7/436.0/436.2/440.9
         GettimeAdapter<CLOCK_THREAD_CPUTIME_ID>     422.0/426.0/426.1/431.0       424.6/427.1/427.2/430.4
                  GettimeAdapter<CLOCK_BOOTTIME>     363.0/365.0/365.3/368.0       364.2/364.5/364.7/367.7
                                       DumbClock       0.0/  0.0/  0.0/  0.0         0.0/  0.0/  0.0/  0.0

The Runtime column shows the cost. Ignoring DumbClock (which is a dummy inline implementation returning constant zero), note that the clocks basically group themselves into 3 groups: around 7 ns, 25-27 ns and 300-400 ns.

The 7 ns group are those that are implemented just by reading a shared memory location, and don't need any rdtsc call at all. The downside, of course, is that this location is only updated periodically (usually during the scheduler tick), so the resolution is limited.

The 25ish ns group are those that are implemented in the VDSO - they need to do an rdtsc call, which is maybe half the time, and then do some math to turn this into a usable time. Note that CLOCK_REALTIME falls into this group on my system.

The 300+ ns group are those that need a system call. This used to be ~100 ns until Spectre and Meltdown mitigations happened. Some of these cannot easily be implemented in VDSO (e.g., those that return process-specific data), and some could be, but simply haven't.

For what it's worth, I wasn't able to reproduce your results from the SO question. Using your own test program (only modified to print the time per call), running it with no sleep and 10000 loops gives:

    $ ./clockt 0 10 10000
    init run 15256
    trial 0 took 659834 (65 cycles per call)
    trial 1 took 659674 (65 cycles per call)
    trial 2 took 659578 (65 cycles per call)
    trial 3 took 659550 (65 cycles per call)
    trial 4 took 659548 (65 cycles per call)
    trial 5 took 659556 (65 cycles per call)
    trial 6 took 659552 (65 cycles per call)
    trial 7 took 659556 (65 cycles per call)
    trial 8 took 659546 (65 cycles per call)
    trial 9 took 659544 (65 cycles per call)
On my 2.6 GHz system, 65 cycles corresponds to 25 ns, so those results are exactly consistent with the uarch-bench results shown above. So either your system is weird, or you weren't running enough loops, or ... I'm not sure.

[1] https://github.com/travisdowns/uarch-bench




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: