Re: [PATCH] vfs: replace current_kernel_time64 with ktime equivalent

Arnd Bergmann <arnd@xxxxxxxx> · Wed, 20 Jun 2018 18:14:50 +0200

On Wed, Jun 20, 2018 at 5:40 PM, Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote:
> Arnd Bergmann <arnd@xxxxxxxx> writes:
>>
>> I traced the original addition of the current_kernel_time() call to set
>> the nanosecond fields back to linux-2.5.48, where Andi Kleen added a
>> patch with subject "nanosecond stat timefields". This adds the original
>> call to current_kernel_time and the truncation to the resolution of the
>> file system, but makes no mention of the intended accuracy.  At the time,
>> we had a do_gettimeofday() interface that on some architectures could
>> return a microsecond-resolution timestamp, but there was no interface
>> for getting an accurate timestamp in nanosecond resolution, neither inside
>> the kernel nor from user space. This makes me suspect that the use of
>> coarse timestamps was never really a conscious decision but instead
>> a result of whatever API was available 16 years ago.
>
> Kind of. VFS/system calls are expensive enough that you need multiple us
> in and out so us resolution was considered good enough.

To clarify: current_kernel_time() uses at most millisecond resolution rather
than microsecond, as tkr_mono.xtime_nsec only gets updated during the
timer tick.

Has that time scale changed over the past 16 years as CPUs got faster
(and system call entry times slower down again with recent changes)?

I tried a simple test on the shell, in tmpfs here and saw:

$ for i in `seq -w 100000` ; do > $i ; done
$ stat * | less | grep Modify | uniq -c | head
    601 Modify: 2018-06-20 18:04:48.794314629 +0200
    920 Modify: 2018-06-20 18:04:48.798314691 +0200
    936 Modify: 2018-06-20 18:04:48.802314753 +0200
    937 Modify: 2018-06-20 18:04:48.806314816 +0200
    901 Modify: 2018-06-20 18:04:48.810314878 +0200
    929 Modify: 2018-06-20 18:04:48.814314940 +0200
    931 Modify: 2018-06-20 18:04:48.818315002 +0200
    894 Modify: 2018-06-20 18:04:48.822315064 +0200
    952 Modify: 2018-06-20 18:04:48.826315128 +0200
    898 Modify: 2018-06-20 18:04:48.830315190 +0200

which indicates that the result of ktime_get_coarse_real_ts64()
gets updated every four milliseconds here (matching the
CONFIG_HZ_250 setting in my running kernel), and that
we can create around 900 files during that time that each
get the same timestamp (strace shows 10 system calls for
each new file). Trying the same on btrfs, I get around 260
files per jiffy.

> Also if you do this change you really need to do some benchmarks,
> especially on setups without lazy atime. This might potentially
> cause a lot more inode flushes.

Good point. On the other hand, there may be some reasons to
do it even if there is a noticeable overhead, in cases where we
actually want hires timestamps, so perhaps this could be
a mount option.

     Arnd