Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 19 Oct 2023 18:41:09 -0400

On Fri, 2023-10-20 at 00:00 +0200, Thomas Gleixner wrote:
> Jeff!
> 
> On Wed, Oct 18 2023 at 13:41, Jeff Layton wrote:
> > +void ktime_get_mg_fine_ts64(struct timespec64 *ts)
> > +{
> > +	struct timekeeper *tk = &tk_core.timekeeper;
> > +	unsigned long flags;
> > +	u32 nsecs;
> > +
> > +	WARN_ON(timekeeping_suspended);
> > +
> > +	raw_spin_lock_irqsave(&timekeeper_lock, flags);
> > +	write_seqcount_begin(&tk_core.seq);
> 
> Depending on the usage scenario, this will end up as a scalability issue
> which affects _all_ of timekeeping.
> 
> The usage of timekeeper_lock and the sequence count has been carefully
> crafted to be as non-contended as possible. We went a great length to
> optimize that because the ktime_get*() functions are really hotpath all
> over the place.
> 

> Exposing such an interface which wreckages that is a recipe for disaster
> down the road. It might be a non-issue today, but once we hit the
> bottleneck of that global lock, we are up the creek without a
> paddle.
> 

Thanks for taking the explanation, Thomas. That's understandable, and
that was my main worry with this set. I'll look at doing this another
way given your feedback. I just started by plumbing this into the
timekeeping code since that seemed like the most obvious place to do it.

I think it's probably still possible to do this by caching the values
returned by the timekeeper at the vfs layer, but there seems to be some
reticence to the basic idea that I don't quite understand yet.

> Well not really, but all we can do then is fall back to
> ktime_get_real(). So let me ask the obvious question:
> 
>      Why don't we do that right away?
> 
> Many moons ago when we added ktime_get_real_coarse() the main reason was
> that reading the time from the underlying hardware was insanely
> expensive.
> 
> Many moons later this is not true anymore, except for the stupid case
> where the BIOS wreckaged the TSC, but that's a hopeless case for
> performance no matter what. Optimizing for that would be beyond stupid.
> 
> I'm well aware that ktime_get_real_coarse() is still faster than
> ktime_get_real() in micro-benchmarks, i.e. 5ns vs. 15ns on the four
> years old laptop I'm writing this.
> 
> Many moons ago it was in the ballpark of 40ns vs. 5us due to TSC being
> useless and even TSC read was way more expensive (factor 8-10x IIRC) in
> comparison. That really mattered for FS, but does todays overhead still
> make a difference in the real FS use case scenario?
> 
> I'm not in the position of running meaningful FS benchmarks to analyze
> that, but I think the delta between ktime_get_real_coarse() and
> ktime_get_real() on contemporary hardware is small enough that it
> justifies this question.
> 
> The point is that both functions have pretty much the same D-cache
> pattern because they access the same data in the very same
> cacheline. The only difference is the actual TSC read and the extra
> conversion, but that's it. The TSC read has been massively optimized by
> the CPU vendors. I know that the ARM64 counter has been optimized too,
> though I have no idea about PPC64 and S390, but I would be truly
> surprised if they didn't optimize the hell out of it because time read
> is really used heavily both in kernel and user space.
> 
> Does anyone have numbers on contemporary hardware to shed some light on
> that in the context of FS and the problem at hand?

That was sort of my suspicion and it's good to have confirmation that
fetching a fine-grained timespec64 from the timekeeper is cheap. It
looked that way when I was poking around in there, but I wasn't sure
whether it was always the case.

It turns out however that the main benefit of using a coarse-grained
timestamp is that it allows the file system to skip a lot of inode
metadata updates.

The way it works today is that when we go to update the timestamp on an
inode, we check whether they have made any visible change, and we dirty
the inode metadata if so. This means that we only really update the
inode on disk once per jiffy or so when an inode is under heavy writes.

The idea with this set is to only use fine-grained timestamps when
someone is actively fetching them via getattr. When the mtime or ctime
is viewed via getattr, we mark the inode and then the following
timestamp update will get a fine-grained timestamp (unless the coarse-
grained clock has already ticked).

That allows us to keep the number of inode updates down to a bare
minimum, but still allows an observer to always see a change in the
timestamp when there have been changes to the inode.

Thanks again for the review! For the next iteration I (probably) won't
need to touch the timekeeper.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>