Re: [PATCH v3 0/6] Generic per-sb io stats

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 3 Mar 2022 08:50:58 +0200

> > Not that simple.
> > First of all alloc_super() is used for all sorts of internal kernel sb
> > (e.g. pipes) that really don't need those stats.
>
> Doesn't change anything - it still should be entirely set up and
> managed by alloc_super/deactivate_locked_super.
>
> If it really has to be selected by filesystem, alloc_super() has
> a fstype passes to it and you can put a falg in the fstype to say
> this is supported. Then filesystems only need to set a feature flag
> to enable it, not have to manage allocation/freeing of something
> that only the core VFS code uses.
>

That sounds good to me.

It will still leave the decision per fs maintainer whether to support
sb iostats or not, but at least it won't cluttle mountstats with less
interesting stats from the many pseudo filesystem mounts.

I wish I had a good heuristic for which filesystems iostats should
be enabled. I was thinking for blockdev fs and for fs with private BDI
the iostats make more sense, because those fs are more likely to be
io intensive. Then I would not initialize iostats in alloc_super() but in
vfs_get_tree() as my test patch does.

Overlayfs does not have a private BDI yet [1], but overlayfs can use
an fstype flag to opt-in for iostats as you suggested.

[1] https://lore.kernel.org/linux-unionfs/20210923130814.140814-2-cgxu519@xxxxxxxxxxxx/

> > Second, counters can have performance impact.
>
> So does adding branches for feature checks that nobody except some
> special case uses.
>
> But if the counters have perf overhead, then fix the counter
> implementation to have less overhead.
>
> > Depending on the fs, overhead may or may not be significant.
> > I used the attached patch for testing and ran some micro benchmarks
> > on tmpfs (10M small read/writes).
> > The patch hacks -omand for enabling iostats [*]
> >
> > The results were not great. up to 20% slower when io size > default
> > batch size (32).
> > Increasing the counter batch size for rchar/wchar to 1K fixed this
> > micro benchmark,
>
> Why do you think that is? Think about it: what size IO did you test?
> I bet it was larger than 32 bytes and so it was forcing the
> default generic percpu counter implementation to take a spin lock on
> every syscall.  Yes?

Yes. I know why that is.

>
> Which means this addition will need to use a custom batch size for
> *all* filesystems, and it will have to be substantially larger than
> PAGE_SIZE because we need to amortise the cost of the global percpu
> counter update across multiple IOs, not just one. IOWs, the counter
> batch size likely needs to be set to several megabytes so that we
> don't need to take the per-cpu spinlock in every decent sized IO
> that applications issue.
>
> So this isn't a negative against enabling the feature for all
> superblocks - you just discovered a general problem because you
> hadn't considered the impact of the counter implementation on
> high performance, high concurrency IO. overlay does get used in such
> environments hence if the implementation isn't up to spec for
> filesystems like XFS, tmpfs, etc that overlay might sit on top of
> then it's not good enough for overlay, either.
>

I very much agree with you here. The batch size should be increased
to several MB, but that brings me back to your comment in the beginning
of this thread that I should use percpu_counter_read_positive() instead of
percpu_counter_sum_positive().

It's true that stats don't need to be accurate, but those stats are pretty
useful for sanity tests and testing some simple operations without ever
seeing the counters increase can be very confusing to users, especially
if megabytes are missing.

My instinct here is that I should use percpu_counter_sum_positive()
for mountstats and if someone reads this file too often, they should
pay the price for it. It is also not hard at all to spot the process and
function to blame for the extra CPU cycles, so unless this change is
going to regress a very common use case, I don't think this is going
to be a real life problem.

> > but it may not be a one size fits all situation.
> > So I'd rather be cautious and not enable the feature unconditionally.
>
> But now that you've realised that the counter configs need to be
> specifically tuned for the use case and the perf overhead is pretty
> much a non-issue, what's the argument against enabling it for
> all superblocks?
>

One argument still is the bloat of the mountstats file.
If people do not want iostats for pseudo fs I do not want to force
feed it to them.

I think the blockdev+private BDI+explicit opt-in by fstype flag
is a good balance.

Thanks,
Amir.