Re: [PATCH v4 0/9] Generic per-sb io stats

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 7 Mar 2022 11:14:20 +1100

On Sat, Mar 05, 2022 at 11:18:34PM -0500, Theodore Ts'o wrote:
> On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote:
> > 
> > Dave Chinner asked why the io stats should not be enabled for all
> > filesystems.  That change seems too bold for me so instead, I included
> > an extra patch to auto-enable per-sb io stats for blockdev filesystems.
> 
> Perhaps something to consider is allowing users to be able to enable
> or disable I/O stats on per mount basis?
> 
> Consider if a potential future user of this feature has servers with
> one or two 256-core AMD Epyc chip, and suppose that they have a
> several thousand iSCSI mounted file systems containing various
> software packages for use by Kubernetes jobs.  (Or even several
> thousand mounted overlay file systems.....)
> 
> The size of the percpu counter is going to be *big* on a large CPU
> count machine, and the iostats structure has 5 of these per-cpu
> counters, so if you have one for every single mounted file system,
> even if the CPU slowdown isn't significant, the non-swappable kernel
> memory overhead might be quite large.

A percpu counter on a 256 core machine is ~1kB. Adding 5kB to the
struct superblock isn't a bit deal for a machine of this size, even
if you have thousands of superblocks - we're talking a few
*megabytes* of extra memory in a machine that would typically have
hundreds of GB of RAM. Seriously, the memory overhead of the per-cpu
counters is noise compared to the memory footprint of, say, the
stacks needing to be allocated for every background worker thread
the filesystem needs.

Yeah, I know, we have ~175 per-cpu stats counters per XFS superblock
(we already cover the 4 counters Amir is proposing to add as generic
SB counters), and we have half a dozen dedicated worker threads per
mount. Yet systems still function just fine when there are thousands
of XFS filesystems and thousands of CPUs.

Seriously, a small handful of per-cpu counters that capture
information for all superblocks is not a big deal. Small systems
will have relatively litte overhead, large systems have the memory
to handle it.

> So maybe a VFS-level mount option, say, "iostats" and "noiostats", and
> some kind of global option indicating whether the default should be
> iostats being enabled or disabled?  Bonus points if iostats can be
> enabled or disabled after the initial mount via remount operation.

Can we please just avoid mount options for stuff like this? It'll
just never get tested unless it defaults to on, and then almost
no-one will ever turn it off because why would you bother tweaking
something that has not noticable impact but can give useful insights
the workload that is running?

I don't care one way or another here because this is essentially
duplicating something we've had in XFS for 20+ years. What I want to
avoid is blowing out the test matrix even further. Adding optional
features has a cost in terms of testing time, so if it's a feature
that is only rarely going to be turned on then we shouldn't add it
at all. If it's only rearely going to be turned off, OTOH, then we
should just make it ubiquitous and available for everything so it's
always tested.

Hence, AFAICT, the only real option for yes/no support is the
Kconfig option. If the kernel builder turns it on, it is on for
everything, otherwise it is off for everything.

> I could imagine some people only being interested to enable iostats on
> certain file systems, or certain classes of block devices --- so they
> might want it enabled on some ext4 file systems which are attached to
> physical devices, but not on the N thousand iSCSI or nbd mounts that
> are also happen to be using ext4.

That seems ... fairly contrived. Block device IO stats are not turned
on and off based on the block device type - they are generic.
Network device stats are not turned on and off based on teh network
device - they are generic. Why should per-filesystem IO stats be
special and different to everything else?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx