Re: [PATCH 12/13] fsverity: remove system-wide workqueue

Eric Biggers <ebiggers@xxxxxxxxxx> · Wed, 24 Apr 2024 18:41:12 +0000

On Wed, Apr 24, 2024 at 11:05:20AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 04, 2024 at 11:14:07PM -0400, Eric Biggers wrote:
> > On Fri, Mar 29, 2024 at 05:35:48PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > Now that we've made the verity workqueue per-superblock, we don't need
> > > the systemwide workqueue.  Get rid of the old implementation.
> > 
> > This commit message needs to be rephrased because this commit isn't just
> > removing unused code.  It's also converting ext4 and f2fs over to the new
> > workqueue type.  (Maybe these two parts belong as separate patches?)
> 
> Yes, will fix that.
> 
> > Also, if there are any changes in the workqueue flags that are being used for
> > ext4 and f2fs, that needs to be documented.
> 
> Hmm.  The current codebase does this:
> 
> 	fsverity_read_workqueue = alloc_workqueue("fsverity_read_queue",
> 						  WQ_HIGHPRI,
> 						  num_online_cpus());
> 
> Looking at commit f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from
> fsverity read workqueue"), I guess you want a bound workqueue so that
> the CPU that handles the readahead ioend will also handle the verity
> validation?

That was the intent of that commit.  Hashing is fast enough on modern CPUs that
it seemed best to just do the work on the CPU that the interrupt comes in on,
instead of taking the performance hit of migrating the work to a different CPU.
The cost of CPU migration was measured to be very significant on arm64.

However, new information has come in indicating that the switch to a bound
workqueue is harmful in some cases.  Specifically, on some systems all storage
interrupts happen on the same CPU, and if the fsverity (or dm-verity -- this
applies there too) work takes too long due to having to read a lot of Merkle
tree blocks, too much work ends up queued up on that one CPU.

So this is an area that's under active investigation.  For example, some
improvements to unbound workqueues were upstreamed in v6.6, and I'll be taking a
look at those and checking whether switching back to an appropriately-configured
unbound workqueue would help.

> Why do you set max_active to num_online_cpus()?  Is that because the
> verity hash is (probably?) being computed on the CPUs, and there's only
> so many of those to go around, so there's little point in making more?
> Or is it to handle systems with more than WQ_DFL_ACTIVE (~256) CPUs?
> Maybe there's a different reason?
> 
> If you add more CPUs to the system later, does this now constrain the
> number of CPUs that can be participating in verity validation?  Why not
> let the system try to process as many read ioends as are ready to be
> processed, rather than introducing a constraint here?

I'm afraid that not much thought has been put into the value of max_active.
dm-crypt has long used an unbound workqueue with max_active=num_online_cpus(),
and I think I had just copied it from there.  Then commit f959325e6ac3 just
didn't change it when removing WQ_UNBOUND.

According to Documentation/core-api/workqueue.rst, max_active is a per-CPU
attribute.  So I doubt that num_online_cpus() makes sense regardless of bound or
unbound.  It probably should just use the default value of 256 (which is
specified by passing max_active=0).

> As for WQ_HIGHPRI, I wish Dave or Andrey would chime in on why this
> isn't appropriate for XFS.  I think they have a reason for this, but the
> most I can do is speculate that it's to avoid blocking other things in
> the system.

I think it's been shown that neither option (WQ_HIGHPRI or !WQ_HIGHPRI) is best
in all cases.  Ideally, the work for each individual I/O request would get
scheduled at a priority that is appropriate for that particular I/O.  This same
problem shows up for dm-crypt, dm-verity, etc.

So if you have a reason to use !WQ_HIGHPRI for now, that seems reasonable, but
really all these subsystems need to be much smarter about scheduling their work.

- Eric