Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Wed, 1 Jun 2022 11:48:33 -0700 (PDT)

On Tue, 31 May 2022, Keith Busch wrote:
> On Tue, May 31, 2022 at 04:04:12PM -0700, Eric Wheeler wrote:
> > 
> >   * Write-through: write is done synchronously both to the cache and to 
> >     the backing store.
> > 
> >   * Write-back (also called write-behind): initially, writing is done only 
> >     to the cache. The write to the backing store is postponed until the 
> >     modified content is about to be replaced by another cache block.
> >   [ https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies ]
> > 
> > 
> > So the kernel's notion of "write through" meaning "Drop FLUSH/FUA" sounds 
> > like the industry meaning of "write-back" as defined above; conversely, 
> > the kernel's notion of "write back" sounds like the industry definition of 
> > "write-through"
> > 
> > Is there a well-meaning rationale for the kernel's concept of "write 
> > through" to be different than what end users have been conditioned to 
> > understand?
> 
> I think we all agree what "write through" vs "write back" mean. I'm just not
> sure what's the source of the disconnect with the kernel's behavior.
> 
>   A "write through" device persists data before completing a write operation.
> 
>   Flush/FUA says to write data to persistence before completing the operation.
> 
> You don't need both. Flush/FUA should be a no-op to a "write through" device
> because the data is synchronously committed to the backing store automatically.

Ok, I think I'm starting to understand the rationale, thank you for your 
patience while I've come to wrap my head around it. So, using a RAID 
controller cache as an example:

1. A RAID controller with a _non-volatile_ "writeback" cache (from the 
   controller's perspective, ie, _with_ battery) is a "write through"  
   device as far as the kernel is concerned because the controller will 
   return the write as complete as soon as it is in the persistent cache.

2. A RAID controller with a _volatile_ "writeback" cache (from the 
   controller's perspective, ie _without_ battery) is a "write back"  
   device as far as the kernel is concerned because the controller will 
   return the write as complete as soon as it is in the cache, but the 
   cache is not persistent!  So in that case flush/FUA is necessary.

I think it is rare someone would configure a RAID controller is as 
writeback (in the controller) when the cache is volatile (ie, without 
battery), but it is an interesting way to disect this to understand the 
rationale around value choices for the `queue/write_cache` flag in sysfs.

So please correct me here if I'm wrong: theoretically, a RAID controller 
with a volatile writeback cache is "safe" in terms of any flush/FIO 
behavior, assuming the controller respects those ops in writeback mode.  
For example, ext4's journal is probably consistent after a crash, even if 
2GB of cached data might be lost (assuming FUA and not FLUSH is being 
used for meta, I don't actually know ext4's implementation there).

I would guess that most end users are going to expect queue/write_cache to 
match their RAID controller's naming convention.  If they see "write 
through" when they know their controller is in writeback w/battery then 
they might reasonably expect the flag to show "write back", too.  If they 
then force it to "write back" then they loose the performance benefit.

Given that, and considering end users that configure raid controllers do 
not commonly understand the flush/FUA intracies and what really 
constitutes "write back" vs "write through" from the kernel's perspective, 
then perhaps it would be a good idea to add more documentation around 
write_cache here:

  https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt

What do you think?

--
Eric Wheeler