Re: Questions about filesystems from SQLite author presentation

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 6 Jan 2020 21:15:18 +1100

On Mon, Jan 06, 2020 at 07:24:53AM +0000, Sitsofe Wheeler wrote:
> At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite
> (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled
> "Things to discuss"
> (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 )
> and had a few questions:
> 
> 1. Reliable ways to discover detailed filesystem properties
> 2. fbarrier()
> 3. Notify the OS about unused regions in the database file
> 
> For 1. I think Jan Kara said that supporting it was undesirable for
> details like just how much additional fsync were needed due to
> competing constraints (https://youtu.be/-oP2BOsMpdo?t=6063 ). Someone
> mentioned there was a
> patch for fsinfo to discover if you were on a network filesystem
> (https://www.youtube.com/watch?v=-oP2BOsMpdo&feature=youtu.be&t=5525
> )...
> For 2. there was a talk by MySQL dev Sergei Golubchik (
> https://youtu.be/-oP2BOsMpdo?t=1219 ) talking about how barriers had
> been taken out and was there a replacement. In
> https://youtu.be/-oP2BOsMpdo?t=1731 Chris Mason(?) seems to suggest
> that the desired effect could be achieved with io_uring chaining.

Even though it wasn't explicitly mentioned, I'm pretty sure that
those "write barriers" for ordering groups of writes against other
groups of writes are intended to be used for data integrity
purposes.

The problem is that data integrity writes also require any
uncommitted filesytsem metadata to be written in the correct order
to disk along with the data. i.e.  you can write to the log file,
but if the transactions during that write that allocate space and/or
convert it to written space have not been committed to the journal
then the data is not on stable storage and so data completion
ordering cannot be relied on for integrity related operations.

This is why write ordering always comes back to "you need to use
fdatasync(), O_DSYNC or RWF_DSYNC" - it is the only way to guarantee
the integrity of a initial data write(s) right down to the hardware
before starting the new dependent write(s).

Hence AIO_FSYNC and now chained operations in io_uring to allow
fsync to be issues asynchronously and be used as a "write barrier"
between groups of order dependent IOs...

> For 3. it sounded like Jan Kara was saying there wasn't anything at
> the moment (hypothetically you could introduce a call that marked the
> extents as "unwritten" but it doesn't sound like you can do that

You can do that with fallocate() - FALLOC_FL_ZERO_RANGE will mark
the unused range as unwritten in XFS, or you can just punch a hole
to free the unused space with FALLOC_FL_PUNCH_HOLE...

> today) and even if you wanted to use something like TRIM it wouldn't
> be worth it unless you were trimming a large (gigabytes) amount of
> data (https://youtu.be/-oP2BOsMpdo?t=6330 ).

Punch the space out, then run a periodic background fstrim so the
filesystem can issue efficient TRIM commands over free space...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx