Re: Questions about filesystems from SQLite author presentation

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 6 Jan 2020 17:40:20 +0200

On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@xxxxxxxxx> wrote:
>
> At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite
> (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled
> "Things to discuss"
> (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 )
> and had a few questions:
>
[...]
>
> However, there were even more questions in the briefing paper
> (https://sqlite.org/lpc2019/doc/trunk/briefing.md and search for '?')
> that couldn't be asked due to limited time. Does anyone know the
> answer to the extended questions and whether the the above is right
> deduction for the questions that were asked?
>

As Jan said, there is a difference between the answer to "what is the
current behavior" and "what are filesystem developers willing to commit
as behavior that will remain the same in the future", but I will try to provide
some answers to your questions.

> If a power loss occurs at about the same time that a file is being extended
> with new data, will the file be guaranteed to contain valid data after reboot,
> or might the extended area of the file contain all zeros or all ones or
> arbitrary content? In other words, is the file data always committed to disk
> ahead of the file size?

While that statement would generally be true (ever since ext3
journal=ordered...),
you have no such guaranty. Getting such a guaranty would require a new API
like O_ATOMIC.

> If a power loss occurs at about the same time as a file truncation, is it possible
> that the truncated area of the file will contain arbitrary data after reboot?
> In other words, is the file size guaranteed to be committed to disk before the
> data sections are released?

That statement is generally true for filesystem that claim to be crash
consistent.
And the filesystems that do not claim to be crash consistent provide
no guaranties
at all w.r.t power loss, so it's not worth talking about them in this context.

> If a write occurs on one or two bytes of a file at about the same time as a power
> loss, are other bytes of the file guaranteed to be unchanged after reboot?
> Or might some other bytes within the same sector have been modified as well?

I don't see how other bytes could change in this scenario, but I don't
know if the
hardware provides this guarantee. Maybe someone else knows the answer.

> When you create a new file, write to it, and fdatasync() successfully, is it also
> necessary to open and fsync() the containing directory in or to ensure that the
> file will still be there following reboot from a power loss?

There is no guarantee that file will be there after power loss without
fsync() of
containing directory. In practice, with current upstream xfs and ext4
file will be
there after reboot, because at the moment, fdatasync() of a new file implies
journal flush, which also includes the file creation.
With current upstream btrfs file may not be there after reboot.

I tried to promote a new API to provide a weaker guarantee
in LSF/MM 2019 [1][2]. The idea is an API used by an application that does not
need durability - it doesn't care if new file is there or not after
power loss, but if
the file is there, its data of the file should be valid.

I do not know if sqlite could potentially use such an API. If there is
a potential
use, I did not find it. Specifically, the proposed API DOES NOT have the
semantics of fbarrier() mentioned in the sqlite briefing doc.

[See more about fdatasync() at the bottom of my reply...]

> Has a file been unlinked or renamed since it was opened?
> (SQLite accomplishes this now by remembering the device and inode numbers
> obtained from fstat() and comparing them against the results of subsequent stat()
> calls against the original filename. Is there a more efficient way to do this?)

name_to_handle_at() is a better way to make sure that file with same name
wasn't replaced by another, because inode numbers get frequently recycled
in create/delete workloads.

> Has a particular file been created since the most recent reboot?

statx(2) exposes "birth time" (STATX_BTIME) which some filesystems
support depending on how they were formatted (e.g. ext4 inode size).
In any case, statx reports if btime info is available or not.

> Is it possible (or helpful) to tell the filesystem that the content of a particular file
> does not need to survive reboot?

Not that I know of.

> Is it possible (or helpful) to tell the filesystem that a particular file can be
> unlinked upon reboot?

Not that I know of.

> Is it possible (or helpful) to tell the filesystem about parts of the database
> file that are currently unused and that the filesystem can zero-out without
> harming the database?

As Dave already replied, FALLOC_FL_ZERO_RANGE.

[...more about fdatasync()]

One thing that I think is worth mentioning, I discussed it on LSF [3],
is the cost
of requiring applications developers to use the most strict API (i.e. fsync()),
because filesystem developers don't want to commit to new APIs -

When the same filesystem hosts two different workloads:
1. sqlite with many frequent small transaction commits
2. Creating many small files with no need for durability (e.g. untar)

Both workloads may in practice hurt each other on many filesystems.
The frequent fdatasync() calls from sqlite will sometimes cause journal
flushes, which flush more than sqlite needs, take more time to commit
and slows down the other metadata intensive workload.

Ext4 is trying to address this issue without extending the API [4].
XFS was a bit bettter than ext4 with avoiding unneeded journal flushes,
but those could still take place. Btrfs is generally better in this regard
(fdatasync() effects are quite isolated to the file).

So how can sqlite developers help to improve the situation?
If you ask me, I would suggest to provide benchmark results from
mixed workloads, like the one I described above.

If you can demonstrate the negative effects that frequent fdatasync()
calls on a single sqlite db have on the system performance as a whole,
then there is surely something that could be done to fix the problem.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@xxxxxxxxxxxxxx/
[2] https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@xxxxxxxxx/
[3] https://lwn.net/Articles/788938/
[4] https://lore.kernel.org/linux-ext4/20191001074101.256523-1-harshadshirwadkar@xxxxxxxxx/