Re: consider dropping defrag of journals on btrfs

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 4 Feb 2021 12:51:33 -0700

On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering
<lennart@xxxxxxxxxxxxxx> wrote:

> You want to optimize write pattersn I understand, i.e. minimize
> iops. Hence start with profiling iops, i.e. what defrag actually costs
> and then weight that agains the reduced access time when accessing the
> files. In particular on rotating media.

A nodatacow journal on Btrfs is no different than a journal on ext4 or
xfs. So I don't understand why you think you *also* need to defragment
the file, only on Btrfs. You cannot do better than you already are
with a nodatacow file. That file isn't going to get anymore fragmented
in use than it was at creation.

If you want to do better, maybe stop appending in 8MB increments?
Every time you append it's another extent. Since apparently the
journal files can max out at 128MB before they are rotated, why aren't
they created 128MB from the very start? That would have a decent
chance of getting you a file that's 1-4 extents, and it's not going to
have more extents than that.

Presumably the currently active journal not being fragmented is more
important than archived journals, because searches will happen on
recent events more than old events. Right? So if you're going to say
fragmentation matters at all, maybe stop intentionally fragmenting the
active journal? Just fallocate the max size it's going to be right off
the bat? Doesn't matter what file system it is. Once that 128MB
journal is full, leave it alone, and rotate to a new 128M file. The
append is what's making them fragmented.

But it gets worse. The way systemd-journald is submitting the journals
for defragmentation is making them more fragmented than just leaving
them alone.

https://drive.google.com/file/d/1FhffN4WZZT9gZTnG5VWongWJgPG_nlPF/view?usp=sharing

All of those archived files have more fragments (post defrag) than
they had when they were active. And here is the FIEMAP for the 96MB
file which has 92 fragments.

https://drive.google.com/file/d/1Owsd5DykNEkwucIPbKel0qqYyS134-tB/view?usp=sharing

I don't know if it's a bug with the submitted target size by
sd-journald, or if it's a bug in Btrfs. But it doesn't really matter.
There is no benefit to defragmenting nodatacow journals that were
fallocated upon creation.

If you want an optimization that's actually useful on Btrfs,
/var/log/journal/ could be a nested subvolume. That would prevent any
snapshots above from turning the nodatacow journals into datacow
journals, which does significantly increase fragmentation (it would in
the exact same case if it were a reflink copy on XFS for that matter).

> No, but doing this once in a big linear stream when the journal is
> archived might not be so bad if then later on things are much faster
> to access for all future because the files aren't fragmented.

Ok well in practice is worse than doing nothing so I'm suggesting doing nothing.

> Somehow I think you are missing what I am asking for: some data that
> actually shows your optimization is worth it: i.e. that leaving the
> files fragment doesn't hurt access to the journal badly, and that the
> number of iops is substantially lowered at the same time.

I don't get the iops thing at all. What we care about in this case is
latency. A least noticeable latency of around 150ms seems reasonable
as a starting point, that's where users realize a delay between a key
press and a character appearing. However, if I check for 10ms latency
(using bcc-tools fileslower) when reading all of the above journals at
once:

$ sudo journalctl -D
/mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager

Not a single report. None. Nothing took even 10ms. And those journals
are more fragmented than your 20 in a 100MB file.

I don't have any hard drives to test this on. This is what, 10% of the
market at this point? The best you can do there is the same as on SSD.
You can't depend on sysfs to conditionally do defragmentation on only
rotational media, too many fragile media claim to be rotating.

And by the way, I use Brfs on SD Card on a Raspberry Pi Zero of all
things. The cards last longer than other file systems due to net lower
write amplification due to native compression. I wouldn't be surprised
if the cards fail sooner if I weren't using compression. But who
knows, maybe Btrfs write amplification compared to ext4 and xfs
constant journaling ends up being a wash. There are a number of
embedded use cases for Btrfs as well. Is compressed F2FS better?
Probably. They have a solution for the wandering trees problem, but
also no snapshots or data checksumming. But I also don't think any of
that is super relevant to the overall topic, I just provide this as a
contra-argument that Btrfs isn't appropriate for small cheap storage
devices.

> The thing is that we tend to have few active files and many archived
> files, and since we interleave stuff our access patterns are pretty
> bad already, so we don't want to spend even more time on paying for
> extra bad access patterns becuase the archived files are fragment.

Right. So pick a size for the journal file, I don't really care what
it is but they seem to get upwards of 128MB in size so just use that.
Make a 128MB file from the very start, fallocate it, and then when
full, rotate and create a new one. Stop the anti-pattern of tacking on
in 8MB increments. And stop defragmenting them. That is the best
scenario for HDD, USB sticks, and NVMe.

Looking at the two original commits, I think they were always in
conflict with each other, happening within months of each other. They
are independent ways of dealing with the same problem, where only one
of them is needed. And the best of the two is fallocate+nodatacow
which makes the journals behave the same as on ext4 where you also
don't do defragmentation.

-- 
Chris Murphy
_______________________________________________
systemd-devel mailing list
systemd-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/systemd-devel