Re: consider dropping defrag of journals on btrfs

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sat, 6 Feb 2021 19:47:28 -0700

On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi <phill@xxxxxxxxxxxx> wrote:

> Chris Murphy writes:
>
> > But it gets worse. The way systemd-journald is submitting the journals
> > for defragmentation is making them more fragmented than just leaving
> > them alone.
>
> Wait, doesn't it just create a new file, fallocate the whole thing, copy
> the contents, and delete the original?

Same inode, so no. As to the logic, I don't know. I'll ask upstream to
document it.

?How can that possibly make
> fragmentation *worse*?

I'm only seeing this pattern with journald journals, and
BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals.

Meanwhile, active journals exhibit no different pattern from ext4 and
xfs, no worse fragmentation.

Consid other storage technologies where COW and snapshots come into
play. For example anything based on device-mapper thin provisioning is
going to run into these issues. How it allocates physical extents
isn't up to the file system. Duplicate a file and delete the original,
you might get a more fragmented file as well. The physical layout is
entirely decoupled from the file system - where the filesystem could
tell you "no fragmentation" and yet it is highly fragmented, or vice
versa. These problems are not unique to Btrfs.

Is there a VFS API for handling these isues? Should there be? I really
don't think any application, including journald, should be having to
micromanage these kinds of things on a case by case basis. General
problems like this need general solutions.

> > All of those archived files have more fragments (post defrag) than
> > they had when they were active. And here is the FIEMAP for the 96MB
> > file which has 92 fragments.
>
> How the heck did you end up with nearly 1 frag per mb?

I didn't do anything special, it's a default configuration. I'll ask
Btrfs developers about it. Maybe it's one of those artifacts of FIEMAP
I mentioned previously. Maybe it's not that badly fragmented to a
drive that's going to reorder reads anyway, to be more efficient about
it.

> > If you want an optimization that's actually useful on Btrfs,
> > /var/log/journal/ could be a nested subvolume. That would prevent any
> > snapshots above from turning the nodatacow journals into datacow
> > journals, which does significantly increase fragmentation (it would in
> > the exact same case if it were a reflink copy on XFS for that matter).
>
> Wouldn't that mean that when you take snapshots, they don't include the
> logs?

That's a snapshot/rollback regime design and policy question.

If you snapshot the subvolume that contains the journals, the journals
will be in the snapshot. The user space tools do not have an option
for recursive snapshots, so snapshotting does end at subvolume
boundaries. If you want journals snapshot, then their enclosing
subvolume would need to be snapshot.

> That seems like an anti feature that violates the principal of
> least surprise.  If I make a snapshot of my root, I *expect* it to
> contain my logs.

You can only rollback that which you snapshot. If you snapshot a root
without excluding journals, if you rollback, you rollback the
journals. That's data loss.

(open)suse has a snapshot/rollback regime configured and enabled by
default out of the box. Logs are excluded from it, same as the
bootloader. (Although I'll also note they default to volatile systemd
journals, and use rsyslogd for persistent logs.) Fedora meanwhile does
have persistent journald journals in the root subvolume, but there's
no snapshot/rollback regime enabled out of the box. I'm inclined to
have them excluded, not so much to avoid cow of the nodatacow
journals, but avoiding discontinuity in the journals upon rollback.

>
> > I don't get the iops thing at all. What we care about in this case is
> > latency. A least noticeable latency of around 150ms seems reasonable
> > as a starting point, that's where users realize a delay between a key
> > press and a character appearing. However, if I check for 10ms latency
> > (using bcc-tools fileslower) when reading all of the above journals at
> > once:
> >
> > $ sudo journalctl -D
> > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
> >
> > Not a single report. None. Nothing took even 10ms. And those journals
> > are more fragmented than your 20 in a 100MB file.
> >
> > I don't have any hard drives to test this on. This is what, 10% of the
> > market at this point? The best you can do there is the same as on SSD.
>
> The above sounded like great data, but not if it was done on SSD.

Right. But also I can't disable the defragmentation in order to do a
proper test on HDD.

> > You can't depend on sysfs to conditionally do defragmentation on only
> > rotational media, too many fragile media claim to be rotating.
>
> It sounds like you are arguing that it is better to do the wrong thing
> on all SSDs rather than do the right thing on ones that aren't broken.

No I'm suggesting there isn't currently a way to isolate
defragmentation to just HDDs.

-- 
Chris Murphy
_______________________________________________
systemd-devel mailing list
systemd-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/systemd-devel