Re: consider dropping defrag of journals on btrfs

Lennart Poettering <lennart@xxxxxxxxxxxxxx> · Fri, 5 Feb 2021 16:23:02 +0100

On Do, 04.02.21 12:51, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote:

> On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering
> <lennart@xxxxxxxxxxxxxx> wrote:
>
> > You want to optimize write pattersn I understand, i.e. minimize
> > iops. Hence start with profiling iops, i.e. what defrag actually costs
> > and then weight that agains the reduced access time when accessing the
> > files. In particular on rotating media.
>
> A nodatacow journal on Btrfs is no different than a journal on ext4 or
> xfs. So I don't understand why you think you *also* need to defragment
> the file, only on Btrfs. You cannot do better than you already are
> with a nodatacow file. That file isn't going to get anymore fragmented
> in use than it was at creation.

You know, we issue the btrfs ioctl, under the assumption that if the
file is already perfectly defragmented it's a NOP. Are you suggesting
it isn't a NOP in that case?

> If you want to do better, maybe stop appending in 8MB increments?
> Every time you append it's another extent. Since apparently the
> journal files can max out at 128MB before they are rotated, why aren't
> they created 128MB from the very start? That would have a decent
> chance of getting you a file that's 1-4 extents, and it's not going to
> have more extents than that.

You know, there are certainly "perfect" ways to adjust our writing
scheme to match some specific file system on some specific storage
matching some specific user pattern. THing is though, what might be
ideal for some fs and some user might be terrible for another fs or
another user. We try to find some compromise in the middle, that might
not result in "perfect" behaviour everywhere, but at least reasonable
behaviour.

> Presumably the currently active journal not being fragmented is more
> important than archived journals, because searches will happen on
> recent events more than old events. Right?

Nope. We always interleave stuff. We currently open all journal files
in parallel. The system one and the per-user ones, the current ones
and the archived ones.

> So if you're going to say
> fragmentation matters at all, maybe stop intentionally fragmenting the
> active journal?

We are not *intentionally* fragmenting. Please don't argue on that
level. Not helpful, man.

> Just fallocate the max size it's going to be right off
> the bat? Doesn't matter what file system it is. Once that 128MB
> journal is full, leave it alone, and rotate to a new 128M file. The
> append is what's making them fragmented.

I don't think that makes much sense: we rotate and start new files for
a multitude of reasons, such as size overrun, time jumps, abnormal
shutdown and so on. If we'd always leave a fully allocated file around
people would hate us...

The 8M increase is a middle ground: we don#t allocate space for each
log message, and we don't allocate space for everything at once. We
allocate medium sized chunks at a time.

Also, we vacuum old journals when allocating and the size constraints
are hit. i.e. if we detect that adding 8M to journal file X would mean
the space used by all journals together would be above the configure
disk usage limits we'll delete the oldest journal files we can, until
we can allocate 8M again. And we do this each time. If we'd allocate
the full file all the time this means we'll likely remove ~256M of
logs whenever we start a new file. And that's just shitty behaviour.

> But it gets worse. The way systemd-journald is submitting the journals
> for defragmentation is making them more fragmented than just leaving
> them alone.

Sounds like a bug in btrfs? systemd is not the place to hack around
btrfs bugs?

> If you want an optimization that's actually useful on Btrfs,
> /var/log/journal/ could be a nested subvolume. That would prevent any
> snapshots above from turning the nodatacow journals into datacow
> journals, which does significantly increase fragmentation (it would in
> the exact same case if it were a reflink copy on XFS for that
> matter).

Not sure what the point of that would be... at least when systemd does
snapshots (i.e. systemd-nspawn --template= and so on) they are of
course recursive, so what'd be the point of doing a subvolume there?

> > Somehow I think you are missing what I am asking for: some data that
> > actually shows your optimization is worth it: i.e. that leaving the
> > files fragment doesn't hurt access to the journal badly, and that the
> > number of iops is substantially lowered at the same time.
>
> I don't get the iops thing at all. What we care about in this case is
> latency. A least noticeable latency of around 150ms seems reasonable
> as a starting point, that's where users realize a delay between a key
> press and a character appearing. However, if I check for 10ms latency
> (using bcc-tools fileslower) when reading all of the above journals at
> once:
>
> $ sudo journalctl -D
> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
>
> Not a single report. None. Nothing took even 10ms. And those journals
> are more fragmented than your 20 in a 100MB file.

Now use rotating media... Of course random access latency doesn't
matter as much on SSD.

> And the best of the two is fallocate+nodatacow which makes the
> journals behave the same as on ext4 where you also don't do
> defragmentation.

We'd probably defrag on ext4 during archival too, if it gets us
anything.

Please provide profiling data showing that even on rotating media
defrag doesn't matter. Please work with the btrfs people if you think
defrag is broken. Please work with the btrfs people if you think the
FIEMAP ioctls are broken.

I mean, this is kinda what I am getting here: "On Chris' nvme btrfs
defrag is broken, please don't do it and oh, there's no working way to
detect if a file is frgmented, fiemap is broken too. And no, I am not
giving you any profiling data to go by, I just say so without
profiling anything, and I don't have a rotating media, and fuck
rotating media. And I am a big believer in btrfs, but apparently
everything is broken, and please make sure journald uses a write
pattern that might suck for everything else but works fine on my
personal nvme storage".

I mean, we can certainly change all this around. But give me some
basic, reasonable profiling data about latencies and iops and
stuff. Otherwise it's armchair optimization, i.e premature
optimization.

Lennart

--
Lennart Poettering, Berlin
_______________________________________________
systemd-devel mailing list
systemd-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/systemd-devel