Re: consider dropping defrag of journals on btrfs

Lennart Poettering <lennart@xxxxxxxxxxxxxx> · Fri, 5 Feb 2021 17:44:14 +0100

On Fr, 05.02.21 16:06, Dave Howorth (systemd@xxxxxxxxxxxxxx) wrote:

> On Fri, 5 Feb 2021 16:23:02 +0100
> Lennart Poettering <lennart@xxxxxxxxxxxxxx> wrote:
> > I don't think that makes much sense: we rotate and start new files for
> > a multitude of reasons, such as size overrun, time jumps, abnormal
> > shutdown and so on. If we'd always leave a fully allocated file around
> > people would hate us...
>
> I'm not sure about that. The file is eventually going to grow to 128 MB
> so if there isn't space for it, I might as well know right now as
> later. And it's not like the space will be available for anything else,
> it's left free for exactly this log file.

let's say you assign 500M space to journald. If you allocate 128M at a
time, this means the effective unused space is anything between 1M and
255M, leaving just 256M of logs around. it's probably surprising that
you only end up with 255M of logs when you asked for 500M. I'd claim
that's really shitty behaviour.

> Or are you talking about left over files after some exceptional event
> that are only part full? If so, then just deallocate the unwanted empty
> space from them after you've recovered from the exceptional event.

Nah, it doesn't work like this: if a journal file isn't marked clean,
i.e. was left in some half-written state we won't touch it, but just
archive it and start a new one. We don't know how much was correctly
written and how much was not, hence we can't sensibly truncate it. The
kernel after all is entirely free to decide in which order it syncs
writte blocks to disk, and hence it quite often happens that stuff at
the end got synced while stuff in the middle didn't.

> > Also, we vacuum old journals when allocating and the size constraints
> > are hit. i.e. if we detect that adding 8M to journal file X would mean
> > the space used by all journals together would be above the configure
> > disk usage limits we'll delete the oldest journal files we can, until
> > we can allocate 8M again. And we do this each time. If we'd allocate
> > the full file all the time this means we'll likely remove ~256M of
> > logs whenever we start a new file. And that's just shitty behaviour.
>
> No it's not; it's exactly what happens most of the time, because all
> the old log files are exactly the same size because that's why they
> were rolled over. So freeing just one of those gives exactly the right
> size space for the new log file. I don't understand why you would want
> to free two?

Because fs metadata, and because we don't always write files in
full. I mean, we often do not, because we start a new file *before*
the file would grow beyond the threshold. this typically means that
it's typically not enough to delete a single file to get the space we
need for a full new one, we usually need to delete two.

actually it's even worse: btrfs lies in "df": it only updates counters
with uncontrolled latency, hence we might actually delete more than
necessary.

Lennart

--
Lennart Poettering, Berlin
_______________________________________________
systemd-devel mailing list
systemd-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/systemd-devel