On Do, 04.02.21 12:51, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote: > On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering > <lennart@xxxxxxxxxxxxxx> wrote: > > > You want to optimize write pattersn I understand, i.e. minimize > > iops. Hence start with profiling iops, i.e. what defrag actually costs > > and then weight that agains the reduced access time when accessing the > > files. In particular on rotating media. > > A nodatacow journal on Btrfs is no different than a journal on ext4 or > xfs. So I don't understand why you think you *also* need to defragment > the file, only on Btrfs. You cannot do better than you already are > with a nodatacow file. That file isn't going to get anymore fragmented > in use than it was at creation. You know, we issue the btrfs ioctl, under the assumption that if the file is already perfectly defragmented it's a NOP. Are you suggesting it isn't a NOP in that case? > If you want to do better, maybe stop appending in 8MB increments? > Every time you append it's another extent. Since apparently the > journal files can max out at 128MB before they are rotated, why aren't > they created 128MB from the very start? That would have a decent > chance of getting you a file that's 1-4 extents, and it's not going to > have more extents than that. You know, there are certainly "perfect" ways to adjust our writing scheme to match some specific file system on some specific storage matching some specific user pattern. THing is though, what might be ideal for some fs and some user might be terrible for another fs or another user. We try to find some compromise in the middle, that might not result in "perfect" behaviour everywhere, but at least reasonable behaviour. > Presumably the currently active journal not being fragmented is more > important than archived journals, because searches will happen on > recent events more than old events. Right? Nope. We always interleave stuff. We currently open all journal files in parallel. The system one and the per-user ones, the current ones and the archived ones. > So if you're going to say > fragmentation matters at all, maybe stop intentionally fragmenting the > active journal? We are not *intentionally* fragmenting. Please don't argue on that level. Not helpful, man. > Just fallocate the max size it's going to be right off > the bat? Doesn't matter what file system it is. Once that 128MB > journal is full, leave it alone, and rotate to a new 128M file. The > append is what's making them fragmented. I don't think that makes much sense: we rotate and start new files for a multitude of reasons, such as size overrun, time jumps, abnormal shutdown and so on. If we'd always leave a fully allocated file around people would hate us... The 8M increase is a middle ground: we don#t allocate space for each log message, and we don't allocate space for everything at once. We allocate medium sized chunks at a time. Also, we vacuum old journals when allocating and the size constraints are hit. i.e. if we detect that adding 8M to journal file X would mean the space used by all journals together would be above the configure disk usage limits we'll delete the oldest journal files we can, until we can allocate 8M again. And we do this each time. If we'd allocate the full file all the time this means we'll likely remove ~256M of logs whenever we start a new file. And that's just shitty behaviour. > But it gets worse. The way systemd-journald is submitting the journals > for defragmentation is making them more fragmented than just leaving > them alone. Sounds like a bug in btrfs? systemd is not the place to hack around btrfs bugs? > If you want an optimization that's actually useful on Btrfs, > /var/log/journal/ could be a nested subvolume. That would prevent any > snapshots above from turning the nodatacow journals into datacow > journals, which does significantly increase fragmentation (it would in > the exact same case if it were a reflink copy on XFS for that > matter). Not sure what the point of that would be... at least when systemd does snapshots (i.e. systemd-nspawn --template= and so on) they are of course recursive, so what'd be the point of doing a subvolume there? > > Somehow I think you are missing what I am asking for: some data that > > actually shows your optimization is worth it: i.e. that leaving the > > files fragment doesn't hurt access to the journal badly, and that the > > number of iops is substantially lowered at the same time. > > I don't get the iops thing at all. What we care about in this case is > latency. A least noticeable latency of around 150ms seems reasonable > as a starting point, that's where users realize a delay between a key > press and a character appearing. However, if I check for 10ms latency > (using bcc-tools fileslower) when reading all of the above journals at > once: > > $ sudo journalctl -D > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager > > Not a single report. None. Nothing took even 10ms. And those journals > are more fragmented than your 20 in a 100MB file. Now use rotating media... Of course random access latency doesn't matter as much on SSD. > And the best of the two is fallocate+nodatacow which makes the > journals behave the same as on ext4 where you also don't do > defragmentation. We'd probably defrag on ext4 during archival too, if it gets us anything. Please provide profiling data showing that even on rotating media defrag doesn't matter. Please work with the btrfs people if you think defrag is broken. Please work with the btrfs people if you think the FIEMAP ioctls are broken. I mean, this is kinda what I am getting here: "On Chris' nvme btrfs defrag is broken, please don't do it and oh, there's no working way to detect if a file is frgmented, fiemap is broken too. And no, I am not giving you any profiling data to go by, I just say so without profiling anything, and I don't have a rotating media, and fuck rotating media. And I am a big believer in btrfs, but apparently everything is broken, and please make sure journald uses a write pattern that might suck for everything else but works fine on my personal nvme storage". I mean, we can certainly change all this around. But give me some basic, reasonable profiling data about latencies and iops and stuff. Otherwise it's armchair optimization, i.e premature optimization. Lennart -- Lennart Poettering, Berlin _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel