On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi <phill@xxxxxxxxxxxx> wrote: > Chris Murphy writes: > > > But it gets worse. The way systemd-journald is submitting the journals > > for defragmentation is making them more fragmented than just leaving > > them alone. > > Wait, doesn't it just create a new file, fallocate the whole thing, copy > the contents, and delete the original? Same inode, so no. As to the logic, I don't know. I'll ask upstream to document it. ?How can that possibly make > fragmentation *worse*? I'm only seeing this pattern with journald journals, and BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals. Meanwhile, active journals exhibit no different pattern from ext4 and xfs, no worse fragmentation. Consid other storage technologies where COW and snapshots come into play. For example anything based on device-mapper thin provisioning is going to run into these issues. How it allocates physical extents isn't up to the file system. Duplicate a file and delete the original, you might get a more fragmented file as well. The physical layout is entirely decoupled from the file system - where the filesystem could tell you "no fragmentation" and yet it is highly fragmented, or vice versa. These problems are not unique to Btrfs. Is there a VFS API for handling these isues? Should there be? I really don't think any application, including journald, should be having to micromanage these kinds of things on a case by case basis. General problems like this need general solutions. > > All of those archived files have more fragments (post defrag) than > > they had when they were active. And here is the FIEMAP for the 96MB > > file which has 92 fragments. > > How the heck did you end up with nearly 1 frag per mb? I didn't do anything special, it's a default configuration. I'll ask Btrfs developers about it. Maybe it's one of those artifacts of FIEMAP I mentioned previously. Maybe it's not that badly fragmented to a drive that's going to reorder reads anyway, to be more efficient about it. > > If you want an optimization that's actually useful on Btrfs, > > /var/log/journal/ could be a nested subvolume. That would prevent any > > snapshots above from turning the nodatacow journals into datacow > > journals, which does significantly increase fragmentation (it would in > > the exact same case if it were a reflink copy on XFS for that matter). > > Wouldn't that mean that when you take snapshots, they don't include the > logs? That's a snapshot/rollback regime design and policy question. If you snapshot the subvolume that contains the journals, the journals will be in the snapshot. The user space tools do not have an option for recursive snapshots, so snapshotting does end at subvolume boundaries. If you want journals snapshot, then their enclosing subvolume would need to be snapshot. > That seems like an anti feature that violates the principal of > least surprise. If I make a snapshot of my root, I *expect* it to > contain my logs. You can only rollback that which you snapshot. If you snapshot a root without excluding journals, if you rollback, you rollback the journals. That's data loss. (open)suse has a snapshot/rollback regime configured and enabled by default out of the box. Logs are excluded from it, same as the bootloader. (Although I'll also note they default to volatile systemd journals, and use rsyslogd for persistent logs.) Fedora meanwhile does have persistent journald journals in the root subvolume, but there's no snapshot/rollback regime enabled out of the box. I'm inclined to have them excluded, not so much to avoid cow of the nodatacow journals, but avoiding discontinuity in the journals upon rollback. > > > I don't get the iops thing at all. What we care about in this case is > > latency. A least noticeable latency of around 150ms seems reasonable > > as a starting point, that's where users realize a delay between a key > > press and a character appearing. However, if I check for 10ms latency > > (using bcc-tools fileslower) when reading all of the above journals at > > once: > > > > $ sudo journalctl -D > > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager > > > > Not a single report. None. Nothing took even 10ms. And those journals > > are more fragmented than your 20 in a 100MB file. > > > > I don't have any hard drives to test this on. This is what, 10% of the > > market at this point? The best you can do there is the same as on SSD. > > The above sounded like great data, but not if it was done on SSD. Right. But also I can't disable the defragmentation in order to do a proper test on HDD. > > You can't depend on sysfs to conditionally do defragmentation on only > > rotational media, too many fragile media claim to be rotating. > > It sounds like you are arguing that it is better to do the wrong thing > on all SSDs rather than do the right thing on ones that aren't broken. No I'm suggesting there isn't currently a way to isolate defragmentation to just HDDs. -- Chris Murphy _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel