>>> Phillip Susi <phill@xxxxxxxxxxxx> schrieb am 05.02.2021 um 16:02 in Nachricht <87a6si5yjq.fsf@xxxxxxxxxxxxxxxx>: > Chris Murphy writes: > >> But it gets worse. The way systemd‑journald is submitting the journals >> for defragmentation is making them more fragmented than just leaving >> them alone. > > Wait, doesn't it just create a new file, fallocate the whole thing, copy > the contents, and delete the original? How can that possibly make > fragmentation *worse*? > >> All of those archived files have more fragments (post defrag) than >> they had when they were active. And here is the FIEMAP for the 96MB >> file which has 92 fragments. > > How the heck did you end up with nearly 1 frag per mb? I didn't follow the thread tightly, but there was a happy mix of IOps, fragments (and no bandwidth), but I wonder here: Isn't it concept of BtrFS that writes are fragmented if there is no contiguous free space? The idea was *not* to spend time trying to find a goot spoace to write to, but use the next available one. > >> If you want an optimization that's actually useful on Btrfs, >> /var/log/journal/ could be a nested subvolume. That would prevent any Actually I stil ldidn't get the benefit of a BtrFS subvolume, but that 's a different topic: Don't all wrtites end up in a single storage pool? >> snapshots above from turning the nodatacow journals into datacow >> journals, which does significantly increase fragmentation (it would in >> the exact same case if it were a reflink copy on XFS for that matter). > > Wouldn't that mean that when you take snapshots, they don't include the > logs? That seems like an anti feature that violates the principal of > least surprise. If I make a snapshot of my root, I *expect* it to > contain my logs. > >> I don't get the iops thing at all. What we care about in this case is >> latency. A least noticeable latency of around 150ms seems reasonable >> as a starting point, that's where users realize a delay between a key >> press and a character appearing. However, if I check for 10ms latency >> (using bcc‑tools fileslower) when reading all of the above journals at >> once: >> >> $ sudo journalctl ‑D >> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ ‑‑no‑pager >> >> Not a single report. None. Nothing took even 10ms. And those journals >> are more fragmented than your 20 in a 100MB file. >> >> I don't have any hard drives to test this on. This is what, 10% of the >> market at this point? The best you can do there is the same as on SSD. > > The above sounded like great data, but not if it was done on SSD. Of > course it doesn't cause latency on an SSD. I don't know about market > trends, but I stopped trusting my data to SSDs a few years ago when my > ext4 fs kept being corrupted and it appeared that the FTL of the drive > was randomly swapping the contents of different sectors around when I > found things like the contents of a text file in a block of the inode > table or a directory. > >> You can't depend on sysfs to conditionally do defragmentation on only >> rotational media, too many fragile media claim to be rotating. Probably to keep software from breaking... ;-) > > It sounds like you are arguing that it is better to do the wrong thing > on all SSDs rather than do the right thing on ones that aren't broken. > >> Looking at the two original commits, I think they were always in >> conflict with each other, happening within months of each other. They >> are independent ways of dealing with the same problem, where only one >> of them is needed. And the best of the two is fallocate+nodatacow >> which makes the journals behave the same as on ext4 where you also >> don't do defragmentation. > > This makes sense. > _______________________________________________ > systemd‑devel mailing list > systemd‑devel@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/systemd‑devel _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel