On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering <lennart@xxxxxxxxxxxxxx> wrote: > You want to optimize write pattersn I understand, i.e. minimize > iops. Hence start with profiling iops, i.e. what defrag actually costs > and then weight that agains the reduced access time when accessing the > files. In particular on rotating media. A nodatacow journal on Btrfs is no different than a journal on ext4 or xfs. So I don't understand why you think you *also* need to defragment the file, only on Btrfs. You cannot do better than you already are with a nodatacow file. That file isn't going to get anymore fragmented in use than it was at creation. If you want to do better, maybe stop appending in 8MB increments? Every time you append it's another extent. Since apparently the journal files can max out at 128MB before they are rotated, why aren't they created 128MB from the very start? That would have a decent chance of getting you a file that's 1-4 extents, and it's not going to have more extents than that. Presumably the currently active journal not being fragmented is more important than archived journals, because searches will happen on recent events more than old events. Right? So if you're going to say fragmentation matters at all, maybe stop intentionally fragmenting the active journal? Just fallocate the max size it's going to be right off the bat? Doesn't matter what file system it is. Once that 128MB journal is full, leave it alone, and rotate to a new 128M file. The append is what's making them fragmented. But it gets worse. The way systemd-journald is submitting the journals for defragmentation is making them more fragmented than just leaving them alone. https://drive.google.com/file/d/1FhffN4WZZT9gZTnG5VWongWJgPG_nlPF/view?usp=sharing All of those archived files have more fragments (post defrag) than they had when they were active. And here is the FIEMAP for the 96MB file which has 92 fragments. https://drive.google.com/file/d/1Owsd5DykNEkwucIPbKel0qqYyS134-tB/view?usp=sharing I don't know if it's a bug with the submitted target size by sd-journald, or if it's a bug in Btrfs. But it doesn't really matter. There is no benefit to defragmenting nodatacow journals that were fallocated upon creation. If you want an optimization that's actually useful on Btrfs, /var/log/journal/ could be a nested subvolume. That would prevent any snapshots above from turning the nodatacow journals into datacow journals, which does significantly increase fragmentation (it would in the exact same case if it were a reflink copy on XFS for that matter). > No, but doing this once in a big linear stream when the journal is > archived might not be so bad if then later on things are much faster > to access for all future because the files aren't fragmented. Ok well in practice is worse than doing nothing so I'm suggesting doing nothing. > Somehow I think you are missing what I am asking for: some data that > actually shows your optimization is worth it: i.e. that leaving the > files fragment doesn't hurt access to the journal badly, and that the > number of iops is substantially lowered at the same time. I don't get the iops thing at all. What we care about in this case is latency. A least noticeable latency of around 150ms seems reasonable as a starting point, that's where users realize a delay between a key press and a character appearing. However, if I check for 10ms latency (using bcc-tools fileslower) when reading all of the above journals at once: $ sudo journalctl -D /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager Not a single report. None. Nothing took even 10ms. And those journals are more fragmented than your 20 in a 100MB file. I don't have any hard drives to test this on. This is what, 10% of the market at this point? The best you can do there is the same as on SSD. You can't depend on sysfs to conditionally do defragmentation on only rotational media, too many fragile media claim to be rotating. And by the way, I use Brfs on SD Card on a Raspberry Pi Zero of all things. The cards last longer than other file systems due to net lower write amplification due to native compression. I wouldn't be surprised if the cards fail sooner if I weren't using compression. But who knows, maybe Btrfs write amplification compared to ext4 and xfs constant journaling ends up being a wash. There are a number of embedded use cases for Btrfs as well. Is compressed F2FS better? Probably. They have a solution for the wandering trees problem, but also no snapshots or data checksumming. But I also don't think any of that is super relevant to the overall topic, I just provide this as a contra-argument that Btrfs isn't appropriate for small cheap storage devices. > The thing is that we tend to have few active files and many archived > files, and since we interleave stuff our access patterns are pretty > bad already, so we don't want to spend even more time on paying for > extra bad access patterns becuase the archived files are fragment. Right. So pick a size for the journal file, I don't really care what it is but they seem to get upwards of 128MB in size so just use that. Make a 128MB file from the very start, fallocate it, and then when full, rotate and create a new one. Stop the anti-pattern of tacking on in 8MB increments. And stop defragmenting them. That is the best scenario for HDD, USB sticks, and NVMe. Looking at the two original commits, I think they were always in conflict with each other, happening within months of each other. They are independent ways of dealing with the same problem, where only one of them is needed. And the best of the two is fallocate+nodatacow which makes the journals behave the same as on ext4 where you also don't do defragmentation. -- Chris Murphy _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel