On Fri, Mar 26, 2021 at 4:00 PM Roberto Ragusa <mail@xxxxxxxxxxxxxxxx> wrote: > Well, there is no reason for fsync to block everything else. In practice, it does. There's only one thing happening at a time with with a HDD, so while the write is flushing, nothing else is going to get either a read or a write in, which is why it's not a great idea to build up a lot of dirty data and hit it with fsync, and do that all day long unless you're a server whose task it is to do that particular workload. Mixed workloads are much harder and that's what we have on the desktop. > The meaning of fsync is that process is telling "I will not proceed until > you tell me this file has reached the disk", and that is a hint to > the kernel to begin writing with the objective to let the process > get unstuck. And what if you have two callers of fsync? If the first one has 10 seconds of writeback to do on fsync, what happens to the fsync of another caller? It's going to have to wait 10 seconds *plus* the time for its own writeback. This is why you want maybe a second or two of writeback, and programs that aren't wrecklessly hammering their files with fsync just because they think that's the only way they're ever going to get on disk. > Indeed fsync doesn't mean "hey, I am in emergency mode, stop everything else > because my stuff is important". If you have multiple aggressive writers calling any kind of sync, you have the potential for contention. When concurrent writes and sync's happen through a single point in time writer, what should happen? Btrfs can actually do a better job of this because it'll tend to aggregate those random writes into sequential writes, interleaving them. The problem with that interleaving comes at read time, because now there's fragmentation. The fragmentation problem is sometimes overstated because contiguity isn't as important as proximity, you can have nearby blocks resulting in low read latency even if they aren't contiguous and quite a lot of engineering has gone into making drives and drive controllers do that work. But it can't overcome significantly different placement of a file's blocks. If all of them need to be read and they're far apart due to prior interleaved writes, you'll see seek latency go up. There's no free lunch, there's tradeoffs for everything. There's a reason for 15k rpm hard drives after all. > So a good filesystem on a good kernel will correctly apply priorities, > fairness etc. to let other processes do their I/O. They can do their own buffered writes while fsync is happening. Concurrent fsyncs means competition. If the data being fsync'd is small, then it's not likely to get noticed until you have 3 or more contenders (for spinning media) to get to around 150ms of latency to be noticed by a person. Noticed, not necessarily annoyed. But if you have even one process producing a lot of anonymous pages and fsyncing frequently, you're going to see tens of seconds of contention for a device that cannot do simultaneous writes and will be very reluctant to > You are not irreversibly queueing 40G to the drive, the drive is going to > get small operations (e.g. 1000 blocks) and there is a chance for > the kernel to insert other I/O in the flow. Sure and what takes 5 seconds as a dedicated fsync now becomes 20 seconds when you add a bunch of reads to it, and both the reads and writes will be noticeably slower than they were when they weren't in contention. This is why multiqueue low latency drives are vastly better for mixed workloads. > But there are two issues to consider: > 1) there could be huge "irreversible" queues somewhere; this problem > is similar to bufferbloat for network packets, but I do not think I/O > suffers too much, considering there are no intermediate nodes in the > middle > 2) there must not be shortcomings in the filesystem code; for example, > ext3 ordered mode was flushing everything when asked to flush a 1kB file; > I don't know about ext4, I don't know about btrfs Btrfs flushes just the files in the directory being fsync'd, but there can be contention on the tree log which is used to make fsync's performant. There is a tree log per subvolume. I doubt that separating the two workloads into separate subvolumes will help in this case, it doesn't sound like either workload is really that significant but I don't know what's going on so it could be worth a try. But I'd say if you're fiddling with things on this level that it's important to be really rigorous and only apply one change at a time, otherwise it's impossible to know what made things better or worse. Note that subvolumes while mostly like directories, they are separate namespaces with separate file descriptors, stat will show them as different devices, they have their own pool of inodes. You can't create hardlinks across subvolumes (you can create reflinks across subvolumes, but there is a VFS limitation that enforces no reflinks the cross mount points.) > > In summary: > - if a process calls fsync and then complains about having to wait > to get unblocked, it is just creating its own problem (are you doing fsync > of big things in your UI thread?) > - if a process gets heavily delayed because another process is doing > fsync the kernel is not doing its job in terms of fairness > It very much depends on the workload. And there's also cgroup io.latency to consider as well. The desktop is in a better position to make decisions on what's more important: UI/UX responsiveness is often more important than performance. > I can agree that reality may not be ideal, but I don't like this > attitude of "dropping the ball" by disabling caching here and there > because (provocative paradox) web browser authors are using fsync > for bookmarks and cookies DB in the UI thread. No one has suggested disabling caching. Reducing the *time* to start write back is what was suggested and we don't even know if that matters anymore, and I wasn't even the one who first suggested it, goes back to Linus saying the defaults are crazy 10 years ago. > NOTE: > I know about eatmydata, I've used it sometimes. > There is also this nice trick for programs stupidly doing too many fsync: > system-nspawn --system-call-filter='~sync:0 fsync:0' That is awesome! Way easier to deal with than eatmydata. -- Chris Murphy _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure