Re: btrfs system slow down with 100GB file

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Fri, 26 Mar 2021 17:49:07 -0600

On Fri, Mar 26, 2021 at 4:00 PM Roberto Ragusa <mail@xxxxxxxxxxxxxxxx> wrote:

> Well, there is no reason for fsync to block everything else.

In practice, it does. There's only one thing happening at a time with
with a HDD, so while the write is flushing, nothing else is going to
get either a read or a write in, which is why it's not a great idea to
build up a lot of dirty data and hit it with fsync, and do that all
day long unless you're a server whose task it is to do that particular
workload. Mixed workloads are much harder and that's what we have on
the desktop.

> The meaning of fsync is that process is telling "I will not proceed until
> you tell me this file has reached the disk", and that is a hint to
> the kernel to begin writing with the objective to let the process
> get unstuck.

And what if you have two callers of fsync? If the first one has 10
seconds of writeback to do on fsync, what happens to the fsync of
another caller? It's going to have to wait 10 seconds *plus* the time
for its own writeback.

This is why you want maybe a second or two of writeback, and programs
that aren't wrecklessly hammering their files with fsync just because
they think that's the only way they're ever going to get on disk.

> Indeed fsync doesn't mean "hey, I am in emergency mode, stop everything else
> because my stuff is important".

If you have multiple aggressive writers calling any kind of sync, you
have the potential for contention. When concurrent writes and sync's
happen through a single point in time writer, what should happen?
Btrfs can actually do a better job of this because it'll tend to
aggregate those random writes into sequential writes, interleaving
them. The problem with that interleaving comes at read time, because
now there's fragmentation. The fragmentation problem is sometimes
overstated because contiguity isn't as important as proximity, you can
have nearby blocks resulting in low read latency even if they aren't
contiguous and quite a lot of engineering has gone into making drives
and drive controllers do that work. But it can't overcome
significantly different placement of a file's blocks. If all of them
need to be read and they're far apart due to prior interleaved writes,
you'll see seek latency go up.

There's no free lunch, there's tradeoffs for everything. There's a
reason for 15k rpm hard drives after all.

> So a good filesystem on a good kernel will correctly apply priorities,
> fairness etc. to let other processes do their I/O.

They can do their own buffered writes while fsync is happening.
Concurrent fsyncs means competition. If the data being fsync'd is
small, then it's not likely to get noticed until you have 3 or more
contenders (for spinning media) to get to around 150ms of latency to
be noticed by a person. Noticed, not necessarily annoyed. But if you
have even one process producing a lot of anonymous pages and fsyncing
frequently, you're going to see tens of seconds of contention for a
device that cannot do simultaneous writes and will be very reluctant
to

> You are not irreversibly queueing 40G to the drive, the drive is going to
> get small operations (e.g. 1000 blocks) and there is a chance for
> the kernel to insert other I/O in the flow.

Sure and what takes 5 seconds as a dedicated fsync now becomes 20
seconds when you add a bunch of reads to it, and both the reads and
writes will be noticeably slower than they were when they weren't in
contention.

This is why multiqueue low latency drives are vastly better for mixed workloads.

> But there are two issues to consider:
> 1) there could be huge "irreversible" queues somewhere; this problem
> is similar to bufferbloat for network packets, but I do not think I/O
> suffers too much, considering there are no intermediate nodes in the
> middle
> 2) there must not be shortcomings in the filesystem code; for example,
> ext3 ordered mode was flushing everything when asked to flush a 1kB file;
> I don't know about ext4, I don't know about btrfs

Btrfs flushes just the files in the directory being fsync'd, but there
can be contention on the tree log which is used to make fsync's
performant. There is a tree log per subvolume. I doubt that separating
the two workloads into separate subvolumes will help in this case, it
doesn't sound like either workload is really that significant but I
don't know what's going on so it could be worth a try. But I'd say if
you're fiddling with things on this level that it's important to be
really rigorous and only apply one change at a time, otherwise it's
impossible to know what made things better or worse.

Note that subvolumes while mostly like directories, they are separate
namespaces with separate file descriptors, stat will show them as
different devices, they have their own pool of inodes. You can't
create hardlinks across subvolumes (you can create reflinks across
subvolumes, but there is a VFS limitation that enforces no reflinks
the cross mount points.)

>
> In summary:
> - if a process calls fsync and then complains about having to wait
> to get unblocked, it is just creating its own problem (are you doing fsync
> of big things in your UI thread?)
> - if a process gets heavily delayed because another process is doing
> fsync the kernel is not doing its job in terms of fairness
>

It very much depends on the workload. And there's also cgroup
io.latency to consider as well. The desktop is in a better position to
make decisions on what's more important: UI/UX responsiveness is often
more important than performance.

> I can agree that reality may not be ideal, but I don't like this
> attitude of "dropping the ball" by disabling caching here and there
> because (provocative paradox) web browser authors are using fsync
> for bookmarks and cookies DB in the UI thread.

No one has suggested disabling caching. Reducing the *time* to start
write back is what was suggested and we don't even know if that
matters anymore, and I wasn't even the one who first suggested it,
goes back to Linus saying the defaults are crazy 10 years ago.

> NOTE:
> I know about eatmydata, I've used it sometimes.
> There is also this nice trick for programs stupidly doing too many fsync:
>    system-nspawn --system-call-filter='~sync:0 fsync:0'

That is awesome! Way easier to deal with than eatmydata.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure