Re: btrfs system slow down with 100GB file

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 25 Mar 2021 19:05:39 -0600

On Thu, Mar 25, 2021 at 6:39 AM Richard Shaw <hobbes1069@xxxxxxxxx> wrote:
>
> On Wed, Mar 24, 2021 at 11:05 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:

>> Append writes are the same on overwriting and cow file systems. You
>> might get slightly higher iowait because datacow means datasum which
>> means more metadata to write. But that's it. There's no data to COW if
>> it's just appending to a file. And metadata writes are always COW.
>
>
> Hmm... While still annoying (chrome locking up because it can't read/write to it's cache in my /home) my desk chair benchmarking says that it was definitely better as nodatacow. Now that I think about it, initial syncing I'm likely getting the blocks out of order which would explain things a bit more. I'm not too worried about nodatasum for this file as the nature of the blockchain is to be able to detect errors (intentional or accidental) already and should be self correcting.

Is this information in a database? What kind? There are cow friendly
databases (e.g. rocksdb, sqlite with WAL enabled), and comparatively
cow unfriendly ones - so it may be that setting the file to nodatacow
helps. If there's also multiple sources of frequent syncing, this can
also exacerbate things.

You can attach strace to both processes and see if either or both of
them are doing sync() of any kind, and what interval. I'm not certain
whether bcc-tools biosnoop shows all kinds of sync. It'd probably be
useful to know both what ioctl is being used by the two programs
(chrome and whatever is writing to the large file), as well as their
concurrent effect on bios using biosnoop.

>
>> You could install bcc-tools and run btrfsslower with the same
>> (exclusive) workload with datacow and nodatacow to see if latency is
>> meaningfully higher with datacow but I don't expect that this is a
>> factor.
>
>
> That's an interesting tool. So I don't want to post all of it here as it could have some private info in it but I'd be willing to share it privately.

TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)

There's no content displaced in any case.

>
> One interesting output now is the blockchain file is almost constantly getting written to but since it's synced, it's only getting appended to (my guess) and I'm not noticing any "chair benchmark" issues but one of the writes did take 1.8s while most of them were a few hundred ms or less.

1.8s is quite a lot of latency. It could be the result of a flush
delay due to a lot of dirty data, and while that flush is happening
it's not going to be easily or quickly preempted by some other process
demanding its data be written right now. Btrfs is quite adept at
taking multiple write streams from many processes and merging the
writes into sequential writes. Even when the writes are random, Btrfs
tends to make them sequential. This is thwarted by sync() which is a
demand to write a specific file's outstanding data and metadata right
now. It sets up all kinds of seek behavior as the data must be
written, then the metadata, then the super block.

>
> I'm pretty sure that's exactly what's happening. But is there a better I/O scheduler for traditional hard disks, currently I have:
>
> $ cat /sys/block/sda/queue/scheduler
> mq-deadline kyber [bfq] none

I don't know anything about the workload still so I'm only able to
speculate. Bfq is biased toward reads and is targeted at the desktop
use case. mq-deadline is biased toward writes and is targeted at
server use case. This is perhaps more server-like in that the chrome
writes, like firefox, are sqlite databases. Firefox enables WAL, but I
don't see that Chrome does (not sure).

You could try mq-deadline.

> $ ls -sh data.mdb
> 101G data.mdb
>
> A large bittorrent download should also be similar since you don't get the parts in order, but perhaps it's smart enough to allocate all the space on the front end?

That's up to the application that owns the file and is writing to it.
There's going to be a seek hit no matter what because they're both
written and read out of order. And while they might be database files,
they aren't active databases - different write pattern.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure