> Il giorno 21 mag 2019, alle ore 15:20, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto: > > > >> Il giorno 21 mag 2019, alle ore 13:25, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto: >> >> >> >>> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto: >>> >>> >>> >>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@xxxxxxxxxxxxx> ha scritto: >>>> >>>> On 5/18/19 11:39 AM, Paolo Valente wrote: >>>>> I've addressed these issues in my last batch of improvements for BFQ, >>>>> which landed in the upcoming 5.2. If you give it a try, and still see >>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it >>>>> for you. >>>>> >>>> >>>> Hi Paolo, >>>> >>>> Thank you for looking into this! >>>> >>>> I just tried current mainline at commit 72cf0b07, but unfortunately >>>> didn't see any improvement: >>>> >>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync >>>> >>>> With mq-deadline, I get: >>>> >>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s >>>> >>>> With bfq, I get: >>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s >>>> >>> >>> Hi Srivatsa, >>> thanks for reproducing this on mainline. I seem to have reproduced a >>> bonsai-tree version of this issue. >> >> Hi again Srivatsa, >> I've analyzed the trace, and I've found the cause of the loss of >> throughput in on my side. To find out whether it is the same cause as >> on your side, I've prepared a script that executes your test and takes >> a trace during the test. If ok for you, could you please >> - change the value for the DEVS parameter in the attached script, if >> needed >> - execute the script >> - send me the trace file that the script will leave in your working >> dir >> > > Sorry, I forgot to add that I also need you to, first, apply the > attached patch (it will make BFQ generate the log I need). > Sorry again :) This time for attaching one more patch. This is basically a blind fix attempt, based on what I see in my VM. So, instead of only sending me a trace, could you please: 1) apply this new patch on top of the one I attached in my previous email 2) repeat your test and report results 3) regardless of whether bfq performance improves, take a trace with my script (I've attached a new version that doesn't risk to output an annoying error message as the previous one) Thanks, Paolo
Attachment:
dsync_test.sh
Description: Binary data
Attachment:
0001-block-bfq-boost-injection.patch.gz
Description: GNU Zip compressed data
> Thanks, > Paolo > > <0001-block-bfq-add-logs-and-BUG_ONs.patch.gz> > >> Looking forward to your trace, >> Paolo >> >> <dsync_test.sh> >>> Before digging into the block >>> trace, I'd like to ask you for some feedback. >>> >>> First, in my test, the total throughput of the disk happens to be >>> about 20 times as high as that enjoyed by dd, regardless of the I/O >>> scheduler. I guess this massive overhead is normal with dsync, but >>> I'd like know whether it is about the same on your side. This will >>> help me understand whether I'll actually be analyzing about the same >>> problem as yours. >>> >>> Second, the commands I used follow. Do they implement your test case >>> correctly? >>> >>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp >>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs >>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler >>> [mq-deadline] bfq none >>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync >>> 10000+0 record dentro >>> 10000+0 record fuori >>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s >>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler >>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync >>> 10000+0 record dentro >>> 10000+0 record fuori >>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s >>> >>> Thanks, >>> Paolo >>> >>>> Please let me know if any more info about my setup might be helpful. >>>> >>>> Thank you! >>>> >>>> Regards, >>>> Srivatsa >>>> VMware Photon OS >>>> >>>>> >>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@xxxxxxxxxxxxx> ha scritto: >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput >>>>>> running the following command, with the CFQ I/O scheduler: >>>>>> >>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync >>>>>> >>>>>> Throughput with CFQ: 60 KB/s >>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s >>>>>> >>>>>> I spent some time looking into it and found that this is caused by the >>>>>> undesirable interaction between 4 different components: >>>>>> >>>>>> - blkio cgroup controller enabled >>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup >>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2 >>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle >>>>>> >>>>>> >>>>>> When docker is enabled, systemd creates a blkio cgroup called >>>>>> system.slice to run system services (and docker) under it, and a >>>>>> separate blkio cgroup called user.slice for user processes. So, when >>>>>> dd is invoked, it runs under user.slice. >>>>>> >>>>>> The dd command above includes the dsync flag, which performs an >>>>>> fdatasync after every write to the output file. Since dd is writing to >>>>>> a file on ext4, jbd2 will be active, committing transactions >>>>>> corresponding to those fdatasync requests from dd. (In other words, dd >>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a >>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which >>>>>> runs under user.slice. >>>>>> >>>>>> Now, if the I/O scheduler in use for the underlying block device is >>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the >>>>>> slice_idle and group_idle parameters, both of which default to 8ms). >>>>>> Therefore, everytime CFQ switches between processing requests from dd >>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall >>>>>> throughput tremendously! >>>>>> >>>>>> To verify this theory, I tried various experiments, and in all cases, >>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this >>>>>> performance drop. For example, if I used an XFS filesystem (which >>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed >>>>>> directly to a block device, I couldn't reproduce the performance >>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2 >>>>>> runs) also gets full performance; as does using the noop or deadline >>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set >>>>>> to zero. >>>>>> >>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi, >>>>>> both with virtualized storage as well as with disk pass-through, >>>>>> backed by a rotational hard disk in both cases. The same problem was >>>>>> also seen with the BFQ I/O scheduler in kernel v5.1. >>>>>> >>>>>> Searching for any earlier discussions of this problem, I found an old >>>>>> thread on LKML that encountered this behavior [1], as well as a docker >>>>>> github issue [2] with similar symptoms (mentioned later in the >>>>>> thread). >>>>>> >>>>>> So, I'm curious to know if this is a well-understood problem and if >>>>>> anybody has any thoughts on how to fix it. >>>>>> >>>>>> Thank you very much! >>>>>> >>>>>> >>>>>> [1]. https://lkml.org/lkml/2015/11/19/359 >>>>>> >>>>>> [2]. https://github.com/moby/moby/issues/21485 >>>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103 >>>>>> >>>>>> Regards, >>>>>> Srivatsa
Attachment:
signature.asc
Description: Message signed with OpenPGP