Re: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 26 Nov 2019 18:24:54 +0800

On Tue, Nov 26, 2019 at 05:15:33PM +0800, Ming Lei wrote:
> On Tue, Nov 26, 2019 at 08:46:07AM +0100, Andrea Vai wrote:
> > Il giorno mar, 26/11/2019 alle 10.32 +0800, Ming Lei ha scritto:
> > > On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote:
> > > > Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha scritto:
> > > > > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > What to try next?
> > > > > 
> > > > > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
> > > > result:
> > > > 
> > > > alloc_policy=FIFO SHOULD_MERGE|2
> > > > 
> > > > > 
> > > > > 
> > > > > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your
> > > copy
> > > > > 1GB
> > > > > test again.
> > > > 
> > > > done, and still fails. What to try next?
> > > 
> > > I just run 256M cp test
> > 
> > I would like to point out that 256MB is a filesize that usually don't
> > trigger the issue (don't know if it matters, sorry).
> 
> OK.
> 
> I tested 256M because IO timeout is often triggered in case of
> qemu-ehci, and it is a long-term issue. When setting up the disk
> via xhci-qemu, the max request size is increased to 1MB from 120KB,
> and IO pattern changed too. When the disk is connected via uhci-qemu,
> the transfer is too slow(1MB/s) because max endpoint size is too small.
> 
> However, I just waited 16min and collected all the 1GB IO log by
> connecting disk over uhci-qemu, but the sector of each data IO
> is still in order.
> 
> > 
> > Another info I would provide is about another strange behavior I
> > noticed: yesterday I ran the test two times (as usual with 1GB
> > filesize) and took 2370s, 1786s, and a third test was going on when I
> > stopped it. Then I started another set of 100 trials and let them run
> > tonight, and the first 10 trials were around 1000s, then gradually
> > decreased to ~300s, and finally settled around 200s with some trials
> > below 70-80s. This to say, times are extremely variable and for the
> > first time I noticed a sort of "performance increase" with time.
> 
> The 'cp' test is buffered IO, can you reproduce it every time by
> running copy just after fresh mount on the USB disk?
> 
> > 
> > >  to one USB storage device on patched kernel,
> > > and WRITE data IO is really in ascending order. The filesystem is
> > > ext4,
> > > and mount without '-o sync'. From previous discussion, looks that is
> > > exactly your test setting. The order can be observed via the
> > > following script:
> > > 
> > > #!/bin/sh
> > > MAJ=$1
> > > MIN=$2
> > > MAJ=$(( $MAJ << 20 ))
> > > DEV=$(( $MAJ | $MIN ))
> > > /usr/share/bcc/tools/trace -t -C \
> > >   't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> > > >rwbs, args->sector, args->nr_sector'
> > > 
> > > $MAJ & $MIN can be retrieved via lsblk for your USB storage disk.
> > > 
> > > So I think we need to check if the patch is applied correctly first.
> > > 
> > > If your kernel tree is managed via git,
> > yes it is,
> > 
> > >  please post 'git diff'.
> > attached. Is it correctly patched? thanks.
> 
> Yeah, it should be correct except for the change on __blk_mq_delay_run_hw_queue()
> is duplicated.
> 
> > 
> > 
> > > Otherwise, share us your kernel version,
> > btw, is 5.4.0+
> > 
> > >  and I will send you one
> > > backported patch on the kernel version.
> > > 
> > > Meantime, you can collect IO order log via the above script as you
> > > did last
> > > time, then send us the log.
> > 
> > ok, will try; is it just required to run it for a short period of time
> > (say, some seconds) during the copy, or should I run it before the
> > beginning (or before the mount?), and terminate it after the end of
> > the copy? (Please note that in the latter case a large amount of time
> > (and data, I suppose) would be involved, because, as said, to be sure
> > the problem triggers I have to use a large file... but we can try to
> > better understand and tune this. If it can help, you can get an ods
> > file with the complete statistic at [1] (look at the "prove_nov19"
> > sheet)).
> 
> The data won't be very big, each line covers 120KB, and ~10K line
> is enough for cover 1GB transfer. Then ~300KB compressed file should
> hold all the trace.

Also use the following trace script this time:

#!/bin/sh

MAJ=$1
MIN=$2
MAJ=$(( $MAJ << 20 ))
DEV=$(( $MAJ | $MIN ))

/usr/share/bcc/tools/trace -t -C \
    't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
    't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'

Thanks,
Ming