On Wed, Nov 27, 2019 at 10:39:40AM +0100, Andrea Vai wrote: > Il giorno mer, 27/11/2019 alle 10.05 +0800, Ming Lei ha scritto: > > > > > > It can be workaround via the following change: > > > > /lib/modules/5.4.0+/build/include/generated/autoconf.h: > > > > //#define CONFIG_CC_HAS_ASM_INLINE 1 > > Thanks, it worked, trace attached. Produced by: start the trace script > (with the pendrive already plugged), wait some seconds, run the test > (1 trial, 1 GB), wait for the test to finish, stop the trace. > > The copy took 2659 seconds, roughly as already seen before. Thanks for collecting the log. >From the log, some of write IOs are out-of-order, such as, the 1st one is 378880. 16.41240 2 266 266 kworker/2:1H block_rq_issue b'W' 370656 240 16.41961 3 485 485 kworker/3:1H block_rq_issue b'W' 378880 240 16.73729 2 266 266 kworker/2:1H block_rq_issue b'W' 370896 240 17.71161 2 266 266 kworker/2:1H block_rq_issue b'W' 379120 240 18.02344 2 266 266 kworker/2:1H block_rq_issue b'W' 371136 240 18.94314 3 485 485 kworker/3:1H block_rq_issue b'W' 379360 240 19.25624 2 266 266 kworker/2:1H block_rq_issue b'W' 371376 240 IO latency is increased a lot since the 1st out-of-order request(usb storage HBA is single queue depth, one request can be issued only if the previous issued request is completed). The reason is that there are two kind of tasks which inserts rq to device. One is the 'cp' process, the other is kworker/u8:*. The out-of-order happens during the two task's interleaving. Under such situation, I believe that the old legacy IO path may not guarantee the order too. In blk_queue_bio(), after get_request() allocates one request, the queue lock is released. And request is actually inserted & issued from blk_flush_plug_list() under the branch of 'if (plug)'. If requests are from two tasks, then request is inserted/issued from two plug list, and no order can be guaranteed. In my test, except for several requests from the beginning, all other requests are inserted via the kworker thread(guess it is writeback wq), that is why I can't observe the issue in my test. As Schmid suggested, you may run the same test on old kernel with legacy io path, and see if the performance is still good. Also, could you share the following info about your machine? So that I can build my VM guest in this setting for reproducing your situation (requests are inserted from two types of threads). - lscpu - free -h - lsblk -d $USB_DISK - exact commands for mount the disk, and running the copy operation Thanks, Ming