Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]

Jens Axboe <axboe@xxxxxxxxx> · Tue, 06 Jan 2015 09:15:53 -0700

On 01/06/2015 09:05 AM, Mike Snitzer wrote:
On Tue, Jan 06 2015 at  4:31am -0500,
Bart Van Assche <bvanassche@xxxxxxx> wrote:

On 01/05/15 22:35, Mike Snitzer wrote:
On Fri, Jan 02 2015 at 12:53pm -0500,
Bart Van Assche <bvanassche@xxxxxxx> wrote:
Thanks, my tests confirm that this patch indeed fixes the issue I had
reported. Unfortunately this doesn't mean that the blk-mq multipath code
is already working perfectly. Most of the time I/O requests are
processed within the expected time but sometimes I/O processing takes
much more time than what I expected:

# /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
0.02
# /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
0.02
# /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
8.68

However, if I run the same command on the underlying device it always
completes within the expected time.

I don't have very large blk-mq devices, but I can work on that.
How large is the blk-mq device in question?

Also, how much memory does the system have?  Is memory fragmented at
all?  With this change the requests are cloned using memory allocated
from block core's blk_get_request (rather than a dedicated mempool in DM
core).

Any chance you could use 'perf record' to try to analyze where the
kernel is spending its time?

Hello Mike,

The device used in this test was a tmpfs file with a size of 16 MB. That
file had been created as follows: dd if=/dev/zero of=/dev/vdisk bs=1M
count=16. The initiator and target systems did have enough memory to keep
this tmpfs file in RAM all the time (32 GB and 4 GB respectively).

For the runs that took much longer than expected the CPU load was low.
This probably means that the system was waiting for one or another I/O
timer to expire. The output triggered by "echo w > /proc/sysrq-trigger"
during a run that took longer than expected was as follows:

SysRq : Show Blocked State
   task                        PC stack   pid father
kdmwork-253:0   D ffff8807c1fd3b78     0 10396      2 0x00000000
  ffff8807c1fd3b78 ffff88083b6b6cc0 0000000000012ec0 ffff8807c1fd3fd8
  0000000000012ec0 ffff880824225aa0 ffff88083b6b6cc0 ffff88081b0cb2c0
  ffff88085fc537c8 ffff8807c1fd3c98 ffff8807f7a99d70 ffffe8ffffc43bc0
Call Trace:
  [<ffffffff814d5230>] io_schedule+0xa0/0x130
  [<ffffffff8125a3f7>] bt_get+0x117/0x1b0
  [<ffffffff81256580>] ? blk_mq_queue_enter+0x30/0x2a0
  [<ffffffff81094cf0>] ? prepare_to_wait_event+0x110/0x110
  [<ffffffff8125a76f>] blk_mq_get_tag+0x9f/0xd0
  [<ffffffff8125591b>] __blk_mq_alloc_request+0x1b/0x210
  [<ffffffff812571c9>] blk_mq_alloc_request+0x139/0x150
  [<ffffffff8124c16e>] blk_get_request+0x2e/0xe0
  [<ffffffff8109a60d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffffa07f7d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
  [<ffffffffa07f7d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
  [<ffffffffa039dbb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
  [<ffffffff8109a53d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
  [<ffffffff81075cbe>] kthread_worker_fn+0x7e/0x1b0
  [<ffffffff81075c40>] ? __init_kthread_worker+0x60/0x60
  [<ffffffff81075bc8>] kthread+0xf8/0x110
  [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
  [<ffffffff814dacac>] ret_from_fork+0x7c/0xb0
  [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210

Jens,

This stack trace confirms my suspicion that switching DM-multipath over
to allocating clone requests via blk_get_request (rather than using a
dedicated mempool in DM core) is the cause of the slowdown that Bart has
experienced.

Given blk_mq_get_tag() looks to be the culprit is there anything we can
do to speed up blk-mq request allocation?  I'm currently using
GFP_KERNEL when calling blk_get_request().

blk-mq request allocation is pretty much as optimized/fast as it can be. 
The slowdown must be due to one of two reasons:

- A bug related to running out of requests, perhaps a missing queue run 
or something like that.
- A smaller number of available requests, due to the requested queue depth.

Looking at Barts results, it looks like it's usually fast, but sometimes 
very slow. That would seem to indicate it's option #1 above that is the 
issue. Bart, since this seems to wait for quite a bit, would it be 
possible to cat the 'tags' file for that queue when it is stuck like that?

--
Jens Axboe

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel