Running a heavy I/O load on multipath/dual-ported SSD disks attached to
a SAS3008 adapter (mpt3sas driver), we are seeing I/Os get aborted and
tasks stuck in blk_complete_request() and this sometimes results in
hitting a BUG_ON in blk_start_request(). It would appear that we are
seeing two completions performed on an I/O, and the second completion is
racing with re-use of the request for a new I/O.
I saw this upstream commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3&id=9961c9bbf2b43acaaf030a0fbabc9954d937ad8c
which addresses the case where the normal completion occurs before the
abort completion. But the situation I am seeing appears to be that the
abort completion occurs before the normal completion (due to tasks
getting delayed in blk_complete_request()). I don't find any commit to
fix this second case.
Of course, tasks being delayed like this is a concern, and is being
worked separately. But it seems that the alternate double-completion
case is being ignored here.
Does everyone concur that this second case needs to be addressed? Is
there a proposed fix?
Thanks,
Doug
FYI, system is a Power9 running RHEL-ALT 7.5, two SAS3008 adapters
connected to an IBM EXP24SX SAS Storage Enclosure with 24
HUSMM8040ASS201 drives. FIO was being used to drive the I/O load.