On 07-07-17 06:26, Nicholas A. Bellinger wrote:
On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
So abort_task can still be observed, but they do not result in a
non-functional not-quite-PANICked machine anymore.
Thank alot for the bug report and your continuous testing to get this
resolved. The patch is queued up in target-pending/for-next with your
Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
kernels get this fix as well.
Thanks again.
I'm afraid we may not be quite there yet after all...
So we had the other two machines run an md check this weekend
as well, again with a rediculously high synx_speed_max:
Jul 9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID
array md0
Jul 9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_
speed: 10000 KB/sec/disk.
Jul 9 04:00:01 myhost kernel: [661309.805219] md: using maximum
available idle IO bandwidth (but not more than 1000000 KB/sec) for
data-check.
Jul 9 04:00:01 myhost kernel: [661309.815194] md: using 128k window,
over a total of 3252682752k.
Jul 9 04:00:42 myhost kernel: [661351.076391]
qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:02:01 myhost kernel: [661429.985082]
qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:04:24 myhost kernel: [661573.395245]
qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:04:57 myhost kernel: [661605.837694]
qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:09:19 myhost kernel: [661868.261211]
qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found
referenced qla2xxx task_tag: 1175332
Jul 9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332
Jul 9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found
referenced qla2xxx task_tag: 1175380
Jul 9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1175380
Jul 9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found
referenced qla2xxx task_tag: 1175620
Jul 9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found
referenced qla2xxx task_tag: 1211140
Jul 9 04:16:42 myhost kernel: [662310.617910]
qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:18:00 myhost kernel: [662389.415853]
qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:18:22 myhost kernel: [662411.066461]
qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 04:20:23 myhost kernel: [662531.852833]
qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
Jul 9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done.
The machine in question was still responsive (was accepting
SSH logins), however it seemed VMware hosts weren't seeing
the volume anymore (presumably due to the heavy IO on the backend).
Also, several (12?) kworkers seemed stuck in a D state.
When my collegue tried to reboot the machine it got
(presumably) stuck on
/usr/bin/targetctl clear
After which it was forcefully rebooted :)
Sorry we don't have any more detailed info at this point.
We haven't been able to reproduce this on a
different machine yet :(
Regards,
Pascal de Bruijn
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html