On 2019-05-20 12:05 p.m., Martin K. Petersen wrote:
James,
Please. What I'm interested in is whether this is simply a bug in the
array firmware, in which case the fix is sufficient, or whether
there's some problem with the parser, like mismatched expectations
over added trailing nulls or something.
Our support folks have been looking at this for a while. We have seen
problems with devices from several vendors. To the extent that I gave up
the idea of blacklisting all of them.
I am collecting "bad" SES pages from these devices. I have added support
for RECEIVE DIAGNOSTICS to scsi_debug and added a bunch of deliberately
broken SES pages so we could debug this
Patches ??
It appears to be very common for devices to return inconsistent or
invalid data. So pretty much all of the ses.c parsing needs to have
sanity checking heuristics added to prevent KASAN hiccups.
And it is not just SES device implementations that were broken. The
relationship between Additional Element Status diagnostic page (dpage)
and the Enclosure Status dpage was under-specified in SES-2 and that
led to the EIIOE field being introduced during the SES-3 revisions.
And the meaning of EIIOE was tweaked several times *** before SES-3 was
standardized. Anyone interested in the adventures of EIIOE can see
the code of sg_ses.c in sg3_utils. The sg_ses utility is many times
more complex than anything else in the sg3_utils package.
And that complexity led me to suspect that the Linux SES driver was
broken. It should be 3 or 4 times larger than it is! It simply doesn't
do enough checking.
So yes Martin, you are on the right track.
Doug Gilbert
BTW the NVME Management Interface folks have decided to use SES-3 for
NVME enclosure management rather than invent their own can of worms :-)
*** For example EIIOE started life as a 1 bit field, but two cases
wasn't enough, so it became a 2 bit field and now uses all
four possibilities.