On 9/24/20 12:21 AM, Brian Bunker wrote:
We have tried with our patch here and it works. We have not tried with our patch at the
customer site where they hit the crash. Since they hit the BUG_ON line which we
can see in the logs we have, we expect that removing the race as we did
would avoid the crash. We also remove the BUG_ON’s in our patch so it can’t hit
the same crash. If there is another similar race a null pointer deference could still
happen in our patch. I saw you had a patch to only use the value if the pointer is not
null. That would also work to stop the crash, but it would hide the race where the
BUG_ON was helpful in finding it.
Trying our fix at the customer site for us would be more difficult since the operating
system crash belongs to Oracle. That is why you see their patch for the same
issue. Our interest in getting this fixed goes beyond this customer since more
Linux vendors as they move forward in kernel version inherit this code, and
we are reliant on ALUA. We hope to catch it here.
Should I put together a patch with the h->sdev set to null removed from the
detach function along the syncrhronize_rcu and removing the BUG_ON, or
did you want me to diff against your checkin where you have already removed
the BUG_ON?
No need, I already sent a patch attached to another mail to the oracle
folks.
Guess I'll be sending an 'official' patch now, seeing that I have
confirmation.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer