Re: [PATCH v3] mpt3sas: correct reset of smid while clearing scsi tracker

Sreekanth Reddy <sreekanth.reddy@xxxxxxxxxxxx> · Tue, 7 Aug 2018 21:46:47 +0530

On Tue, Aug 7, 2018 at 8:27 PM, Bart Van Assche <Bart.VanAssche@xxxxxxx> wrote:
> On Tue, 2018-08-07 at 20:03 +0530, Sreekanth Reddy wrote:
>> The main intention of this patch to reset the smid to zero after
>> resetting the corresponding smid entry's chain_offset to zero. While
>> posting "mpt3sas: Fix calltrace observed while running IO & reset"
>> patch I have done manual mistake that I reset the smid to zero before
>> resetting the chain_offset which is wrong. Due to which driver
>> always's reset the chain_offset of zero th  entry of chain_lookup[]
>> table which is wrong and hence I am correcting it with this patch.
>>
>> After that you asked me to add memory barriers and hence I have added
>> memory barriers in subsequent versions of the patch.
>
> Hello Sreekanth,
>
> Please answer the questions I already asked you twice instead of sidestepping
> these.

Here I am copying your quires from your previous mails and I am
replaying inline..

In mpt3sas_base_get_smid_scsiio() I found the following code:

              struct scsiio_tracker *request = scsi_cmd_priv(scmd);
              u16 smid = scmd->request->tag + 1;
              request->smid = smid;

That confirms what you wrote about smid being unique for each I/O request.

However, this also raises new questions. As far as I can see all code
in the I/O path that accesses the chain_lookup[] array uses index smid
- 1.

[Sreekanth] Our IOC firmware always requires smid to be >=1, zero is
illegal smid for the IOC firmware. Hence while framing MPT SCSI IO
request driver always make sure that smid to be >=1, so it uses smid
as tag + 1.
where as chain_lookup[] starts with index zero, So driver is accessing
this chain_lookup[] with smid-1 index.

The patch at the start of this e-mail thread modifies two functions, namely

mpt3sas_remove_dead_ioc_func() and mpt3sas_base_clear_st(). Since the
chain_lookup[] array is indexed with the smid and hence contains
request- private data, how is it even possible that these functions
race against each other?
[Sreekanth] This patch modifies in two functions namely
'_base_get_chain_buffer_tracker()' & 'mpt3sas_base_clear_st()' not in
the mpt3sas_remove_dead_ioc_func(). Their is no race between
mpt3sas_remove_dead_ioc_func() & mpt3sas_base_clear_st(). Also their
is no race between _base_get_chain_buffer_tracker() &
mpt3sas_base_clear_st(), since for a particular IO
_base_get_chain_buffer_tracker() is called during IO submission time
for getting required chain buffer and mpt3sas_base_clear_st() is
called to free the used chain buffers in the IO completion path. Also
we have pre-allocated required chain buffers for each IO request and
hence no need to main any list and their is no race between these two
functions.

 Is there perhaps code that accesses the chain_lookup[] array and that
is called from another context than the I/O completion path?
[Sreekanth] No, it is called only from IO submission time.

Is the patch at the start of this e-mail thread the result of a
theoretical concern or of a real failure? In the latter case, which
sequence of actions triggered the failure? The answer to this question
should already have been mentioned in the description of v1 of this
patch
[Sreekanth] The main intention of this patch to reset the smid to zero
after resetting the corresponding smid entry's chain_offset to zero in
chain_lookup[]. While posting "mpt3sas: Fix calltrace observed while
running IO & reset" patch I have done manual mistake that I reset the
smid to zero before resetting the chain_offset which is wrong. Due to
which driver always's reset the chain_offset of zero th  entry of
chain_lookup[] table which is wrong and we may see that scmd's will be
returned from scsih_qcmd() with "host busy" status due to
unavailability of chain buffers (as driver is resetting the wrong
smid's i.e. zero th smid entry's chain_offset instead of correct smid
entry's chain_offset) and IO's will be in hung state .  So, I am
correcting it with this patch.
After that you asked me to add memory barriers and hence I have added
memory barriers in subsequent versions of the patch.

How can this patch be necessary if there are no races between I/O
submission and I/O completion?
[Sreekanth] As I explained above in this patch I am resetting the smid
variable to zero only after resetting the smid's chain_offset in
chain_lookup[] table. Due to manual mistake earlier I am resetting the
smid variable to zero before resetting the correct smid entry's
chain_offset in the chain_lookup[] table, which is wrong and I am
correcting it with this patch.

 Changing the order of two memory writes only makes sense if there is
a race condition involved. Since the block layer allocates a request
tag before scsih_qcmd() is called and since mpt3sas_base_free_smid()
is called before scmd->scsi_done(scmd) in _scsih_io_done(), the
request tag is freed after the smid has been freed when I/O completes
in a normal way. So there shouldn't be a race condition in that
scenario.
[Sreekanth] yes their is no race condition in this scenario.

By the way, you have not answered my question about why development of this
patch started. Does this patch address a theoretical concern or a real bug?
[Sreekanth] As I explained above while submitting "mpt3sas: Fix
calltrace observed while running IO & reset" patch I have made a
manual mistake that I am resetting the smid variable in wrong line in
mpt3sas_base_clear_st() function and I am correcting it with this
patch.

In the latter case, which scenario triggers the addressed bug?
[Sreekanth] Due to this manual mistake IO will be in hung state as
driver will return the scmd's with host busy status due to
unavailability of chain buffers after running IO for some time.

If you want this patch to go upstream you should be able to explain
why you think it is
useful, and that description should be included in the patch description.
[Sreekanth] In the patch description I mentioned that I have done a
manual mistake and I am correcting with this patch.

Hope I have answered all of your quires.

Thanks,
Sreekanth

>
> Thanks,
>
> Bart.
>