Re: [Open-FCoE] System crashes with increased drive count

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 05 Jun 2014 15:28:28 -0700

On Wed, 2014-06-04 at 15:01 -0700, Vasu Dev wrote:
> On Wed, 2014-06-04 at 15:21 -0700, Nicholas A. Bellinger wrote:
> > On Wed, 2014-06-04 at 11:45 -0700, Jun Wu wrote:
> > > The test setup includes one host and one target. The target exposes 10
> > > hard drives (or 10 LUNs) on one fcoe port. The single initiator runs
> > > 10 fio processes simultaneously to the 10 target drives through fcoe
> > > vn2vn. This is a simple configuration that other people may also want
> > > to try.
> > > 
> > > >Exchange 0x6e4 is aborted and then target still sending frame, while
> > > >later should not occur but first setting up abort with 0 msec timeout
> > > >doesn not look correct either and it is different that 8000 ms on
> > > >initiator side.
> > > 
> > > Should target stop sending frame after abort? I still see a lot of 0
> > > msec messages on target side. Is this something that should be
> > > addressed?
> > > 
> > > >Reducing retries could narrow down to early aborts is the cause here,
> > > >can you try with REC disabled on initiator side for that using this
> > > >change ?
> > > 
> > > By disabling REC have you confirmed that the early aborts is the
> > > cause? Is the abort caused by 0 msec timeout?
> > > 
> > 
> > The 0 msec timeout still look really suspicious..
> > 
> > IIRC, these timeout values are exchanged in the FLOGI request packet,
> > and/or in a separate Request Timeout Value (RTV) packet..
> > 
> > It might be worthwhile to track down where these zero-length settings
> > are coming from, as it might be a indication of what's wrong.
> > 
> > How about the following patch to dump these values..?
> > 
> > Also just curious, have you tried running these two hosts in
> > point-to-point mode without the switch to see if the same types of
> > issues occur..? It might be useful to help isolate the problem space a
> > bit.
> > 
> > Vasu, any other ideas here..?
> > 
> 
> Your patch is good to debug 0 msec value, however this may not the issue
> since these are from incoming aborts processing and by then IO is
> aborted and would cause seq_send failures as I explained in other
> response.
> 
> Nab, Shall tcm_fc take some action on seq_send failures to the target
> core which could help in slowing down host requests rate above the fcoe
> transport ? 
> 

So there are two options here..

One is to simply return -EAGAIN or -ENOMEM from the ->queue_data_in() or
->queue_status() callbacks to notify the target to delay + retry sending
the data-in + response.

The second is to go ahead and set SAM_STAT_TASK_SET_FULL status once
->seq_send() fails for data-in and immediately attempt to send the
response with non GOOD status.  If ->seq_send() for the response
subsequently fails, then keep the SAM_STAT_TASK_SET_FULL status and
return -EAGAIN to force the target to requeue + response only the
response packet.

Once the initiator receives SAM_STAT_TASK_SET_FULL status, it will lower
it's queue depth and retry the initiator command a new queue slot is
available in the LLD..  Another option would be to return BUSY status
instead, will which make the initiator retry the original command,
without attempting to lower the outstanding queue_depth.

So I've got a few (untested) patches to implement this in tcm_fc for the
data-in seq_send() failure case, and will be sending them out shortly.
Please review + test.

Separate from these patches, is the amount of outstanding I/Os on the
network the main culprit here after reviewing the logs..?  Considering
that the upstream fcoe initiator is using cmd_per_lun=3, it seems that
even with 10 LUNs on a single endpoint, the number of total outstanding
I/Os is still pretty low..

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html