Re: [Open-FCoE] System crashes with increased drive count

Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> · Fri, 20 Jun 2014 11:23:55 -0700

On Thu, 2014-06-19 at 17:17 -0700, Jun Wu wrote:
> On Thu, Jun 19, 2014 at 10:00 AM, Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> wrote:
> > On Thu, 2014-06-12 at 18:20 -0700, Jun Wu wrote:
> >> On Thu, Jun 12, 2014 at 5:43 PM, Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> wrote:
> >> > On Thu, 2014-06-12 at 15:18 -0700, Jun Wu wrote:
> >> >> On Wed, Jun 11, 2014 at 11:19 AM, Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> wrote:
> >> >> > On Tue, 2014-06-10 at 19:40 -0700, Jun Wu wrote:
> >> >> >> On Tue, Jun 10, 2014 at 3:38 PM, Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> wrote:
> >> >> >> > On Tue, 2014-06-10 at 09:46 -0700, Jun Wu wrote:
> >> >> >> >> This a Supermicro chassis with redundant power supplies. We see the
> >> >> >> >> same failures with both SSDs or HDDs.
> >> >> >> >> The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
> >> >> >> >>
> >> >> >> >
> >> >> >> > Is iSCSI or AoE tests with same TCM core kernel with same target and
> >> >> >> > host NICs/switch ?
> >> >> >>
> >> >> >> We tested AoE with the same hardware/switch and test setup. AoE works
> >> >> >> except that it is not enterprise protocol and it doesn't provide
> >> >> >> performance. It doesn't use TCM.
> >> >> >>
> >> >> >
> >> >> > You had fcoe working with lower queue depth and that should be yielding
> >> >> > lower performance as AoE beside AoE is not using TCM, so not correct
> >> >> > comparison. What about iSCSI, is that using TCM ?
> >> >>
> >> >> We didn't use TCM for the iSCSI test either.
> >> >>
> >> >
> >> > Too many variables to compare or get any help in isolating issues here.
> >> >
> >> >> >> >
> >> >> >> > What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
> >> >> >> > typically used and required by fcoe", but you are using PAUSE and switch
> >> >> >> > cannot be eliminated as you mentioned before, these could affect more to
> >> >> >> > FCoE than other protocols, so can you ensure IO errors are not due to
> >> >> >> > frames losses w/o DCB/PFC in your setup ?
> >> >> >>
> >> >> >> The NIC is:
> >> >> >> [root@poc1 log]# lspci | grep 82599
> >> >> >> 08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >> >> >> SFI/SFP+ Network Connection (rev 01)
> >> >> >>
> >> >> >> The issue should not be caused by frame losses. The systems work fine
> >> >> >> with other protocols.
> >> >> >
> >> >> > FCoE is less tolerant than others to packet losses & latency variations
> >> >> > for more FC like deterministic fabric performance and therefore no drop
> >> >> > ethernet is must for FCoE unlike others in the comparison, for instance
> >> >> > iSCSI would adapt tx window as per frame losses but no such thing in
> >> >> > FCoE.  Thus you cannot conclude that there is no frames losses just
> >> >> > because other works in the setup, iSCSI and AoE should work fine without
> >> >> > no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
> >> >> > using "ethtool -S ethX" ?
> >> >> >
> >> >>
> >> >> I ran the same test again, that is 10 fio sessions from one initiator
> >> >> to 10 drives on the target via vn2vn. After I saw the following
> >> >> messages
> >> >>   poc2 kernel: ft_queue_data_in: Failed to send frame
> >> >> ffff8800ae64ba00, xid <0x389>, remaining 196608, lso_max <0x10000>
> >> >>
> >> >> I checked
> >> >>   ethtool -S p4p1 | grep error
> >> >> and
> >> >>   ethtool -S p4p1 | grep control
> >> >> on both initiator and target, the numbers were all zero.
> >> >>
> >> >> So the abort should not be caused by frame losses.
> >> >
> >> > You mentioned below frames missed count up, thus there are packet drops.
> >> > It would be either due to loss or response not in time. So it could be
> >> > due to frames loss,  BTW is it after increased timeout as suggested
> >> > above with REC disabled ?
> >>
> >> In this particular test, I saw multiple "Failed to send frame"
> >> messages first, and after quite a while, there were no frame losses
> >> reported. It seems to me that the frame loss was not the cause of the
> >> abort, at least in this case.
> >> Yes, the kernel used for the test has REC disabled.
> >>
> >
> > In that case it takes 10 seconds before abort issued, so don't know what
> > would hold IO that long in your test setup w/o any frames loss. You
> > might want to trace IO end to end or profile system for any bottlenecks.
> > As for FCoE Host stack goes its optimized significantly as I push over
> > 2M IOPS with just single dual port 82599ES 10-Gigabit adapter, so
> > possibly bottlenecks beyond host interface anywhere from switch to the
> > backend and therefore profiling or eliminating some setup elements could
> > help you locate bottlenecks.
> >
> 
> For the 2M IOPS test, what IO pattern did you run? How many target
> drives you have?

I used two RAM based soft SANblaze FCoE targets on Romley E5-2690.

> We are using Supermicro chassis, so the 10 SSD drives are all local
> drives to the target machine.
> We are also able to get around 1M IOPS with single port for 4KB IO
> size.

Thats is good.

>  The issue seems to be associated with large IO size.
> 

The PFC/Pause becomes more relevant with large IO as with them workload
is IO bound with 10Gig link saturated while CPUs mostly idle or least
not pegged, if later is not the case then you got some other bottlenecks
not related to no drop ethernet. Anycase would result aborts as
discussed before and below same instances again.

> I repeated the one initiator one target, 10 target SSD drives test
> again with different IO patterns:
> 1. 1MB io size, sequence write. There were no issue.
> 2. 1MB io size, random write. No issue.
> 3. 1MB io size, sequence read. Within one minute, I saw the following
> message on the target which indicates abort from the initiator:
>        Jun 19 13:32:33 poc2 kernel: [ 2820.617015] ft_queue_data_in:
> Failed to send frame ffff8806133a8000, xid <0x86a>, remaining 458752,
> lso_max <0x10000>
>      iostat -x 1 showed 0 iops for multiple drives.
> 4. 1MB io size, random read. This is the least stable one. Immediately
> iostat -x 1 showed zero iops for 8 out of the 10 target drives. The
> following messages printed out on the initiator side:
>        Jun 19 15:03:46 poc1 kernel: [  617.837398] sd 7:0:0:8: [sdt]
> Unhandled error code
>        Jun 19 15:03:46 poc1 kernel: [  617.837403] sd 7:0:0:8: [sdt]
>        Jun 19 15:03:46 poc1 kernel: [  617.837406] Result:
> hostbyte=DID_ERROR driverbyte=DRIVER_OK
>        Jun 19 15:03:46 poc1 kernel: [  617.837408] sd 7:0:0:8: [sdt] CDB:
>        Jun 19 15:03:46 poc1 kernel: [  617.837409] Read(10): 28 00 1c
> 0c d4 00 00 04 00 00
>        Jun 19 15:03:46 poc1 kernel: [  617.837418] end_request: I/O
> error, dev sdt, sector 470602752
> 
> There were no frame losses or PAUSE frame sent during all the above
> tests, even though each of the tests lasted quite long, around 20-30
> minutes each.
> 

I suppose you are checking at all access points, I mean at switch end as
well as at NIC. I typically see PFC from switch under stress, possibly
your switch has larger ports buffers not resulting any PAUSE. 

> > Again repeating as I mentioned before that switch eliminate in your
> > setup is without DCB/PFC so either find ways to skip that hop or find
> > more into switch for possible drops or extended PAUSE etc. If you could
> > tell about switch in use then I might try to find more on that though
> > that all goes beyond Open-FCoE stack.
> >
> 
> The switch model is the Arista 7050S-52. PFC/Pause frames are not
> enabled on the switch. There is only one switch between the initiator
> and the target.

I don't have that one here, but typically "show interfaces .." dumps
useful stats for any frames underrun, CRC etc. 

//Vasu

> Thanks,
> 
> Jun
> 
> >
> >> >
> >> >>
> >> >> >> >
> >> >> >> > While possibly abort issues at target with zero timeout values but you
> >> >> >> > could avoid them completely by increasing scsi timeout and disabling REC
> >> >> >> > as discussed before.
> >> >> >> >
> >> >> >
> >> >> > Now that I know ixgbe (82599) in use, try few more things in addition to
> >> >> > suggestions above:->> >
> >> >> > 1) Disable irq balancer
> >> >> > 2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
> >> >> > Identifying fcoe among them is tricky, it may be labeled with fcoe but
> >> >> > if not then identify them through intr activity while fcoe traffic on,
> >> >> > total 8 fcoe intr are used and pin them across first eight set of CPUs
> >> >> > used in your workloads.
> >> >> > 3) Increase rings size from default 512 to 2K or 4K, just a hunch in
> >> >> > case frames dropped due to longer PAUSE or congestion in your setup.
> >> >> > 4) Also monitor ethX stats beside fcoe hostX stats for anything stand
> >> >> > out odd there at "/sys/class/fc_host/hostX/statistics/"
> >> >> >
> >> >> >
> >> >>
> >> >> We disabled irq balancer, found the interrupts and pinned them.
> >> >> Increased the ring sizes to 4K. It seems that these changes allow the
> >> >> test to run longer. But the target eventually hung. In this test, we
> >> >> saw none zero tx_flow_control_xon/off and rx_flow_control_xon/off
> >> >> numbers which indicate PAUSE is working. We did see frame losses
> >> >> (rx_missed_errors) in this test. So PAUSE frames were issued,
> >> >
> >> > NIC sent pause frames out but still switch not stopping means possibly
> >> > switch side pause not enabled and leading to rx_missed_errors, again
> >> > back to back would have helped here as Nab suggested before.
> >> >
> >> >> but
> >> >> ultimately it didn't work.
> >> >>
> >> >> Is it reasonable to expect PAUSE frame to be sufficient for end to end
> >> >> flow control between nodes?
> >> >
> >> > PAUSE/PFC is link level but would spread in case of multi nodes but
> >> > would depend on switch is use, so check with them.
> >> >
> >> >> If not, should the issue be dealt with at
> >> >> mid level by, for example, issuing BUSY to manage flow control between
> >> >> nodes? Or there needs to be some management at higher level by
> >> >> limiting outstanding commands? What is the best way to manage?
> >> >>
> >> >
> >> > Stack should handle this provided PAUSE is working and frames are not
> >> > dropped at L2.
> >> >
> >> >> We are flooding the network link, using all SSD drives and pushing the
> >> >> boundaries:)
> >> >>
> >> >
> >> > Yeap, I doubt anyone has done enough stress on TCM FC so far and mostly
> >> > hosts SW are tested least at Intel against real FC/FCoE targets.
> >> >
> >> > //Vasu
> >> >
> >> >> Thanks,
> >> >> Jun
> >> >>
> >> >> > <snip>
> >> >> >>
> >> >> >> Is the following cmd_per_lun fcoe related? Its default value is 3. And
> >> >> >> it doesn't allow me to change.
> >> >> >> /sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
> >> >> >
> >> >> > I think this doesn't matter once device queue depth adjusted to 32 and
> >> >> > that can be adjusted. I mean this is used at scsi host alloc as initial
> >> >> > queue depth and later scsi device queue depth is adjusted to 32 through
> >> >> > slave_alloc call back and that can be adjusted
> >> >> > at /sys/block/sdX/device/queue_depth as you did before but not this.
> >> >> >
> >> >> > //Vasu
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe target-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html