Re: Mitigating excessive iSCSI initiator connection failed errors.

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Fri, 11 Mar 2016 20:25:44 -0800

On Fri, 2016-03-11 at 12:21 +0000, Benjamin ESTRABAUD wrote:
> On 11/03/16 11:45, Benjamin ESTRABAUD wrote:
> > On 10/03/16 21:41, Mike Christie wrote:
> >> On 03/08/2016 12:57 PM, Benjamin ESTRABAUD wrote:
> >>> Hi,
> >>>
> >>> I've been facing with a small issue lately when working with 128 iSCSI
> >>> targets on a single system with multiple iscsi initiators connecting to
> >>> it (3, 4 inits). If I remove the iSCSI targets or even just the ACLs
> >>> from the system even temporarily I get flooded by thousands upon
> >>> thousands of connection failures from the hosts trying to login to the
> >>> system with messages like:
> >>>
> >>> [  923.560908] iSCSI Initiator Node: iqn.1994-05.com.redhat:c87d91366225
> >>> is not authorized to access iSCSI target portal group: 1.
> >>> [  923.561124] iSCSI Login negotiation failed.
> >>>
> >>> These are fine in small number but when all of the hosts are combined to
> >>> so many targets the system gets overwhelmed, the kernel logger gets
> >>> flooded and the network portal thread's CPU usage ramps up to close to
> >>> 100%.
> >>>
> >>> Is there a way to limit those, say for instance add a timer between the
> >>> login attempt processing that gradually increases with each login
> >>> failure?
> >>>
> >>> I can control the hosts but it's not always evident as a solution
> >>> (sometimes the initiators have been improperly configured and I get back
> >>> in the same situation).
> >>>
> >>
> >> Is this with Linux hosts only? We can implement something on the
> >> initiator side like other OSes have where for these types of errors it
> >> will stop retrying the relogin then the user has to relogin manually
> >> later. We used to do that by default, but hit issues. We can just make
> >> the behavior a config option.
> > So far yes, I've only seen this with a RHEL 7.1 host with Multipath
> > enabled on all the iSCSI LUNs. A user would take the RAID offline on the
> > target, causing the associated Volume and LIO targets to be taken
> > offline. Because this an symmetrical dual controller system one path of
> > the multipath device becomes unresponsive (IOs submitted to it will
> > timeout) and the other path to the side which hosted the RAID will
> > return iSCSI login errors. It seems that iscsid or IOs submitted to the
> > iSCSI device by multipathd (path checker IOs, application IOs etc.)
> > trigger a login request to the target very frequently. Since we have
> > between 64 and 128 targets it ends up generating a DoS.
> >
> > Having this behaviour back as a config option would greatly help, as in
> > this particular example automatic recovery is unlikely (somebody took
> > the storage offline on the target side, it may not come back).
> >
> Actually, tweaking "node.session.iscsi.DefaultTime2Wait = 2" seems to 
> help a lot.

DefaultTime2Wait controls the delay of initiator session + connection
reconnect.  Note this parameter is negotiated during login, and can also
be set on a per TargetName+TargetPortalGroupTag context.

>  Changing this from "2" to "10" caused the iscsi session on 
> "yanked" targets to only attempt to reconnect five time less, which is 
> more or less what I was looking for (mitigating the number of connection 
> attempts). What you were talking about was a way for the iscsi initiator 
> to eventually give up, or to change the frequency at which the initiator 
> would retry?

node.session.timeo.replacement_timeout controls the timeout for
completing outstanding I/O (with error status) from open-iscsi back to
scsi-core, separate from iscsi session reconnect.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html