On Fri, 2016-03-11 at 12:21 +0000, Benjamin ESTRABAUD wrote: > On 11/03/16 11:45, Benjamin ESTRABAUD wrote: > > On 10/03/16 21:41, Mike Christie wrote: > >> On 03/08/2016 12:57 PM, Benjamin ESTRABAUD wrote: > >>> Hi, > >>> > >>> I've been facing with a small issue lately when working with 128 iSCSI > >>> targets on a single system with multiple iscsi initiators connecting to > >>> it (3, 4 inits). If I remove the iSCSI targets or even just the ACLs > >>> from the system even temporarily I get flooded by thousands upon > >>> thousands of connection failures from the hosts trying to login to the > >>> system with messages like: > >>> > >>> [ 923.560908] iSCSI Initiator Node: iqn.1994-05.com.redhat:c87d91366225 > >>> is not authorized to access iSCSI target portal group: 1. > >>> [ 923.561124] iSCSI Login negotiation failed. > >>> > >>> These are fine in small number but when all of the hosts are combined to > >>> so many targets the system gets overwhelmed, the kernel logger gets > >>> flooded and the network portal thread's CPU usage ramps up to close to > >>> 100%. > >>> > >>> Is there a way to limit those, say for instance add a timer between the > >>> login attempt processing that gradually increases with each login > >>> failure? > >>> > >>> I can control the hosts but it's not always evident as a solution > >>> (sometimes the initiators have been improperly configured and I get back > >>> in the same situation). > >>> > >> > >> Is this with Linux hosts only? We can implement something on the > >> initiator side like other OSes have where for these types of errors it > >> will stop retrying the relogin then the user has to relogin manually > >> later. We used to do that by default, but hit issues. We can just make > >> the behavior a config option. > > So far yes, I've only seen this with a RHEL 7.1 host with Multipath > > enabled on all the iSCSI LUNs. A user would take the RAID offline on the > > target, causing the associated Volume and LIO targets to be taken > > offline. Because this an symmetrical dual controller system one path of > > the multipath device becomes unresponsive (IOs submitted to it will > > timeout) and the other path to the side which hosted the RAID will > > return iSCSI login errors. It seems that iscsid or IOs submitted to the > > iSCSI device by multipathd (path checker IOs, application IOs etc.) > > trigger a login request to the target very frequently. Since we have > > between 64 and 128 targets it ends up generating a DoS. > > > > Having this behaviour back as a config option would greatly help, as in > > this particular example automatic recovery is unlikely (somebody took > > the storage offline on the target side, it may not come back). > > > Actually, tweaking "node.session.iscsi.DefaultTime2Wait = 2" seems to > help a lot. DefaultTime2Wait controls the delay of initiator session + connection reconnect. Note this parameter is negotiated during login, and can also be set on a per TargetName+TargetPortalGroupTag context. > Changing this from "2" to "10" caused the iscsi session on > "yanked" targets to only attempt to reconnect five time less, which is > more or less what I was looking for (mitigating the number of connection > attempts). What you were talking about was a way for the iscsi initiator > to eventually give up, or to change the frequency at which the initiator > would retry? node.session.timeo.replacement_timeout controls the timeout for completing outstanding I/O (with error status) from open-iscsi back to scsi-core, separate from iscsi session reconnect. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html