[ +Cc Hannes ] On Thu, Aug 09, 2018 at 04:41:24PM +0200, ard wrote: > Hi, > > On Thu, Aug 09, 2018 at 12:01:30PM +0200, ard wrote: > > This to determine if we have a single regression in just the > > login handling or both. > As a matter of fact, I think this will not work on vn2vn: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/libfc?id=386b97b43c0c9e0d878eec7ea1db16af22b036ae > scsi: libfc: Rework PRLI handling > <theorising> > As it clearly rejects the PRLI and eventually sends a LOGO when > no store is associated. > But as san setup goes: it first attaches the network driver to > the fcoe layer and hence is store less. Then it attaches stores > to export. But at that moment all possible initiators already got > a LOGO > </theorising> > > But looking back at the logs, it is not all that bad as what I > said. It's not erroring ad infinitum, it's: > > root@antec:~/logs# grep "vn_add rport 00c76e\|kmemleak" 2018-08-08-kern.log|cut -d\ -f9-|uniq -c > 1 2.577320] kmemleak: Kernel memory leak detector initialized > 1 2.577350] kmemleak: Automatic memory scanning thread started > 1 136.452894] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 host10: fip: vn_add rport 00c76e new state 0 > 1 host10: fip: vn_add rport 00c76e old state 0 > 8 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 2 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 52 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 55 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 52 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 36 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 kmemleak: 36 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > > From the last few lines we can clearly see that de vn_add rport ... old state 4 > coincides with a kmemleak 10 minutes later. especially 50 and 36. Now the exact > dump follows for that. > > Now to get back to what is the difference: > This is a working login with a 4.9: > > Aug 8 10:53:15 localhost kernel: [ 14.929451] host10: fip: vn_add rport 0004e0 old state 0 > Aug 8 10:53:21 localhost kernel: [ 23.143274] host10: fip: beacon from rport 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143275] host10: fip: beacon expired for rport 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143276] host10: rport 0004e0: Login to port > Aug 8 10:53:21 localhost kernel: [ 23.143277] host10: rport 0004e0: Entered FLOGI state from Init state > Aug 8 10:53:21 localhost kernel: [ 23.143294] host10: fip: els_send op 7 d_id 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143301] host10: fip: beacon from rport 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143920] host10: rport 0004e0: Received a FLOGI accept > Aug 8 10:53:21 localhost kernel: [ 23.143921] host10: rport 0004e0: Port entered PLOGI state from FLOGI state > Aug 8 10:53:21 localhost kernel: [ 23.143956] host10: rport 0004e0: Received a PLOGI accept > Aug 8 10:53:21 localhost kernel: [ 23.143958] host10: rport 0004e0: Port entered PRLI state from PLOGI state > Aug 8 10:53:21 localhost kernel: [ 23.143982] host10: rport 0004e0: Received a PRLI accept > Aug 8 10:53:21 localhost kernel: [ 23.143984] host10: rport 0004e0: PRLI spp_flags = 0x0 spp_type 0x20 > Aug 8 10:53:21 localhost kernel: [ 23.143985] host10: rport 0004e0: Error -6 in state PRLI, retrying > Aug 8 10:53:21 localhost kernel: [ 23.144434] host10: rport 0004e0: Received PRLI request while in state PRLI > Aug 8 10:53:21 localhost kernel: [ 23.144445] host10: rport 0004e0: PRLI rspp type 8 active 1 passive 0 > Aug 8 10:53:21 localhost kernel: [ 23.749492] host10: rport 0004e0: Received RTV request > Aug 8 10:53:23 localhost kernel: [ 25.204559] host10: rport 0004e0: Port timeout, state PRLI > Aug 8 10:53:23 localhost kernel: [ 25.347400] host10: rport 0004e0: Port entered PRLI state from PRLI state > Aug 8 10:53:23 localhost kernel: [ 25.393058] host10: rport 0004e0: Received a PRLI accept > Aug 8 10:53:23 localhost kernel: [ 25.398412] host10: rport 0004e0: PRLI spp_flags = 0x21 spp_type 0x8 > Aug 8 10:53:23 localhost kernel: [ 25.404853] host10: rport 0004e0: Port entered RTV state from PRLI state > Aug 8 10:53:23 localhost kernel: [ 25.450291] host10: rport 0004e0: Received a RTV reject > Aug 8 10:53:23 localhost kernel: [ 25.455620] host10: rport 0004e0: Port is Ready > Aug 8 10:53:23 localhost kernel: [ 25.460185] host10: rport 0004e0: work event 1 > Aug 8 10:53:23 localhost kernel: [ 25.464672] host10: rport 0004e0: No rport! > Aug 8 10:53:23 localhost kernel: [ 25.469577] host10: rport 0004e0: callback ev 1 > Aug 8 10:53:23 localhost kernel: [ 25.474152] host10: fip: vn_rport_callback 4e0 event 1 > Aug 8 10:53:30 localhost kernel: [ 31.850891] host10: fip: beacon from rport 4e0 > > This is the error retry with a 4.16 unpatched: > > Aug 8 11:33:50 localhost kernel: [ 2452.565524] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:33:50 localhost kernel: [ 2452.571392] host10: fip: vn_add rport 00c76e new state 0 > Aug 8 11:33:50 localhost kernel: [ 2452.576898] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:33:50 localhost kernel: [ 2452.582605] host10: fip: vn_add rport 00c76e old state 0 > Aug 8 11:33:57 localhost kernel: [ 2458.854081] host10: fip: beacon from rport c76e > Aug 8 11:33:57 localhost kernel: [ 2458.858622] host10: fip: beacon expired for rport c76e > Aug 8 11:33:57 localhost kernel: [ 2458.863760] host10: rport 00c76e: Login to port > Aug 8 11:33:57 localhost kernel: [ 2458.868455] host10: rport 00c76e: Entered FLOGI state from Init state > Aug 8 11:33:57 localhost kernel: [ 2458.875177] host10: fip: els_send op 7 d_id c76e > Aug 8 11:33:57 localhost kernel: [ 2458.886089] host10: fip: beacon from rport c76e > Aug 8 11:33:57 localhost kernel: [ 2458.896069] host10: rport 00c76e: Received a FLOGI accept > Aug 8 11:33:57 localhost kernel: [ 2458.901716] host10: rport 00c76e: Port entered PLOGI state from FLOGI state > Aug 8 11:33:57 localhost kernel: [ 2458.934161] host10: rport 00c76e: Received a PLOGI accept > Aug 8 11:33:57 localhost kernel: [ 2458.939651] host10: rport 00c76e: Port entered PRLI state from PLOGI state > Aug 8 11:33:57 localhost kernel: [ 2458.985427] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:33:57 localhost kernel: [ 2458.990817] host10: rport 00c76e: PRLI ELS rejected, reason 5 expl 0 > Aug 8 11:33:57 localhost kernel: [ 2458.997255] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 8 11:33:59 localhost kernel: [ 2461.060167] host10: rport 00c76e: Port timeout, state PRLI > Aug 8 11:33:59 localhost kernel: [ 2461.065688] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 8 11:33:59 localhost kernel: [ 2461.111048] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:33:59 localhost kernel: [ 2461.116388] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 8 11:33:59 localhost kernel: [ 2461.122927] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 8 11:34:01 localhost kernel: [ 2463.172152] host10: rport 00c76e: Port timeout, state PRLI > Aug 8 11:34:01 localhost kernel: [ 2463.177696] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 8 11:34:01 localhost kernel: [ 2463.224578] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:34:01 localhost kernel: [ 2463.230023] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 8 11:34:01 localhost kernel: [ 2463.236765] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 8 11:34:03 localhost kernel: [ 2465.283937] host10: rport 00c76e: Port timeout, state PRLI > Aug 8 11:34:03 localhost kernel: [ 2465.289424] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 8 11:34:03 localhost kernel: [ 2465.335150] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:34:03 localhost kernel: [ 2465.340540] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 8 11:34:03 localhost kernel: [ 2465.347068] host10: rport 00c76e: Error -5 in state PRLI, retries 3 > Aug 8 11:34:03 localhost kernel: [ 2465.353411] host10: rport 00c76e: Port sending LOGO from PRLI state > Aug 8 11:34:03 localhost kernel: [ 2465.359760] host10: fip: els_send op 9 d_id c76e > Aug 8 11:34:05 localhost kernel: [ 2467.046011] host10: fip: beacon from rport c76e > Aug 8 11:34:05 localhost kernel: [ 2467.050549] host10: fip: beacon from rport c76e > Aug 8 11:34:09 localhost kernel: [ 2470.857393] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:09 localhost kernel: [ 2470.863225] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:09 localhost kernel: [ 2470.868764] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:09 localhost kernel: [ 2470.874463] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:13 localhost kernel: [ 2475.238121] host10: fip: beacon from rport c76e > Aug 8 11:34:13 localhost kernel: [ 2475.242660] host10: fip: beacon expired for rport c76e > Aug 8 11:34:13 localhost kernel: [ 2475.247804] host10: rport 00c76e: port already started > > Repeat: > > Aug 8 11:34:13 localhost kernel: [ 2475.253076] host10: fip: beacon from rport c76e > Aug 8 11:34:21 localhost kernel: [ 2483.429903] host10: fip: beacon from rport c76e > Aug 8 11:34:21 localhost kernel: [ 2483.434500] host10: fip: beacon from rport c76e > Aug 8 11:34:29 localhost kernel: [ 2491.621938] host10: fip: beacon from rport c76e > Aug 8 11:34:29 localhost kernel: [ 2491.626478] host10: fip: beacon from rport c76e > > This part contains a memory leak after a timeout: > Aug 8 11:34:33 localhost kernel: [ 2495.433102] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:33 localhost kernel: [ 2495.438842] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:33 localhost kernel: [ 2495.444266] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:33 localhost kernel: [ 2495.450120] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:38 localhost kernel: [ 2499.813722] host10: fip: beacon from rport c76e > Aug 8 11:34:38 localhost kernel: [ 2499.818256] host10: fip: beacon expired for rport c76e > Aug 8 11:34:38 localhost kernel: [ 2499.823402] host10: rport 00c76e: port already started > > You can find the raw kmemleak and kern.log here: > https://github.com/hardkernel/linux/issue_comments#issuecomment-411772096 > > And the some filtered logs a bit more up. > > The logs when it all went ok: > https://github.com/hardkernel/linux/issues/360#issuecomment-411335647 > Especially: > Aug 7 17:54:30 localhost kernel: [ 15.540607] host10: fip: claim resp from from rport c76e - state VNMP_UP > Aug 7 17:54:30 localhost kernel: [ 15.548710] host10: fip: vn_add rport 00c76e new state 0 > Aug 7 17:54:30 localhost kernel: [ 15.555463] host10: fip: claim resp from from rport c76e - state VNMP_UP > Aug 7 17:54:30 localhost kernel: [ 15.563584] host10: fip: vn_add rport 00c76e old state 0 > Aug 7 17:54:30 localhost kernel: [ 15.696327] host10: fip: beacon expired for rport c76e > Aug 7 17:54:30 localhost kernel: [ 15.702975] host10: rport 00c76e: Login to port > Aug 7 17:54:30 localhost kernel: [ 15.707617] host10: rport 00c76e: Entered FLOGI state from Init state > Aug 7 17:54:30 localhost kernel: [ 15.714224] host10: fip: els_send op 7 d_id c76e > Aug 7 17:54:30 localhost kernel: [ 15.736741] host10: rport 00c76e: Received a FLOGI accept > Aug 7 17:54:30 localhost kernel: [ 15.742572] host10: rport 00c76e: Port entered PLOGI state from FLOGI state > Aug 7 17:54:30 localhost kernel: [ 15.749695] host10: rport 00c76e: Received a PLOGI accept > Aug 7 17:54:30 localhost kernel: [ 15.749696] host10: rport 00c76e: Port entered PRLI state from PLOGI state > Aug 7 17:54:30 localhost kernel: [ 15.750112] host10: rport 00c76e: Received a PRLI reject > Aug 7 17:54:30 localhost kernel: [ 15.750113] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 7 17:54:30 localhost kernel: [ 15.750114] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 7 17:54:30 localhost kernel: [ 15.750360] host10: rport 00c76e: Received PRLI request while in state PRLI > Aug 7 17:54:30 localhost kernel: [ 15.750371] host10: rport 00c76e: PRLI rspp type 8 active 1 passive 0 > Aug 7 17:54:30 localhost kernel: [ 15.750570] host10: rport 00c76e: Received RTV request > Aug 7 17:54:30 localhost kernel: [ 17.758897] host10: rport 00c76e: Port timeout, state PRLI > Aug 7 17:54:30 localhost kernel: [ 17.764559] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 7 17:54:30 localhost kernel: [ 17.812081] host10: rport 00c76e: Received a PRLI accept > Aug 7 17:54:31 localhost kernel: [ 17.818809] host10: rport 00c76e: PRLI spp_flags = 0x21 spp_type 0x8 > Aug 7 17:54:31 localhost kernel: [ 17.826634] host10: rport 00c76e: Port entered RTV state from PRLI state > Aug 7 17:54:31 localhost kernel: [ 17.878821] host10: rport 00c76e: Received a RTV accept > Aug 7 17:54:31 localhost kernel: [ 17.885478] host10: rport 00c76e: Port is Ready > Aug 7 17:54:31 localhost kernel: [ 17.891513] host10: rport 00c76e: work event 1 > Aug 7 17:54:31 localhost kernel: [ 17.897386] host10: rport 00c76e: No rport! > Aug 7 17:54:31 localhost kernel: [ 17.903698] host10: rport 00c76e: callback ev 1 > Aug 7 17:54:31 localhost kernel: [ 17.909081] host10: fip: vn_rport_callback c76e event 1 > Aug 7 17:54:34 localhost kernel: [ 20.992106] host10: rport 00c76e: Received ADISC request > > So in the "working" case I see that the other node (4.16 unpatched) accepts our > PRLI, if I interpret that right, and in the other case it > doesn't. Same system, different boot order though, but I guess > the boot order doesn't matter. > When it doesn't work, it gets or sends a mutual LOGO and gets into a weird state. > > Anyway: as far as I interpreted it, in the vn2vn situation alle > enodes should know all other rports to prevent a double rport > assignment. I have no experience with FC itself though, so I > don't know what the normal state is but sending a LOGO due to a > failure to set up, is that ok? > Or is the vn2vn rport maintenance on another layer? > > The 4.9 kernel is still to be baked... > I expect it to yield minor kmemleaks, until I start toggling the > FCoE. > > Regards, > Ard van Breemen > > PS: I removed some cross spamming from the Cc:, I hope that's ok. Full quote for reference. This looks like the PRLI rework from Hannes introduced the memleaks here. Hannes can you have a look at it? -- Johannes Thumshirn Storage jthumshirn@xxxxxxx +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850