RE: Infinite Loop on 1.0.26?

<Brad.Goodman@xxxxxxx> · Sat, 14 Apr 2012 20:09:29 -0400

> > -----Original Message-----
> > From: FUJITA Tomonori [mailto:fujita.tomonori@xxxxxxxxxxxxx]
> > Sent: Sunday, April 08, 2012 9:14 PM
> > To: Goodman, Brad
> > Cc: stgt@xxxxxxxxxxxxxxx; alexandern@xxxxxxxxxxxx
> > Subject: Re: Infinite Loop on 1.0.26?
> >
> > On Sat, 7 Apr 2012 21:53:23 -0400
> > <Brad.Goodman@xxxxxxx> wrote:
> >
> > >> > I have never seen this type of behavior ever, on prior
> > >> > versions. Barring that investigation when/if this happens again
> -
> > I
> > >> > just wanted to see if this was a "known" issue, or anyone had
> ever
> > >> > seen anything like this before. Is this new? Any ideas?
> > >>
> > >> As far as I know, it's new. What the last tgt version worked well
> > for
> > >> you?
> > >
> > > We have done a decent amount of testing with versions 1.0.14 and
> > 1.0.23
> > >
> > > Our testing with 1.0.26 has been fairly short (I'm guessing under
> 10
> > minutes actual total run-time).
> > >
> > > However, in prior versions our testing has been limited to a
> maximum
> > of two initiators, whereas our 1.0.26 has been with a maximum of 8
> > initiators. In both cases, again, testing has been specific to iSER.
> >
> > Can you perform the same test against the old versions?
> 
> 
> I have done some more testing and have not seen this exact bug
> reproduced, however, I have seen other issues happen with 1.0.26, which
> I believe may explain what was happening.
> 
> When we reproduced this bug with 1.0.26, it appears as though it did
> not necessarily happen during our intensive data testing - but at some
> other time around it. I wasn't sure quite when, but it may have been
> AFTER the testing, possibly associated with other activities, such as
> adding/removing initiators, etc.
> 
> In other testing (though I have not seen this exact bug), I have seen
> cases where if one were to accidentally use a pre-1.0.26 version of
> tgtadm to talk to tgtd, it appears as though memory leaks occur. I have
> seen this manifest itself in several different, reproducible ways. For
> example, I could create a target, and query which targets exist. The
> name of the target I just created will show up as garbled. I will then
> try to add a LUN to the target, and it would say the LUN was "in use",
> although it didn't even exist, etc.
> 
> So, I would conclude that there is a decent chance that I had been
> using an older version of tgtadm, which may have caused this problem.
> One of my engineers on the project had told me at one point that there
> appeared to be differences in [the data structures associated with]
> tgtadm communication in the newer 1.0.26, and there may be some
> compatibility issues. Thus, I believe I am seeing just that.
> 
> I would possibly advise:
> 
> 1. That such data structures which would potential change to the point
> of incompatibility be stamped with some sort of "version number", so
> that they may be rejected if messages are sent with incompatible
> versions.
> 
> 2. Safeguards against the types of (apparent) buffer leaks that may
> happen when bad, or incompatible data is sent.
> 
> Either way - this information, I will still keep a watchful eye for
> issues, but am willing to lay this issue to rest for now.
> 
> Thanks for your time and attention,
> 
> Brad Goodman
> EMC

Sorry for the craziness - but FUTHER testing indicates that this STILL happens with 1.0.26!

We do have a tiny more info, though:

First, it did happen during "Steady-state data testing" - this means we were just running traffic from some initiators. There were no initiators being logged in or out, or any tgtadm commands being issued when this happened.

Second, when it did happen, it was very hard to notice. CPU numbers would spike on the target (on all tgtd processes), but this did NOT have an effect on performance (see discussion below).

First, note that our testing was using eight different tgtd instances, each with one target which had a NULL backing-store device, and an AIO backing device. (There is some debate over which of the two devices we were actually testing when the problem arose). We were testing exclusively over iSER.

When this problem happened, (100% CPU time on all tgtd processes - 85% system/15% user) - tgtd seemed as though it was constantly calling epoll_wait, and rather than it blocking on something, it would always immediately return a "1". Further investigation showed, that sometimes it would return a "2". When the "2" was returned, it would immediately poll other devices, like the "timerfd" file descriptor (which seemed to fire every 500mS), and sometimes the "ibverbs-event fd". But the VAST majority of the time - and this is telling in and of itself, it didn't make *any* type of syscall in its handling of whatever fd it was getting. (I.E. It didn't go off and read a file descriptor, in response to the fact that epoll_wait was apparently telling it an FD had an event on it). It seemed akin to a sporadic interrupt - constantly (falsely) firing for no reason, maybe. This could have been for two reasons:

1. epoll_wait is broken (doubtful)
2. Some underlying driver/service was broken, constantly notifying epoll_wait
3. Something weird happening, which constantly needed actual attention.

HOWEVER - and this is the big one - What event could have fired back, which would have been handled by tgtd in a manner that it wouldn't have needed a syscall to do so?? (I don't know the answer). Like some sort of services (RDMA?) which could have put data into user-memory directly, so a syscall would not be required to poll the result? I don't know if such a mechanism exists or is used.

As to the "second" point above - this behavior was not directly noticeable during periods of heavy traffic, because, even though it would consume a lot of CPU cycles to constantly poll events which (apparently did NOT exist) - when actual events DO exist, the poll is doing what it normally needs to - and does not seem to be preempted by the "false" events. Therefore, no reduction in throughput is seen. This state only seems to be detected once we STOP our normal testing, and notice that TGTD is still consuming 100% of each CPU it is on.

(I think I may have indicated the contrary in a prior message - that communication to tgtd was impacted when in this state. In this latest test, that certainly was not the case).

We didn't have a version of tgtd time built with sufficient debug information for us to do any inspection other than with strace. We are in the process of testing with version prior to 1.0.26, and having debug and instrumentation on-hand for the code we will continue to run on.

Thanks,

-BKG

P.S. This doesn't change what I said about the problems with incompatibilities between tgtd and tgtadm - with versions newer-than and prior-to 1.0.25 (or 1.0.26?)

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html