On Tue, 2013-03-26 at 18:51 -0700, Andrew Morton wrote: > On Tue, 26 Mar 2013 21:44:12 -0400 Peter Hurley <peter@xxxxxxxxxxxxxxxxxx> wrote: > > > On Tue, 2013-03-26 at 12:43 -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote: > > > The patch titled > > > Subject: revert "ipc: don't allocate a copy larger than max" > > > has been added to the -mm tree. Its filename is > > > revert-ipc-dont-allocate-a-copy-larger-than-max.patch > > > > > > Before you just go and hit "reply", please: > > > a) Consider who else should be cc'ed > > > b) Prefer to cc a suitable mailing list as well > > > c) Ideally: find the original patch on the mailing list and do a > > > reply-to-all to that, adding suitable additional cc's > > > > > > *** Remember to use Documentation/SubmitChecklist when testing your code *** > > > > > > The -mm tree is included into linux-next and is updated > > > there every 3-4 working days > > > > > > ------------------------------------------------------ > > > From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > > Subject: revert "ipc: don't allocate a copy larger than max" > > > > > > Revert 88b9e456b164. Dave has confirmed that this was causing oopses > > > during trinity testing. > > > > No, he didn't. > > hm, so perhaps he didn't run the test for long enough? > > > Here's a copy of Dave Jones's original report [1] on this very same bug > > in linux-next on Feb 19, __6 days before__ I even submitted the series > > that fixes this bug. > > > > Note that the faulting instruction is __identical__ to Dave's most > > recent report on 3.9-rc4: > > Well bah. Why wasn't I told this (sufficiently clearly for it to sink in)? > > > My recommendation is to either: > > 1) apply my entire 'ipc MSG_COPY fixes' series > > --or-- > > 2) revert the entire ipc MSG_COPY implementation that introduced this > > bug to begin with. > > urgh. I really don't want to merge a pile of patches, one of which we > think fixes a bug which we don't understand for reasons we don't > understand. > > Not only does this generally suck, but it also creates a nightmare for > maintainers of 3.8.x kernels - what patch should they merge to fix that > bug? (I didn't think MSG_COPY was in 3.8 because the first time we were hitting this was in linux-next in Feb). > I'm about to disappear for 1.5 weeks. Stanislav, someone, please let's > get this sorted out! I just figured out how the queue is being corrupted and why my series fixes it. With MSG_COPY set, the queue scan can exit with the local variable 'msg' pointing to a real msg if the msg_counter never reaches the copy_number. The remaining execution looks like this: if (!IS_ERR(msg)) { .... if (msgflg & MSG_COPY) goto out_unlock; .... out_unlock: msg_unlock(msq); break; } } if (IS_ERR(msg)) .... bufsz = msg_handler(); free_msg(msg); <<---- msg never unlinked Since the msg should not have been found (because it failed the match criteria), the if (!IS_ERR(msg)) clause should never have executed. That's why my refactor fixes resolve this; because msg is not inadvertently treated as a found msg. But let's be honest; the real bug here is the poor structure of this function that even made this possible. The deepest nesting executes a goto to a label in the middle of an if clause. Yuck! No wonder this thing's fragile. So my recommendation still stands. The series that fixes this has been getting tested in linux-next for a month. Fixing this some other way is just asking for more trouble. But why not just revert MSG_COPY altogether for 3.9? Regards, Peter Hurley -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html