----- Original Message ----- > From: "Dave Jones" <davej@xxxxxxxxxxxxxxxxx> > To: "Jan Stancek" <jstancek@xxxxxxxxxx> > Cc: trinity@xxxxxxxxxxxxxxx > Sent: Friday, 9 September, 2016 3:32:36 PM > Subject: Re: [bug] child processes stall forever and don't get killed > > On Fri, Sep 09, 2016 at 06:30:16AM -0400, Jan Stancek wrote: > > Hi, > > > > I'm running v1.6-643-gecea2b06d5f3 on RHEL7.3 and I'm seeing an issue > > where all child processes stall and none of them is getting killed. > > They are usually in a syscalls like read, recv, nanosleep, etc. > > > > I suspect this commit introduced the problem, because any syscall > > that started but not completed is now considered to "make progress": > > > > commit ecf6dfd83d4c886d78d4605163cb8c3f1728db62 > > Author: Dave Jones <davej@xxxxxxxxxxxxxxxxx> > > Date: Fri Aug 12 15:05:01 2016 -0400 > > > > if we haven't done a syscall yet, treat child as "making progress". > > > > Chances are that we haven't been scheduled because some other > > children are hogging the cpu. > > > > I'm seeing more the opposite of what commit above says. Most CPUs > > are idle, because N-1 children are stuck in recv/read/... > > and last child manages to keep going. Then by a chance it also hits > > a syscall that doesn't complete and system stays idle > > (after ~hour I gave up waiting). > > Need to think some more on this, but as a quick guess... > try replacing the <= BEFORE with < BEFORE I've started new test with patch above reverted and that looks good so far. No stalls after 1 hour. Previously it stalled after ~20-30 minutes. I noticed that when syscall stat messages (those which show number of iteration) stopped appearing. > > I'll try and find some time to look into this soon. I'm surprised I > haven't also seen it happen though. How many CPUs & how many child > processes ? Anywhere from 2-8 CPUs, 8-32 children on x86_64, ppc64le and s390x systems (RHEL7.3 Beta). It happened usually within 20-30 minutes. Regards, Jan > > Dave > > -- To unsubscribe from this list: send the line "unsubscribe trinity" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html