RE: [PATCH v2 0/6] Force pipes to flush immediately on NonStop platform

"Randall S. Becker" <rsbecker@xxxxxxxxxxxxx> · Tue, 23 Jan 2018 15:46:18 -0500

On January 23, 2018 1:13 PM, Junio C Hamano wrote:
> "Randall S. Becker" <rsbecker@xxxxxxxxxxxxx> writes:
> 
> >> IOW, I do not see it explained clearly why this change is needed on
> >> any single platform---so "that issue may be shared by others, too"
> >> is a bit premature thing for me to listen to and understand, as "that
> >> issue" is quite unclear to me.
> >
> > v4 might be a little better. The issue seems to be specific to NonStop
> > that it's PIPE mechanism needs to have setbuf(pipe,NULL) called for
> > git to be happy.  The default behaviour appears to be different on
> > NonStop from other platforms from our testing. We get hung up waiting
> > on pipes unless this is done.
> 
> I am afraid that that is not a "diagnosis" enough to allow us moving forward.
> We get hung up because...?  When the process that has the other end of
> pipe open exits, NonStop does not close the pipe properly?  Or NonStop
> does not flush the data buffered in the pipe?
> Would it help if a compat wrapper that does setbuf(fd, NULL) immediately
> before closing the fd, or some other more targetted mechanism, is used only
> on NonStop, for example?  Potentially megabytes of data can pass thru a
> pipe, and if the platform bug affects only at the tail end of the transfer,
> marking the pipe not to buffer at all at the beginning is too big a hammer to
> work it around.  With the explanation given so far, this still smells more like
> "we have futzed around without understanding why, and this happens to
> work."  It may be good enough for your purpose of making progress (after
> all, I'd imagine that you'd need to work this around one way or another to
> hunt for and fix more issues on the platform), but it does not sound like "we
> know what the problem is, and this is the best workaround for that", which is
> what we want if it wants to become part of the official codebase.

I am retesting without setbuf(NULL) but this is unlikely to be enlightening. The test cases do not adequately simulate the configuration in which my team originally encountered the problem. This requires a guarantee of the source and destination coming through different logical CPUs. We never encountered the issue in the test suite, only when end users got hold of it. We had two distinct problems, one which was the revent=0 related hang (already solved) and other was a stream flush problem. The two are related but distinct. The platform support group insisted that we should have the setbuf(NULL) in place for interprocess communications in the form used here - I'm worried about losing this, but completely aware that this is far too heavy for other platforms (hence the __TANDEM guard in wrapper.c). If the form of the wrapper should be different, I would be happy to comply.

I have a much longer explanation about the platform message stack structure, but that doesn't belong here. Happy to respond privately if requested.

Cheers,
Randall

-- Brief whoami:
 NonStop developer since approximately 211288444200000000
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.