On Wed, Aug 10, 2022 at 07:39:34AM +0200, René Scharfe wrote: > > So it's weird that you'd see EAGAIN in this instance. Either the > > underlying write() is refusing to do a partial write (and just returning > > an error with EAGAIN in the first place), or the poll emulation is wrong > > (telling us the descriptor is ready for writing when it isn't). > > You're right, Windows' write needs two corrections. The helper below > reports what happens when we feed a pipe with writes of different sizes. > On Debian on WSL 2 (Windows Subsystem for Linux) it says: > [...] Thanks for digging into this further. What you found makes sense to me and explains what we're seeing. > The two corrections mentioned above together with the enable_nonblock() > implementation for Windows (and the removal of "false") suffice to let > t3701 pass when started directly, but it still hangs when running the > whole test suite using prove. Interesting. I wish there was an easy way for me to poke at this, too. I tried installing the Git for Windows SDK under wine, but unsurprisingly it did not get very far. Possibly I could try connecting to a running CI instance, but the test did not seem to fail there! (Plus I'd have to figure out how to do that... ;) ). > I don't have time to investigate right now, but I still don't > understand how xwrite() can possibly work against a non-blocking pipe. > It loops on EAGAIN, which is bad if the only way forward is to read > from a different fd to allow the other process to drain the pipe > buffer so that xwrite() can write again. I suspect pump_io_round() > must not use xwrite() and should instead handle EAGAIN by skipping to > the next fd. Right, it's susceptible to looping forever in such a case. _But_ a blocking write is likewise susceptible to blocking forever. In either case, we're relying on the reading side to pull some bytes out of the pipe so we can make forward progress. The key thing is that pump_io() is careful never to initiate a write() unless poll() has just told us that the descriptor is ready for writing. If something unexpected happens there (i.e., the descriptor is not really ready), a blocking descriptor is going to be stuck. And with xwrite(), we're similarly stuck (just looping instead of blocking). Without xwrite(), a non-blocking one _could_ be better off, because that EAGAIN would make it up to pump_io(). But what is it supposed to do? I guess it could go back into its main loop and hope that whatever bug caused the mismatch between poll() and write() goes away. But even that would not have fixed the problem here on Windows. From my understanding, mingw_write() in this case would never write _any_ bytes. So we'd never make forward progress, and just loop writing 0 bytes and returning EAGAIN over and over. So I dunno. We could try to be a bit more defensive about non-blocking descriptors by avoiding xwrite() in this instance, but it only helps for a particular class of weird OS behavior/bugs. I'd prefer to see a real case that it would help before moving in that direction. -Peff