Re: pread() over NFS (again) [1.5.5.4]

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Sun, 29 Jun 2008 20:32:03 -0400

Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> wrote:
> On Thu, 2008-06-26 at 22:57 -0400, J. Bruce Fields wrote:
> > On Thu, Jun 26, 2008 at 04:38:40PM -0700, Junio C Hamano wrote:
> > > logank@xxxxxxxx writes:
> > > 
> > > > On Jun 26, 2008, at 1:56 PM, Junio C Hamano wrote:
> > > >
> > > >>> "The file shouldn't be short unless someone truncated it, or there
> > > >>> is a bug in index-pack.  Neither is very likely, but I don't think
> > > >>> we would want to retry pread'ing the same block forever.
> > > >>
> > > >> I don't think we would want to retry even once.  Return value of 0
> > > >> from
> > > >> pread is defined to be an EOF, isn't it?
> > > >
> > > > No, it seems to be a simple error-out in this case. We have 2.4.20
> > > > systems with nfs-utils 0.3.3 and used to frequently get the same error
> > > > while pushing. I made a similar change back in February and haven't
> > > > had a problem since:
> > > >
> > > > diff --git a/index-pack.c b/index-pack.c
> > > > index 5ac91ba..85c8bdb 100644
> > > > --- a/index-pack.c
> > > > +++ b/index-pack.c
> > > > @@ -313,7 +313,14 @@ static void *get_data_from_pack(struct
> > > > object_entry *obj)
> > > > 	src = xmalloc(len);
> > > > 	data = src;
> > > > 	do {
> > > > +		// It appears that if multiple threads read across NFS, the
> > > > +		// second read will fail. I know this is awful, but we wait for
> > > > +		// a little bit and try again.
> > > > 		ssize_t n = pread(pack_fd, data + rdy, len - rdy, from + rdy);
> > > > +		if (n <= 0) {
> > > > +			sleep(1);
> > > > +			n = pread(pack_fd, data + rdy, len - rdy, from + rdy);
> > > > +		}
> > > > 		if (n <= 0)
> > > > 			die("cannot pread pack file: %s", strerror(errno));
> > > > 		rdy += n;
> > > >
> > > > I use a sleep request since it seems less likely that the other thread
> > > > will have an outstanding request after a second of waiting.
> > > 
> > > Gaah.  Don't we have NFS experts in house?  Bruce, perhaps?
> > 
> > Trond, you don't have any idea why a 2.6.9-42.0.8.ELsmp client (2.4.28
> > server) might be returning spurious 0's from pread()?
> > 
> > Seems like everything is happening from that one client--the file isn't
> > being simultaneously accessed from the server or from another client.
> 
> Is the file only being read, or could there be a simultaneous write to
> the same file? I'm surmising this could be an effect resulting from
> simultaneous cache invalidations: prior to Linux 2.6.20 or so, we
> weren't rigorously following the VFS/VM rules for page locking, and so
> page cache invalidation in particular could have some curious
> side-effects.

The file was created and opened O_CREAT|O_EXCL|O_RDWR, by this
process, written linearly using write(2), without any lseeks.
We kept the file descriptor open and starting issuing pread(2)
calls for earlier offsets we had alread written.  One of those
kicks back EOF far too early (and results in this bug report).

Note the only accesses we are using is write(2) and pread(2), and
once we start reading we don't ever go back to writing.  The pread(2)
calls are typically issued in ascending offsets, and we read each
position only once.  This is to try and take advantage of any
read-ahead the kernel may be able to do.  The pread(2) calls are
rarely (if ever) on a block/page boundary.

Nobody else should know about this file. Its written to a temporary
name and no other well behaved processes would attempt to read
the file until it gets closed and renamed to its final destination.
We haven't reached that far in the processing when we get this error,
so there should be only one file descriptor open on the inode, and
its the same one that wrote the data.

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html