J. Bruce Fields wrote: > On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote: > >> J. Bruce Fields wrote: >> >> > >>> static ssize_t >>> cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) >>> { >>> - struct cache_reader *rp = filp->private_data; >>> - struct cache_request *rq; >>> + struct cache_request *rq = filp->private_data; >>> struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data; >>> + struct list_head *queue = &cd->queue; >>> int err; >>> >>> if (count == 0) >>> @@ -711,60 +703,45 @@ cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) >>> mutex_lock(&queue_io_mutex); /* protect against multiple concurrent >>> * readers on this file */ >>> >>> >> Ah, so you still have a single global lock which is serialising all >> reads and writes to all caches. >> > > Yes, making this per-cd seems sensible (though if the problem is > typically a single cache (auth_unix) then I don't know how significant a > help it is). > The usual pattern of traffic I see (on a SLES10 platform with the older set of export caches) when the first NFS packet arrives during a client mounts is: - upcall to mountd via auth.unix.ip cache - mountd writes pre-emptively to nfsd.export and nfsd.expkey caches - mountd write reply to auth.unix.ip cache So it's not just a single cache. However, I have no measurements for any performance improvement. Based on earlier experience I believe the elapsed mounting time to be dominated by the latency of the forward and reverse DNS lookup that mountd does, so the improvement is probably small. > >> >> I think partial reads are Just Too Hard to do properly, i.e. without >> risk of racy message corruption under all combinations of message size >> and userspace behaviour . In particular I think your code will corrupt >> upcall data if multiple userspace threads race to do partial reads on >> the same struct file (as rpc.mountd is doing at SGI's customer sites). >> > > Yes, but what mountd's doing is just dumb, as far as I can tell; is > there any real reason not to just keep a separate open for each thread? > None at all. The current rpc.mountd behaviour is a historical accident of the "look, we can put a fork() here and everything will Just Work" variety. I was hoping to avoid changes to the current userspace behaviour to limit deployment hassles with shipping a fix, but ultimately it can be changed. > If we just tell userland to keep a number of opens equal to the number > of concurrent upcalls it wants to handle, and then all of this becomes > very easy. > If we put that requirement on userspace, and partial reads are still necessary, then your approach of using filp->private_data as a parking spot for requests currently being read would be the right way to go. > Forget sharing a single struct file between tasks that do too-small > reads: we should make sure that we don't oops, but that's all. > We should definitely not oops :-) Consistently delivering correct messages to userspace would be nice too. > So, the somewhat depressing situation with spkm3, which was to be the > public-key-based gss mechanism for nfs: we (citi) implemented it (modulo > this problem and maybe one or two others), but found some problems along > the way that required revising the spec. This is frequently the way with specs :-/ > [...] > So: the immediate pressure for larger upcalls is probably gone. Sweet. > That said, I think it's easy enough to handle just the case of multiple > reads on unshared struct files that it might make sense to keep that > piece. > > >> A smaller issue is that you keep a single list and use the value of the >> CACHE_PENDING bit to tell the difference between states. [...] >> > > OK. When exactly do they get moved between lists? Immediately after being copied out to userspace in cache_read(). -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html