On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote: > J. Bruce Fields wrote: > > Appended below if it's of > > any use to you to compare. (Happy to apply either your patch or mine > > depending which looks better; I haven't tried to compare carefully.) > > > > Ok, I'll take a look at yours. Thanks for doing this, by the way. > > static ssize_t > > cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) > > { > > - struct cache_reader *rp = filp->private_data; > > - struct cache_request *rq; > > + struct cache_request *rq = filp->private_data; > > struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data; > > + struct list_head *queue = &cd->queue; > > int err; > > > > if (count == 0) > > @@ -711,60 +703,45 @@ cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) > > mutex_lock(&queue_io_mutex); /* protect against multiple concurrent > > * readers on this file */ > > > > Ah, so you still have a single global lock which is serialising all > reads and writes to all caches. Yes, making this per-cd seems sensible (though if the problem is typically a single cache (auth_unix) then I don't know how significant a help it is). > Also, I think you'd want to sample filp->private_data after taking > queue_io_mutex. Whoops, yes. > > + if (rq == NULL) { > > mutex_unlock(&queue_io_mutex); > > - BUG_ON(rp->offset); > > > > Good riddance to that BUG_ON(). > > - return 0; > > + return -EAGAIN; > > } > > - rq = container_of(rp->q.list.next, struct cache_request, q.list); > > - BUG_ON(rq->q.reader); > > - if (rp->offset == 0) > > - rq->readers++; > > - spin_unlock(&queue_lock); > > > > - if (rp->offset == 0 && !test_bit(CACHE_PENDING, &rq->item->flags)) { > > + if (!test_bit(CACHE_PENDING, &rq->item->flags)) > > err = -EAGAIN; > > - spin_lock(&queue_lock); > > - list_move(&rp->q.list, &rq->q.list); > > - spin_unlock(&queue_lock); > > - } else { > > - if (rp->offset + count > rq->len) > > - count = rq->len - rp->offset; > > + else { > > + if (rq->offset + count > rq->len) > > + count = rq->len - rq->offset; > > err = -EFAULT; > > - if (copy_to_user(buf, rq->buf + rp->offset, count)) > > + if (copy_to_user(buf, rq->buf + rq->offset, count)) > > goto out; > > > > Ok, so you try to keep partial read support but move the offset into the > upcall request itself. Interesting idea. > > I think partial reads are Just Too Hard to do properly, i.e. without > risk of racy message corruption under all combinations of message size > and userspace behaviour . In particular I think your code will corrupt > upcall data if multiple userspace threads race to do partial reads on > the same struct file (as rpc.mountd is doing at SGI's customer sites). Yes, but what mountd's doing is just dumb, as far as I can tell; is there any real reason not to just keep a separate open for each thread? If we just tell userland to keep a number of opens equal to the number of concurrent upcalls it wants to handle, and then all of this becomes very easy. Forget sharing a single struct file between tasks that do too-small reads: we should make sure that we don't oops, but that's all. > For the same reasons I think the FIONREAD support is utterly pointless > (even though I preserved it). > > But I still don't understand how this 100K message size number for gssd > can happen? If that's really necessary then the design equation changes > considerably. This seems to be the most significant difference between > our patches. So, the somewhat depressing situation with spkm3, which was to be the public-key-based gss mechanism for nfs: we (citi) implemented it (modulo this problem and maybe one or two others), but found some problems along the way that required revising the spec. I think there might have been an implementation for one other platform too. But the ietf seems to have given up on spkm3, and instead is likely to pursue a new mechanism, pku2u. Nobody's implementing that yet. Consulting my local expert, it sounds like pku2u will have some more reasonable bound on init-sec-ctx token sizes. (Not sure if it'll be under a page, without checking, but it shouldn't be much more than that, anyway). So: the immediate pressure for larger upcalls is probably gone. And anyway more changes would be required (the ascii-hex encoding and upcall-downcall matching are very wasteful in this case, etc.), so we'd likely end up using a different mechanism. That said, I think it's easy enough to handle just the case of multiple reads on unshared struct files that it might make sense to keep that piece. > A smaller issue is that you keep a single list and use the value of the > CACHE_PENDING bit to tell the difference between states. I think this > could be made to work; however my technique of using two lists makes > most of the code even simpler at the small cost of having to do two list > searches in queue_loose(). OK. When exactly do they get moved between lists? I'll take a look at the code. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html