On Mon, Feb 25, 2013 at 4:01 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng <zheng.z.yan@xxxxxxxxx> wrote: >> On 02/23/2013 02:54 AM, Gregory Farnum wrote: >>> I haven't spent that much time in the kernel client, but this patch >>> isn't working out for me. In particular, I'm pretty sure we need to >>> preserve this: >>> >>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >>>> index 5d5c32b..b9d8417 100644 >>>> --- a/fs/ceph/caps.c >>>> +++ b/fs/ceph/caps.c >>>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want, >>>> } >>>> have = __ceph_caps_issued(ci, &implemented); >>>> >>>> - /* >>>> - * disallow writes while a truncate is pending >>>> - */ >>>> - if (ci->i_truncate_pending) >>>> - have &= ~CEPH_CAP_FILE_WR; >>>> - >>>> if ((have & need) == need) { >>>> /* >>>> * Look at (implemented & ~have & not) so that we keep waiting >>> >>> Because if there's a pending truncate, we really can't write. You do >>> handle it in the particular case of doing buffered file writes, but >>> these caps are a marker of permission, and the client shouldn't have >>> write permission to a file until it's up to date on the truncates. Or >>> am I misunderstanding something? >> >> pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap, >> the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write, >> this patch only affects situation that clients receives truncate request from MDS in the >> middle of write. > It looks to me like truncates can get queued for later, so that's not the case? > And how could the client receive a truncate while in the middle of > writing? Either it's got the write caps (in which case nobody else can > truncate), or it shouldn't be writing...I suppose if somebody enables > lazyIO that would do it, but again in that case I imagine we'd rather > do the truncate before writing: > >> I think it's better to do vmtruncate after write finishes. > What if you're writing past the truncate point? Then your written data > would be truncated off even though it should have been a file > extension. > > >>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >>>> index a1e5b81..bf7849a 100644 >>>> --- a/fs/ceph/file.c >>>> +++ b/fs/ceph/file.c >>>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov, >>>> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n", >>>> inode, ceph_vinop(inode), pos, (unsigned)len, inode); >>>> again: >>>> - __ceph_do_pending_vmtruncate(inode); >>> >>> There doesn't seem to be any kind of replacement for this? We need to >>> do any pending truncates before reading or we might get stale data >>> back. >> >> generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't >> get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential >> bug, that's the reason I remove it. > > Oh, you're right about the locking (I think, anyway?). However, i_size > doesn't protect us — we might for instance have truncated to zero and > then back up to the original file size. :( But that doesn't look like > it's handled correctly anyway (racy) — am I misreading that badly or > is the infrastructure in fact broken in that case? Sage pointed out in standup today that we only send a truncate message for truncate downs, and he thinks that the more complicated cases (truncate to zero, write out 2KB, truncate to 1KB) are okay thanks to the capabilities bits. I haven't thought it through but that seems plausible to me. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html