Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 26 Feb 2013 10:19:26 -0800



On Mon, Feb 25, 2013 at 4:01 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng <zheng.z.yan@xxxxxxxxx> wrote:
>> On 02/23/2013 02:54 AM, Gregory Farnum wrote:
>>> I haven't spent that much time in the kernel client, but this patch
>>> isn't working out for me. In particular, I'm pretty sure we need to
>>> preserve this:
>>>
>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>> index 5d5c32b..b9d8417 100644
>>>> --- a/fs/ceph/caps.c
>>>> +++ b/fs/ceph/caps.c
>>>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>>>>         }
>>>>         have = __ceph_caps_issued(ci, &implemented);
>>>>
>>>> -       /*
>>>> -        * disallow writes while a truncate is pending
>>>> -        */
>>>> -       if (ci->i_truncate_pending)
>>>> -               have &= ~CEPH_CAP_FILE_WR;
>>>> -
>>>>         if ((have & need) == need) {
>>>>                 /*
>>>>                  * Look at (implemented & ~have & not) so that we keep waiting
>>>
>>> Because if there's a pending truncate, we really can't write. You do
>>> handle it in the particular case of doing buffered file writes, but
>>> these caps are a marker of permission, and the client shouldn't have
>>> write permission to a file until it's up to date on the truncates. Or
>>> am I misunderstanding something?
>>
>> pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap,
>> the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write,
>> this patch only affects situation that clients receives truncate request from MDS in the
>> middle of write.
> It looks to me like truncates can get queued for later, so that's not the case?
> And how could the client receive a truncate while in the middle of
> writing? Either it's got the write caps (in which case nobody else can
> truncate), or it shouldn't be writing...I suppose if somebody enables
> lazyIO that would do it, but again in that case I imagine we'd rather
> do the truncate before writing:
>
>> I think it's better to do vmtruncate after write finishes.
> What if you're writing past the truncate point? Then your written data
> would be truncated off even though it should have been a file
> extension.
>
>
>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>> index a1e5b81..bf7849a 100644
>>>> --- a/fs/ceph/file.c
>>>> +++ b/fs/ceph/file.c
>>>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>>>>         dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>>>>              inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>>>>  again:
>>>> -       __ceph_do_pending_vmtruncate(inode);
>>>
>>> There doesn't seem to be any kind of replacement for this? We need to
>>> do any pending truncates before reading or we might get stale data
>>> back.
>>
>> generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't
>> get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential
>> bug, that's the reason I remove it.
>
> Oh, you're right about the locking (I think, anyway?). However, i_size
> doesn't protect us — we might for instance have truncated to zero and
> then back up to the original file size. :( But that doesn't look like
> it's handled correctly anyway (racy) — am I misreading that badly or
> is the infrastructure in fact broken in that case?

Sage pointed out in standup today that we only send a truncate message
for truncate downs, and he thinks that the more complicated cases
(truncate to zero, write out 2KB, truncate to 1KB) are okay thanks to
the capabilities bits. I haven't thought it through but that seems
plausible to me.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html