Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 25 Feb 2013 16:01:12 -0800

On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng <zheng.z.yan@xxxxxxxxx> wrote:
> On 02/23/2013 02:54 AM, Gregory Farnum wrote:
>> I haven't spent that much time in the kernel client, but this patch
>> isn't working out for me. In particular, I'm pretty sure we need to
>> preserve this:
>>
>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>> index 5d5c32b..b9d8417 100644
>>> --- a/fs/ceph/caps.c
>>> +++ b/fs/ceph/caps.c
>>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>>>         }
>>>         have = __ceph_caps_issued(ci, &implemented);
>>>
>>> -       /*
>>> -        * disallow writes while a truncate is pending
>>> -        */
>>> -       if (ci->i_truncate_pending)
>>> -               have &= ~CEPH_CAP_FILE_WR;
>>> -
>>>         if ((have & need) == need) {
>>>                 /*
>>>                  * Look at (implemented & ~have & not) so that we keep waiting
>>
>> Because if there's a pending truncate, we really can't write. You do
>> handle it in the particular case of doing buffered file writes, but
>> these caps are a marker of permission, and the client shouldn't have
>> write permission to a file until it's up to date on the truncates. Or
>> am I misunderstanding something?
>
> pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap,
> the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write,
> this patch only affects situation that clients receives truncate request from MDS in the
> middle of write.
It looks to me like truncates can get queued for later, so that's not the case?
And how could the client receive a truncate while in the middle of
writing? Either it's got the write caps (in which case nobody else can
truncate), or it shouldn't be writing...I suppose if somebody enables
lazyIO that would do it, but again in that case I imagine we'd rather
do the truncate before writing:

> I think it's better to do vmtruncate after write finishes.
What if you're writing past the truncate point? Then your written data
would be truncated off even though it should have been a file
extension.

>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index a1e5b81..bf7849a 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>>>         dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>>>              inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>>>  again:
>>> -       __ceph_do_pending_vmtruncate(inode);
>>
>> There doesn't seem to be any kind of replacement for this? We need to
>> do any pending truncates before reading or we might get stale data
>> back.
>
> generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't
> get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential
> bug, that's the reason I remove it.

Oh, you're right about the locking (I think, anyway?). However, i_size
doesn't protect us — we might for instance have truncated to zero and
then back up to the original file size. :( But that doesn't look like
it's handled correctly anyway (racy) — am I misreading that badly or
is the infrastructure in fact broken in that case?

As I said, I don't spend much time in the kernel client, but I looked
around a bit more and this is what I came up with. Education, if
nothing else. ;)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html