Re: [PATCH] ceph: just wait the osd requests' callbacks to finish when unmounting

Luís Henriques <lhenriques@xxxxxxx> · Mon, 05 Jun 2023 15:54:29 +0100

Xiubo Li <xiubli@xxxxxxxxxx> writes:

> On 6/2/23 19:29, Luís Henriques wrote:
>> xiubli@xxxxxxxxxx writes:
>>
>>> From: Xiubo Li <xiubli@xxxxxxxxxx>
>>>
>>> The sync_filesystem() will flush all the dirty buffer and submit the
>>> osd reqs to the osdc and then is blocked to wait for all the reqs to
>>> finish. But the when the reqs' replies come, the reqs will be removed
>>> from osdc just before the req->r_callback()s are called. Which means
>>> the sync_filesystem() will be woke up by leaving the req->r_callback()s
>>> are still running.
>>>
>>> This will be buggy when the waiter require the req->r_callback()s to
>>> release some resources before continuing. So we need to make sure the
>>> req->r_callback()s are called before removing the reqs from the osdc.
>>>
>>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
>>> CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
>>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
>>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
>>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
>>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
>>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
>>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
>>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
>>> FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>> <TASK>
>>> generic_shutdown_super+0x47/0x120
>>> kill_anon_super+0x14/0x30
>>> ceph_kill_sb+0x36/0x90 [ceph]
>>> deactivate_locked_super+0x29/0x60
>>> cleanup_mnt+0xb8/0x140
>>> task_work_run+0x67/0xb0
>>> exit_to_user_mode_prepare+0x23d/0x240
>>> syscall_exit_to_user_mode+0x25/0x60
>>> do_syscall_64+0x40/0x80
>>> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> RIP: 0033:0x7fd83dc39e9b
>>>
>>> We need to increase the blocker counter to make sure all the osd
>>> requests' callbacks have been finished just before calling the
>>> kill_anon_super() when unmounting.
>> (Sorry for taking so long replying to this patch!  And I've still a few
>> others on the queue!)
>>
>> I've been looking at this patch and at patch "ceph: drop the messages from
>> MDS when unmounting", but I still can't say whether they are correct or
>> not.  They seem to be working, but they don't _look_ right.
>>
>> For example, mdsc->stopping is being used by ceph_dec_stopping_blocker()
>> and ceph_inc_stopping_blocker() for setting the ceph_mds_client state, but
>> the old usage for that field (e.g. in ceph_mdsc_pre_umount()) is being
>> kept.  Is that correct?  Maybe, but it looks wrong: the old usage isn't
>> protected by the spinlock and doesn't use the atomic counter.
>
> This is okay, the spin lock "stopping_lock" is not trying to protect the
> "stopping", and it's just trying to protect the "stopping_blockers" and the
> order of accessing "fsc->mdsc->stopping" and "fsc->mdsc->stopping_blockers", you
> can try without this you can reproduce it again.
>
>
>>
>> Another example: in patch "ceph: drop the messages from MDS when
>> unmounting" we're adding calls to ceph_inc_stopping_blocker() in
>> ceph_handle_caps(), ceph_handle_quota(), and ceph_handle_snap().  Are
>> those the only places where this is needed?  Obviously not, because this
>> new patch is adding extra calls in the read/write paths.  But maybe there
>> are more places...?
>
> I have gone through all the related code and this should be all the places,
> which will grab the inode, we need to do this. You can confirm that.
>
>> And another one: changing ceph_inc_stopping_blocker() to accept NULL to
>> distinguish between mds and osd requests makes things look even more
>> hacky :-(
>
> This can be improved.
>
> Let me update it to make it easier to read.
>
>>
>> On the other end, I've been testing these patches thoroughly and couldn't
>> see any issues with them.  And although I'm still not convinced that they
>> will not deadlock in some corner cases, I don't have a better solution
>> either for the problem they're solving.
>>
>> FWIW you can add my:
>>
>> Tested-by: Luís Henriques <lhenriques@xxxxxxx>
>>
>> to this patch (the other one already has it), but I'll need to spend more
>> time to see if there are better solutions.
>
> Thanks Luis for your test and review.

One final question:

Would it be better to use wait_for_completion_timeout() in ceph_kill_sb()?
I'm mostly worried with the read/write paths, of course.  For example, the
complexity of function ceph_writepages_start() makes it easy to introduce
a bug where we'll have an ceph_inc_stopping_blocker() without an
ceph_dec_stopping_blocker().

On the other hand, adding a timeout won't solve anything as we'll get the
fscrypt warning in the unlikely even of timing-out; but maybe that's more
acceptable than blocking the umount operation forever.  So, I'm not really
sure it's worth it.

Cheers,
-- 
Luís