Xiubo Li <xiubli@xxxxxxxxxx> writes: > On 6/2/23 19:29, Luís Henriques wrote: >> xiubli@xxxxxxxxxx writes: >> >>> From: Xiubo Li <xiubli@xxxxxxxxxx> >>> >>> The sync_filesystem() will flush all the dirty buffer and submit the >>> osd reqs to the osdc and then is blocked to wait for all the reqs to >>> finish. But the when the reqs' replies come, the reqs will be removed >>> from osdc just before the req->r_callback()s are called. Which means >>> the sync_filesystem() will be woke up by leaving the req->r_callback()s >>> are still running. >>> >>> This will be buggy when the waiter require the req->r_callback()s to >>> release some resources before continuing. So we need to make sure the >>> req->r_callback()s are called before removing the reqs from the osdc. >>> >>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 >>> CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 >>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 >>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 >>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 >>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 >>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 >>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 >>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 >>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 >>> FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> Call Trace: >>> <TASK> >>> generic_shutdown_super+0x47/0x120 >>> kill_anon_super+0x14/0x30 >>> ceph_kill_sb+0x36/0x90 [ceph] >>> deactivate_locked_super+0x29/0x60 >>> cleanup_mnt+0xb8/0x140 >>> task_work_run+0x67/0xb0 >>> exit_to_user_mode_prepare+0x23d/0x240 >>> syscall_exit_to_user_mode+0x25/0x60 >>> do_syscall_64+0x40/0x80 >>> entry_SYSCALL_64_after_hwframe+0x63/0xcd >>> RIP: 0033:0x7fd83dc39e9b >>> >>> We need to increase the blocker counter to make sure all the osd >>> requests' callbacks have been finished just before calling the >>> kill_anon_super() when unmounting. >> (Sorry for taking so long replying to this patch! And I've still a few >> others on the queue!) >> >> I've been looking at this patch and at patch "ceph: drop the messages from >> MDS when unmounting", but I still can't say whether they are correct or >> not. They seem to be working, but they don't _look_ right. >> >> For example, mdsc->stopping is being used by ceph_dec_stopping_blocker() >> and ceph_inc_stopping_blocker() for setting the ceph_mds_client state, but >> the old usage for that field (e.g. in ceph_mdsc_pre_umount()) is being >> kept. Is that correct? Maybe, but it looks wrong: the old usage isn't >> protected by the spinlock and doesn't use the atomic counter. > > This is okay, the spin lock "stopping_lock" is not trying to protect the > "stopping", and it's just trying to protect the "stopping_blockers" and the > order of accessing "fsc->mdsc->stopping" and "fsc->mdsc->stopping_blockers", you > can try without this you can reproduce it again. > > >> >> Another example: in patch "ceph: drop the messages from MDS when >> unmounting" we're adding calls to ceph_inc_stopping_blocker() in >> ceph_handle_caps(), ceph_handle_quota(), and ceph_handle_snap(). Are >> those the only places where this is needed? Obviously not, because this >> new patch is adding extra calls in the read/write paths. But maybe there >> are more places...? > > I have gone through all the related code and this should be all the places, > which will grab the inode, we need to do this. You can confirm that. > >> And another one: changing ceph_inc_stopping_blocker() to accept NULL to >> distinguish between mds and osd requests makes things look even more >> hacky :-( > > This can be improved. > > Let me update it to make it easier to read. > >> >> On the other end, I've been testing these patches thoroughly and couldn't >> see any issues with them. And although I'm still not convinced that they >> will not deadlock in some corner cases, I don't have a better solution >> either for the problem they're solving. >> >> FWIW you can add my: >> >> Tested-by: Luís Henriques <lhenriques@xxxxxxx> >> >> to this patch (the other one already has it), but I'll need to spend more >> time to see if there are better solutions. > > Thanks Luis for your test and review. One final question: Would it be better to use wait_for_completion_timeout() in ceph_kill_sb()? I'm mostly worried with the read/write paths, of course. For example, the complexity of function ceph_writepages_start() makes it easy to introduce a bug where we'll have an ceph_inc_stopping_blocker() without an ceph_dec_stopping_blocker(). On the other hand, adding a timeout won't solve anything as we'll get the fscrypt warning in the unlikely even of timing-out; but maybe that's more acceptable than blocking the umount operation forever. So, I'm not really sure it's worth it. Cheers, -- Luís