On Sat, Jan 28, 2023 at 4:12 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > On 26/01/2023 22:04, Ilya Dryomov wrote: > > On Thu, Jan 26, 2023 at 2:03 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > >> > >> On 24/01/2023 20:32, Luís Henriques wrote: > >>> On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote: > >>>> On 23/01/2023 18:15, Luís Henriques wrote: > >>>>> On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote: > >>>>>> Hi Luis, > >>>>>> > >>>>>> On 20/01/2023 18:36, Luís Henriques wrote: > >>>>>>> On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote: > >>>>>>>> On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@xxxxxxxxxx wrote: > >>>>>>>>> From: Xiubo Li <xiubli@xxxxxxxxxx> > >>>>>>>>> > >>>>>>>>> When unmounting it will just wait for the inflight requests to be > >>>>>>>>> finished, but just before the sessions are closed the kclient still > >>>>>>>>> could receive the caps/snaps/lease/quota msgs from MDS. All these > >>>>>>>>> msgs need to hold some inodes, which will cause ceph_kill_sb() failing > >>>>>>>>> to evict the inodes in time. > >>>>>>>>> > >>>>>>>>> If encrypt is enabled the kernel generate a warning when removing > >>>>>>>>> the encrypt keys when the skipped inodes still hold the keyring: > >>>>>>>> Finally (sorry for the delay!) I managed to look into the 6.1 rebase. It > >>>>>>>> does look good, but I started hitting the WARNING added by patch: > >>>>>>>> > >>>>>>>> [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting > >>>>>>>> > >>>>>>>> This patch seems to be working but I'm not sure we really need the extra > >>>>>>> OK, looks like I jumped the gun here: I still see the warning with your > >>>>>>> patch. > >>>>>>> > >>>>>>> I've done a quick hack and the patch below sees fix it. But again, it > >>>>>>> will impact performance. I'll see if I can figure out something else. > >>>>>>> > >>>>>>> Cheers, > >>>>>>> -- > >>>>>>> Luís > >>>>>>> > >>>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c > >>>>>>> index 2cd134ad02a9..bdb4efa0f9f7 100644 > >>>>>>> --- a/fs/ceph/file.c > >>>>>>> +++ b/fs/ceph/file.c > >>>>>>> @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, > >>>>>>> return ret; > >>>>>>> } > >>>>>>> +static int ceph_flush(struct file *file, fl_owner_t id) > >>>>>>> +{ > >>>>>>> + struct inode *inode = file_inode(file); > >>>>>>> + int ret; > >>>>>>> + > >>>>>>> + if ((file->f_mode & FMODE_WRITE) == 0) > >>>>>>> + return 0; > >>>>>>> + > >>>>>>> + ret = filemap_write_and_wait(inode->i_mapping); > >>>>>>> + if (ret) > >>>>>>> + ret = filemap_check_wb_err(file->f_mapping, 0); > >>>>>>> + > >>>>>>> + return ret; > >>>>>>> +} > >>>>>>> + > >>>>>>> const struct file_operations ceph_file_fops = { > >>>>>>> .open = ceph_open, > >>>>>>> .release = ceph_release, > >>>>>>> @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = { > >>>>>>> .compat_ioctl = compat_ptr_ioctl, > >>>>>>> .fallocate = ceph_fallocate, > >>>>>>> .copy_file_range = ceph_copy_file_range, > >>>>>>> + .flush = ceph_flush, > >>>>>>> }; > >>>>>>> > >>>>>> This will only fix the second crash case in > >>>>>> https://tracker.ceph.com/issues/58126#note-6, but not the case in > >>>>>> https://tracker.ceph.com/issues/58126#note-5. > >>>>>> > >>>>>> This issue could be triggered with "test_dummy_encryption" and with > >>>>>> xfstests-dev's generic/124. You can have a try. > >>>>> OK, thanks. I'll give it a try. BTW, my local reproducer was > >>>>> generic/132, not generic/124. I'll let you know if I find something. > >>>> Hi Luis, > >>>> > >>>> I added some logs and found that when doing the aio_write, it will split to > >>>> many aio requests and when the last req finished it will call the > >>>> "writepages_finish()", which will iput() the inode and release the last > >>>> refcount of the inode. > >>>> > >>>> But it seems the complete_all() is called without the req->r_callback() is > >>>> totally finished: > >>>> > >>>> <4>[500400.268200] writepages_finish 0000000060940222 rc 0 > >>>> <4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last > >>>> osd req->r_callback() > >>>> <4>[500400.268515] sync_fs (blocking) <===== unmounting begin > >>>> <4>[500400.268526] sync_fs (blocking) done > >>>> <4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4 <=== the > >>>> sync_filesystem() will be called, I just added it in ceph code but the VFS > >>>> will call it again in "kill_anon_super()" > >>>> <4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7 > >>>> <4>[500400.268626] sync_fs (blocking) > >>>> <4>[500400.268631] sync_fs (blocking) done > >>>> <4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was > >>>> skipped! <=== skipped > >>>> <4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977 > >>>> mk_active_refs = 2 > >>>> <4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222 <==== > >>>> evict the inode in the req->r_callback() > >>>> > >>>> Locally my VM is not working and I couldn't run the test for now. Could you > >>>> help test the following patch ? > >>> So, running generic/132 in my test environment with your patch applied to > >>> the 'testing' branch I still see the WARNING (pasted the backtrace > >>> bellow). I'll try help to dig a bit more on this issue in the next few > >>> days. > >>> > >>> Cheers, > >>> -- > >>> Luís > >>> > >>> > >>> [ 102.713299] ceph: test_dummy_encryption mode enabled > >>> [ 121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped! > >>> [ 121.809055] fscrypt_destroy_keyring: mk_active_refs = 2 > >>> [ 121.810439] ------------[ cut here ]------------ > >>> [ 121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110 > >>> [ 121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23 > >>> [ 121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 > >>> [ 121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110 > >>> [ 121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0 > >>> [ 121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202 > >>> [ 121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001 > >>> [ 121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff > >>> [ 121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff > >>> [ 121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000 > >>> [ 121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00 > >>> [ 121.835186] FS: 00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000 > >>> [ 121.838205] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0 > >>> [ 121.842190] Call Trace: > >>> [ 121.843125] <TASK> > >>> [ 121.843945] generic_shutdown_super+0x42/0x130 > >>> [ 121.845656] kill_anon_super+0xe/0x30 > >>> [ 121.847081] ceph_kill_sb+0x3d/0xa0 > >>> [ 121.848418] deactivate_locked_super+0x34/0x60 > >>> [ 121.850166] cleanup_mnt+0xb8/0x150 > >>> [ 121.851498] task_work_run+0x61/0x90 > >>> [ 121.852863] exit_to_user_mode_prepare+0x147/0x170 > >>> [ 121.854741] syscall_exit_to_user_mode+0x20/0x40 > >>> [ 121.856489] do_syscall_64+0x48/0x80 > >>> [ 121.857887] entry_SYSCALL_64_after_hwframe+0x46/0xb0 > >>> [ 121.859659] RIP: 0033:0x7f2442444a67 > >>> [ 121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018 > >>> [ 121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 > >>> [ 121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67 > >>> [ 121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10 > >>> [ 121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0 > >>> [ 121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000 > >>> [ 121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90 > >>> [ 121.879418] </TASK> > >>> [ 121.880039] ---[ end trace 0000000000000000 ]--- > >>> > >> Okay. > >> > >> I found the sync_filesystem() will wait for the last the osd request, > >> but the sequence of deleting the osd req from osdc and calling > >> req->r_callback seems incorrect. > >> > >> The following patch should fix it: > >> > >> > >> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > >> index 78b622178a3d..a3b4c5cabb80 100644 > >> --- a/net/ceph/osd_client.c > >> +++ b/net/ceph/osd_client.c > >> @@ -2532,12 +2532,12 @@ static void finish_request(struct > >> ceph_osd_request *req) > >> ceph_msg_revoke_incoming(req->r_reply); > >> } > >> > >> -static void __complete_request(struct ceph_osd_request *req) > >> +static void __complete_request(struct ceph_osd_request *req, bool callback) > >> { > >> dout("%s req %p tid %llu cb %ps result %d\n", __func__, req, > >> req->r_tid, req->r_callback, req->r_result); > >> > >> - if (req->r_callback) > >> + if (callback && req->r_callback) > >> req->r_callback(req); > >> complete_all(&req->r_completion); > >> ceph_osdc_put_request(req); > >> @@ -2548,7 +2548,7 @@ static void complete_request_workfn(struct > >> work_struct *work) > >> struct ceph_osd_request *req = > >> container_of(work, struct ceph_osd_request, r_complete_work); > >> > >> - __complete_request(req); > >> + __complete_request(req, true); > >> } > >> > >> /* > >> @@ -3873,11 +3873,13 @@ static void handle_reply(struct ceph_osd *osd, > >> struct ceph_msg *msg) > >> WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK)); > >> req->r_version = m.user_version; > >> req->r_result = m.result ?: data_len; > >> + if (req->r_callback) > >> + req->r_callback(req); > >> finish_request(req); > >> mutex_unlock(&osd->lock); > >> up_read(&osdc->lock); > >> > >> - __complete_request(req); > >> + __complete_request(req, false); > > Hi Xiubo, > > > > I haven't looked into the actual problem that you trying to fix here > > but this patch seems wrong and very unlikely to fly. The messenger > > invokes the OSD request callback outside of osd->lock and osdc->lock > > critical sections on purpose to avoid deadlocks. This goes way back > > and also consistent with how Objecter behaves in userspace. > > Hi Ilya > > This is just a draft patch here. > > I didn't test other cases yet and only tested the issue here in cephfs > and it succeeded. > > The root cause is the sequence issue to make the sync_filesystem() > failing to wait the last osd request, which is the last > req->r_callback() isn't finished yet the waiter is woke up. And more > detail please see my comment in > https://tracker.ceph.com/issues/58126?issue_count=405&issue_position=3&next_issue_id=58082&prev_issue_id=58564#note-7. Hi Xiubo, This seems to boil down to an expectation mismatch. The diagram in the tracker comment seems to point the finger at ceph_osdc_sync() because it appears to return while req->r_callback is still running. This can indeed be the case but note that ceph_osdc_sync() has never waited for any higher-level callback -- req->r_callback could schedule some delayed work which the OSD client would know nothing about, after all. ceph_osdc_sync() just waits for a safe/commit reply from an OSD (and nowadays all replies are safe -- unsafe/ack replies aren't really a thing anymore). Thanks, Ilya