Re: [PATCH v2] ceph: drop the messages from MDS when unmouting

Xiubo Li <xiubli@xxxxxxxxxx> · Sat, 28 Jan 2023 11:11:55 +0800

On 26/01/2023 22:04, Ilya Dryomov wrote:
On Thu, Jan 26, 2023 at 2:03 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:

On 24/01/2023 20:32, Luís Henriques wrote:
On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote:
On 23/01/2023 18:15, Luís Henriques wrote:
On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
Hi Luis,

On 20/01/2023 18:36, Luís Henriques wrote:
On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

When unmounting it will just wait for the inflight requests to be
finished, but just before the sessions are closed the kclient still
could receive the caps/snaps/lease/quota msgs from MDS. All these
msgs need to hold some inodes, which will cause ceph_kill_sb() failing
to evict the inodes in time.

If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:
Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
does look good, but I started hitting the WARNING added by patch:

      [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting

This patch seems to be working but I'm not sure we really need the extra
OK, looks like I jumped the gun here: I still see the warning with your
patch.

I've done a quick hack and the patch below sees fix it.  But again, it
will impact performance.  I'll see if I can figure out something else.

Cheers,
--
Luís

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 2cd134ad02a9..bdb4efa0f9f7 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
           return ret;
     }
+static int ceph_flush(struct file *file, fl_owner_t id)
+{
+ struct inode *inode = file_inode(file);
+ int ret;
+
+ if ((file->f_mode & FMODE_WRITE) == 0)
+         return 0;
+
+ ret = filemap_write_and_wait(inode->i_mapping);
+ if (ret)
+         ret = filemap_check_wb_err(file->f_mapping, 0);
+
+ return ret;
+}
+
     const struct file_operations ceph_file_fops = {
           .open = ceph_open,
           .release = ceph_release,
@@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
           .compat_ioctl = compat_ptr_ioctl,
           .fallocate      = ceph_fallocate,
           .copy_file_range = ceph_copy_file_range,
+ .flush = ceph_flush,
     };

This will only fix the second crash case in
https://tracker.ceph.com/issues/58126#note-6, but not the case in
https://tracker.ceph.com/issues/58126#note-5.

This issue could be triggered with "test_dummy_encryption" and with
xfstests-dev's generic/124. You can have a try.
OK, thanks.  I'll give it a try.  BTW, my local reproducer was
generic/132, not generic/124.  I'll let you know if I find something.
Hi Luis,

I added some logs and found that when doing the aio_write, it will split to
many aio requests and when the last req finished it will call the
"writepages_finish()", which will iput() the inode and release the last
refcount of the inode.

But it seems the complete_all() is called without the req->r_callback() is
totally finished:

<4>[500400.268200] writepages_finish 0000000060940222 rc 0
<4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last
osd req->r_callback()
<4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
<4>[500400.268526] sync_fs (blocking) done
<4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== the
sync_filesystem() will be called, I just added it in ceph code but the VFS
will call it again in "kill_anon_super()"
<4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
<4>[500400.268626] sync_fs (blocking)
<4>[500400.268631] sync_fs (blocking) done
<4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was
skipped!    <=== skipped
<4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977
mk_active_refs = 2
<4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   <====
evict the inode in the req->r_callback()

Locally my VM is not working and I couldn't run the test for now. Could you
help test the following patch ?
So, running generic/132 in my test environment with your patch applied to
the 'testing' branch I still see the WARNING (pasted the backtrace
bellow).  I'll try help to dig a bit more on this issue in the next few
days.

Cheers,
--
Luís


[  102.713299] ceph: test_dummy_encryption mode enabled
[  121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped!
[  121.809055] fscrypt_destroy_keyring: mk_active_refs = 2
[  121.810439] ------------[ cut here ]------------
[  121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110
[  121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23
[  121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
[  121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110
[  121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0
[  121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202
[  121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001
[  121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff
[  121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff
[  121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000
[  121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00
[  121.835186] FS:  00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[  121.838205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0
[  121.842190] Call Trace:
[  121.843125]  <TASK>
[  121.843945]  generic_shutdown_super+0x42/0x130
[  121.845656]  kill_anon_super+0xe/0x30
[  121.847081]  ceph_kill_sb+0x3d/0xa0
[  121.848418]  deactivate_locked_super+0x34/0x60
[  121.850166]  cleanup_mnt+0xb8/0x150
[  121.851498]  task_work_run+0x61/0x90
[  121.852863]  exit_to_user_mode_prepare+0x147/0x170
[  121.854741]  syscall_exit_to_user_mode+0x20/0x40
[  121.856489]  do_syscall_64+0x48/0x80
[  121.857887]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  121.859659] RIP: 0033:0x7f2442444a67
[  121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018
[  121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67
[  121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10
[  121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0
[  121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000
[  121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90
[  121.879418]  </TASK>
[  121.880039] ---[ end trace 0000000000000000 ]---

Okay.

I found the sync_filesystem() will wait for the last the osd request,
but the sequence of deleting the osd req from osdc and calling
req->r_callback seems incorrect.

The following patch should fix it:


diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 78b622178a3d..a3b4c5cabb80 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2532,12 +2532,12 @@ static void finish_request(struct
ceph_osd_request *req)
          ceph_msg_revoke_incoming(req->r_reply);
   }

-static void __complete_request(struct ceph_osd_request *req)
+static void __complete_request(struct ceph_osd_request *req, bool callback)
   {
          dout("%s req %p tid %llu cb %ps result %d\n", __func__, req,
               req->r_tid, req->r_callback, req->r_result);

-       if (req->r_callback)
+       if (callback && req->r_callback)
                  req->r_callback(req);
          complete_all(&req->r_completion);
          ceph_osdc_put_request(req);
@@ -2548,7 +2548,7 @@ static void complete_request_workfn(struct
work_struct *work)
          struct ceph_osd_request *req =
              container_of(work, struct ceph_osd_request, r_complete_work);

-       __complete_request(req);
+       __complete_request(req, true);
   }

   /*
@@ -3873,11 +3873,13 @@ static void handle_reply(struct ceph_osd *osd,
struct ceph_msg *msg)
          WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
          req->r_version = m.user_version;
          req->r_result = m.result ?: data_len;
+       if (req->r_callback)
+               req->r_callback(req);
          finish_request(req);
          mutex_unlock(&osd->lock);
          up_read(&osdc->lock);

-       __complete_request(req);
+       __complete_request(req, false);
Hi Xiubo,

I haven't looked into the actual problem that you trying to fix here
but this patch seems wrong and very unlikely to fly.  The messenger
invokes the OSD request callback outside of osd->lock and osdc->lock
critical sections on purpose to avoid deadlocks.  This goes way back
and also consistent with how Objecter behaves in userspace.

Hi Ilya

This is just a draft patch here.

I didn't test other cases yet and only tested the issue here in cephfs 
and it succeeded.

The root cause is the sequence issue to make the sync_filesystem() 
failing to wait the last osd request, which is the last 
req->r_callback() isn't finished yet the waiter is woke up. And more 
detail please see my comment in 
https://tracker.ceph.com/issues/58126?issue_count=405&issue_position=3&next_issue_id=58082&prev_issue_id=58564#note-7.

I will figure out the best approach to fix it later.

Thanks

- Xiubo

Thanks,

                 Ilya

--
Best Regards,

Xiubo Li (李秀波)

Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
Slack: @Xiubo Li