Re: [PATCH v2] ceph: drop the messages from MDS when unmouting

Xiubo Li <xiubli@xxxxxxxxxx> · Thu, 26 Jan 2023 21:03:44 +0800

On 24/01/2023 20:32, Luís Henriques wrote:
On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote:
On 23/01/2023 18:15, Luís Henriques wrote:
On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
Hi Luis,

On 20/01/2023 18:36, Luís Henriques wrote:
On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

When unmounting it will just wait for the inflight requests to be
finished, but just before the sessions are closed the kclient still
could receive the caps/snaps/lease/quota msgs from MDS. All these
msgs need to hold some inodes, which will cause ceph_kill_sb() failing
to evict the inodes in time.

If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:
Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
does look good, but I started hitting the WARNING added by patch:

     [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting

This patch seems to be working but I'm not sure we really need the extra
OK, looks like I jumped the gun here: I still see the warning with your
patch.

I've done a quick hack and the patch below sees fix it.  But again, it
will impact performance.  I'll see if I can figure out something else.

Cheers,
--
Luís

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 2cd134ad02a9..bdb4efa0f9f7 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
    	return ret;
    }
+static int ceph_flush(struct file *file, fl_owner_t id)
+{
+	struct inode *inode = file_inode(file);
+	int ret;
+
+	if ((file->f_mode & FMODE_WRITE) == 0)
+		return 0;
+
+	ret = filemap_write_and_wait(inode->i_mapping);
+	if (ret)
+		ret = filemap_check_wb_err(file->f_mapping, 0);
+
+	return ret;
+}
+
    const struct file_operations ceph_file_fops = {
    	.open = ceph_open,
    	.release = ceph_release,
@@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
    	.compat_ioctl = compat_ptr_ioctl,
    	.fallocate	= ceph_fallocate,
    	.copy_file_range = ceph_copy_file_range,
+	.flush = ceph_flush,
    };

This will only fix the second crash case in
https://tracker.ceph.com/issues/58126#note-6, but not the case in
https://tracker.ceph.com/issues/58126#note-5.

This issue could be triggered with "test_dummy_encryption" and with
xfstests-dev's generic/124. You can have a try.
OK, thanks.  I'll give it a try.  BTW, my local reproducer was
generic/132, not generic/124.  I'll let you know if I find something.
Hi Luis,

I added some logs and found that when doing the aio_write, it will split to
many aio requests and when the last req finished it will call the
"writepages_finish()", which will iput() the inode and release the last
refcount of the inode.

But it seems the complete_all() is called without the req->r_callback() is
totally finished:

<4>[500400.268200] writepages_finish 0000000060940222 rc 0
<4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last
osd req->r_callback()
<4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
<4>[500400.268526] sync_fs (blocking) done
<4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== the
sync_filesystem() will be called, I just added it in ceph code but the VFS
will call it again in "kill_anon_super()"
<4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
<4>[500400.268626] sync_fs (blocking)
<4>[500400.268631] sync_fs (blocking) done
<4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was
skipped!    <=== skipped
<4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977
mk_active_refs = 2
<4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   <====
evict the inode in the req->r_callback()

Locally my VM is not working and I couldn't run the test for now. Could you
help test the following patch ?
So, running generic/132 in my test environment with your patch applied to
the 'testing' branch I still see the WARNING (pasted the backtrace
bellow).  I'll try help to dig a bit more on this issue in the next few
days.

Cheers,
--
Luís


[  102.713299] ceph: test_dummy_encryption mode enabled
[  121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped!
[  121.809055] fscrypt_destroy_keyring: mk_active_refs = 2
[  121.810439] ------------[ cut here ]------------
[  121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110
[  121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23
[  121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
[  121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110
[  121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0
[  121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202
[  121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001
[  121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff
[  121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff
[  121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000
[  121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00
[  121.835186] FS:  00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[  121.838205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0
[  121.842190] Call Trace:
[  121.843125]  <TASK>
[  121.843945]  generic_shutdown_super+0x42/0x130
[  121.845656]  kill_anon_super+0xe/0x30
[  121.847081]  ceph_kill_sb+0x3d/0xa0
[  121.848418]  deactivate_locked_super+0x34/0x60
[  121.850166]  cleanup_mnt+0xb8/0x150
[  121.851498]  task_work_run+0x61/0x90
[  121.852863]  exit_to_user_mode_prepare+0x147/0x170
[  121.854741]  syscall_exit_to_user_mode+0x20/0x40
[  121.856489]  do_syscall_64+0x48/0x80
[  121.857887]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  121.859659] RIP: 0033:0x7f2442444a67
[  121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018
[  121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67
[  121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10
[  121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0
[  121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000
[  121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90
[  121.879418]  </TASK>
[  121.880039] ---[ end trace 0000000000000000 ]---

Okay.

I found the sync_filesystem() will wait for the last the osd request, 
but the sequence of deleting the osd req from osdc and calling 
req->r_callback seems incorrect.

The following patch should fix it:


diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 78b622178a3d..a3b4c5cabb80 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2532,12 +2532,12 @@ static void finish_request(struct 
ceph_osd_request *req)
        ceph_msg_revoke_incoming(req->r_reply);
 }

-static void __complete_request(struct ceph_osd_request *req)
+static void __complete_request(struct ceph_osd_request *req, bool callback)
 {
        dout("%s req %p tid %llu cb %ps result %d\n", __func__, req,
             req->r_tid, req->r_callback, req->r_result);

-       if (req->r_callback)
+       if (callback && req->r_callback)
                req->r_callback(req);
        complete_all(&req->r_completion);
        ceph_osdc_put_request(req);
@@ -2548,7 +2548,7 @@ static void complete_request_workfn(struct 
work_struct *work)
        struct ceph_osd_request *req =
            container_of(work, struct ceph_osd_request, r_complete_work);

-       __complete_request(req);
+       __complete_request(req, true);
 }

 /*
@@ -3873,11 +3873,13 @@ static void handle_reply(struct ceph_osd *osd, 
struct ceph_msg *msg)
        WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
        req->r_version = m.user_version;
        req->r_result = m.result ?: data_len;
+       if (req->r_callback)
+               req->r_callback(req);
        finish_request(req);
        mutex_unlock(&osd->lock);
        up_read(&osdc->lock);

-       __complete_request(req);
+       __complete_request(req, false);
        return;

 fail_request:

Thanks

- Xiubo


diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 78b622178a3d..5cdaba0d3003 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2539,6 +2539,7 @@ static void __complete_request(struct ceph_osd_request
*req)

         if (req->r_callback)
                 req->r_callback(req);
+       barrier();
         complete_all(&req->r_completion);
         ceph_osdc_put_request(req);
  }

Thanks

- Xiubo


Locally I am still reading the code to check why "sync_filesystem(s);"
couldn't do the same with "ceph_flush" as above.

I am still on holiday these days and I will test this after my back.
Sure, no problem.  Enjoy your break!

Cheers,
--
Luís

Thanks


'stopping' state.  Looking at the code, we've flushed all the workqueues
and done all the waits, so I *think* the sync_filesystem() call should be
enough.

The other alternative I see would be to add a ->flush() to ceph_file_fops,
where we'd do a filemap_write_and_wait().  But that would probably have a
negative performance impact -- my understand is that it basically means
we'll have sync file closes.

Cheers,
--
Luís

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

URL: https://tracker.ceph.com/issues/58126
Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
---

V2:
- Fix it in ceph layer.


    fs/ceph/caps.c       |  3 +++
    fs/ceph/mds_client.c |  5 ++++-
    fs/ceph/mds_client.h |  7 ++++++-
    fs/ceph/quota.c      |  3 +++
    fs/ceph/snap.c       |  3 +++
    fs/ceph/super.c      | 14 ++++++++++++++
    6 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 15d9e0f0d65a..e8a53aeb2a8c 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -4222,6 +4222,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
    	dout("handle_caps from mds%d\n", session->s_mds);
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
    	/* decode */
    	end = msg->front.iov_base + msg->front.iov_len;
    	if (msg->front.iov_len < sizeof(*h))
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d41ab68f0130..1ad85af49b45 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4869,6 +4869,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
    	dout("handle_lease from mds%d\n", mds);
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
    	/* decode */
    	if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
    		goto bad;
@@ -5262,7 +5265,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
    void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
    {
    	dout("pre_umount\n");
-	mdsc->stopping = 1;
+	mdsc->stopping = CEPH_MDSC_STOPPING_BEGAIN;
    	ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
    	ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 81a1f9a4ac3b..56f9d8077068 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -398,6 +398,11 @@ struct cap_wait {
    	int			want;
    };
+enum {
+	CEPH_MDSC_STOPPING_BEGAIN = 1,
+	CEPH_MDSC_STOPPING_FLUSHED = 2,
+};
+
    /*
     * mds client state
     */
@@ -414,7 +419,7 @@ struct ceph_mds_client {
    	struct ceph_mds_session **sessions;    /* NULL for mds if no session */
    	atomic_t		num_sessions;
    	int                     max_sessions;  /* len of sessions array */
-	int                     stopping;      /* true if shutting down */
+	int                     stopping;      /* the stage of shutting down */
    	atomic64_t		quotarealms_count; /* # realms with quota */
    	/*
diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
index 64592adfe48f..f5819fc31d28 100644
--- a/fs/ceph/quota.c
+++ b/fs/ceph/quota.c
@@ -47,6 +47,9 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
    	struct inode *inode;
    	struct ceph_inode_info *ci;
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
    	if (msg->front.iov_len < sizeof(*h)) {
    		pr_err("%s corrupt message mds%d len %d\n", __func__,
    		       session->s_mds, (int)msg->front.iov_len);
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index a73943e51a77..eeabdd0211d8 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -1010,6 +1010,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
    	int locked_rwsem = 0;
    	bool close_sessions = false;
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
    	/* decode */
    	if (msg->front.iov_len < sizeof(*h))
    		goto bad;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index f10a076f47e5..012b35be41a9 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1483,6 +1483,20 @@ static void ceph_kill_sb(struct super_block *s)
    	ceph_mdsc_pre_umount(fsc->mdsc);
    	flush_fs_workqueues(fsc);
+	/*
+	 * Though the kill_anon_super() will finally trigger the
+	 * sync_filesystem() anyway, we still need to do it here and
+	 * then bump the stage of shutdown. This will drop any further
+	 * message, which makes no sense any more, from MDSs.
+	 *
+	 * Without this when evicting the inodes it may fail in the
+	 * kill_anon_super(), which will trigger a warning when
+	 * destroying the fscrypt keyring and then possibly trigger
+	 * a further crash in ceph module when iput() the inodes.
+	 */
+	sync_filesystem(s);
+	fsc->mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
+
    	kill_anon_super(s);
    	fsc->client->extra_mon_dispatch = NULL;
--
2.31.1

--
Best Regards,

Xiubo Li (李秀波)

Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
Slack: @Xiubo Li

--
Best Regards,

Xiubo Li (李秀波)

Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
Slack: @Xiubo Li

--
Best Regards,

Xiubo Li (李秀波)

Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
Slack: @Xiubo Li