Re: Xfs kernel panic during rbd mount

Shain Miley <SMiley@xxxxxxx> · Mon, 31 Aug 2020 17:00:39 +0000

Llya,
Thank you for the quick response this was very helpful toward getting resolution.

Restarting those 3 osds has allowed the rbd image to mount successfully.

I really appreciate all your help on this.

Shain

On 8/31/20, 12:41 PM, "Ilya Dryomov" <idryomov@xxxxxxxxx> wrote:

    On Mon, Aug 31, 2020 at 6:21 PM Shain Miley <SMiley@xxxxxxx> wrote:
    >
    > Hi,
    > A few weeks ago several of our rdb images became unresponsive after a few of our OSDs reached a near full state.
    >
    > Another member of the team rebooted the server that the rbd images are mounted on in an attempt to resolve the issue.
    > In the meantime I added several more nodes to the cluster in order to get additional space.
    >
    > Here are some cluster details:
    >
    > root@rbd1:/var/log# ceph -v
    > ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
    >
    > root@rbd1:/var/log# uname -va
    > Linux rbd1 4.15.0-48-generic #51~16.04.1-Ubuntu SMP Fri Apr 5 12:01:12 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    >
    > root@rbd1:/var/log# ceph -s
    >   cluster:
    >     id:     504b5794-34bd-44e7-a8c3-0494cf800c23
    >     health: HEALTH_ERR
    >             crush map has legacy tunables (require argonaut, min is firefly)
    >             full ratio(s) out of order
    >             2091546/274437905 objects misplaced (0.762%)
    >             Reduced data availability: 114 pgs inactive
    >             1 slow requests are blocked > 32 sec. Implicated osds 152
    >             4 stuck requests are blocked > 4096 sec. Implicated osds 6,99
    >   services:
    >     mon: 3 daemons, quorum hqceph1,hqceph2,hqceph3
    >     mgr: hqceph2(active), standbys: hqceph3
    >     osd: 291 osds: 283 up, 265 in; 116 remapped pgs
    >     rgw: 1 daemon active
    >   data:
    >     pools:   17 pools, 8199 pgs
    >     objects: 91.48M objects, 292TiB
    >     usage:   880TiB used, 758TiB / 1.60PiB avail
    >     pgs:     1.390% pgs not active
    >              2091546/274437905 objects misplaced (0.762%)
    >              8040 active+clean
    >              114  activating+remapped
    >              40   active+clean+scrubbing+deep
    >              3    active+clean+scrubbing
    >              2    active+remapped+backfilling
    >   io:
    >     recovery: 41.4MiB/s, 12objects/s
    >
    > This morning I got on the server in order to map and mount the 5 or so RBD images that are shared out via samba on this server.
    >
    > After waiting about 10 minutes it was clear something was not 100% correct…here is what I found in the kern.log file:
    >
    >
    > Aug 31 11:43:16 rbd1 kernel: [2158818.570948] libceph: mon0 10.35.1.201:6789 session established
    > Aug 31 11:43:16 rbd1 kernel: [2158818.576306] libceph: client182797617 fsid 504b5794-34bd-44e7-a8c3-0494cf800c23
    > Aug 31 11:43:16 rbd1 kernel: [2158818.710199] rbd: rbd0: capacity 54975581388800 features 0x0
    > Aug 31 11:43:22 rbd1 kernel: [2158824.685353] XFS (rbd0): Mounting V4 Filesystem
    > Aug 31 11:44:19 rbd1 kernel: [2158881.536794] XFS (rbd0): Starting recovery (logdev: internal)
    > Aug 31 11:47:06 rbd1 kernel: [2159048.202835] INFO: task mount:33177 blocked for more than 120 seconds.
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203053]       Not tainted 4.15.0-48-generic #51~16.04.1-Ubuntu
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203260] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203523] mount           D    0 33177  33011 0x00000000
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203527] Call Trace:
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203538]  __schedule+0x3d6/0x8b0
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203542]  ? __switch_to_asm+0x34/0x70
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203546]  schedule+0x36/0x80
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203551]  schedule_timeout+0x1db/0x370
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203629]  ? xfs_trans_read_buf_map+0xf8/0x330 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203634]  wait_for_completion+0xb4/0x140
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203637]  ? wake_up_q+0x70/0x70
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203691]  ? xfs_trans_read_buf_map+0xf8/0x330 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203740]  ? _xfs_buf_read+0x23/0x30 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203787]  xfs_buf_submit_wait+0x7f/0x220 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203839]  ? xfs_trans_read_buf_map+0xf8/0x330 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203887]  _xfs_buf_read+0x23/0x30 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203933]  xfs_buf_read_map+0x10a/0x190 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.203985]  xfs_trans_read_buf_map+0xf8/0x330 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204022]  xfs_read_agf+0x90/0x120 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204058]  xfs_alloc_read_agf+0x49/0x1d0 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204094]  xfs_alloc_pagf_init+0x29/0x60 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204141]  xfs_initialize_perag_data+0x99/0x110 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204193]  xfs_mountfs+0x79b/0x950 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204243]  ? xfs_mru_cache_create+0x12b/0x170 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204294]  xfs_fs_fill_super+0x428/0x5e0 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204300]  mount_bdev+0x246/0x290
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204349]  ? xfs_test_remount_options.isra.16+0x60/0x60 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204398]  xfs_fs_mount+0x15/0x20 [xfs]
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204402]  mount_fs+0x3d/0x150
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204406]  vfs_kern_mount+0x67/0x110
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204410]  do_mount+0x201/0xd00
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204414]  ? __check_object_size+0xfc/0x1a0
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204418]  SyS_mount+0x95/0xe0
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204423]  do_syscall_64+0x73/0x130
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204426]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204429] RIP: 0033:0x7f77a9a4fc5a
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204431] RSP: 002b:00007ffda3e83818 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204435] RAX: ffffffffffffffda RBX: 0000000002424030 RCX: 00007f77a9a4fc5a
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204436] RDX: 0000000002424210 RSI: 0000000002424310 RDI: 0000000002424230
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204438] RBP: 0000000000000000 R08: 0000000002424280 R09: 0000000000000012
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204440] R10: 00000000c0ed0c00 R11: 0000000000000206 R12: 0000000002424230
    > Aug 31 11:47:06 rbd1 kernel: [2159048.204441] R13: 0000000002424210 R14: 0000000000000000 R15: 0000000000000003
    >
    > Any suggestions on what I can/should do next?

    Hi Shain,

    This is not a panic.  XFS's mount process is waiting for some I/O to
    complete, and I'm guessing it is one of those 5 stuck requests reported
    under HEALTH_ERR.  To confirm, you can take a look at

      /sys/kernel/debug/ceph/504b5794-34bd-44e7-a8c3-0494cf800c23.client182797617/osdc

    on "rbd1" node and OSD logs for OSDs 6, 99, and 152.  The OSD logs
    should tell you what exactly those requests are stuck on.

    Thanks,

                    Ilya

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx