Re: librbd hangs during large backfill

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Tue, 18 Jul 2023 12:18:04 -0400

Did your automation / process allow for stalls in between changes to allow
peering to complete? My hunch is you caused a very large peering storm
(during peering a PG is inactive) which in turn caused your VMs to panic.
If the RBDs are unmapped and re-mapped does it still continue to struggle?

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Tue, Jul 18, 2023 at 11:52 AM <
fb2cd0fc-933c-4cfe-b534-93d67045a088@xxxxxxxxxxxxxxx> wrote:

> Starting on Friday, as part of adding a new pod of 12 servers, we
> initiated a reweight on roughly 384 drives; from 0.1 to 0.25. Something
> about the resulting large backfill is causing librbd to hang, requiring
> server restarts. The volumes are showing buffer i/o errors when this
> happens.We are currently using hybrid OSDs with both SSD and traditional
> spinning disks. The current status of the cluster is:
> ceph --version
> ceph version 14.2.22
> Cluster Kernel 5.4.49-200
> {
>         "mon": {
>         "ceph version 14.2.22 nautilus (stable)": 3
>         },
>         "mgr": {
>         "ceph version 14.2.22 nautilus (stable)": 3
>         },
>         "osd": {
>         "ceph version 14.2.21 nautilus (stable)": 368,
>         "ceph version 14.2.22 (stable)": 2055
>         },
>         "mds": {},
>         "rgw": {
>         "ceph version 14.2.22 (stable)": 7
>         },
>         "overall": {
>         "ceph version 14.2.21 (stable)": 368,
>         "ceph version 14.2.22 (stable)": 2068
>         }
> }
>
> HEALTH_WARN, noscrub,nodeep-scrub flag(s) set.
> pgs: 6815703/11016906121 objects degraded (0.062%) 2814059622/11016906121
>  objects misplaced (25.543%).
>
> The client servers are on 3.10.0-1062.1.2.el7.x86_6
>
> We have found a couple of issues that look relevant:
> https://tracker.ceph.com/issues/19385
> https://tracker.ceph.com/issues/18807
> Has anyone experienced anything like this before? Does anyone have any
> recommendations as to settings that can help alleviate this while the
> backfill completes?
> An example of the buffer ii/o errors:
>
> Jul 17 06:36:08 host8098 kernel: buffer_io_error: 22 callbacks suppressed
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
> block 0, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
> block 0, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
> block 0, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
> block 0, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
> block 0, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
> block 0, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
> block 3, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-5, logical
> block 511984, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical
> block 3487657728, async page read
> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical
> block 3487657729, async page read
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx