Did your automation / process allow for stalls in between changes to allow peering to complete? My hunch is you caused a very large peering storm (during peering a PG is inactive) which in turn caused your VMs to panic. If the RBDs are unmapped and re-mapped does it still continue to struggle? Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Tue, Jul 18, 2023 at 11:52 AM < fb2cd0fc-933c-4cfe-b534-93d67045a088@xxxxxxxxxxxxxxx> wrote: > Starting on Friday, as part of adding a new pod of 12 servers, we > initiated a reweight on roughly 384 drives; from 0.1 to 0.25. Something > about the resulting large backfill is causing librbd to hang, requiring > server restarts. The volumes are showing buffer i/o errors when this > happens.We are currently using hybrid OSDs with both SSD and traditional > spinning disks. The current status of the cluster is: > ceph --version > ceph version 14.2.22 > Cluster Kernel 5.4.49-200 > { > "mon": { > "ceph version 14.2.22 nautilus (stable)": 3 > }, > "mgr": { > "ceph version 14.2.22 nautilus (stable)": 3 > }, > "osd": { > "ceph version 14.2.21 nautilus (stable)": 368, > "ceph version 14.2.22 (stable)": 2055 > }, > "mds": {}, > "rgw": { > "ceph version 14.2.22 (stable)": 7 > }, > "overall": { > "ceph version 14.2.21 (stable)": 368, > "ceph version 14.2.22 (stable)": 2068 > } > } > > HEALTH_WARN, noscrub,nodeep-scrub flag(s) set. > pgs: 6815703/11016906121 objects degraded (0.062%) 2814059622/11016906121 > objects misplaced (25.543%). > > The client servers are on 3.10.0-1062.1.2.el7.x86_6 > > We have found a couple of issues that look relevant: > https://tracker.ceph.com/issues/19385 > https://tracker.ceph.com/issues/18807 > Has anyone experienced anything like this before? Does anyone have any > recommendations as to settings that can help alleviate this while the > backfill completes? > An example of the buffer ii/o errors: > > Jul 17 06:36:08 host8098 kernel: buffer_io_error: 22 callbacks suppressed > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical > block 0, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical > block 0, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical > block 0, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical > block 0, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical > block 0, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical > block 0, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical > block 3, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-5, logical > block 511984, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical > block 3487657728, async page read > Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical > block 3487657729, async page read > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx