Re: librbd hangs during large backfill

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Tue, 18 Jul 2023 17:43:06 -0400

I've seen this dynamic contribute to a hypervisor with many attachments running out of system-wide file descriptors.

> On Jul 18, 2023, at 16:21, Konstantin Shalygin <k0ste@xxxxxxxx> wrote:
> 
> Hi,
> 
> Check you libvirt limits for qemu open files/sockets. Seems, when you added new OSD's, your librbd client limit reached
> 
> 
> k
> Sent from my iPhone
> 
>> On 18 Jul 2023, at 19:32, Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> wrote:
>> 
>> Did your automation / process allow for stalls in between changes to allow
>> peering to complete? My hunch is you caused a very large peering storm
>> (during peering a PG is inactive) which in turn caused your VMs to panic.
>> If the RBDs are unmapped and re-mapped does it still continue to struggle?
>> 
>> Respectfully,
>> 
>> *Wes Dillingham*
>> wes@xxxxxxxxxxxxxxxxx
>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>> 
>> 
>>> On Tue, Jul 18, 2023 at 11:52 AM <
>>> fb2cd0fc-933c-4cfe-b534-93d67045a088@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> Starting on Friday, as part of adding a new pod of 12 servers, we
>>> initiated a reweight on roughly 384 drives; from 0.1 to 0.25. Something
>>> about the resulting large backfill is causing librbd to hang, requiring
>>> server restarts. The volumes are showing buffer i/o errors when this
>>> happens.We are currently using hybrid OSDs with both SSD and traditional
>>> spinning disks. The current status of the cluster is:
>>> ceph --version
>>> ceph version 14.2.22
>>> Cluster Kernel 5.4.49-200
>>> {
>>>       "mon": {
>>>       "ceph version 14.2.22 nautilus (stable)": 3
>>>       },
>>>       "mgr": {
>>>       "ceph version 14.2.22 nautilus (stable)": 3
>>>       },
>>>       "osd": {
>>>       "ceph version 14.2.21 nautilus (stable)": 368,
>>>       "ceph version 14.2.22 (stable)": 2055
>>>       },
>>>       "mds": {},
>>>       "rgw": {
>>>       "ceph version 14.2.22 (stable)": 7
>>>       },
>>>       "overall": {
>>>       "ceph version 14.2.21 (stable)": 368,
>>>       "ceph version 14.2.22 (stable)": 2068
>>>       }
>>> }
>>> 
>>> HEALTH_WARN, noscrub,nodeep-scrub flag(s) set.
>>> pgs: 6815703/11016906121 objects degraded (0.062%) 2814059622/11016906121
>>> objects misplaced (25.543%).
>>> 
>>> The client servers are on 3.10.0-1062.1.2.el7.x86_6
>>> 
>>> We have found a couple of issues that look relevant:
>>> https://tracker.ceph.com/issues/19385
>>> https://tracker.ceph.com/issues/18807
>>> Has anyone experienced anything like this before? Does anyone have any
>>> recommendations as to settings that can help alleviate this while the
>>> backfill completes?
>>> An example of the buffer ii/o errors:
>>> 
>>> Jul 17 06:36:08 host8098 kernel: buffer_io_error: 22 callbacks suppressed
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
>>> block 0, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
>>> block 0, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
>>> block 0, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
>>> block 0, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
>>> block 0, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
>>> block 0, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical
>>> block 3, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-5, logical
>>> block 511984, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical
>>> block 3487657728, async page read
>>> Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical
>>> block 3487657729, async page read
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx