Re: [EXTERNAL] RE: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 29 Sep 2021 15:27:14 +0300

Hi Dave,

I think it's your disk sizing/utilization what makes your setup rather 
unique and apparently causes the issue.

First of all you're using custom 4K min_alloc_size which wasn't adapted 
before Pacific, aren't you?

2021-09-08T10:42:02.049+0000 7f705c4f2f00  1 
bluestore(/var/lib/ceph/osd/ceph-2) _open_super_meta min_alloc_size 0x1000

Secondly your main (and the only) disk is relatively small - just 350GB:

2021-09-08T10:41:41.350+0000 7f705c4f2f00  1 bdev(0x55e9586e2000 
/var/lib/ceph/osd/ceph-2/block) open size 375805444096 (0x577fc00000, 
350 GiB) block_size 4096 (4 KiB) rotational discard not supported

And finally disk space utilization is more than 90%, free space 
available is at ~30GB (see 'available' field below):

-114> 2021-09-08T10:42:18.880+0000 7f705c4f2f00 -1 
bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace failed to 
allocate on 0x3b3a0000 min_size 0x100000 > allocated total 0x4c30000 
bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 7090ac000

So below is what apparently happens:

BlueFS is asking for additional space to keep RocksDB data. In the 
observed case (OSD-2) it requests roughly 2GB in 1GB portions from the 
main device allocator.

-236> 2021-09-08T10:42:18.703+0000 7f705c4f2f00 10 
bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace gifting 
1073545216 (1024 MiB)

...

 --116> 2021-09-08T10:42:18.795+0000 7f705c4f2f00 10 
bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace gifting 
993656832 (948 MiB)

The tricky thing is that BlueFS uses 64K allocation units while main 
device has got 4K ones (your custom setting!!). Hence main device 
allocator needs to find a bunch of available 64K-aligned chunks.

And the first 1GB request is given the space while the second one isn't:

-114> 2021-09-08T10:42:18.880+0000 7f705c4f2f00 -1 
bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace failed to 
allocate on 0x3b3a0000 min_size 0x100000 > allocated total 0x4c30000 
bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 7090ac000

Given pretty low amount of free space and apparently high space 
fragmentation it looks like the main allocator simply doesn't have that 
many contiguous chunks to provide to BlueFS.

Hence you should consider the following action points to work around the 
issue:

1) use 64K alloc size for main device or 4K one for BlueFS. Both require 
OSD redeployment and actually out of mainstream. More proper way would 
be to use Pacific then.

2) ensure more free space is present at main device - not 100% guarantee 
to avoid the issue, just reduce the probability to face it.

3) reduce main space space fragmentation by using Hybrid allocator from 
scratch - OSD redeployment is required as well. I don't remember what 
was wrong with this allocator before - some pending bugs or what?

Most likely the above analysis is applicable to albans cluster as well 
but I didn't check that in details. Anyway I'd highly recommend to 
upgrade it to at least the latest Nautilus relevase - it's pretty 
outdatad atm IMO.

As for reported large omap objects - I don't think it's related to the 
issue.

Thanks,

Igor

as the first On 9/29/2021 1:39 PM, Dave Piper wrote:

Some interesting updates on our end.

This cluster (condor) is in a multisite RGW zonegroup with another cluster (albans). Albans is still on nautilus and was healthy back when we started this thread. As a last resort, we decided to destroy condor and recreate it, putting it back in the zonegroup with albans to restore all its data. This worked, but shortly after completing the process albans (still on nautilus) ran into the same issue we started out with on condor, that we raised this thread for.

So - we're now seeing this issue ("bluefs _allocate failed to expand slow device to fit...") on a nautilus rig, running ceph 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable). Once again 2 of the 3 OSDs are flapping. This is with both allocators set to "bitmap" in config. I've attached logs from one of these hosts (albans_sc1) in case there's any comparison to be made; the logs from the failure ultimately look the same to me.

Would that suggest this isn't specific to octopus at all? Or perhaps it's a result of having one cluster at octopus and one at nautilus within the same RGW zonegroup?

Something else I did wonder about - we've had alarms about "large omap objects" on these two clusters for several weeks now, certainly before the OSDs started flapping. Currently albans is reporting 54 large omap objects.  Condor, which is back running healthily again on octopus since we redeployed it, has 21. Could this be the underlying issue?

One final thought: we're using CephFS too, which I think is perhaps less commonly used than other Ceph features. Could that be related, and explain why we're seeing this when other ceph users aren't?

Any suggestion on next steps or other things to try would be greatly appreciated  - we're out of ideas here!

Dave

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx