Re: [EXTERNAL] RE: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

Dave Piper <david.piper@xxxxxxxxxxxxx> · Mon, 18 Oct 2021 13:13:42 +0000

On 9/30/ 2021 7:03 PM, Igor Fedotov wrote:
> 3) reduce main space space fragmentation by using Hybrid allocator from 
> scratch - OSD redeployment is required as well.
> 
>> We deployed these clusters at nautilus with the default allocator, which was bitmap I think? After redeploying condor on octopus, it seems to be running the hybrid allocator now.  So I think we've inadvertently already carried out this action on condor.
>
>
> You mean your condor's OSDs were redeployed on octopus and hybrid 
> allocator was exclusively in action since then, don't you? If so then 
> that's what I mean. As far as I understand this didn't help and 
> fragmentation is still quite high to cause NO-SPACE condition, right?

Correct. As you say, this didn't make an significant impact; we were still hitting the same issue.

Since then we've made decent progress. We've stopped OSDs flapping and recovered the system by redeploying the existing 3 OSDs and adding a 4th. The additional capacity from the extra OSD, plus deleting some data once the OSDs were running again, has got the kit in a stable state again.

I think we got a bit confused along the way and managed to redeploy OSDs without changing the alloc_size config, so we're still running with a mismatch of bluefs_shared_alloc_size and bluestore_min_alloc_size and at risk of hitting this side effect again / continuing to be impacted by whatever performance hit this causes (we've never run with different config so we don't know any different in terms of perf). Our plan is to leave these rigs like this for now. We don't want to bump bluestore_min_alloc_size back to 64K as that will vastly increase our disk utilization; equally bumping bluefs_shared_alloc_size to 4K now will require another OSD redeploy (quite a lot of effort) and potentially expose us to other different issues since we're still on octopus.  

When we upgrade to pacific (not expected for another 6 months), we'll look to redeploy OSDs to adjust bluefs_shared_alloc_size to match bluestore_min_alloc_size (4KB), and hopefully reap all the benefits then.  In the meantime we're treating 85% as the new 100% in terms of disk utilization to avoid ending up with flapping OSDs again.

I'm hoping that's the end of our troubles. Thanks very much again for all your help Igor.

Kind regards,

Dave

-----Original Message-----
From: Igor Fedotov <ifedotov@xxxxxxx> 
Sent: 30 September 2021 17:03
To: Dave Piper <david.piper@xxxxxxxxxxxxx>; ceph-users@xxxxxxx
Subject: Re: [EXTERNAL] RE:  OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

On 9/30/2021 6:28 PM, Dave Piper wrote:
> Thanks so much Igor, this is making a lot of sense.
>
>> First of all you're using custom 4K min_alloc_size which wasn't adapted before Pacific, aren't you?
> We've set bluestore_min_alloc_size = 4096 because we write a lot of small objects. Various sources recommended this as a solution to not over-allocating disk space. Is that the setting you meant? Other settings that look related are still at their defaults:
>
> bluefs_alloc_size                                           1048576
> bluefs_shared_alloc_size                            65536

Yep - bluestore_min_alloc_size and bluefs_shared_allog_size are key parameters in your case.

bluefs_shared_alloc_size higher than bluestore_min_alloc_size is what causes just a subset of free chunks at main device is available for bluefs. Ones aligned to 64K are permitted only. And their amount is quite limited due to high fragmentation/high spaceutilization.

>
>> ... disk space utilization is more than 90%, free space available is at ~30GB (see 'available' field below):
> You're right; disk space utilization is currently really high on this systems. Somehow I'd failed to notice that; I'm not sure if it's crept up since we started debugging this but based on your notes I suspect this has always been the cause.
>
>> The tricky thing is that BlueFS uses 64K allocation units while main device has got 4K ones (your custom setting!!). Hence main device allocator needs to find a bunch of available 64K-aligned chunks.
>> 1) use 64K alloc size for main device or 4K one for BlueFS. Both require OSD redeployment and actually out of mainstream. More proper way would be to use Pacific then.
> What do you mean by "out of mainstream"? We're pretty set on octopus 
> for now, and moving back to 64k alloc for main device will mean we 
> suddenly require a much bigger disk, due to all our small objects. So 
> I'm leaning towards setting 4k for bluestore alloc size. Is that 
> configurable in octopus? Is it "bluefs_shared_alloc_size"? How worried 
> should I be that you sound a bit nervous about this suggestion? :)

Using non-default min_alloc_size is generally not recommended. Primarily due to perfomance penalties. Some side effects (like your ones) can be observed as well. That's simple - non-default parameters generally mean much worse QA coverage devs and less adaptation/experience from users. 
Hence they're risky.

It's Pacific release what allows to use 4K min_alloc_size [almost/hopefully] without such penalties.

3) reduce main space space fragmentation by using Hybrid allocator from 
scratch - OSD redeployment is required as well.

> We deployed these clusters at nautilus with the default allocator, which was bitmap I think? After redeploying condor on octopus, it seems to be running the hybrid allocator now.  So I think we've inadvertently already carried out this action on condor.

You mean your condor's OSDs were redeployed on octopus and hybrid 
allocator was exclusively in action since then, don't you? If so then 
that's what I mean. As far as I understand this didn't help and 
fragmentation is still quite high to cause NO-SPACE condition, right?

>
> What's actually required for OSD redeployment?  Can I swap in new OSDs to replace the old ones one-by-one, or does the entire cluster (MONs etc) need to be recreated too?  We used ceph-ansible to deploy initially so I'm expecting to use the add-mon.yml playbooks to do this; I'll look into that.

one-by-one OSD replacement is enough. Monitors/MDS-es are irrelevant to 
this issue.

>
> Cheers,
>
> Dave
>
> -----Original Message-----
> From: Igor Fedotov <ifedotov@xxxxxxx>
> Sent: 29 September 2021 13:27
> To: Dave Piper <david.piper@xxxxxxxxxxxxx>; ceph-users@xxxxxxx
> Subject: Re: [EXTERNAL] RE:  OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"
>
> Hi Dave,
>
> I think it's your disk sizing/utilization what makes your setup rather unique and apparently causes the issue.
>
> First of all you're using custom 4K min_alloc_size which wasn't adapted before Pacific, aren't you?
>
> 2021-09-08T10:42:02.049+0000 7f705c4f2f00  1
> bluestore(/var/lib/ceph/osd/ceph-2) _open_super_meta min_alloc_size 0x1000
>
>
> Secondly your main (and the only) disk is relatively small - just 350GB:
>
> 2021-09-08T10:41:41.350+0000 7f705c4f2f00  1 bdev(0x55e9586e2000
> /var/lib/ceph/osd/ceph-2/block) open size 375805444096 (0x577fc00000,
> 350 GiB) block_size 4096 (4 KiB) rotational discard not supported
>
> And finally disk space utilization is more than 90%, free space
> available is at ~30GB (see 'available' field below):
>
> -114> 2021-09-08T10:42:18.880+0000 7f705c4f2f00 -1
> bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace failed to
> allocate on 0x3b3a0000 min_size 0x100000 > allocated total 0x4c30000
> bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 7090ac000
>
>
> So below is what apparently happens:
>
> BlueFS is asking for additional space to keep RocksDB data. In the
> observed case (OSD-2) it requests roughly 2GB in 1GB portions from the
> main device allocator.
>
> -236> 2021-09-08T10:42:18.703+0000 7f705c4f2f00 10
> bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace gifting
> 1073545216 (1024 MiB)
>
> ...
>
>    --116> 2021-09-08T10:42:18.795+0000 7f705c4f2f00 10
> bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace gifting
> 993656832 (948 MiB)
>
>
> The tricky thing is that BlueFS uses 64K allocation units while main
> device has got 4K ones (your custom setting!!). Hence main device
> allocator needs to find a bunch of available 64K-aligned chunks.
>
> And the first 1GB request is given the space while the second one isn't:
>
> -114> 2021-09-08T10:42:18.880+0000 7f705c4f2f00 -1
> bluestore(/var/lib/ceph/osd/ceph-2) allocate_bluefs_freespace failed to
> allocate on 0x3b3a0000 min_size 0x100000 > allocated total 0x4c30000
> bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 7090ac000
>
> Given pretty low amount of free space and apparently high space
> fragmentation it looks like the main allocator simply doesn't have that
> many contiguous chunks to provide to BlueFS.
>
> Hence you should consider the following action points to work around the
> issue:
>
> 1) use 64K alloc size for main device or 4K one for BlueFS. Both require
> OSD redeployment and actually out of mainstream. More proper way would
> be to use Pacific then.
>
> 2) ensure more free space is present at main device - not 100% guarantee
> to avoid the issue, just reduce the probability to face it.
>
> 3) reduce main space space fragmentation by using Hybrid allocator from
> scratch - OSD redeployment is required as well. I don't remember what
> was wrong with this allocator before - some pending bugs or what?
>
>
> Most likely the above analysis is applicable to albans cluster as well
> but I didn't check that in details. Anyway I'd highly recommend to
> upgrade it to at least the latest Nautilus relevase - it's pretty
> outdatad atm IMO.
>
> As for reported large omap objects - I don't think it's related to the
> issue.
>
>
> Thanks,
>
> Igor
>
> as the first On 9/29/2021 1:39 PM, Dave Piper wrote:
>
>> Some interesting updates on our end.
>>
>> This cluster (condor) is in a multisite RGW zonegroup with another cluster (albans). Albans is still on nautilus and was healthy back when we started this thread. As a last resort, we decided to destroy condor and recreate it, putting it back in the zonegroup with albans to restore all its data. This worked, but shortly after completing the process albans (still on nautilus) ran into the same issue we started out with on condor, that we raised this thread for.
>>
>> So - we're now seeing this issue ("bluefs _allocate failed to expand slow device to fit...") on a nautilus rig, running ceph 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable). Once again 2 of the 3 OSDs are flapping. This is with both allocators set to "bitmap" in config. I've attached logs from one of these hosts (albans_sc1) in case there's any comparison to be made; the logs from the failure ultimately look the same to me.
>>
>> Would that suggest this isn't specific to octopus at all? Or perhaps it's a result of having one cluster at octopus and one at nautilus within the same RGW zonegroup?
>>
>> Something else I did wonder about - we've had alarms about "large omap objects" on these two clusters for several weeks now, certainly before the OSDs started flapping. Currently albans is reporting 54 large omap objects.  Condor, which is back running healthily again on octopus since we redeployed it, has 21. Could this be the underlying issue?
>>
>> One final thought: we're using CephFS too, which I think is perhaps less commonly used than other Ceph features. Could that be related, and explain why we're seeing this when other ceph users aren't?
>>
>> Any suggestion on next steps or other things to try would be greatly appreciated  - we're out of ideas here!
>>
>> Dave
>>
>>
>>
>>    
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx