Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Chris Smart <distroguy@xxxxxxxxx> · Sun, 21 Aug 2022 12:41:46 +1000

On Fri, 2022-08-19 at 15:48 +0200, Stefan Kooman wrote:
> On 8/19/22 15:04, Frank Schilder wrote:
> > Hi Chris,
> > 
> > looks like your e-mail stampede is over :) I will cherry-pick some
> > questions to answer, other things either follow or you will figure
> > it out with the docs and trial-and-error. The cluster set-up is
> > actually not that bad.
> > 
> > 1) Set osd_op_queue_cut_off = high on global level. Even though its
> > prefixed with osd_, it seems actually used by more daemons. After
> > setting it just on the OSDs on my mimic-cluster, the MDSes crashed
> > until I set it on global level.
> > 
> > 2) I think your PG nums on the pools are fine, you should aim for
> > between 100-200 PGs per OSD. On the meta-data pool you can increase
> > it if you convert your OSDs one-by-one to LVM (mimic and later) and
> > deploy 2 or 4 OSDs per NVMe drive.
> 
> If you are not running latest luminous, I would advise to do so if
> you 
> first want to make improvements / changes before upgrading.
> 

Hi Stefan,

Thanks for the reply. I don't think we're on the latest, so that makes
sense, thanks.

> LVM support was gained by ceph-volume in 12.2.2 and support for 
> bluestore added [1]. It was the first release we started with. So if
> you 
> want, and I guess you are running > 12.2.2, it's possible to change
> to 
> bluestore with LVM before upgrading to a newer release. I would 
> recommend (certainly with many small files) to set the following 
> property (nowadays the default in Ceph):
> 
> # 4096 B instead of 16K (SSD) / 64K (HDD) to avoid large overhead for
> small (cephFS) files
> bluestore_min_alloc_size_ssd = 4096
> bluestore_min_alloc_size_hdd = 4096
> 
> Otherwise you will end up wasting a lot of space and might even run
> out 
> of space during upgrade. You cannot change that parameter afterwards
> (or 
> at least it won't have any affect) ... so make sure you have that set
> before converting from filestore to bluestore.
> 

Oh, great info, thanks very much! We do have lots of small files. All
the OSDs are on filestore at the moment, so I was definitely planning
on moving over to bluestore as a part of trying to address these issues
(and expand the cluster, etc).

> # MEMORY ALLOCATOR
> bluestore_allocator = bitmap
> bluefs_allocator = bitmap
> 
> Nowadays hybrid is the default memory allocator, but not available in
> luminous, and bitmap is better than stupid (default in older luminous
> releases).
> 
> I would postpone changing to bluestore though. You will need to do a 
> conversion from Octopus -> Pacific. And this might take a lot of time
> for drives with a lot of OMAP. You won't need to do any of that when 
> changing filestore -> bluestore when on Pacific and benefit from the 
> sharding in rocksdb and some other improvements with regard to OMAP 
> related functionality (that also CephFS will benefit from). Then you 
> also don't need to manually perform all the sharding for RockSDB. We 
> have followed L->M->N->O path. But in hindsight it would have been 
> better to have skipped O and directly moved to P. Not that O is bad,
> not 
> at all actually, it's just there is a lot of extra work involved and 
> Pacific is mature enough by now. We are waiting for 16.2.10 to make
> the 
> move to P. All the proper settings are there now by default.
> 

OK, so basically sounds like I should stick with filestore, ugprade the
cluster to Pacific to inherit the newer settings, then do the
conversion to bluestore which will avoid manual sharding etc. Did I
understand correctly? I expect I'll probably send a few more emails to
the list before I undertake the upgrades, to try and undertsand as many
optimisations and avoid as many pitfalls as possible.

Thanks very much!
-c

> My 2 cents.
> 
> Gr. Stefan
> 
> [1]. https://docs.ceph.com/en/latest/releases/luminous/#id44

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx