Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Stefan Kooman <stefan@xxxxxx> · Fri, 19 Aug 2022 15:48:51 +0200

On 8/19/22 15:04, Frank Schilder wrote:
Hi Chris,

looks like your e-mail stampede is over :) I will cherry-pick some questions to answer, other things either follow or you will figure it out with the docs and trial-and-error. The cluster set-up is actually not that bad.

1) Set osd_op_queue_cut_off = high on global level. Even though its prefixed with osd_, it seems actually used by more daemons. After setting it just on the OSDs on my mimic-cluster, the MDSes crashed until I set it on global level.

2) I think your PG nums on the pools are fine, you should aim for between 100-200 PGs per OSD. On the meta-data pool you can increase it if you convert your OSDs one-by-one to LVM (mimic and later) and deploy 2 or 4 OSDs per NVMe drive.

If you are not running latest luminous, I would advise to do so if you 
first want to make improvements / changes before upgrading.

LVM support was gained by ceph-volume in 12.2.2 and support for 
bluestore added [1]. It was the first release we started with. So if you 
want, and I guess you are running > 12.2.2, it's possible to change to 
bluestore with LVM before upgrading to a newer release. I would 
recommend (certainly with many small files) to set the following 
property (nowadays the default in Ceph):

# 4096 B instead of 16K (SSD) / 64K (HDD) to avoid large overhead for 
small (cephFS) files
bluestore_min_alloc_size_ssd = 4096
bluestore_min_alloc_size_hdd = 4096

Otherwise you will end up wasting a lot of space and might even run out 
of space during upgrade. You cannot change that parameter afterwards (or 
at least it won't have any affect) ... so make sure you have that set 
before converting from filestore to bluestore.

# MEMORY ALLOCATOR
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Nowadays hybrid is the default memory allocator, but not available in 
luminous, and bitmap is better than stupid (default in older luminous 
releases).

I would postpone changing to bluestore though. You will need to do a 
conversion from Octopus -> Pacific. And this might take a lot of time 
for drives with a lot of OMAP. You won't need to do any of that when 
changing filestore -> bluestore when on Pacific and benefit from the 
sharding in rocksdb and some other improvements with regard to OMAP 
related functionality (that also CephFS will benefit from). Then you 
also don't need to manually perform all the sharding for RockSDB. We 
have followed L->M->N->O path. But in hindsight it would have been 
better to have skipped O and directly moved to P. Not that O is bad, not 
at all actually, it's just there is a lot of extra work involved and 
Pacific is mature enough by now. We are waiting for 16.2.10 to make the 
move to P. All the proper settings are there now by default.

My 2 cents.

Gr. Stefan

[1]. https://docs.ceph.com/en/latest/releases/luminous/#id44
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx