Dear Igor, thanks a lot for the analysis and recommendations. > Here is a brief analysis: > 1) Your DB is pretty large - 27GB at DB device (making it full) and > 279GB at main spinning one. I.e. RocksDB is experiencing huge > spillover to slow main device - expect performance drop. And generally > DB is highly under-provisioned. Yes, we have known about this issue for a long time. This cluster and in particular its SSD devices were dimensioned in the pre-Bluestore days. We haven't yet found a viable migration path towards something more sensible (with ~1500 OSDs on two separate clusters and quite a bit of user data on them). > 2) Main device space is highly fragmented - 0.84012572151981013 where > 1.0 is the maximum. Can't say for sure but I presume it's pretty full > as well. Not too full: $ ceph osd df | sort -n ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS [...] 3 hdd 7.27699 1.00000 7.3 TiB 4.6 TiB 4.3 TiB 49 MiB 319 GiB 2.7 TiB 63.46 1.10 0 down > The above are indirect factors resulting in the current failure > though, primarily I just want to make you aware since this might cause > other issues later on. Thanks. > The major reason preventing OSD from proper starting is BlueFS attempt > to claim additional space (~52GB), see in the log: [...] > I can suggest the following workarounds to start the OSD for now: > 1) switch allocator to stupid by setting 'bluestore allocator' > parameter to 'stupid'. Presume you have default setting of 'bitmap' > now.. This will allow more continuous allocations for bluefs space > claim. and hence shorter log write. But given high main disk > fragmentation this might be not enough. 'stupid' allocator has some > issues (e.g. high RAM utilization over time in some cases) as well but > they're rather irrelevant for OSD startup. Thanks, we'll try that & report. > 2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the > default value at 4MB). > Suggest to start with 1) and then additionally proceed with 2) if the > first one doesn't help. > Once OSD is up and cluster is healthy please consider adding more DB > space and/or OSDs to your cluster to fight dangerous factors I started > with. > BTW wondering what payload is the primarily one for you cluster - RGW > or something else? The payload has changed over the lifetime of the cluster (which has been in operation for more than four years, growing and being upgraded). Initially it was almost exclusively RBD (for OpenStack VMs), then we added RadosGW (still all with 3-way replication). As RadosGW/S3 became more popular, we added an EC 8+3 pool. (We also added an NVMe-only pool, which is used for RadosGW indexes.) Lately this EC 8+3 pool has become very popular, and users have been storing hundreds of Terabyte on it. Unfortunately they tend to use a small object size (~1MB per object). That's why we have close to a billion objects in the EC pool now, and things start to fail. As I said, it's a problem of finding a viable migration path to a better configuration. Unfortunately we cannot just throw away the current installation and start from scratch... Cheers, -- Simon. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx