Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

Christian Rohmann <christian.rohmann@xxxxxxxxx> · Thu, 10 Feb 2022 11:44:39 +0100

Hey Stefan,

thanks for getting back to me!

On 10/02/2022 10:05, Stefan Schueffler wrote:
since my last mail in Dezember, we changed our ceph-setuo like this:

we added one SSD osd on each ceph host (which were pure HDD before). Then, we moved the problematic pool "de-dus5.rgw.buckets.index“ to those dedicated SSDs (by adding a corresponding crush map).

Since then, no further PG corruptions occurred.

This now has a two sided result:

on the one side, we now do not observe the problematic behavior anymore,

on the other side, this means, by using just spinning HDDs something is buggy with ceph. If the HDD can not fulfill the data IO requirements, then it probably should not lead to data/PG corruption…
And, just a blind guess, we only have a few IO requests in our RGW gateway per second - even with spinning HDDs there should not be a problem to store / update the index pool.

I would guess that it correlates with our setup having 7001 shards in the problematic bucket, and the implementation of „multisite“ feature, which will do 7001 „status“ requests per second to check and synchronize between the different rgw sites. And _this_ amount of random IO can not be satisfied by utilizing HDDs…
Anyway it should not lead to corrupted PGs.

We also have a multi-site setup and and and have one HDD-only and one 
cluster (primary) with NVME SSD for the OSD journaling.
There are more inconsistencies on the HDD-only cluster, but we do 
observe those on the other cluster as well.

If you follow the issue at https://tracker.ceph.com/issues/53663 there 
is even another user (Dieter Roels) observing this issue now.
He is talking about RADOSGW crashes potentially causing the 
inconsistencies. We already guessed it could be rolling restarts. But we 
cannot put our finger on it yet.

And yes, no amount of IO contention should ever cause data corruption.
In this case I believe there might be a correlation to the multisite 
feature hitting OMAP and stored metadata much harder than with regular 
RADOSGW usage.
And if there is a race condition or missing lock /semaphore or something 
along this line, this certainly is affected by the latency on the 
underlying storage.

Could you maybe trigger manual a deep-scrub on all your OSDs, just to 
see if that does anything?

Thanks again for keeping in touch!
Regards

Christian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx