Re: ceph_assert(start >= coll_range_start && start < coll_range_end)

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 10 Feb 2022 15:39:49 +0300

Speaking of test cluster - there are multiple objects in the test pool, 
right?

If so could you please create a new pool and put just a single object 
with th problematic name there. Then do the deep scrub. Is the issue 
reproducible this way?

Thanks,

Igor

On 2/10/2022 3:27 PM, Manuel Lausch wrote:
Hi Igor,

yes I just put an object with "rados put" with the problematic name and
4MB random data. The smae size of the object in the production cluster
deep-scrub afterwards produces the following error in the osd log

2022-02-09T11:16:42.739+0100 7f0ce58f5700 -1 log_channel(cluster) log [ERR] : 1.fff deep-scrub : stat mismatch, got 3327/3328 objects, 0/0 clones, 3327/3328 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 5012860424/5017054728 bytes, 0/0 manifest objects,
  0/0 hit_set_archive bytes.

On the nautilus cluster I have bluestore and filestore mixed.
The deep-scrub in this cluster logs a "missing object" in the osd log. I
don't know if this is different to pacific because of the newer version
or the bluestore only vs. mixed.

On my pacific testcluster all the OSDs are running with bluestore

rados get of the object works and the content of the object is
correct.

Manuel

On Thu, 10 Feb 2022 14:56:58 +0300
Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

Hi Manuel,

could you please elaborate a bit about the reproduction steps in 16.2.6:

1) Do you just put the object named this way with rados tool to a
replicated pool and subseqent deep scrubs reports the error? Or some
othe steps are present?

2) Do you have all-bluestore setup for that pacific cluster or there is
a mixture of bluestore and file store osds?

Thanks,

Igor

On 2/10/2022 12:06 PM, Manuel Lausch wrote:
Okay. the issue is triggered with a specifc object name
->
c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000

And with this name I could trigger at least the scrub issues
on a ceph pacific 16.2.6 as well.

I opend a bug ticket to this issue:
https://tracker.ceph.com/issues/54226

On Tue, 8 Feb 2022 14:35:58 +0100
Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:

Okay, I definitely need here some help.

The crashing OSD moved with the PG. so The PG seems to have the issue

I moved (via upmaps ) all 4 replicas to filestore OSDs. After this the
error seems to be solved. No OSD crashed after this.

A deep-scrub of the PG didn't throw any error. So I moved the first
shard back to a bluestore OSD. This worked flawlessly as well.

A deep scrub after this showed one object missing. The
same which was obviously the cause of the prior crashes.

repair seemed to fixed the object. But a further deep-scrub brings back
the same error.

Even putting the object again with rados put didn't help. now I have
two "missing" objects. (the head and the snapshot from overwriting)

Here the scrub error and reapair from the osd log
2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 missing, 0 inconsistent objects
2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 errors

2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 missing, 0 inconsistent objects
2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 errors, 1 fixed

and here the new scrub error with the two missings
2022-02-08 14:19:10.990 7f600dfec700  0 log_channel(cluster) log [DBG] : 1.7fff deep-scrub starts
2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:974 : missing
2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 missing, 0 inconsistent objects
2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 errors

Can someone help me here? I don't have any clue.

Regards
Manuel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx