Re: ceph_assert(start >= coll_range_start && start < coll_range_end)

Manuel Lausch <manuel.lausch@xxxxxxxx> · Thu, 10 Feb 2022 13:27:34 +0100

Hi Igor,

yes I just put an object with "rados put" with the problematic name and
4MB random data. The smae size of the object in the production cluster
deep-scrub afterwards produces the following error in the osd log

2022-02-09T11:16:42.739+0100 7f0ce58f5700 -1 log_channel(cluster) log [ERR] : 1.fff deep-scrub : stat mismatch, got 3327/3328 objects, 0/0 clones, 3327/3328 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 5012860424/5017054728 bytes, 0/0 manifest objects,
 0/0 hit_set_archive bytes.

On the nautilus cluster I have bluestore and filestore mixed.
The deep-scrub in this cluster logs a "missing object" in the osd log. I
don't know if this is different to pacific because of the newer version
or the bluestore only vs. mixed.

On my pacific testcluster all the OSDs are running with bluestore

rados get of the object works and the content of the object is
correct.

Manuel

On Thu, 10 Feb 2022 14:56:58 +0300
Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

> Hi Manuel,
> 
> could you please elaborate a bit about the reproduction steps in 16.2.6:
> 
> 1) Do you just put the object named this way with rados tool to a 
> replicated pool and subseqent deep scrubs reports the error? Or some 
> othe steps are present?
> 
> 2) Do you have all-bluestore setup for that pacific cluster or there is 
> a mixture of bluestore and file store osds?
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 2/10/2022 12:06 PM, Manuel Lausch wrote:
> > Okay. the issue is triggered with a specifc object name  
> > ->  
> > c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000
> >
> > And with this name I could trigger at least the scrub issues
> > on a ceph pacific 16.2.6 as well.
> >
> > I opend a bug ticket to this issue:
> > https://tracker.ceph.com/issues/54226
> >
> >
> >
> >
> > On Tue, 8 Feb 2022 14:35:58 +0100
> > Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:
> >  
> >> Okay, I definitely need here some help.
> >>
> >> The crashing OSD moved with the PG. so The PG seems to have the issue
> >>
> >> I moved (via upmaps ) all 4 replicas to filestore OSDs. After this the
> >> error seems to be solved. No OSD crashed after this.
> >>
> >> A deep-scrub of the PG didn't throw any error. So I moved the first
> >> shard back to a bluestore OSD. This worked flawlessly as well.
> >>
> >> A deep scrub after this showed one object missing. The
> >> same which was obviously the cause of the prior crashes.
> >>
> >> repair seemed to fixed the object. But a further deep-scrub brings back
> >> the same error.
> >>
> >> Even putting the object again with rados put didn't help. now I have
> >> two "missing" objects. (the head and the snapshot from overwriting)
> >>
> >>
> >> Here the scrub error and reapair from the osd log
> >> 2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
> >> 2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 missing, 0 inconsistent objects
> >> 2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 errors
> >>
> >> 2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
> >> 2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 missing, 0 inconsistent objects
> >> 2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 errors, 1 fixed
> >>
> >>
> >> and here the new scrub error with the two missings
> >> 2022-02-08 14:19:10.990 7f600dfec700  0 log_channel(cluster) log [DBG] : 1.7fff deep-scrub starts
> >> 2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:974 : missing
> >> 2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
> >> 2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 missing, 0 inconsistent objects
> >> 2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 errors
> >>
> >>
> >> Can someone help me here? I don't have any clue.
> >>
> >>
> >> Regards
> >> Manuel
> >>  
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx  
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx