Re: ceph_assert(start >= coll_range_start && start < coll_range_end)

Manuel Lausch <manuel.lausch@xxxxxxxx> · Tue, 8 Feb 2022 14:35:58 +0100

Okay, I definitely need here some help.

The crashing OSD moved with the PG. so The PG seems to have the issue

I moved (via upmaps ) all 4 replicas to filestore OSDs. After this the
error seems to be solved. No OSD crashed after this.

A deep-scrub of the PG didn't throw any error. So I moved the first
shard back to a bluestore OSD. This worked flawlessly as well.

A deep scrub after this showed one object missing. The
same which was obviously the cause of the prior crashes.

repair seemed to fixed the object. But a further deep-scrub brings back
the same error.

Even putting the object again with rados put didn't help. now I have
two "missing" objects. (the head and the snapshot from overwriting)

Here the scrub error and reapair from the osd log
2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 missing, 0 inconsistent objects
2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 errors

2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 missing, 0 inconsistent objects
2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 errors, 1 fixed

and here the new scrub error with the two missings
2022-02-08 14:19:10.990 7f600dfec700  0 log_channel(cluster) log [DBG] : 1.7fff deep-scrub starts
2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:974 : missing
2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:head : missing
2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 missing, 0 inconsistent objects
2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 errors

Can someone help me here? I don't have any clue.

Regards
Manuel

On Mon, 7 Feb 2022 16:51:16 +0100
Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:

> Hi,
> 
> I am migrating from filestore to bluestore (workflow is draining osd,
> and reformat it with bluestore)
> 
> Now I have two OSDs which crashes to the same time with the following
> error. Restarting of the OSD works for some time until they crash
> again.
> 
>    -40> 2022-02-07 16:28:20.489 7f550723a700 20 bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600)  r 0 v.len 512
>    -39> 2022-02-07 16:28:20.489 7f550723a700 15 bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head #1:ffffffeb:::9b6886fa3639e64c892813ba7c9da9f4411f0a5fb73c89517b5f3f68acdaa388_00400000:head#
>    -38> 2022-02-07 16:28:20.489 7f550723a700 10 bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head #1:ffffffeb:::9b6886fa3639e64c892813ba7c9da9f4411f0a5fb73c89517b5f3f68acdaa388_00400000:head# = 0
>    -37> 2022-02-07 16:28:20.489 7f550723a700 10 bluestore(/var/lib/ceph/osd/ceph-410) stat 1.7fff_head #1:ffffffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head#
>    -36> 2022-02-07 16:28:20.489 7f550723a700 20 bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600) get_onode oid #1:ffffffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head# key 0x7f8000000000000001ffffffef216264'a22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d!='0xfffffffffffffffeffffffffffffffff'o'
>    -35> 2022-02-07 16:28:20.489 7f550723a700 20 bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600)  r 0 v.len 843
>    -34> 2022-02-07 16:28:20.489 7f550723a700 15 bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head #1:ffffffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head#
>    -33> 2022-02-07 16:28:20.489 7f550723a700 10 bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head #1:ffffffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head# = 0
>    -32> 2022-02-07 16:28:20.489 7f550723a700 10 bluestore(/var/lib/ceph/osd/ceph-410) stat 1.7fff_head #1:fffffffb:::98c8a3708cceb042f5ec0d5dd49416968adc95cf6019796fdf6ae1a1f7fd929d_00400000:head#
>    -31> 2022-02-07 16:28:20.489 7f550723a700 20 bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600) get_onode oid #1:fffffffb:::98c8a3708cceb042f5ec0d5dd49416968adc95cf6019796fdf6ae1a1f7fd929d_00400000:head# key 0x7f8000000000000001fffffffb213938'c8a3708cceb042f5ec0d5dd49416968adc95cf6019796fdf6ae1a1f7fd929d_00400000!='0xfffffffffffffffeffffffffffffffff'o'
>    -30> 2022-02-07 16:28:20.489 7f550723a700 20 bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600)  r 0 v.len 512
>    -29> 2022-02-07 16:28:20.489 7f550723a700 15 bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head #1:fffffffb:::98c8a3708cceb042f5ec0d5dd49416968adc95cf6019796fdf6ae1a1f7fd929d_00400000:head#
>    -28> 2022-02-07 16:28:20.489 7f550723a700 10 bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head #1:fffffffb:::98c8a3708cceb042f5ec0d5dd49416968adc95cf6019796fdf6ae1a1f7fd929d_00400000:head# = 0
>    -27> 2022-02-07 16:28:20.494 7f550723a700 15 bluestore(/var/lib/ceph/osd/ceph-410) collection_list 1.7fff_head start #1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:0# end #MAX# max 2147483647
>    -26> 2022-02-07 16:28:20.494 7f550723a700 20 bluestore(/var/lib/ceph/osd/ceph-410) _collection_list range #-3:fffe0000::::0#0 to #-3:ffffffff::::0#0 and #1:fffe0000::::0#0 to #1:ffffffff::::0#0 start #1:ffffffff:::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_00400000:0#
>    -25> 2022-02-07 16:28:20.506 7f550723a700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el8/BUILD/ceph-14.2.22/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_collection_list(BlueStore::Collection*, const ghobject_t&, const ghobject_t&, int, bool, std::vector<ghobject_t>*, ghobject_t*)' thread 7f550723a700 time 2022-02-07 16:28:20.495642  
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el8/BUILD/ceph-14.2.22/src/os/bluestore/BlueStore.cc: 10157: FAILED ceph_assert(start >= coll_range_start && start < coll_range_end)
> 
> 
> Please let me know if you need more loglines.
> 
> The pool is relicated with size 4
> the two problematic OSDs are running with bluestore. the two other
> replicas are running on filestore.
> 
> 
> What is happening here and how can I fix it?
> 
> 
> Ceph Version: 14.2.22
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx