When I increased the debug level of the RGW sync client to 20, I get it: 2024-12-23T09:42:17.248+0000 7f124866b700 20 register_request mgr=0x5633b4b8d958 req_data->id=79162099, curl_handle=0x5633bb89ee60 2024-12-23T09:42:17.248+0000 7f124866b700 20 run: stack=0x5633b51952c0 is io blocked 2024-12-23T09:42:17.248+0000 7f1248e6c700 20 link_request req_data=0x5633bc240d20 req_data->id=79162099, curl_handle=0x5633bb89ee60 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b50bb100:25RGWMetaSyncShardControlCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 10 RGW-SYNC:meta:shard[0]: start full sync 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b7731500:20RGWContinuousLeaseCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633b50f1540 is_blocked_by_stack()=0 is_sleeping=1 waiting_for_child()=0 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 enqueued request req=0x5633be14c500 2024-12-23T09:42:17.252+0000 7f124866b700 20 RGWWQ: 2024-12-23T09:42:17.252+0000 7f124866b700 20 req: 0x5633be14c500 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633bb892000 is io blocked 2024-12-23T09:42:17.252+0000 7f125167d700 20 dequeued request req=0x5633be14c500 2024-12-23T09:42:17.252+0000 7f125167d700 20 RGWWQ: empty 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b7731500:20RGWContinuousLeaseCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633bb892000 is io blocked 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 10 RGW-SYNC:meta:shard[0]: took lease 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633b50f1540 is io blocked 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() returned r=-2 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 0 meta sync: ERROR: full_sync(): RGWRadosGetOmapKeysCR() returned ret=-2 2024-12-23T09:42:17.252+0000 7f124866b700 0 RGW-SYNC:meta:shard[0]: ERROR: failed to list omap keys, status=-2 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633b50f1540 is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b7731500:20RGWContinuousLeaseCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:22RGWSimpleRadosUnlockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: cr:s=0x5633bb892000:op=0x5633b796ce00:22RGWSimpleRadosUnlockCR: operate() 2024-12-23T09:42:17.252+0000 7f124866b700 20 enqueued request req=0x5633bd10ea20 And when I try to run this to get the list of OMAP, the output is empty! rados -p s3-cdn-dc07.rgw.meta listomapvals -N root .bucket.meta.XXXX:XXXX.9 When I list the PGs of meta pool using ceph pg ls-by-pool s3-cdn-dc07.rgw.meta, There are some PGs that have BYTES size but OMAP_BYTES and OMAP_KEYS values are 0! PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP LAST_SCRUB_DURATION SCRUB_SCHEDULING 10.0 14 0 0 0 3364 1676 9 957 active+clean 6h 57938'32678 57939:9604093 [70,88,46]p70 [70,88,46]p70 2024-12-23T04:57:43.792352+0000 2024-12-21T22:45:49.820084+0000 1 periodic scrub scheduled @ 2024-12-24T16:29:19.844180+0000 *10.1 10 0 0 0 3073 0 0 49 active+clean 14h 57715'25516 57939:8975816 [0,60,52]p0 [0,60,52]p0 2024-12-22T20:40:57.494355+0000 2024-12-17T17:43:43.383320+0000 1 periodic scrub scheduled @ 2024-12-23T22:30:45.498616+0000* 10.2 15 0 0 0 3706 188 1 132 active+clean 27h 57938'26080 57938:8966743 [39,14,1]p39 [39,14,1]p39 2024-12-22T08:05:34.147330+0000 2024-12-22T08:05:34.147330+0000 1 periodic scrub scheduled @ 2024-12-23T12:00:40.771526+0000 10.3 14 0 0 0 3898 2234 11 414 active+clean 2h 57939'36531 57939:9388345 [40,78,6]p40 [40,78,6]p40 2024-12-23T09:00:29.100038+0000 2024-12-16T12:58:31.054537+0000 1 periodic scrub scheduled @ 2024-12-24T14:36:29.705903+0000 *10.4 13 0 0 0 3346 0 0 67 active+clean 25h 57715'25662 57939:8964482 [34,44,25]p34 [34,44,25]p34 2024-12-22T09:26:16.897803+0000 2024-12-16T23:09:15.271909+0000 1 periodic scrub scheduled @ 2024-12-23T20:57:45.809155+000010.5 13 0 0 0 3567 0 0 2904 active+clean 28h 57939'120020 57939:9016514 [66,37,91]p66 [66,37,91]p66 2024-12-22T07:03:26.677870+0000 2024-12-19T09:16:25.987456+0000 1 periodic scrub scheduled @ 2024-12-23T17:12:34.661323+000010.6 7 0 0 0 2159 0 0 39 active+clean 15h 57938'25500 57939:9961651 [19,24,56]p19 [19,24,56]p19 2024-12-22T19:16:11.782123+0000 2024-12-22T19:16:11.782123+0000 1 periodic scrub scheduled @ 2024-12-24T01:40:30.212204+000010.7 20 0 0 0 5963 0 0 116 active+clean 33m 57715'25722 57939:9028884 [50,58,84]p50 [50,58,84]p50 2024-12-23T10:35:45.237158+0000 2024-12-17T03:28:34.643774+0000 1 periodic scrub scheduled @ 2024-12-24T18:39:23.977660+0000* 10.8 20 0 0 0 6074 2155 11 226 active+clean 9h 57938'25807 57939:9347768 [22,4,42]p22 [22,4,42]p22 2024-12-23T02:07:21.060581+0000 2024-12-21T23:16:53.596335+0000 1 periodic scrub scheduled @ 2024-12-24T10:21:26.911522+0000 *10.9 10 0 0 0 3046 0 0 63 active+clean 10h 57715'25597 57939:9021141 [10,38,15]p10 [10,38,15]p10 2024-12-23T00:19:34.157798+0000 2024-12-20T11:28:20.294176+0000 1 periodic scrub scheduled @ 2024-12-24T01:15:58.674425+0000* 10.a 13 0 0 0 3400 188 1 79 active+clean 6h 57715'25740 57938:9021299 [2,31,80]p2 [2,31,80]p2 2024-12-23T04:58:25.809790+0000 2024-12-16T15:51:26.184478+0000 1 periodic scrub scheduled @ 2024-12-24T05:19:15.221400+0000 10.b 11 0 0 0 2915 176 1 68 active+clean 3h 57854'25555 57939:9067491 [53,82,20]p53 [53,82,20]p53 2024-12-23T07:52:31.500216+0000 2024-12-22T07:11:15.383987+0000 1 periodic scrub scheduled @ 2024-12-24T11:42:16.438508+0000 10.c 18 0 0 0 4430 2182 11 458 active+clean 9h 57939'36171 57939:9036209 [16,48,8]p16 [16,48,8]p16 2024-12-23T01:46:59.720703+0000 2024-12-21T19:11:29.473994+0000 1 periodic scrub scheduled @ 2024-12-24T11:00:21.726487+0000 10.d 13 0 0 0 3135 187 1 122 active+clean 8h 57901'26081 57939:9015503 [36,3,13]p36 [36,3,13]p36 2024-12-23T02:31:13.107381+0000 2024-12-16T21:47:29.385720+0000 1 periodic scrub scheduled @ 2024-12-24T10:01:35.665610+0000 *10.e 14 0 0 0 3366 0 0 71 active+clean 7h 57715'25607 57939:9843618 [83,41,17]p83 [83,41,17]p83 2024-12-23T03:39:55.648905+0000 2024-12-20T08:01:47.715401+0000 1 periodic scrub scheduled @ 2024-12-24T09:58:14.031613+0000* 10.f 14 0 0 0 4105 2508 13 521 active+clean 20h 57939'39421 57939:8983895 [33,49,68]p33 [33,49,68]p33 2024-12-22T14:39:32.975012+0000 2024-12-22T14:39:32.975012+0000 1 periodic scrub scheduled @ 2024-12-23T23:39:32.875651+0000 I tried to fix them with ceph pg deep-scrub, but there were no changes! I started to check the primary OSD log of 10.1 in detail. I see the following logs. Why 2024-12-23T00:03:38.920+0000 7f3fef2d7700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:112: ERROR: int rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t, rados::cls::fifo::part_header*): failed decoding part header 2024-12-23T00:03:38.920+0000 7f3fef2d7700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:781: int rados::cls::fifo::{anonymous}::trim_part(cls_method_context_t, ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): failed to read part header 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:112: ERROR: int rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t, rados::cls::fifo::part_header*): failed decoding part header 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:781: int rados::cls::fifo::{anonymous}::trim_part(cls_method_context_t, ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): failed to read part header 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:112: ERROR: int rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t, rados::cls::fifo::part_header*): failed decoding part header 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:781: int rados::cls::fifo::{anonymous}::trim_part(cls_method_context_t, ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): failed to read part header On Fri, Dec 20, 2024 at 10:59 AM Eugen Block <eblock@xxxxxx> wrote: > I don't really have a good idea, except for maybe running "metadata > sync init" and "metadata sync run"? > But I wanted to point out that the .meta pool uses namespaces. > > > - Attempted to list metadata in the pool using rados ls -p > > s3-cdn-dc07.rgw.meta, but got an empty result. > > Try this instead: > > rados -p s3-cdn-dc07.rgw.meta ls --all > > Do you have a specific object where the omap data is missing? Maybe > increase debug_rgw logs to find a trace why it's missing. > > Zitat von Vahideh Alinouri <vahideh.alinouri@xxxxxxxxx>: > > > I also see this in the output of radosgw-admin metadata sync status. I > > think it's strange because there should be a marker to follow the sync. > > { > > "key": 0, > > "val": { > > "state": 0, > > "marker": "", > > "next_step_marker": > "1_1730469205.875723_877487777.1", > > "total_entries": 174, > > "pos": 0, > > "timestamp": "2024-11-01T13:53:25.875723Z", > > "realm_epoch": 0 > > } > > > > On Mon, Dec 16, 2024 at 1:24 PM Vahideh Alinouri < > vahideh.alinouri@xxxxxxxxx> > > wrote: > > > >> I also see this log in the RGW log: > >> > >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 1 ====== starting new request > >> req=0x7f9ad9959730 ===== > >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 -1 req 11778501317150336521 > >> 0.000000000s :list_data_changes_log int > >> rgw::cls::fifo::{anonymous}::list_part(const DoutPrefixProvider*, > >> librados::v14_2_0::IoCtx&, const string&, > >> std::optional<std::basic_string_view<char> >, uint64_t, uint64_t, > >> std::vector<rados::cls::fifo::part_list_entry>*, bool*, bool*, > >> std::string*, uint64_t, optional_yield):245 fifo::op::LIST_PART failed > >> r=-34 tid=4176 > >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 -1 req 11778501317150336521 > >> 0.000000000s :list_data_changes_log int rgw::cls::fifo::FIFO::list(const > >> DoutPrefixProvider*, int, std::optional<std::basic_string_view<char> >, > >> std::vector<rgw::cls::fifo::list_entry>*, bool*, optional_yield):1660 > >> list_entries failed: r=-34 tid= 4176 > >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 -1 req 11778501317150336521 > >> 0.000000000s :list_data_changes_log virtual int > >> RGWDataChangesFIFO::list(const DoutPrefixProvider*, int, int, > >> std::vector<rgw_data_change_log_entry>&, > >> std::optional<std::basic_string_view<char> >, std::string*, bool*): > unable > >> to list FIFO: data_log.44: (34) Numerical result out of range > >> > >> On Sun, Dec 15, 2024 at 10:45 PM Vahideh Alinouri < > >> vahideh.alinouri@xxxxxxxxx> wrote: > >> > >>> Hi guys, > >>> > >>> My Ceph release is Quincy 17.2.5. I need to change the master zone to > >>> decommission the old one and upgrade all zones. I have separated the > client > >>> traffic and sync traffic in RGWs, meaning there are separate RGW > daemons > >>> handling the sync process. > >>> > >>> I encountered an issue when trying to sync one of the zones in the > >>> zonegroup. The data sync is proceeding fine, but I have an issue with > the > >>> metadata sync. It gets stuck behind on a shard. Here is the output > >>> from radosgw-admin > >>> sync status: > >>> > >>> metadata sync syncing > >>> full sync: 1/64 shards > >>> full sync: 135 entries to sync > >>> incremental sync: 63/64 shards > >>> metadata is behind on 1 shard > >>> behind shards: [0] > >>> > >>> In the RGW log, I see this error: > >>> 2024-12-15T21:30:59.641+0000 7f6dff472700 1 beast: 0x7f6d2f1cf730: > >>> 172.19.66.112 - s3-cdn-user [15/Dec/2024:21:30:59.641 +0000] "GET > >>> > /admin/log/?type=data&id=56&marker=00000000000000000000%3A00000000000000204086&extra-info=true&rgwx-zonegroup=7c01d60f-88c6-4192-baf7-d725260bf05d > >>> HTTP/1.1" 200 44 - - - latency=0.000000000s > >>> 2024-12-15T21:30:59.701+0000 7f6e44d1e700 0 meta sync: ERROR: > >>> full_sync(): RGWRadosGetOmapKeysCR() returned ret=-2 > >>> 2024-12-15T21:30:59.701+0000 7f6e44d1e700 0 RGW-SYNC:meta:shard[0]: > >>> ERROR: failed to list omap keys, status=-2 > >>> 2024-12-15T21:30:59.701+0000 7f6e44d1e700 0 meta sync: ERROR: > >>> RGWBackoffControlCR called coroutine returned -2 > >>> 2024-12-15T21:31:00.705+0000 7f6e44d1e700 0 meta sync: ERROR: > >>> full_sync(): RGWRadosGetOmapKeysCR() returned ret=-2 > >>> 2024-12-15T21:31:00.705+0000 7f6e44d1e700 0 RGW-SYNC:meta:shard[0]: > >>> ERROR: failed to list omap keys, status=-2 > >>> 2024-12-15T21:31:00.705+0000 7f6e44d1e700 0 meta sync: ERROR: > >>> RGWBackoffControlCR called coroutine returned -2 > >>> > >>> I’ve tried the following steps: > >>> > >>> - Changed the PG number of the metadata pool to force a rebalance, > >>> but everything was fine. > >>> - Ran metadata sync init and tried to run it again. > >>> - Restarted RGW services in both the zone and the master zone. > >>> - Created a user in the master zone to ensure metadata sync works, > >>> which was successful. > >>> - Checked OSD logs but didn’t see any specific errors. > >>> - Attempted to list metadata in the pool using rados ls -p > >>> s3-cdn-dc07.rgw.meta, but got an empty result. > >>> - Compared the code for listing OMAP keys between Quincy and Squid > >>> versions; there were no specific changes. > >>> > >>> I’m looking for any advice or suggestions to resolve this issue. > >>> > >>> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx