The metadata sync issue has been resolved by changing the master zone and re-running the metadata sync. On Mon, Dec 23, 2024 at 2:15 PM Vahideh Alinouri <vahideh.alinouri@xxxxxxxxx> wrote: > When I increased the debug level of the RGW sync client to 20, I get it: > > 2024-12-23T09:42:17.248+0000 7f124866b700 20 register_request > mgr=0x5633b4b8d958 req_data->id=79162099, curl_handle=0x5633bb89ee60 > 2024-12-23T09:42:17.248+0000 7f124866b700 20 run: stack=0x5633b51952c0 is > io blocked > 2024-12-23T09:42:17.248+0000 7f1248e6c700 20 link_request > req_data=0x5633bc240d20 req_data->id=79162099, curl_handle=0x5633bb89ee60 > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b50bb100:25RGWMetaSyncShardControlCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 10 RGW-SYNC:meta:shard[0]: start > full sync > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b7731500:20RGWContinuousLeaseCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633b50f1540 > is_blocked_by_stack()=0 is_sleeping=1 waiting_for_child()=0 > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 enqueued request > req=0x5633be14c500 > 2024-12-23T09:42:17.252+0000 7f124866b700 20 RGWWQ: > 2024-12-23T09:42:17.252+0000 7f124866b700 20 req: 0x5633be14c500 > 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633bb892000 is > io blocked > 2024-12-23T09:42:17.252+0000 7f125167d700 20 dequeued request > req=0x5633be14c500 > 2024-12-23T09:42:17.252+0000 7f125167d700 20 RGWWQ: empty > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:20RGWSimpleRadosLockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b7731500:20RGWContinuousLeaseCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633bb892000 is > io blocked > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 10 RGW-SYNC:meta:shard[0]: took > lease > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633b50f1540 is > io blocked > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b796ce00:21RGWRadosGetOmapKeysCR: operate() > returned r=-2 > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 0 meta sync: ERROR: > full_sync(): RGWRadosGetOmapKeysCR() returned ret=-2 > 2024-12-23T09:42:17.252+0000 7f124866b700 0 RGW-SYNC:meta:shard[0]: > ERROR: failed to list omap keys, status=-2 > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633b50f1540:op=0x5633b520f400:18RGWMetaSyncShardCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 run: stack=0x5633b50f1540 > is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1 > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b7731500:20RGWContinuousLeaseCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:22RGWSimpleRadosUnlockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 rgw rados thread: > cr:s=0x5633bb892000:op=0x5633b796ce00:22RGWSimpleRadosUnlockCR: operate() > 2024-12-23T09:42:17.252+0000 7f124866b700 20 enqueued request > req=0x5633bd10ea20 > > And when I try to run this to get the list of OMAP, the output is empty! > rados -p s3-cdn-dc07.rgw.meta listomapvals -N root .bucket.meta.XXXX:XXXX.9 > > When I list the PGs of meta pool using ceph pg ls-by-pool > s3-cdn-dc07.rgw.meta, There are some PGs that have BYTES size but > OMAP_BYTES and OMAP_KEYS values are 0! > > PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* > OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP > ACTING SCRUB_STAMP DEEP_SCRUB_STAMP > LAST_SCRUB_DURATION SCRUB_SCHEDULING > > 10.0 14 0 0 0 3364 1676 > 9 957 active+clean 6h 57938'32678 57939:9604093 [70,88,46]p70 > [70,88,46]p70 2024-12-23T04:57:43.792352+0000 > 2024-12-21T22:45:49.820084+0000 1 periodic scrub > scheduled @ 2024-12-24T16:29:19.844180+0000 > *10.1 10 0 0 0 3073 0 > 0 49 active+clean 14h 57715'25516 57939:8975816 [0,60,52]p0 > [0,60,52]p0 2024-12-22T20:40:57.494355+0000 > 2024-12-17T17:43:43.383320+0000 1 periodic scrub > scheduled @ 2024-12-23T22:30:45.498616+0000* > 10.2 15 0 0 0 3706 188 > 1 132 active+clean 27h 57938'26080 57938:8966743 [39,14,1]p39 > [39,14,1]p39 2024-12-22T08:05:34.147330+0000 > 2024-12-22T08:05:34.147330+0000 1 periodic scrub > scheduled @ 2024-12-23T12:00:40.771526+0000 > 10.3 14 0 0 0 3898 2234 > 11 414 active+clean 2h 57939'36531 57939:9388345 [40,78,6]p40 > [40,78,6]p40 2024-12-23T09:00:29.100038+0000 > 2024-12-16T12:58:31.054537+0000 1 periodic scrub > scheduled @ 2024-12-24T14:36:29.705903+0000 > > > > *10.4 13 0 0 0 3346 0 > 0 67 active+clean 25h 57715'25662 57939:8964482 [34,44,25]p34 > [34,44,25]p34 2024-12-22T09:26:16.897803+0000 > 2024-12-16T23:09:15.271909+0000 1 periodic scrub > scheduled @ 2024-12-23T20:57:45.809155+000010.5 13 0 > 0 0 3567 0 0 2904 active+clean 28h > 57939'120020 57939:9016514 [66,37,91]p66 [66,37,91]p66 > 2024-12-22T07:03:26.677870+0000 2024-12-19T09:16:25.987456+0000 > 1 periodic scrub scheduled @ 2024-12-23T17:12:34.661323+000010.6 > 7 0 0 0 2159 0 0 > 39 active+clean 15h 57938'25500 57939:9961651 [19,24,56]p19 > [19,24,56]p19 2024-12-22T19:16:11.782123+0000 > 2024-12-22T19:16:11.782123+0000 1 periodic scrub > scheduled @ 2024-12-24T01:40:30.212204+000010.7 20 0 > 0 0 5963 0 0 116 active+clean 33m > 57715'25722 57939:9028884 [50,58,84]p50 [50,58,84]p50 > 2024-12-23T10:35:45.237158+0000 2024-12-17T03:28:34.643774+0000 > 1 periodic scrub scheduled @ 2024-12-24T18:39:23.977660+0000* > 10.8 20 0 0 0 6074 2155 > 11 226 active+clean 9h 57938'25807 57939:9347768 [22,4,42]p22 > [22,4,42]p22 2024-12-23T02:07:21.060581+0000 > 2024-12-21T23:16:53.596335+0000 1 periodic scrub > scheduled @ 2024-12-24T10:21:26.911522+0000 > *10.9 10 0 0 0 3046 0 > 0 63 active+clean 10h 57715'25597 57939:9021141 [10,38,15]p10 > [10,38,15]p10 2024-12-23T00:19:34.157798+0000 > 2024-12-20T11:28:20.294176+0000 1 periodic scrub > scheduled @ 2024-12-24T01:15:58.674425+0000* > 10.a 13 0 0 0 3400 188 > 1 79 active+clean 6h 57715'25740 57938:9021299 [2,31,80]p2 > [2,31,80]p2 2024-12-23T04:58:25.809790+0000 > 2024-12-16T15:51:26.184478+0000 1 periodic scrub > scheduled @ 2024-12-24T05:19:15.221400+0000 > 10.b 11 0 0 0 2915 176 > 1 68 active+clean 3h 57854'25555 57939:9067491 [53,82,20]p53 > [53,82,20]p53 2024-12-23T07:52:31.500216+0000 > 2024-12-22T07:11:15.383987+0000 1 periodic scrub > scheduled @ 2024-12-24T11:42:16.438508+0000 > 10.c 18 0 0 0 4430 2182 > 11 458 active+clean 9h 57939'36171 57939:9036209 [16,48,8]p16 > [16,48,8]p16 2024-12-23T01:46:59.720703+0000 > 2024-12-21T19:11:29.473994+0000 1 periodic scrub > scheduled @ 2024-12-24T11:00:21.726487+0000 > 10.d 13 0 0 0 3135 187 > 1 122 active+clean 8h 57901'26081 57939:9015503 [36,3,13]p36 > [36,3,13]p36 2024-12-23T02:31:13.107381+0000 > 2024-12-16T21:47:29.385720+0000 1 periodic scrub > scheduled @ 2024-12-24T10:01:35.665610+0000 > *10.e 14 0 0 0 3366 0 > 0 71 active+clean 7h 57715'25607 57939:9843618 [83,41,17]p83 > [83,41,17]p83 2024-12-23T03:39:55.648905+0000 > 2024-12-20T08:01:47.715401+0000 1 periodic scrub > scheduled @ 2024-12-24T09:58:14.031613+0000* > 10.f 14 0 0 0 4105 2508 > 13 521 active+clean 20h 57939'39421 57939:8983895 [33,49,68]p33 > [33,49,68]p33 2024-12-22T14:39:32.975012+0000 > 2024-12-22T14:39:32.975012+0000 1 periodic scrub > scheduled @ 2024-12-23T23:39:32.875651+0000 > > I tried to fix them with ceph pg deep-scrub, but there were no changes! > > I started to check the primary OSD log of 10.1 in detail. I see the > following logs. Why > > 2024-12-23T00:03:38.920+0000 7f3fef2d7700 0 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:112: > ERROR: int > rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t, > rados::cls::fifo::part_header*): failed decoding part header > 2024-12-23T00:03:38.920+0000 7f3fef2d7700 0 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:781: > int rados::cls::fifo::{anonymous}::trim_part(cls_method_context_t, > ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): failed to read > part header 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:112: > ERROR: int > rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t, > rados::cls::fifo::part_header*): failed decoding part header > 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:781: > int rados::cls::fifo::{anonymous}::trim_part(cls_method_context_t, > ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): failed to read > part header 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:112: > ERROR: int > rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t, > rados::cls::fifo::part_header*): failed decoding part header > 2024-12-23T00:03:38.920+0000 7f3feb2cf700 0 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/cls/fifo/cls_fifo.cc:781: > int rados::cls::fifo::{anonymous}::trim_part(cls_method_context_t, > ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): failed to read > part header > > On Fri, Dec 20, 2024 at 10:59 AM Eugen Block <eblock@xxxxxx> wrote: > >> I don't really have a good idea, except for maybe running "metadata >> sync init" and "metadata sync run"? >> But I wanted to point out that the .meta pool uses namespaces. >> >> > - Attempted to list metadata in the pool using rados ls -p >> > s3-cdn-dc07.rgw.meta, but got an empty result. >> >> Try this instead: >> >> rados -p s3-cdn-dc07.rgw.meta ls --all >> >> Do you have a specific object where the omap data is missing? Maybe >> increase debug_rgw logs to find a trace why it's missing. >> >> Zitat von Vahideh Alinouri <vahideh.alinouri@xxxxxxxxx>: >> >> > I also see this in the output of radosgw-admin metadata sync status. I >> > think it's strange because there should be a marker to follow the sync. >> > { >> > "key": 0, >> > "val": { >> > "state": 0, >> > "marker": "", >> > "next_step_marker": >> "1_1730469205.875723_877487777.1", >> > "total_entries": 174, >> > "pos": 0, >> > "timestamp": "2024-11-01T13:53:25.875723Z", >> > "realm_epoch": 0 >> > } >> > >> > On Mon, Dec 16, 2024 at 1:24 PM Vahideh Alinouri < >> vahideh.alinouri@xxxxxxxxx> >> > wrote: >> > >> >> I also see this log in the RGW log: >> >> >> >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 1 ====== starting new >> request >> >> req=0x7f9ad9959730 ===== >> >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 -1 req 11778501317150336521 >> >> 0.000000000s :list_data_changes_log int >> >> rgw::cls::fifo::{anonymous}::list_part(const DoutPrefixProvider*, >> >> librados::v14_2_0::IoCtx&, const string&, >> >> std::optional<std::basic_string_view<char> >, uint64_t, uint64_t, >> >> std::vector<rados::cls::fifo::part_list_entry>*, bool*, bool*, >> >> std::string*, uint64_t, optional_yield):245 fifo::op::LIST_PART failed >> >> r=-34 tid=4176 >> >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 -1 req 11778501317150336521 >> >> 0.000000000s :list_data_changes_log int >> rgw::cls::fifo::FIFO::list(const >> >> DoutPrefixProvider*, int, std::optional<std::basic_string_view<char> >, >> >> std::vector<rgw::cls::fifo::list_entry>*, bool*, optional_yield):1660 >> >> list_entries failed: r=-34 tid= 4176 >> >> 2024-12-16T12:23:58.651+0000 7f9b2b9fe700 -1 req 11778501317150336521 >> >> 0.000000000s :list_data_changes_log virtual int >> >> RGWDataChangesFIFO::list(const DoutPrefixProvider*, int, int, >> >> std::vector<rgw_data_change_log_entry>&, >> >> std::optional<std::basic_string_view<char> >, std::string*, bool*): >> unable >> >> to list FIFO: data_log.44: (34) Numerical result out of range >> >> >> >> On Sun, Dec 15, 2024 at 10:45 PM Vahideh Alinouri < >> >> vahideh.alinouri@xxxxxxxxx> wrote: >> >> >> >>> Hi guys, >> >>> >> >>> My Ceph release is Quincy 17.2.5. I need to change the master zone to >> >>> decommission the old one and upgrade all zones. I have separated the >> client >> >>> traffic and sync traffic in RGWs, meaning there are separate RGW >> daemons >> >>> handling the sync process. >> >>> >> >>> I encountered an issue when trying to sync one of the zones in the >> >>> zonegroup. The data sync is proceeding fine, but I have an issue with >> the >> >>> metadata sync. It gets stuck behind on a shard. Here is the output >> >>> from radosgw-admin >> >>> sync status: >> >>> >> >>> metadata sync syncing >> >>> full sync: 1/64 shards >> >>> full sync: 135 entries to sync >> >>> incremental sync: 63/64 shards >> >>> metadata is behind on 1 shard >> >>> behind shards: [0] >> >>> >> >>> In the RGW log, I see this error: >> >>> 2024-12-15T21:30:59.641+0000 7f6dff472700 1 beast: 0x7f6d2f1cf730: >> >>> 172.19.66.112 - s3-cdn-user [15/Dec/2024:21:30:59.641 +0000] "GET >> >>> >> /admin/log/?type=data&id=56&marker=00000000000000000000%3A00000000000000204086&extra-info=true&rgwx-zonegroup=7c01d60f-88c6-4192-baf7-d725260bf05d >> >>> HTTP/1.1" 200 44 - - - latency=0.000000000s >> >>> 2024-12-15T21:30:59.701+0000 7f6e44d1e700 0 meta sync: ERROR: >> >>> full_sync(): RGWRadosGetOmapKeysCR() returned ret=-2 >> >>> 2024-12-15T21:30:59.701+0000 7f6e44d1e700 0 RGW-SYNC:meta:shard[0]: >> >>> ERROR: failed to list omap keys, status=-2 >> >>> 2024-12-15T21:30:59.701+0000 7f6e44d1e700 0 meta sync: ERROR: >> >>> RGWBackoffControlCR called coroutine returned -2 >> >>> 2024-12-15T21:31:00.705+0000 7f6e44d1e700 0 meta sync: ERROR: >> >>> full_sync(): RGWRadosGetOmapKeysCR() returned ret=-2 >> >>> 2024-12-15T21:31:00.705+0000 7f6e44d1e700 0 RGW-SYNC:meta:shard[0]: >> >>> ERROR: failed to list omap keys, status=-2 >> >>> 2024-12-15T21:31:00.705+0000 7f6e44d1e700 0 meta sync: ERROR: >> >>> RGWBackoffControlCR called coroutine returned -2 >> >>> >> >>> I’ve tried the following steps: >> >>> >> >>> - Changed the PG number of the metadata pool to force a rebalance, >> >>> but everything was fine. >> >>> - Ran metadata sync init and tried to run it again. >> >>> - Restarted RGW services in both the zone and the master zone. >> >>> - Created a user in the master zone to ensure metadata sync works, >> >>> which was successful. >> >>> - Checked OSD logs but didn’t see any specific errors. >> >>> - Attempted to list metadata in the pool using rados ls -p >> >>> s3-cdn-dc07.rgw.meta, but got an empty result. >> >>> - Compared the code for listing OMAP keys between Quincy and Squid >> >>> versions; there were no specific changes. >> >>> >> >>> I’m looking for any advice or suggestions to resolve this issue. >> >>> >> >>> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx