Are there any suggestions/tips on how we can debug this type of multisite/replication issues? From: At: 10/04/22 19:08:56 UTC-4:00To: ceph-users@xxxxxxx Subject: Re: multisite replication issue with Quincy We are able to consistently reproduce the replication issue now. The following are the environment and the steps to reproduce it. Please see the details in the open tracker: https://tracker.ceph.com/issues/57562?next_issue_id=57266&prev_issue_id=55179#no te-7. Any ideas of what's going on and how to fix it are welcome. Ceph version: 17.2.3 GA or 17.2.4 GA Custom configs in config db: $ sudo ceph config dump WHO MASK LEVEL OPTION VALUE RO mon advanced mon_compact_on_start true mon advanced mon_max_pg_per_osd 500 mon advanced mon_mgr_beacon_grace 90 mon advanced mon_stat_smooth_intervals 9 mon advanced osd_pool_default_size 4 mon advanced osd_scrub_auto_repair true osd dev bluestore_cache_trim_max_skip_pinned 10000 osd advanced osd_crush_initial_weight 0.000000 osd advanced osd_deep_scrub_interval 2592000.000000 osd advanced osd_deep_scrub_large_omap_object_key_threshold 500000 osd advanced osd_max_backfills 1 osd advanced osd_max_scrubs 10 osd host:osd_node_1 basic osd_memory_target 4294967296 osd host:osd_node_2 basic osd_memory_target 4294967296 osd host:osd_node_3 basic osd_memory_target 4294967296 osd host:osd_node_4 basic osd_memory_target 4294967296 osd advanced osd_op_thread_timeout 60 osd advanced osd_scrub_auto_repair true osd advanced osd_scrub_auto_repair_num_errors 5 osd advanced osd_scrub_begin_hour 0 osd advanced osd_scrub_during_recovery false osd advanced osd_scrub_end_hour 23 osd advanced osd_scrub_max_interval 604800.000000 osd advanced osd_scrub_min_interval 259200.000000 osd advanced osd_scrub_sleep 0.050000 osd.0 basic osd_mclock_max_capacity_iops_hdd 18632.987170 osd.1 basic osd_mclock_max_capacity_iops_hdd 19001.305326 osd.10 basic osd_mclock_max_capacity_iops_hdd 19538.878049 osd.11 basic osd_mclock_max_capacity_iops_hdd 17584.470315 osd.2 basic osd_mclock_max_capacity_iops_hdd 18656.206041 osd.3 basic osd_mclock_max_capacity_iops_hdd 18430.691608 osd.4 basic osd_mclock_max_capacity_iops_hdd 20036.659741 osd.5 basic osd_mclock_max_capacity_iops_hdd 19520.095460 osd.6 basic osd_mclock_max_capacity_iops_hdd 18263.526765 osd.7 basic osd_mclock_max_capacity_iops_hdd 18016.738667 osd.8 basic osd_mclock_max_capacity_iops_hdd 19053.610592 osd.9 basic osd_mclock_max_capacity_iops_hdd 20066.962652 client.rgw advanced objecter_inflight_op_bytes 5368709120 client.rgw advanced objecter_inflight_ops 102400 client.rgw advanced rgw_lc_max_worker 3 client.rgw advanced rgw_lc_max_wp_worker 3 client.rgw advanced rgw_lifecycle_work_time 00:00-23:59 * client.rgw basic rgw_max_concurrent_requests 2048 Multisite clusters settings: 2 clusters, each has 3 mons, 4 osds, 2 rgws Each rgw has 2 client traffic rgws and 2 replication rgws Testing tool: cosbench Reproduce steps: 1. Create 2 vm clusters for primary site and secondary site. 2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites. 3. Setup custom configs on mons of both clusters. 4. On the primary site, create 10 rgw users for cosbench tests, and set the max-buckets of each user to 10,000. 5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and generate 10 mins write-only traffic. 6. Run a cosbench workload to create another 30,000 buckets and generate 4 hours of write-only traffic. 7. We observed “behind shards” in sync status after the 4-hr cosbench test, and the replication didn’t catch up over time. Cluster status: 1) Primary site: $ sudo radosgw-admin sync status realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm) zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup) zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master) metadata sync no sync (zone is master) data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 3 shards behind shards: [18,48,64] $ sudo radosgw-admin data sync status --shard-id=48 --source-zone=dev-zone-bcc-secondary { "shard_id": 48, "marker": { "status": "incremental-sync", "marker": "00000000000000000001:00000000000000001013", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-10-03T05:56:20.319563Z" }, "pending_buckets": [], "recovering_buckets": [] } 2) Secondary site: $ sudo radosgw-admin sync status realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm) zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup) zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 1 shards behind shards: [31] $ sudo radosgw-admin data sync status --shard-id=31 --source-zone=dev-zone-bcc-master { "shard_id": 31, "marker": { "status": "incremental-sync", "marker": "00000000000000000001:00000000000000000512", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-10-03T05:56:03.944817Z" }, "pending_buckets": [], "recovering_buckets": [] } Some error/fail log lines we observed: 1) Primary site 2022-10-02T23:15:12.482-0400 7fbf6a819700 1 req 8882223748441190067 0.001000015s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002 … 2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345 2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=10345 … 2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, rgw::cls::fifo::Completion<rgw::cls::fifo::Trimmer>::Ptr&&, int):1858 trim failed: r=-5 tid=14844 ... 2) Secondary site ... 2022-10-02T23:15:50.279-0400 7f679a2ce700 1 req 16201632253829371026 0.001000002s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002 ... We did a bucket sync run on a broken bucket, but nothing happened and the bucket still didn't sync. $ sudo radosgw-admin bucket sync run --bucket=jjm-4hr-test-1k-thisisbcstestload0011007 --source-zone=dev-zone-bcc-secondary From: jane.dev.zhu@xxxxxxxxx At: 10/04/22 18:57:12 UTC-4:00To: Jane Zhu (BLOOMBERG/ 120 PARK ) Subject: Fwd: multisite replication issue with Quincy We have encountered replication issues in our multisite settings with Quincy v17.2.3. Our Ceph clusters are brand new. We tore down our clusters and re-deployed fresh Quincy ones before we did our test. In our environment, we have 3 RGW nodes per site, each node has 2 instances for client traffic and 1 instance dedicated for replication. Our test was done using cosbench with the following settings: - 10 rgw users - 3000 buckets per user - write only - 6 different object sizes with the following distribution: 1k: 17% 2k: 48% 3k: 14% 4k: 5% 1M: 13% 8M: 3% - trying to write 10 million objects per object size bucket per user to avoid writing to the same objects - no multipart uploads involved The test ran for about 2 hours roughly from 22:50pm 9/14 to 1:00am 9/15. And after that, the replication tail continued for another roughly 4 hours till 4:50am 9/15 with gradually decreasing replication traffic. Then the replication stopped and nothing has been going on in the clusters since. While we were verifying the replication status, we found many issues. 1. The sync status shows the clusters are not fully synced. However all the replication traffic has stopped and nothing is going on in the clusters. Secondary zone: realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 2 shards behind shards: [40,74] Why the replication stopped even though the clusters are still not in-sync? 2. We can see some buckets are not fully synced and we are able to identified some missing objects in our secondary zone. Here is an example bucket. This is its sync status in the secondary zone. realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb .89152.78]) source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) source bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb .89152.78]) full sync: 0/101 shards incremental sync: 100/101 shards bucket is behind on 1 shards behind shards: [78] 3. We can see from the above sync status, the behind shard for the example bucket is not in the list of the behind shards for the system sync status. Why is that? 4. Data sync status for these behind shards doesn't list any "pending_buckets" or "recovering_buckets". An example: { "shard_id": 74, "marker": { "status": "incremental-sync", "marker": "00000000000000000003:00000000000003381964", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-09-15T00:00:08.718840Z" }, "pending_buckets": [], "recovering_buckets": [] } Shouldn't the not-yet-in-sync buckets be listed here? 5. The sync status of the primary zone is different from the sync status of the secondary zone with different groups of behind shards. The same for the sync status of the same bucket. Is it legitimate? Please see the item 1 for sync status of the secondary zone, and the item 6 for the primary zone. 6. Why the primary zone has behind shards anyway since the replication is from primary to the secondary?| Primary Zone: realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) metadata sync no sync (zone is master) data sync source: 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 30 shards behind shards: [6,7,26,28,29,37,47,52,55,56,61,67,68,69,74,79,82,91,95,99,101,104,106,111,112,1 21,122,123,126,127] 7. We have buckets in-sync that show correct sync status in secondary zone but still show behind shards in primary. Why is that? Secondary Zone: realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb .89754.279]) source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) source bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb .89754.279]) full sync: 0/101 shards incremental sync: 99/101 shards bucket is caught up with source Primary zone: realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb .89754.279]) source zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) source bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb .89754.279]) full sync: 0/101 shards incremental sync: 97/101 shards bucket is behind on 11 shards behind shards: [9,11,14,16,22,31,44,45,67,85,90] Our primary goals here are: 1. to find out why the replication stopped while the clusters are not in-sync; 2. to understand what we need to do resume the replication, and to make sure it runs to the end without too much lagging; 3. to understand if all the sync status info is correct. Seems to us there are many conflicts, and some doesn't reflect the real status of the clusters at all. I have opened an issue in the Issue Tracker: https://tracker.ceph.com/issues/57562. And more info regarding our clusters has been attached to the issue. It includes the following: - ceph.conf of rgws - ceph config dump - ceph versions output - sync status of cluster, an in-sync bucket, a not-in-sync bucket, and some behind shards - bucket list and bucket stats of a not-in-sync bucket and stat of a not-in-sync object Thanks, Jane _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx