Re: multisite replication issue with Quincy

"Jane Zhu (BLOOMBERG/ 120 PARK)" <jzhu116@xxxxxxxxxxxxx> · Tue, 4 Oct 2022 23:06:08 -0000

We are able to consistently reproduce the replication issue now. 
The following are the environment and the steps to reproduce it.

Please see the details in the open tracker: https://tracker.ceph.com/issues/57562?next_issue_id=57266&prev_issue_id=55179#note-7.

Any ideas of what's going on and how to fix it are welcome.

Ceph version:

17.2.3 GA or 17.2.4 GA

Custom configs in config db:

$ sudo ceph config dump
WHO         MASK              LEVEL     OPTION                                          VALUE           RO
mon                           advanced  mon_compact_on_start                            true              
mon                           advanced  mon_max_pg_per_osd                              500               
mon                           advanced  mon_mgr_beacon_grace                            90                
mon                           advanced  mon_stat_smooth_intervals                       9                 
mon                           advanced  osd_pool_default_size                           4                 
mon                           advanced  osd_scrub_auto_repair                           true              
osd                           dev       bluestore_cache_trim_max_skip_pinned            10000             
osd                           advanced  osd_crush_initial_weight                        0.000000          
osd                           advanced  osd_deep_scrub_interval                         2592000.000000    
osd                           advanced  osd_deep_scrub_large_omap_object_key_threshold  500000            
osd                           advanced  osd_max_backfills                               1                 
osd                           advanced  osd_max_scrubs                                  10                
osd         host:osd_node_1   basic     osd_memory_target                               4294967296        
osd         host:osd_node_2  basic     osd_memory_target                               4294967296        
osd         host:osd_node_3   basic     osd_memory_target                               4294967296        
osd         host:osd_node_4   basic     osd_memory_target                               4294967296        
osd                           advanced  osd_op_thread_timeout                           60                
osd                           advanced  osd_scrub_auto_repair                           true              
osd                           advanced  osd_scrub_auto_repair_num_errors                5                 
osd                           advanced  osd_scrub_begin_hour                            0                 
osd                           advanced  osd_scrub_during_recovery                       false             
osd                           advanced  osd_scrub_end_hour                              23                
osd                           advanced  osd_scrub_max_interval                          604800.000000     
osd                           advanced  osd_scrub_min_interval                          259200.000000     
osd                           advanced  osd_scrub_sleep                                 0.050000          
osd.0                         basic     osd_mclock_max_capacity_iops_hdd                18632.987170      
osd.1                         basic     osd_mclock_max_capacity_iops_hdd                19001.305326      
osd.10                        basic     osd_mclock_max_capacity_iops_hdd                19538.878049      
osd.11                        basic     osd_mclock_max_capacity_iops_hdd                17584.470315      
osd.2                         basic     osd_mclock_max_capacity_iops_hdd                18656.206041      
osd.3                         basic     osd_mclock_max_capacity_iops_hdd                18430.691608      
osd.4                         basic     osd_mclock_max_capacity_iops_hdd                20036.659741      
osd.5                         basic     osd_mclock_max_capacity_iops_hdd                19520.095460      
osd.6                         basic     osd_mclock_max_capacity_iops_hdd                18263.526765      
osd.7                         basic     osd_mclock_max_capacity_iops_hdd                18016.738667      
osd.8                         basic     osd_mclock_max_capacity_iops_hdd                19053.610592      
osd.9                         basic     osd_mclock_max_capacity_iops_hdd                20066.962652      
client.rgw                    advanced  objecter_inflight_op_bytes                      5368709120        
client.rgw                    advanced  objecter_inflight_ops                           102400            
client.rgw                    advanced  rgw_lc_max_worker                               3                 
client.rgw                    advanced  rgw_lc_max_wp_worker                            3                 
client.rgw                    advanced  rgw_lifecycle_work_time                         00:00-23:59     * 
client.rgw                    basic     rgw_max_concurrent_requests                     2048              

Multisite clusters settings:

2 clusters, each has 3 mons, 4 osds, 2 rgws 
Each rgw has 2 client traffic rgws and 2 replication rgws

Testing tool: cosbench

Reproduce steps:

1. Create 2 vm clusters for primary site and secondary site.
2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites.
3. Setup custom configs on mons of both clusters.
4. On the primary site, create 10 rgw users for cosbench tests, and set the max-buckets of each user to 10,000. 
5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and generate 10 mins write-only traffic.
6. Run a cosbench workload to create another 30,000 buckets and generate 4 hours of write-only traffic. 
7. We observed “behind shards” in sync status after the 4-hr cosbench test, and the replication didn’t catch up over time.

Cluster status:

1) Primary site:
$ sudo radosgw-admin sync status
          realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
      zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
           zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master)
  metadata sync no sync (zone is master)
      data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 3 shards
                        behind shards: [18,48,64]

$ sudo radosgw-admin data sync status --shard-id=48 --source-zone=dev-zone-bcc-secondary
{
    "shard_id": 48,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000001:00000000000000001013",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-10-03T05:56:20.319563Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

2) Secondary site:
$ sudo radosgw-admin sync status
          realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
      zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
           zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [31]
$ sudo radosgw-admin data sync status --shard-id=31 --source-zone=dev-zone-bcc-master
{
    "shard_id": 31,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000001:00000000000000000512",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-10-03T05:56:03.944817Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

Some error/fail log lines we observed:

1) Primary site
2022-10-02T23:15:12.482-0400 7fbf6a819700  1 req 8882223748441190067 0.001000015s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
…
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=10345
…
2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, rgw::cls::fifo::Completion<rgw::cls::fifo::Trimmer>::Ptr&&, int):1858 trim failed: r=-5 tid=14844
...

2) Secondary site
...
2022-10-02T23:15:50.279-0400 7f679a2ce700  1 req 16201632253829371026 0.001000002s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
...

We did a bucket sync run on a broken bucket, but nothing happened and the bucket still didn't sync.
$ sudo radosgw-admin bucket sync run --bucket=jjm-4hr-test-1k-thisisbcstestload0011007 --source-zone=dev-zone-bcc-secondary

From: jane.dev.zhu@xxxxxxxxx At: 10/04/22 18:57:12 UTC-4:00To:  Jane Zhu (BLOOMBERG/ 120 PARK ) 
Subject: Fwd:  multisite replication issue with Quincy

We have encountered replication issues in our multisite settings with Quincy v17.2.3.
Our Ceph clusters are brand new. We tore down our clusters and re-deployed fresh Quincy ones before we did our test.
In our environment, we have 3 RGW nodes per site, each node has 2 instances for client traffic and 1 instance dedicated for replication.
Our test was done using cosbench with the following settings:
- 10 rgw users
- 3000 buckets per user
- write only
- 6 different object sizes with the following distribution:
1k: 17%
2k: 48%
3k: 14%
4k: 5%
1M: 13%
8M: 3%
- trying to write 10 million objects per object size bucket per user to avoid writing to the same objects
- no multipart uploads involved
The test ran for about 2 hours roughly from 22:50pm 9/14 to 1:00am 9/15. And after that, the replication tail continued for another roughly 4 hours till 4:50am 9/15 with gradually decreasing replication traffic. Then the replication stopped and nothing has been going on in the clusters since.
While we were verifying the replication status, we found many issues.
1. The sync status shows the clusters are not fully synced. However all the replication traffic has stopped and nothing is going on in the clusters.
Secondary zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        behind shards: [40,74]

Why the replication stopped even though the clusters are still not in-sync?

2. We can see some buckets are not fully synced and we are able to identified some missing objects in our secondary zone.
Here is an example bucket. This is its sync status in the secondary zone.
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
         bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78])

    source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  source bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78])
                full sync: 0/101 shards
                incremental sync: 100/101 shards
                bucket is behind on 1 shards
                behind shards: [78]

3. We can see from the above sync status, the behind shard for the example bucket is not in the list of the behind shards for the system sync status. Why is that?
4. Data sync status for these behind shards doesn't list any "pending_buckets" or "recovering_buckets".
An example:
{
    "shard_id": 74,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000003:00000000000003381964",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-09-15T00:00:08.718840Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

Shouldn't the not-yet-in-sync buckets be listed here?

5. The sync status of the primary zone is different from the sync status of the secondary zone with different groups of behind shards. The same for the sync status of the same bucket. Is it legitimate? Please see the item 1 for sync status of the secondary zone, and the item 6 for the primary zone.
6. Why the primary zone has behind shards anyway since the replication is from primary to the secondary?|
Primary Zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  metadata sync no sync (zone is master)
      data sync source: 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 30 shards
                        behind shards: [6,7,26,28,29,37,47,52,55,56,61,67,68,69,74,79,82,91,95,99,101,104,106,111,112,121,122,123,126,127]

7. We have buckets in-sync that show correct sync status in secondary zone but still show behind shards in primary. Why is that?
Secondary Zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
         bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])

    source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  source bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])
                full sync: 0/101 shards
                incremental sync: 99/101 shards
                bucket is caught up with source

Primary zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
         bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])

    source zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
  source bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])
                full sync: 0/101 shards
                incremental sync: 97/101 shards
                bucket is behind on 11 shards
                behind shards: [9,11,14,16,22,31,44,45,67,85,90]

Our primary goals here are:
1. to find out why the replication stopped while the clusters are not in-sync;
2. to understand what we need to do resume the replication, and to make sure it runs to the end without too much lagging;
3. to understand if all the sync status info is correct. Seems to us there are many conflicts, and some doesn't reflect the real status of the clusters at all.

I have opened an issue in the Issue Tracker: https://tracker.ceph.com/issues/57562.
And more info regarding our clusters has been attached to the issue. It includes the following:
- ceph.conf of rgws
- ceph config dump
- ceph versions output
- sync status of cluster, an in-sync bucket, a not-in-sync bucket, and some behind shards
- bucket list and bucket stats of a not-in-sync bucket and stat of a not-in-sync object

Thanks,
Jane

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx