Re: multisite replication issue with Quincy

"Jane Zhu (BLOOMBERG/ 120 PARK)" <jzhu116@xxxxxxxxxxxxx> · Mon, 10 Oct 2022 19:19:28 -0000

Are there any suggestions/tips on how we can debug this type of multisite/replication issues?

From: At: 10/04/22 19:08:56 UTC-4:00To:  ceph-users@xxxxxxx
Subject:  Re: multisite replication issue with Quincy

We are able to consistently reproduce the replication issue now. 
The following are the environment and the steps to reproduce it.

Please see the details in the open tracker: 
https://tracker.ceph.com/issues/57562?next_issue_id=57266&prev_issue_id=55179#no
te-7.

Any ideas of what's going on and how to fix it are welcome.

Ceph version:

17.2.3 GA or 17.2.4 GA

Custom configs in config db:

$ sudo ceph config dump
WHO         MASK              LEVEL     OPTION                                  
        VALUE           RO
mon                           advanced  mon_compact_on_start                    
        true              
mon                           advanced  mon_max_pg_per_osd                      
        500               
mon                           advanced  mon_mgr_beacon_grace                    
        90                
mon                           advanced  mon_stat_smooth_intervals               
        9                 
mon                           advanced  osd_pool_default_size                   
        4                 
mon                           advanced  osd_scrub_auto_repair                   
        true              
osd                           dev       bluestore_cache_trim_max_skip_pinned    
        10000             
osd                           advanced  osd_crush_initial_weight                
        0.000000          
osd                           advanced  osd_deep_scrub_interval                 
        2592000.000000    
osd                           advanced  
osd_deep_scrub_large_omap_object_key_threshold  500000            
osd                           advanced  osd_max_backfills                       
        1                 
osd                           advanced  osd_max_scrubs                          
        10                
osd         host:osd_node_1   basic     osd_memory_target                       
        4294967296        
osd         host:osd_node_2  basic     osd_memory_target                        
       4294967296        
osd         host:osd_node_3   basic     osd_memory_target                       
        4294967296        
osd         host:osd_node_4   basic     osd_memory_target                       
        4294967296        
osd                           advanced  osd_op_thread_timeout                   
        60                
osd                           advanced  osd_scrub_auto_repair                   
        true              
osd                           advanced  osd_scrub_auto_repair_num_errors        
        5                 
osd                           advanced  osd_scrub_begin_hour                    
        0                 
osd                           advanced  osd_scrub_during_recovery               
        false             
osd                           advanced  osd_scrub_end_hour                      
        23                
osd                           advanced  osd_scrub_max_interval                  
        604800.000000     
osd                           advanced  osd_scrub_min_interval                  
        259200.000000     
osd                           advanced  osd_scrub_sleep                         
        0.050000          
osd.0                         basic     osd_mclock_max_capacity_iops_hdd        
        18632.987170      
osd.1                         basic     osd_mclock_max_capacity_iops_hdd        
        19001.305326      
osd.10                        basic     osd_mclock_max_capacity_iops_hdd        
        19538.878049      
osd.11                        basic     osd_mclock_max_capacity_iops_hdd        
        17584.470315      
osd.2                         basic     osd_mclock_max_capacity_iops_hdd        
        18656.206041      
osd.3                         basic     osd_mclock_max_capacity_iops_hdd        
        18430.691608      
osd.4                         basic     osd_mclock_max_capacity_iops_hdd        
        20036.659741      
osd.5                         basic     osd_mclock_max_capacity_iops_hdd        
        19520.095460      
osd.6                         basic     osd_mclock_max_capacity_iops_hdd        
        18263.526765      
osd.7                         basic     osd_mclock_max_capacity_iops_hdd        
        18016.738667      
osd.8                         basic     osd_mclock_max_capacity_iops_hdd        
        19053.610592      
osd.9                         basic     osd_mclock_max_capacity_iops_hdd        
        20066.962652      
client.rgw                    advanced  objecter_inflight_op_bytes              
        5368709120        
client.rgw                    advanced  objecter_inflight_ops                   
        102400            
client.rgw                    advanced  rgw_lc_max_worker                       
        3                 
client.rgw                    advanced  rgw_lc_max_wp_worker                    
        3                 
client.rgw                    advanced  rgw_lifecycle_work_time                 
        00:00-23:59     * 
client.rgw                    basic     rgw_max_concurrent_requests             
        2048              

Multisite clusters settings:

2 clusters, each has 3 mons, 4 osds, 2 rgws 
Each rgw has 2 client traffic rgws and 2 replication rgws

Testing tool: cosbench

Reproduce steps:

1. Create 2 vm clusters for primary site and secondary site.
2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites.
3. Setup custom configs on mons of both clusters.
4. On the primary site, create 10 rgw users for cosbench tests, and set the 
max-buckets of each user to 10,000. 
5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and 
generate 10 mins write-only traffic.
6. Run a cosbench workload to create another 30,000 buckets and generate 4 
hours of write-only traffic. 
7. We observed “behind shards” in sync status after the 4-hr cosbench test, and 
the replication didn’t catch up over time.

Cluster status:

1) Primary site:
$ sudo radosgw-admin sync status
          realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
      zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
           zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master)
  metadata sync no sync (zone is master)
      data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c 
(dev-zone-bcc-secondary)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 3 shards
                        behind shards: [18,48,64]

$ sudo radosgw-admin data sync status --shard-id=48 
--source-zone=dev-zone-bcc-secondary
{
    "shard_id": 48,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000001:00000000000000001013",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-10-03T05:56:20.319563Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

2) Secondary site:
$ sudo radosgw-admin sync status
          realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
      zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
           zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 
(dev-zone-bcc-master)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [31]
$ sudo radosgw-admin data sync status --shard-id=31 
--source-zone=dev-zone-bcc-master
{
    "shard_id": 31,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000001:00000000000000000512",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-10-03T05:56:03.944817Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

Some error/fail log lines we observed:

1) Primary site
2022-10-02T23:15:12.482-0400 7fbf6a819700  1 req 8882223748441190067 
0.001000015s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
…
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 
0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const 
DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, 
std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, 
optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 
0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const 
DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, 
optional_yield):1102 push_part failed: r=-34 tid=10345
…
2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void 
rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, 
rgw::cls::fifo::Completion<rgw::cls::fifo::Trimmer>::Ptr&&, int):1858 trim 
failed: r=-5 tid=14844
...

2) Secondary site
...
2022-10-02T23:15:50.279-0400 7f679a2ce700  1 req 16201632253829371026 
0.001000002s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
...

We did a bucket sync run on a broken bucket, but nothing happened and the 
bucket still didn't sync.
$ sudo radosgw-admin bucket sync run 
--bucket=jjm-4hr-test-1k-thisisbcstestload0011007 
--source-zone=dev-zone-bcc-secondary

From: jane.dev.zhu@xxxxxxxxx At: 10/04/22 18:57:12 UTC-4:00To:  Jane Zhu 
(BLOOMBERG/ 120 PARK ) 
Subject: Fwd:  multisite replication issue with Quincy

We have encountered replication issues in our multisite settings with Quincy 
v17.2.3.
Our Ceph clusters are brand new. We tore down our clusters and re-deployed 
fresh Quincy ones before we did our test.
In our environment, we have 3 RGW nodes per site, each node has 2 instances for 
client traffic and 1 instance dedicated for replication.
Our test was done using cosbench with the following settings:
- 10 rgw users
- 3000 buckets per user
- write only
- 6 different object sizes with the following distribution:
1k: 17%
2k: 48%
3k: 14%
4k: 5%
1M: 13%
8M: 3%
- trying to write 10 million objects per object size bucket per user to avoid 
writing to the same objects
- no multipart uploads involved
The test ran for about 2 hours roughly from 22:50pm 9/14 to 1:00am 9/15. And 
after that, the replication tail continued for another roughly 4 hours till 
4:50am 9/15 with gradually decreasing replication traffic. Then the replication 
stopped and nothing has been going on in the clusters since.
While we were verifying the replication status, we found many issues.
1. The sync status shows the clusters are not fully synced. However all the 
replication traffic has stopped and nothing is going on in the clusters.
Secondary zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        behind shards: [40,74]

Why the replication stopped even though the clusters are still not in-sync?

2. We can see some buckets are not fully synced and we are able to identified 
some missing objects in our secondary zone.
Here is an example bucket. This is its sync status in the secondary zone.
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
         bucket 
:mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb
.89152.78])

    source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  source bucket 
:mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb
.89152.78])
                full sync: 0/101 shards
                incremental sync: 100/101 shards
                bucket is behind on 1 shards
                behind shards: [78]

3. We can see from the above sync status, the behind shard for the example 
bucket is not in the list of the behind shards for the system sync status. Why 
is that?
4. Data sync status for these behind shards doesn't list any "pending_buckets" 
or "recovering_buckets".
An example:
{
    "shard_id": 74,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000003:00000000000003381964",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-09-15T00:00:08.718840Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

Shouldn't the not-yet-in-sync buckets be listed here?

5. The sync status of the primary zone is different from the sync status of the 
secondary zone with different groups of behind shards. The same for the sync 
status of the same bucket. Is it legitimate? Please see the item 1 for sync 
status of the secondary zone, and the item 6 for the primary zone.
6. Why the primary zone has behind shards anyway since the replication is from 
primary to the secondary?|
Primary Zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  metadata sync no sync (zone is master)
      data sync source: 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 30 shards
                        behind shards: 
[6,7,26,28,29,37,47,52,55,56,61,67,68,69,74,79,82,91,95,99,101,104,106,111,112,1
21,122,123,126,127]

7. We have buckets in-sync that show correct sync status in secondary zone but 
still show behind shards in primary. Why is that?
Secondary Zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
         bucket 
:mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb
.89754.279])

    source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  source bucket 
:mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb
.89754.279])
                full sync: 0/101 shards
                incremental sync: 99/101 shards
                bucket is caught up with source

Primary zone:
          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
         bucket 
:mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb
.89754.279])

    source zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
  source bucket 
:mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb
.89754.279])
                full sync: 0/101 shards
                incremental sync: 97/101 shards
                bucket is behind on 11 shards
                behind shards: [9,11,14,16,22,31,44,45,67,85,90]

Our primary goals here are:
1. to find out why the replication stopped while the clusters are not in-sync;
2. to understand what we need to do resume the replication, and to make sure it 
runs to the end without too much lagging;
3. to understand if all the sync status info is correct. Seems to us there are 
many conflicts, and some doesn't reflect the real status of the clusters at all.

I have opened an issue in the Issue Tracker: 
https://tracker.ceph.com/issues/57562.
And more info regarding our clusters has been attached to the issue. It 
includes the following:
- ceph.conf of rgws
- ceph config dump
- ceph versions output
- sync status of cluster, an in-sync bucket, a not-in-sync bucket, and some 
behind shards
- bucket list and bucket stats of a not-in-sync bucket and stat of a 
not-in-sync object

Thanks,
Jane

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx