Replication problems on multi-sites configuration

Gilles Mocellin <gilles.mocellin@xxxxxxxxxxxxxx> · Tue, 15 Mar 2022 10:43:06 +0100

Hello Cephers,

We've just configured multi-site between an existing Octopus cluster and 
a new one in another Datacenter.
On our biggest bucket (replic_cfn_prod/cfb : 1.4 M objects, 670 GB) we 
have many errors like that on the new site :
2022-03-15T10:21:00.800+0100 7f9834750700  1 ====== starting new request 
req=0x7f9988470620 =====
2022-03-15T10:21:00.800+0100 7f9834750700  1 ====== req done 
req=0x7f9988470620 op status=0 http_status=200 latency=0s ======
2022-03-15T10:21:00.800+0100 7f9834750700  1 beast: 0x7f9988470620: 
100.69.103.105 - - [2022-03-15T10:21:00.800105+0100] "HEAD / HTTP/1.0" 
200 5 - - -
2022-03-15T10:21:00.912+0100 7f998b7fe700  0 ERROR: curl error: 
Transferred a partial file, maybe network unstable
2022-03-15T10:21:00.912+0100 7f996b7fe700  0 store->fetch_remote_obj() 
returned r=-11
2022-03-15T10:21:01.068+0100 7f98bf7fe700  0 
RGW-SYNC:data:sync:shard[75]:entry[replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49]:bucket_sync_sources[target=:[]):source_bucket=:[]):source_zone=aefd4003-1866-4b16-b1b3-2f308075cd1c]:bucket[replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49<-replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49]:inc_sync[replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49]:entry[10264336-2213-11ec-a9c2-fa163e1fd8b9/615ac4886acbe-cloture-littoral-949-86.pdf]: 
ERROR: failed to sync object: 
replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49/10264336-2213-11ec-a9c2-fa163e1fd8b9/615ac4886acbe-xxx-949-86.pdf
2022-03-15T10:21:01.140+0100 7f998b7fe700  0 ERROR: curl error: 
Transferred a partial file, maybe network unstable
2022-03-15T10:21:01.140+0100 7f99727fc700  0 store->fetch_remote_obj() 
returned r=-11
2022-03-15T10:21:01.428+0100 7f998b7fe700  0 ERROR: curl error: 
Transferred a partial file, maybe network unstable
2022-03-15T10:21:01.432+0100 7f99677f6700  0 store->fetch_remote_obj() 
returned r=-11
The network seems good, I have some retries if I benchmark with iperf3 
and 10 simultaneous transfers. But manual curl work.
One thing I noticed : radosgw / beast, even if bounded to a specific 
interface/network IP, the front storage, it uses another source IP for 
communication, if the default gateway is not on the storage network.
In our conf, we have 3 networks :
- management (for ssh... it is the default gateway)
- storage (Ceph "public" network)
- replication (Ceph cluster network)

Ping from one node to another on the storage network has the management 
ip as source.
We added source routing on nodes so that responses on the storage 
network leaves via the storage network.
As radosgw is bound to our storage network IP (we can see it with 
netstat / lsof), I thought that outgoing packets will have that IP. But 
no. Requests for replication in multi-site are made with a source IP of 
the management network.
It doesn't sound good, so I added a static route to my storage subnet of 
the other zone via the storage network.
At first, I thought it was better, but, with some time I finally had the 
same errors.
Does someone have a clue on the curl error: Transferred a partial file, 
maybe network unstable ?
PS:
On the first cluster, we had problems with large omaps, and resharding 
that would not pass because of bi_list() I/O errors 
(https://tracker.ceph.com/issues/51429)
We had re-created the bucket, and manually shard it before activating 
the multi-site sync.
We still have 6 Large OMAP objects. I Don't know if there are old ones 
that will be corrected with deep-scrub (actually running on the 
concerned PG).
Perhaps it can cause issues, but `radosgw-admin bi list` is now running 
fine on the bucket.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx