Hello Cephers,
We've just configured multi-site between an existing Octopus cluster and
a new one in another Datacenter.
On our biggest bucket (replic_cfn_prod/cfb : 1.4 M objects, 670 GB) we
have many errors like that on the new site :
2022-03-15T10:21:00.800+0100 7f9834750700 1 ====== starting new request
req=0x7f9988470620 =====
2022-03-15T10:21:00.800+0100 7f9834750700 1 ====== req done
req=0x7f9988470620 op status=0 http_status=200 latency=0s ======
2022-03-15T10:21:00.800+0100 7f9834750700 1 beast: 0x7f9988470620:
100.69.103.105 - - [2022-03-15T10:21:00.800105+0100] "HEAD / HTTP/1.0"
200 5 - - -
2022-03-15T10:21:00.912+0100 7f998b7fe700 0 ERROR: curl error:
Transferred a partial file, maybe network unstable
2022-03-15T10:21:00.912+0100 7f996b7fe700 0 store->fetch_remote_obj()
returned r=-11
2022-03-15T10:21:01.068+0100 7f98bf7fe700 0
RGW-SYNC:data:sync:shard[75]:entry[replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49]:bucket_sync_sources[target=:[]):source_bucket=:[]):source_zone=aefd4003-1866-4b16-b1b3-2f308075cd1c]:bucket[replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49<-replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49]:inc_sync[replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49]:entry[10264336-2213-11ec-a9c2-fa163e1fd8b9/615ac4886acbe-cloture-littoral-949-86.pdf]:
ERROR: failed to sync object:
replic_cfn_prod/cfb:aefd4003-1866-4b16-b1b3-2f308075cd1c.9774642.1:49/10264336-2213-11ec-a9c2-fa163e1fd8b9/615ac4886acbe-xxx-949-86.pdf
2022-03-15T10:21:01.140+0100 7f998b7fe700 0 ERROR: curl error:
Transferred a partial file, maybe network unstable
2022-03-15T10:21:01.140+0100 7f99727fc700 0 store->fetch_remote_obj()
returned r=-11
2022-03-15T10:21:01.428+0100 7f998b7fe700 0 ERROR: curl error:
Transferred a partial file, maybe network unstable
2022-03-15T10:21:01.432+0100 7f99677f6700 0 store->fetch_remote_obj()
returned r=-11
The network seems good, I have some retries if I benchmark with iperf3
and 10 simultaneous transfers. But manual curl work.
One thing I noticed : radosgw / beast, even if bounded to a specific
interface/network IP, the front storage, it uses another source IP for
communication, if the default gateway is not on the storage network.
In our conf, we have 3 networks :
- management (for ssh... it is the default gateway)
- storage (Ceph "public" network)
- replication (Ceph cluster network)
Ping from one node to another on the storage network has the management
ip as source.
We added source routing on nodes so that responses on the storage
network leaves via the storage network.
As radosgw is bound to our storage network IP (we can see it with
netstat / lsof), I thought that outgoing packets will have that IP. But
no. Requests for replication in multi-site are made with a source IP of
the management network.
It doesn't sound good, so I added a static route to my storage subnet of
the other zone via the storage network.
At first, I thought it was better, but, with some time I finally had the
same errors.
Does someone have a clue on the curl error: Transferred a partial file,
maybe network unstable ?
PS:
On the first cluster, we had problems with large omaps, and resharding
that would not pass because of bi_list() I/O errors
(https://tracker.ceph.com/issues/51429)
We had re-created the bucket, and manually shard it before activating
the multi-site sync.
We still have 6 Large OMAP objects. I Don't know if there are old ones
that will be corrected with deep-scrub (actually running on the
concerned PG).
Perhaps it can cause issues, but `radosgw-admin bi list` is now running
fine on the bucket.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx