Re: Multisite sync issue

特木勒 <twl007@xxxxxxxxx> · Tue, 1 Mar 2022 17:55:56 +0800

Hi Julian:

Thanks for your reply.

We are using tenant enabled RGW for now. :(

I will try to Use CEPH 16 as secondary cluster to do the testing. If it
works, I will upgrade master cluster upgrade to Ceph v16 too.

Have a good day.

Poß, Julian <julian.poss@xxxxxxx> 于2022年3月1日周二 17:42写道：

> Hey,
>
>
>
> my cluster is only a test installation, to generally verify a rgw
> multisite design, so there is no production data on it.
>
> Therefore my “solution” was to create a rgw s3 user without a tenant, so
> instead of
>
> radosgw-admin user create –tenant=test --uid=test --display-name=test
> --access_key=123 --secret_key=123 --rgw_realm internal
>
>
>
> I created a user like this:
> radosgw-admin user create --uid=test2 --display-name=test2
> --access_key=1234 --secret_key=1234 --rgw_realm internal
>
>
>
>
>
> That worked for me, to verify the problem. Unfortunately this is most
> likely not going to be a solution for you.
>
> And it isn’t for me either. But knowing this in my test-setup, I can take
> precautions for the production installation, and install that with v16
> release, instead.
>
> I’ll probably verify that this is fixed in the latest v16 release, too,
> before installing production clusters.
>
>
>
> Best, Julian
>
>
>
> *Von:* Te Mule <twl007@xxxxxxxxx>
> *Gesendet:* Dienstag, 1. März 2022 10:26
> *An:* Poß, Julian <julian.poss@xxxxxxx>
> *Cc:* Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx
> *Betreff:* Re:  Multisite sync issue
>
>
>
> Hi Julian:
>
>
>
> Could you share your solution for this? We are also trying to find out a
> solution for this.
>
>
>
> Thanks
>
>
>
> 在 2022年3月1日，下午5:18，Poß, Julian <julian.poss@xxxxxxx> 写道：
>
> 
>
> Thanks a ton for pointing this out.
>
> Just verified this with a rgw user without tenant, works perfectly as you
> would expect.
>
> I guess I could have suspected that tenants have something to do with it,
> since I spotted issues with them in the past, too.
>
> Anyways, I got my “solution”. Thanks again!
>
>
>
> Best, Julian
>
>
>
> *Von:* Mule Te (TWL007) <twl007@xxxxxxxxx>
> *Gesendet:* Freitag, 25. Februar 2022 19:45
> *An:* Poß, Julian <julian.poss@xxxxxxx>
> *Cc:* Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx
> *Betreff:* Re:  Multisite sync issue
>
>
>
> We have the same issue on Ceph 15.2.15.
>
>
>
> In the testing cluster, seem like Ceph 16 solved this issue. The PR
> https://github.com/ceph/ceph/pull/41316 seem to remove this issue, but I
> do not know why it does not merge back to Ceph 15.
>
>
>
> Also here is a new issue in Ceph tracker describes the same issue you
> have: https://tracker.ceph.com/issues/53737
>
>
>
> Thanks
>
>
>
>
> On Feb 25, 2022, at 10:07 PM, Poß, Julian <julian.poss@xxxxxxx> wrote:
>
>
>
> As far as i can tell, it can be reproduced every time, yes.
>
> That statement was actually about two RGW in one zone. That is also
> something that I tested.
> Because I felt like ceph should be able to handle that ha-like on its own.
>
> But for the main issue, there is indeed only one rgw in each zone running.
> Well as far as I can tell, I see no issues others than what I posted in my
> initial mail.
>
> Best, Julian
>
> -----Ursprüngliche Nachricht-----
> Von: Eugen Block <eblock@xxxxxx>
> Gesendet: Freitag, 25. Februar 2022 12:57
> An: ceph-users@xxxxxxx
> Betreff:  Re: WG: Multisite sync issue
>
> I see, then I misread your statement about multiple RGWs:
>
>
>
> It also worries me that replication won't work with multiple rgws in
> one zone, but one of them being unavailable, for instance during
> maintenance.
>
>
> Is there anything else than the RGW logs pointing to any issues? I find it
> strange that after a restart of the RGW fixes it. Is this always
> reproducable?
>
> Zitat von "Poß, Julian" <julian.poss@xxxxxxx>:
>
>
>
> Hi Eugen,
>
> there is currently only one RGW installed for each region+realm.
> So the places to look at are already pretty much limited.
>
> As of now, the RGWs itself are the endpoints. So far no loadbalancer
> has been put into place there.
>
> Best, Julian
>
> -----Ursprüngliche Nachricht-----
> Von: Eugen Block <eblock@xxxxxx>
> Gesendet: Freitag, 25. Februar 2022 10:52
> An: ceph-users@xxxxxxx
> Betreff:  Re: WG: Multisite sync issue
>
> This email originated from outside of CGM. Please do not click links
> or open attachments unless you know the sender and know the content is
> safe.
>
>
> Hi,
>
> I would stop alle RGWs except one in each cluster to limit the places
> and logs to look at. Do you have a loadbalancer as endpoint or do you
> have a list of all RGWs as endpoints?
>
>
> Zitat von "Poß, Julian" <julian.poss@xxxxxxx>:
>
>
>
> Hi,
>
> i did setup multisite with 2 ceph clusters and multiple rgw's and
> realms/zonegroups.
> This setup was installed using ceph ansible branch "stable-5.0", with
> focal+octopus.
> During some testing, i noticed that somehow the replication seems to
> not work as expected.
>
> With s3cmd, i put a small file of 1.9 kb into a bucket on the master
> zone s3cmd put /etc/hosts s3://test/
>
> Then i can see at the output of "radosgw-admin sync status
> --rgw_realm internal", that the cluster has indeed to sync something,
> and switching back to "nothing to sync" after a couple of seconds.
> "radosgw-admin sync error list --rgw_realm internal" is emtpy, too.
> However, if i look via s3cmd on the secondary zone, i can't see the
> file. Even if i look at the ceph pools directly, the data didn't get
> replicated.
> If i proceed by uploading the file again, with the same command and
> without a change, basically just updating it, or by restarting rgw
> deamon of the secondary zone, the affected file gets replicated.
>
> I spotted this issue with all my realms/zonegroups. But even with
> "debug_rgw = 20" and debug_rgw_sync = "20" i can't spot any obvious
> errors in the logs.
>
> It also worries me that replication won't work with multiple rgws in
> one zone, but one of them being unavailable, for instance during
> maintenance.
> I did somehow expect ceph to work it's way though the list of
> available endpoints, and only fail if none are available.
> ...Or am I missing something here?
>
> Any help whatsoever is very much appreciated.
> I am pretty new to multisite and stuck on this for a couple of days
> now already.
>
> Thanks, Julian
>
>
> Here is some additional information, including some log snippets:
>
> # ON Master site, i can see the file in the bilog right away
> radosgw-admin bilog list --bucket test/test --rgw_realm internal
>                {
>        "op_id": "3#00000000001.445.5",
>        "op_tag": "b9794e07-8f6c-4c45-a981-a73c3a4dc863.8360.106",
>        "op": "write",
>        "object": "hosts",
>        "instance": "",
>        "state": "complete",
>        "index_ver": 1,
>        "timestamp": "2022-02-24T09:14:41.957638774Z",
>        "ver": {
>            "pool": 7,
>            "epoch": 2
>        },
>        "bilog_flags": 0,
>        "versioned": false,
>        "owner": "",
>        "owner_display_name": "",
>        "zones_trace": [
>            {
>                "entry":
>
> "b9794e07-8f6c-4c45-a981-a73c3a4dc863:test/test:b9794e07-8f6c-4c45-a981-a73c3a4dc863.8366.3"
>            }
>        ]
>    },
>
>
> # RGW log of secondary zone shows the sync attempt:
> 2022-02-24T09:14:52.502+0000 7f1419ff3700  0
> RGW-SYNC:data:sync:shard[72]:entry[test/test:b9794e07-8f6c-4c45-a981-
> a
> 73c3a4dc863.8366.3:3]: triggering sync of source bucket/shard
> test/test:b9794e07-8f6c-4c45-a981-a73c3a4dc863.8366.3:3
>
> # but the secondary zone, doesnt actually show the new file in the
> bilog radosgw-admin bilog list --bucket test/test --rgw_realm
> internal
>
> # and the shard log that according to the logfile had the data to
> sync in it, doesn't seem to even exist at the secondary zone
> radosgw-admin datalog list --shard-id 72 --rgw_realm internal
> ERROR: list_bi_log_entries(): (2) No such file or directory
>
>
> # RGW Log at master zone, there is one 404 in there which worries me
> a bit
> 2022-02-24T09:14:52.515+0000 7ff5816e2700  1 beast: 0x7ff6387f77c0:
> 192.168.85.71 - - [2022-02-24T09:14:52.515949+0000] "GET
> /admin/log/?type=bucket-index&bucket-instance=test%2Ftest%3Ab9794e07-
> 8
> f6c-4c45-a981-a73c3a4dc863.8366.3%3A3&info&rgwx-zonegroup=7d35d818-08
> 8
> 1-483a-b1bf-47ec21f26609 HTTP/1.1" 200 94 - -
> -
> 2022-02-24T09:14:52.527+0000 7ff512604700  1 beast: 0x7ff6386747c0:
> 192.168.85.71 - - [2022-02-24T09:14:52.527950+0000] "GET
> /test?rgwx-bucket-instance=test%2Ftest%3Ab9794e07-8f6c-4c45-a981-a73c
> 3
> a4dc863.8366.3%3A3&versions&format=json&objs-container=true&key-marke
> r
> &version-id-marker&rgwx-zonegroup=7d35d818-0881-483a-b1bf-47ec21f2660
> 9
> HTTP/1.1" 404 146 - -
> -
> 2022-02-24T09:14:52.535+0000 7ff559e93700  1 beast: 0x7ff6386747c0:
> 192.168.85.71 - - [2022-02-24T09:14:52.535950+0000] "GET
> /admin/log?bucket-instance=test%2Ftest%3Ab9794e07-8f6c-4c45-a981-a73c
> 3
> a4dc863.8366.3%3A3&format=json&marker=00000000001.445.5&type=bucket-i
> n
> dex&rgwx-zonegroup=7d35d818-0881-483a-b1bf-47ec21f26609 HTTP/1.1" 200
> 2 - -
> -
>
>
>
> # if i update the file, by reuploading it, or restart the rgw deamon
> of the secondary zone, the affected file gets synced s3cmd put
> /etc/hosts s3://test/
>
> # again, there is the sync attempt from the secondary zone rgw
> 2022-02-24T12:04:52.452+0000 7f1419ff3700  0
> RGW-SYNC:data:sync:shard[72]:entry[test/test:b9794e07-8f6c-4c45-a981-
> a
> 73c3a4dc863.8366.3:3]: triggering sync of source bucket/shard
> test/test:b9794e07-8f6c-4c45-a981-a73c3a4dc863.8366.3:3
>
> # But now the file does show in bilog, and data log radosgw-admin
> bilog list --bucket test/test --rgw_realm internal {
>        "op_id": "3#00000000001.456.5",
>        "op_tag": "_e1zRfGuaFH7mLumu1gapeLzHo9zYU6M",
>        "op": "write",
>        "object": "hosts",
>        "instance": "",
>        "state": "complete",
>        "index_ver": 1,
>        "timestamp": "2022-02-24T12:04:38.405141253Z",
>        "ver": {
>            "pool": 7,
>            "epoch": 2
>        },
>        "bilog_flags": 0,
>        "versioned": false,
>        "owner": "",
>        "owner_display_name": "",
>        "zones_trace": [
>            {
>                "entry":
>
> "b9794e07-8f6c-4c45-a981-a73c3a4dc863:test/test:b9794e07-8f6c-4c45-a981-a73c3a4dc863.8366.3"
>            },
>            {
>                "entry":
>
> "e00c182b-27dc-4500-ad5b-77719f615d76:test/test:b9794e07-8f6c-4c45-a981-a73c3a4dc863.8366.3"
>            },
>            {
>                "entry":
>
> "e00c182b-27dc-4500-ad5b-77719f615d76:test/test:b9794e07-8f6c-4c45-a981-a73c3a4dc863.8366.3:3"
>            }
>        ]
>    }
>
> radosgw-admin datalog list --shard-id 72 --rgw_realm internal
>    {
>        "entity_type": "bucket",
>        "key": "test/test:b9794e07-8f6c-4c45-a981-a73c3a4dc863.8366.3:3",
>        "timestamp": "2022-02-24T12:04:52.522177Z"
>    }
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx