Re: radosgw-admin user create takes a long time (with failed to distribute cache message)

Boris Behrens <bb@xxxxxxxxx> · Tue, 11 May 2021 11:18:35 +0200

Hi Amit,
it is the same physical interface but different VLANs. I checked all IP
adresses from all systems and everything is direct connected, without any
gateway hops.

Am Di., 11. Mai 2021 um 10:59 Uhr schrieb Amit Ghadge <amitg.b14@xxxxxxxxx>:

> I hope you are using a single network interface for the public and cluster?
>
> On Tue, May 11, 2021 at 2:15 PM Thomas Schneider <
> Thomas.Schneider-q2p@xxxxxxxxxxxxxxxxxx> wrote:
>
>> Hey all,
>>
>> we had slow RGW access when some OSDs were slow due to an (to us) unknown
>> OSD bug that made PG access either slow or impossible. (It showed itself
>> through slowness of the mgr as well, but nothing other than that).
>> We restarted all OSDs that held RGW data and the problem was gone.
>> I have no good way to debug the problem since it never occured again
>> after we restarted the OSDs.
>>
>> Kind regards,
>> Thomas
>>
>>
>> Am 11. Mai 2021 08:47:06 MESZ schrieb Boris Behrens <bb@xxxxxxxxx>:
>> >Hi Amit,
>> >
>> >I just pinged the mons from every system and they are all available.
>> >
>> >Am Mo., 10. Mai 2021 um 21:18 Uhr schrieb Amit Ghadge <
>> amitg.b14@xxxxxxxxx>:
>> >
>> >> We seen slowness due to unreachable one of them mgr service, maybe here
>> >> are different, you can check monmap/ ceph.conf mon entry and then
>> verify
>> >> all nodes are successfully ping.
>> >>
>> >>
>> >> -AmitG
>> >>
>> >>
>> >> On Tue, 11 May 2021 at 12:12 AM, Boris Behrens <bb@xxxxxxxxx> wrote:
>> >>
>> >>> Hi guys,
>> >>>
>> >>> does someone got any idea?
>> >>>
>> >>> Am Mi., 5. Mai 2021 um 16:16 Uhr schrieb Boris Behrens <bb@xxxxxxxxx
>> >:
>> >>>
>> >>> > Hi,
>> >>> > since a couple of days we experience a strange slowness on some
>> >>> > radosgw-admin operations.
>> >>> > What is the best way to debug this?
>> >>> >
>> >>> > For example creating a user takes over 20s.
>> >>> > [root@s3db1 ~]# time radosgw-admin user create --uid test-bb-user
>> >>> > --display-name=test-bb-user
>> >>> > 2021-05-05 14:08:14.297 7f6942286840  1 robust_notify: If at first
>> you
>> >>> > don't succeed: (110) Connection timed out
>> >>> > 2021-05-05 14:08:14.297 7f6942286840  0 ERROR: failed to distribute
>> >>> cache
>> >>> > for eu-central-1.rgw.users.uid:test-bb-user
>> >>> > 2021-05-05 14:08:24.335 7f6942286840  1 robust_notify: If at first
>> you
>> >>> > don't succeed: (110) Connection timed out
>> >>> > 2021-05-05 14:08:24.335 7f6942286840  0 ERROR: failed to distribute
>> >>> cache
>> >>> > for eu-central-1.rgw.users.keys:****
>> >>> > {
>> >>> >     "user_id": "test-bb-user",
>> >>> >     "display_name": "test-bb-user",
>> >>> >    ....
>> >>> > }
>> >>> > real 0m20.557s
>> >>> > user 0m0.087s
>> >>> > sys 0m0.030s
>> >>> >
>> >>> > First I thought that rados operations might be slow, but adding and
>> >>> > deleting objects in rados are fast as usual (at least from my
>> >>> perspective).
>> >>> > Also uploading to buckets is fine.
>> >>> >
>> >>> > We changed some things and I think it might have to do with this:
>> >>> > * We have a HAProxy that distributes via leastconn between the 3
>> >>> radosgw's
>> >>> > (this did not change)
>> >>> > * We had three times a daemon with the name "eu-central-1" running
>> (on
>> >>> the
>> >>> > 3 radosgw's)
>> >>> > * Because this might have led to our data duplication problem, we
>> have
>> >>> > split that up so now the daemons are named per host
>> (eu-central-1-s3db1,
>> >>> > eu-central-1-s3db2, eu-central-1-s3db3)
>> >>> > * We also added dedicated rgw daemons for garbage collection,
>> because
>> >>> the
>> >>> > current one were not able to keep up.
>> >>> > * So basically ceph status went from "rgw: 1 daemon active
>> >>> (eu-central-1)"
>> >>> > to "rgw: 14 daemons active (eu-central-1-s3db1, eu-central-1-s3db2,
>> >>> > eu-central-1-s3db3, gc-s3db12, gc-s3db13...)
>> >>> >
>> >>> >
>> >>> > Cheers
>> >>> >  Boris
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
>> im
>> >>> groÃƒ¼en Saal.
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>
>> >>
>> >
>>
>> --
>> Thomas Schneider
>> IT.SERVICES
>> Wissenschaftliche Informationsversorgung Ruhr-Universität Bochum | 44780
>> Bochum
>> Telefon: +49 234 32 23939
>> http://www.it-services.rub.de/
>>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx