Re: radosgw-admin user create takes a long time (with failed to distribute cache message)

Boris Behrens <bb@xxxxxxxxx> · Tue, 11 May 2021 17:54:30 +0200

I actually WAS the amount of watchers... narf..

This is so embarissing.. Thanks a lot for all your input.

Am Di., 11. Mai 2021 um 13:54 Uhr schrieb Boris Behrens <bb@xxxxxxxxx>:

> I tried to debug it with --debug-ms=1.
> Maybe someone could help me to wrap my head around it?
> https://pastebin.com/LD9qrm3x
>
>
>
> Am Di., 11. Mai 2021 um 11:17 Uhr schrieb Boris Behrens <bb@xxxxxxxxx>:
>
>> Good call. I just restarted the whole cluster, but the problem still
>> persists.
>> I don't think it is a problem with the rados, but with the radosgw.
>>
>> But I still struggle to pin the issue.
>>
>> Am Di., 11. Mai 2021 um 10:45 Uhr schrieb Thomas Schneider <
>> Thomas.Schneider-q2p@xxxxxxxxxxxxxxxxxx>:
>>
>>> Hey all,
>>>
>>> we had slow RGW access when some OSDs were slow due to an (to us)
>>> unknown OSD bug that made PG access either slow or impossible. (It showed
>>> itself through slowness of the mgr as well, but nothing other than that).
>>> We restarted all OSDs that held RGW data and the problem was gone.
>>> I have no good way to debug the problem since it never occured again
>>> after we restarted the OSDs.
>>>
>>> Kind regards,
>>> Thomas
>>>
>>>
>>> Am 11. Mai 2021 08:47:06 MESZ schrieb Boris Behrens <bb@xxxxxxxxx>:
>>> >Hi Amit,
>>> >
>>> >I just pinged the mons from every system and they are all available.
>>> >
>>> >Am Mo., 10. Mai 2021 um 21:18 Uhr schrieb Amit Ghadge <
>>> amitg.b14@xxxxxxxxx>:
>>> >
>>> >> We seen slowness due to unreachable one of them mgr service, maybe
>>> here
>>> >> are different, you can check monmap/ ceph.conf mon entry and then
>>> verify
>>> >> all nodes are successfully ping.
>>> >>
>>> >>
>>> >> -AmitG
>>> >>
>>> >>
>>> >> On Tue, 11 May 2021 at 12:12 AM, Boris Behrens <bb@xxxxxxxxx> wrote:
>>> >>
>>> >>> Hi guys,
>>> >>>
>>> >>> does someone got any idea?
>>> >>>
>>> >>> Am Mi., 5. Mai 2021 um 16:16 Uhr schrieb Boris Behrens <bb@xxxxxxxxx
>>> >:
>>> >>>
>>> >>> > Hi,
>>> >>> > since a couple of days we experience a strange slowness on some
>>> >>> > radosgw-admin operations.
>>> >>> > What is the best way to debug this?
>>> >>> >
>>> >>> > For example creating a user takes over 20s.
>>> >>> > [root@s3db1 ~]# time radosgw-admin user create --uid test-bb-user
>>> >>> > --display-name=test-bb-user
>>> >>> > 2021-05-05 14:08:14.297 7f6942286840  1 robust_notify: If at first
>>> you
>>> >>> > don't succeed: (110) Connection timed out
>>> >>> > 2021-05-05 14:08:14.297 7f6942286840  0 ERROR: failed to distribute
>>> >>> cache
>>> >>> > for eu-central-1.rgw.users.uid:test-bb-user
>>> >>> > 2021-05-05 14:08:24.335 7f6942286840  1 robust_notify: If at first
>>> you
>>> >>> > don't succeed: (110) Connection timed out
>>> >>> > 2021-05-05 14:08:24.335 7f6942286840  0 ERROR: failed to distribute
>>> >>> cache
>>> >>> > for eu-central-1.rgw.users.keys:****
>>> >>> > {
>>> >>> >     "user_id": "test-bb-user",
>>> >>> >     "display_name": "test-bb-user",
>>> >>> >    ....
>>> >>> > }
>>> >>> > real 0m20.557s
>>> >>> > user 0m0.087s
>>> >>> > sys 0m0.030s
>>> >>> >
>>> >>> > First I thought that rados operations might be slow, but adding and
>>> >>> > deleting objects in rados are fast as usual (at least from my
>>> >>> perspective).
>>> >>> > Also uploading to buckets is fine.
>>> >>> >
>>> >>> > We changed some things and I think it might have to do with this:
>>> >>> > * We have a HAProxy that distributes via leastconn between the 3
>>> >>> radosgw's
>>> >>> > (this did not change)
>>> >>> > * We had three times a daemon with the name "eu-central-1" running
>>> (on
>>> >>> the
>>> >>> > 3 radosgw's)
>>> >>> > * Because this might have led to our data duplication problem, we
>>> have
>>> >>> > split that up so now the daemons are named per host
>>> (eu-central-1-s3db1,
>>> >>> > eu-central-1-s3db2, eu-central-1-s3db3)
>>> >>> > * We also added dedicated rgw daemons for garbage collection,
>>> because
>>> >>> the
>>> >>> > current one were not able to keep up.
>>> >>> > * So basically ceph status went from "rgw: 1 daemon active
>>> >>> (eu-central-1)"
>>> >>> > to "rgw: 14 daemons active (eu-central-1-s3db1, eu-central-1-s3db2,
>>> >>> > eu-central-1-s3db3, gc-s3db12, gc-s3db13...)
>>> >>> >
>>> >>> >
>>> >>> > Cheers
>>> >>> >  Boris
>>> >>> >
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal
>>> abweichend im
>>> >>> groÃƒ¼en Saal.
>>> >>> _______________________________________________
>>> >>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >>>
>>> >>
>>> >
>>>
>>> --
>>> Thomas Schneider
>>> IT.SERVICES
>>> Wissenschaftliche Informationsversorgung Ruhr-Universität Bochum | 44780
>>> Bochum
>>> Telefon: +49 234 32 23939
>>> http://www.it-services.rub.de/
>>>
>>
>>
>> --
>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> groÃƒ¼en Saal.
>>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx