Re: radosgw fails with "ERROR: failed to initialize watch: (34) Numerical result out of range"

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 17 Jan 2018 13:34:19 +1000



See http://tracker.ceph.com/issues/22351#note-11

On Wed, Jan 17, 2018 at 10:09 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
> On Wed, Jan 17, 2018 at 5:41 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>> On Wed, Jan 17, 2018 at 2:20 AM, Nikos Kormpakis <nkorb@xxxxxxxxxxxx> wrote:
>>> On 01/16/2018 12:53 AM, Brad Hubbard wrote:
>>>> On Tue, Jan 16, 2018 at 1:35 AM, Alexander Peters <apeters@xxxxxxxxx> wrote:
>>>>> i created the dump output but it looks very cryptic to me so i can't really make much sense of it. is there anything to look for in particular?
>>>>
>>>> Yes, basically we are looking for any line that ends in "= 34". You
>>>> might also find piping it through c++filt helps.
>>>>
>>>> Something like...
>>>>
>>>> $ c++filt </tmp/ltrace.out|grep "= 34"
>>>
>>> Hello,
>>> we're facing the exact same issue. I added some more info about
>>> our cluster and output from ltrace in [1].
>>
>> Unfortunately, the strlen lines in that output are expected.
>>
>> Is it possible for me to access the ltrace output file somehow
>> (you could email it directly or use  ceph-post-file perhaps)?
>
> Ah, nm, my bad.
>
> It turns out what we need is the hexadecimal int representation of '-34'.
>
> $ c++filt </tmp/ltrace.out|grep "0xffffffde"
>
> I'll update the tracker accordingly.
>
>>
>>>
>>> Best regards,
>>> Nikos.
>>>
>>> [1] http://tracker.ceph.com/issues/22351
>>>
>>>>>
>>>>> i think i am going to read up on how interpret ltrace output...
>>>>>
>>>>> BR
>>>>> Alex
>>>>>
>>>>> ----- Ursprüngliche Mail -----
>>>>> Von: "Brad Hubbard" <bhubbard@xxxxxxxxxx>
>>>>> An: "Alexander Peters" <alexander.peters@xxxxxxxxx>
>>>>> CC: "Ceph Users" <ceph-users@xxxxxxxxxxxxxx>
>>>>> Gesendet: Montag, 15. Januar 2018 03:09:53
>>>>> Betreff: Re:  radosgw fails with "ERROR: failed to initialize watch: (34) Numerical result out of range"
>>>>>
>>>>> On Mon, Jan 15, 2018 at 11:38 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>>>>>> On Mon, Jan 15, 2018 at 10:38 AM, Alexander Peters
>>>>>> <alexander.peters@xxxxxxxxx> wrote:
>>>>>>> Thanks for the reply - unfortunatly the link you send is behind a paywall so
>>>>>>> at least for now i can’t read it.
>>>>>>
>>>>>> That's why I provided the cause as laid out in that article (pgp num > pg num).
>>>>>>
>>>>>> Do you have any settings in ceph.conf related to pg_num or pgp_num?
>>>>>>
>>>>>> If not, please add your details to http://tracker.ceph.com/issues/22351
>>>>>
>>>>> Rados can return ERANGE (34) in multiple places so identifying where
>>>>> might be a big step towards working this out.
>>>>>
>>>>> $ ltrace -fo /tmp/ltrace.out /usr/bin/radosgw --cluster ceph --name
>>>>> client.radosgw.ctrl02 --setuser ceph --setgroup ceph -f -d
>>>>>
>>>>> The objective is to find which function(s) return 34.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> output of ceph osd dump shows that pgp num == pg num:
>>>>>>>
>>>>>>> [root@ctrl01 ~]# ceph osd dump
>>>>>>> epoch 142
>>>>>>> fsid 0e2d841f-68fd-4629-9813-ab083e8c0f10
>>>>>>> created 2017-12-20 23:04:59.781525
>>>>>>> modified 2018-01-14 21:30:57.528682
>>>>>>> flags sortbitwise,recovery_deletes,purged_snapdirs
>>>>>>> crush_version 6
>>>>>>> full_ratio 0.95
>>>>>>> backfillfull_ratio 0.9
>>>>>>> nearfull_ratio 0.85
>>>>>>> require_min_compat_client jewel
>>>>>>> min_compat_client jewel
>>>>>>> require_osd_release luminous
>>>>>>> pool 1 'glance' replicated size 3 min_size 2 crush_rule 0 object_hash
>>>>>>> rjenkins pg_num 64 pgp_num 64 last_change 119 flags hashpspool stripe_width
>>>>>>> 0 application rbd
>>>>>>> removed_snaps [1~3]
>>>>>>> pool 2 'cinder-2' replicated size 3 min_size 2 crush_rule 0 object_hash
>>>>>>> rjenkins pg_num 64 pgp_num 64 last_change 120 flags hashpspool stripe_width
>>>>>>> 0 application rbd
>>>>>>> removed_snaps [1~3]
>>>>>>> pool 3 'cinder-3' replicated size 3 min_size 2 crush_rule 0 object_hash
>>>>>>> rjenkins pg_num 64 pgp_num 64 last_change 121 flags hashpspool stripe_width
>>>>>>> 0 application rbd
>>>>>>> removed_snaps [1~3]
>>>>>>> pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
>>>>>>> rjenkins pg_num 8 pgp_num 8 last_change 94 owner 18446744073709551615 flags
>>>>>>> hashpspool stripe_width 0 application rgw
>>>>>>> max_osd 3
>>>>>>> osd.0 up   in  weight 1 up_from 82 up_thru 140 down_at 79
>>>>>>> last_clean_interval [23,78) 10.16.0.11:6800/1795 10.16.0.11:6801/1795
>>>>>>> 10.16.0.11:6802/1795 10.16.0.11:6803/1795 exists,up
>>>>>>> abe33844-6d98-4ede-81a8-a8bdc92dada8
>>>>>>> osd.1 up   in  weight 1 up_from 73 up_thru 140 down_at 71
>>>>>>> last_clean_interval [55,72) 10.16.0.13:6800/1756 10.16.0.13:6804/1001756
>>>>>>> 10.16.0.13:6805/1001756 10.16.0.13:6806/1001756 exists,up
>>>>>>> 0dab9372-6ffe-4a23-a8b7-4edca3745a2a
>>>>>>> osd.2 up   in  weight 1 up_from 140 up_thru 140 down_at 133
>>>>>>> last_clean_interval [31,132) 10.16.0.12:6800/1749 10.16.0.12:6801/1749
>>>>>>> 10.16.0.12:6802/1749 10.16.0.12:6803/1749 exists,up
>>>>>>> 220bba17-8119-4035-9e43-5b8eaa27562f
>>>>>>>
>>>>>>>
>>>>>>> Am 15.01.2018 um 01:33 schrieb Brad Hubbard <bhubbard@xxxxxxxxxx>:
>>>>>>>
>>>>>>> On Mon, Jan 15, 2018 at 8:34 AM, Alexander Peters
>>>>>>> <alexander.peters@xxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> I am currently experiencing a strange issue with my radosgw. It Fails to
>>>>>>> start and all tit says is:
>>>>>>> [root@ctrl02 ~]# /usr/bin/radosgw --cluster ceph --name
>>>>>>> client.radosgw.ctrl02 --setuser ceph --setgroup ceph -f -d
>>>>>>> 2018-01-14 21:30:57.132007 7f44ddd18e00  0 deferred set uid:gid to 167:167
>>>>>>> (ceph:ceph)
>>>>>>> 2018-01-14 21:30:57.132161 7f44ddd18e00  0 ceph version 12.2.2
>>>>>>> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process
>>>>>>> (unknown), pid 13928
>>>>>>> 2018-01-14 21:30:57.556672 7f44ddd18e00 -1 ERROR: failed to initialize
>>>>>>> watch: (34) Numerical result out of range
>>>>>>> 2018-01-14 21:30:57.558752 7f44ddd18e00 -1 Couldn't init storage provider
>>>>>>> (RADOS)
>>>>>>>
>>>>>>> (when started via systemctl it writes the same lines to the logfile)
>>>>>>>
>>>>>>> strange thing is that it is working on an other env that was installed with
>>>>>>> the same set of ansible playbooks.
>>>>>>> OS is CentOS Linux release 7.4.1708 (Core)
>>>>>>>
>>>>>>> Ceph is up and running ( I am currently using it for storing volumes and
>>>>>>> images form Openstack )
>>>>>>>
>>>>>>> Does anyone have an idea how to debug this?
>>>>>>>
>>>>>>>
>>>>>>> According to https://access.redhat.com/solutions/2778161 this can
>>>>>>> happen if your pgp num is higher than the pg num.
>>>>>>>
>>>>>>> Check "ceph osd dump" output for that possibility.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards
>>>>>>> Alexander
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Cheers,
>>>>>>> Brad
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Cheers,
>>>>>> Brad
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> Brad
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nikos Kormpakis - nkorb@xxxxxxxxxxxx
>>> Network Operations Center, Greek Research & Technology Network
>>> Tel: +30 210 7475712 - http://www.grnet.gr
>>> 7, Kifisias Av., 115 23 Athens, Greece
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Cheers,
>> Brad
>
>
>
> --
> Cheers,
> Brad


-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com