Re: External RGW always down

Eugen Block <eblock@xxxxxx> · Mon, 26 Sep 2022 17:36:29 +0000

Yes, I have an inactive pgs when the osd goes down. Then I started the osds
manually. But the rgw fails to start.

But the OSDs stay online if you start them manually? Do the inactive  
PGs recover when you start them manually? By the way, you should check  
your crush rules, depending on how many OSDs fail you may have room  
for improvement there. And why do the OSDs fail with automatic  
restart, what's in the logs?

Only upgrading to a newer version is only for the issue and we faced this
issue two times.

What versions are you using (ceph versions)?

I dont know why it is happening. But maybe the rgw are running in separate
machines. This causes the issue ?

I don't know how that should

Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:

Hi Eugen,

Yes, I have an inactive pgs when the osd goes down. Then I started the osds
manually. But the rgw fails to start.

Only upgrading to a newer version is only for the issue and we faced this
issue two times.

I dont know why it is happening. But maybe the rgw are running in separate
machines. This causes the issue ?

On Sat, Sep 10, 2022 at 11:27 PM Eugen Block <eblock@xxxxxx> wrote:

You didn’t respond to the other questions. If you want people to be
able to help you need to provide more information. If your OSDs fail
do you have inactive PGs? Or do you have full OSDs which would RGW
prevent from starting? I’m assuming that if you fix your OSDs the RGWs
would start working again. But then again, we still don’t know
anything about the current situation.

Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:

> Hi Eugen,
>
> Below is the log output,
>
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework: beast
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework conf key: port,
val:
> 80
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 radosgw_Main not setting
numa
> affinity
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 rgw_d3n:
> rgw_d3n_l1_local_datacache_enabled=0
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 D3N datacache enabled: 0
> 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization timeout,
failed
> to initialize
> 2022-09-07T12:08:53.395+0000 7f69017095c0  0 deferred set uid:gid to
> 167:167 (ceph:ceph)
> 2022-09-07T12:08:53.395+0000 7f69017095c0  0 ceph version 17.2.0
> (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
> radosgw, pid 7
> 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework: beast
> 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework conf key: port,
val:
> 80
> 2022-09-07T12:08:53.395+0000 7f69017095c0  1 radosgw_Main not setting
numa
> affinity
> 2022-09-07T12:08:53.395+0000 7f69017095c0  1 rgw_d3n:
> rgw_d3n_l1_local_datacache_enabled=0
> 2022-09-07T12:08:53.395+0000 7f69017095c0  1 D3N datacache enabled: 0
> 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization timeout,
failed
> to initialize
>
> I installed the cluster in quincy.
>
>
> On Sat, Sep 10, 2022 at 4:02 PM Eugen Block <eblock@xxxxxx> wrote:
>
>> What troubleshooting have you tried? You don’t provide any log output
>> or information about the cluster setup, for example the ceph osd tree,
>> ceph status, are the failing OSDs random or do they all belong to the
>> same pool? Any log output from failing OSDs and the RGWs might help,
>> otherwise it’s just wild guessing. Is the cluster a new installation
>> with cephadm or an older cluster upgraded to Quincy?
>>
>> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>>
>> > Hi all,
>> >
>> > I have one critical issue in my prod cluster. When the customer's data
>> > comes from 600 MiB .
>> >
>> > My Osds are down *8 to 20 from 238* . Then I manually up my osds .
After
>> a
>> > few minutes, my all rgw crashes.
>> >
>> > We did some troubleshooting but nothing works. When we upgrade ceph to
>> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two times. But
>> both
>> > times we upgraded the ceph.
>> >
>> > *Node schema :*
>> >
>> > *Node 1 to node 5 --> mon,mgr and osds*
>> > *Node 6 to Node15 --> only osds*
>> > *Node 16 to Node 20 --> only rgws.*
>> >
>> > Kindly, check this issue and let me know the correct troubleshooting
>> method.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx