Re: External RGW always down

Monish Selvaraj <monish@xxxxxxxxxxxxxxx> · Wed, 28 Sep 2022 15:04:47 +0530

Hi Eugen,

Kindly tell me the command to enable debug log for rgw.

Also after all the pg inactive problems were resolved the rgw wont
start.The ceph health becomes ok and in the pg status there is no working,
warning and unknown pgs are there.

On Wed, Sep 28, 2022 at 2:37 PM Eugen Block <eblock@xxxxxx> wrote:

> As I already said, it's possible that your inactive PGs prevent the
> RGWs from starting. You can turn on debug logs for the RGWs, maybe
> they reveal more.
>
> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>
> > Hi Eugen,
> >
> > The OSD fails because of RAM/CPU overloaded whatever it is.After Osd
> fails
> > it starts again. That's not the problem.
> >
> > I need to know why the rgw fails when the osd down ?
> >
> > The rgw log output below,
> >
> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework: beast
> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework conf key: port,
> val:
> > 80
> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 radosgw_Main not setting
> numa
> > affinity
> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 rgw_d3n:
> > rgw_d3n_l1_local_datacache_enabled=0
> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 D3N datacache enabled: 0
> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization timeout,
> failed
> > to initialize
> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 deferred set uid:gid to
> > 167:167 (ceph:ceph)
> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 ceph version 17.2.0
> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
> > radosgw, pid 7
> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework: beast
> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework conf key: port,
> val:
> > 80
> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 radosgw_Main not setting
> numa
> > affinity
> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 rgw_d3n:
> > rgw_d3n_l1_local_datacache_enabled=0
> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 D3N datacache enabled: 0
> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization timeout,
> failed
> > to initialize
> >
> >
> >
> > On Tue, Sep 27, 2022 at 6:38 PM Eugen Block <eblock@xxxxxx> wrote:
> >
> >> > if i set ceph osd set nodown what will happen to the cluster. Example,
> >> the
> >> > migration goes on and I enable this parameter. Will it cause any issue
> >> > while migrating the data ?
> >>
> >> Well, since we don't really know what is going on there it's hard to
> >> tell. But that flag basically prevents the MONs from marking the OSDs
> >> down (wrongly). But to verify we would need more information why the
> >> OSDs fail or who stops them. Is it a container resouce limit? Not
> >> enough CPU/RAM or whatever? Do you see anything in dmesg indicating an
> >> oom killer? If OSDs go down it's logged so there should be something
> >> in the logs.
> >> You don't respond to all questions so it's really hard to assist here,
> >> to be honest.
> >>
> >> > Can you suggest a good recovery option in erasure coded pool. because
> >> the k
> >> > means the copy value 11 and m the parity value 4 . I thought that
> means
> >> in
> >> > 15 hosts 3 host may down and also we migrate the data.
> >>
> >> Your assumption is correct, you should be able to sustain the failure
> >> of three hosts without client impact, but if multiple OSDs across more
> >> hosts fail (holding PGs of the same pool(s)) you would have inactive
> >> PGs as you already reported.
> >>
> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >>
> >> > Hi Eugen,
> >> >
> >> > Thanks for your reply.
> >> >
> >> > Can you suggest a good recovery option in erasure coded pool. because
> >> the k
> >> > means the copy value 11 and m the parity value 4 . I thought that
> means
> >> in
> >> > 15 hosts 3 host may down and also we migrate the data.
> >> >
> >> > if i set ceph osd set nodown what will happen to the cluster. Example,
> >> the
> >> > migration goes on and I enable this parameter. Will it cause any issue
> >> > while migrating the data ?
> >> >
> >> > I didn't see any mon and mgr logs.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Sep 27, 2022 at 3:07 PM Eugen Block <eblock@xxxxxx> wrote:
> >> >
> >> >> > No pg recovery starts automatically when the osd starts.
> >> >>
> >> >> So you mean that you still have inactive PGs although your OSDs are
> >> >> all up? In that case try to 'ceph pg repeer <PG_ID>' to activate the
> >> >> PGs, maybe your RGWs will start then.
> >> >>
> >> >> > I'm using an erasure coded pool for rgw .In that rule we have k=11
> m=4
> >> >> > total 15 hosts and the crush rule is host .
> >> >>
> >> >> That means if one host goes down you can't recover until the node is
> >> >> back, you should have at least one or two more nodes so you have at
> >> >> least some recovery options.
> >> >>
> >> >> > When I migrate high data at 2gbps speed. the osds are automatically
> >> down.
> >> >> > But some osd are automatically started. Some of the osds we need to
> >> start
> >> >> > manually.
> >> >>
> >> >> Did you check the MON and/or MGR logs? Do the MONs mark the OSDs down
> >> >> after 10 minutes (or was it 15?)? That sounds a bit like flapping
> >> >> OSDs, you might want to check the mailing list archives for that,
> >> >> setting 'ceph osd set nodown' might help during the migration. But
> are
> >> >> the OSDs fully saturated ('iostat -xmt /dev/sd* 1')? If updating
> helps
> >> >> just stay on that version and maybe report a tracker issue with your
> >> >> findings.
> >> >>
> >> >>
> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >>
> >> >> > Hi Euden,
> >> >> >
> >> >> > Yes the osds stay online when i start them manually.
> >> >> >
> >> >> > No pg recovery starts automatically when the osd starts.
> >> >> >
> >> >> > I'm using an erasure coded pool for rgw .In that rule we have k=11
> m=4
> >> >> > total 15 hosts and the crush rule is host .
> >> >> >
> >> >> > I didn't find any error logs in the osds.
> >> >> >
> >> >> > First time I upgraded the ceph version from pacific to quincy.
> >> >> >
> >> >> > Second time I upgraded the ceph version from quincy 17.2.1 to
> 17.2.2
> >> >> >
> >> >> > I have an doubt we are migrating data from scality to ceph. when
> the
> >> data
> >> >> > migration is too high that means normally we migrate the data speed
> >> 800
> >> >> to
> >> >> > 900 mbps it does not cause the problem.
> >> >> >
> >> >> > When I migrate high data at 2gbps speed. the osds are automatically
> >> down.
> >> >> > But some osd are automatically started. Some of the osds we need to
> >> start
> >> >> > manually.
> >> >> >
> >> >> >
> >> >> > On Mon, Sep 26, 2022 at 11:06 PM Eugen Block <eblock@xxxxxx>
> wrote:
> >> >> >
> >> >> >> > Yes, I have an inactive pgs when the osd goes down. Then I
> started
> >> the
> >> >> >> osds
> >> >> >> > manually. But the rgw fails to start.
> >> >> >>
> >> >> >> But the OSDs stay online if you start them manually? Do the
> inactive
> >> >> >> PGs recover when you start them manually? By the way, you should
> >> check
> >> >> >> your crush rules, depending on how many OSDs fail you may have
> room
> >> >> >> for improvement there. And why do the OSDs fail with automatic
> >> >> >> restart, what's in the logs?
> >> >> >>
> >> >> >> > Only upgrading to a newer version is only for the issue and we
> >> faced
> >> >> this
> >> >> >> > issue two times.
> >> >> >>
> >> >> >> What versions are you using (ceph versions)?
> >> >> >>
> >> >> >> > I dont know why it is happening. But maybe the rgw are running
> in
> >> >> >> separate
> >> >> >> > machines. This causes the issue ?
> >> >> >>
> >> >> >> I don't know how that should
> >> >> >>
> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >> >>
> >> >> >> > Hi Eugen,
> >> >> >> >
> >> >> >> > Yes, I have an inactive pgs when the osd goes down. Then I
> started
> >> the
> >> >> >> osds
> >> >> >> > manually. But the rgw fails to start.
> >> >> >> >
> >> >> >> > Only upgrading to a newer version is only for the issue and we
> >> faced
> >> >> this
> >> >> >> > issue two times.
> >> >> >> >
> >> >> >> > I dont know why it is happening. But maybe the rgw are running
> in
> >> >> >> separate
> >> >> >> > machines. This causes the issue ?
> >> >> >> >
> >> >> >> > On Sat, Sep 10, 2022 at 11:27 PM Eugen Block <eblock@xxxxxx>
> >> wrote:
> >> >> >> >
> >> >> >> >> You didn’t respond to the other questions. If you want people
> to
> >> be
> >> >> >> >> able to help you need to provide more information. If your OSDs
> >> fail
> >> >> >> >> do you have inactive PGs? Or do you have full OSDs which would
> RGW
> >> >> >> >> prevent from starting? I’m assuming that if you fix your OSDs
> the
> >> >> RGWs
> >> >> >> >> would start working again. But then again, we still don’t know
> >> >> >> >> anything about the current situation.
> >> >> >> >>
> >> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >> >> >>
> >> >> >> >> > Hi Eugen,
> >> >> >> >> >
> >> >> >> >> > Below is the log output,
> >> >> >> >> >
> >> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework: beast
> >> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework conf
> key:
> >> >> port,
> >> >> >> >> val:
> >> >> >> >> > 80
> >> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 radosgw_Main not
> >> >> setting
> >> >> >> >> numa
> >> >> >> >> > affinity
> >> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 rgw_d3n:
> >> >> >> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 D3N datacache
> >> >> enabled: 0
> >> >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> >> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> >> RGWSI_RADOS::Obj&,
> >> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> >> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> >> RGWSI_RADOS::Obj&,
> >> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> >> > 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization
> >> >> timeout,
> >> >> >> >> failed
> >> >> >> >> > to initialize
> >> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 deferred set
> >> uid:gid
> >> >> to
> >> >> >> >> > 167:167 (ceph:ceph)
> >> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 ceph version
> 17.2.0
> >> >> >> >> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable),
> >> process
> >> >> >> >> > radosgw, pid 7
> >> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework: beast
> >> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework conf
> key:
> >> >> port,
> >> >> >> >> val:
> >> >> >> >> > 80
> >> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 radosgw_Main not
> >> >> setting
> >> >> >> >> numa
> >> >> >> >> > affinity
> >> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 rgw_d3n:
> >> >> >> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 D3N datacache
> >> >> enabled: 0
> >> >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> >> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> >> RGWSI_RADOS::Obj&,
> >> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> >> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> >> RGWSI_RADOS::Obj&,
> >> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> >> > 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization
> >> >> timeout,
> >> >> >> >> failed
> >> >> >> >> > to initialize
> >> >> >> >> >
> >> >> >> >> > I installed the cluster in quincy.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block <eblock@xxxxxx>
> >> wrote:
> >> >> >> >> >
> >> >> >> >> >> What troubleshooting have you tried? You don’t provide any
> log
> >> >> output
> >> >> >> >> >> or information about the cluster setup, for example the ceph
> >> osd
> >> >> >> tree,
> >> >> >> >> >> ceph status, are the failing OSDs random or do they all
> belong
> >> to
> >> >> the
> >> >> >> >> >> same pool? Any log output from failing OSDs and the RGWs
> might
> >> >> help,
> >> >> >> >> >> otherwise it’s just wild guessing. Is the cluster a new
> >> >> installation
> >> >> >> >> >> with cephadm or an older cluster upgraded to Quincy?
> >> >> >> >> >>
> >> >> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >> >> >> >>
> >> >> >> >> >> > Hi all,
> >> >> >> >> >> >
> >> >> >> >> >> > I have one critical issue in my prod cluster. When the
> >> >> customer's
> >> >> >> data
> >> >> >> >> >> > comes from 600 MiB .
> >> >> >> >> >> >
> >> >> >> >> >> > My Osds are down *8 to 20 from 238* . Then I manually up
> my
> >> >> osds .
> >> >> >> >> After
> >> >> >> >> >> a
> >> >> >> >> >> > few minutes, my all rgw crashes.
> >> >> >> >> >> >
> >> >> >> >> >> > We did some troubleshooting but nothing works. When we
> >> upgrade
> >> >> >> ceph to
> >> >> >> >> >> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two
> >> >> times.
> >> >> >> But
> >> >> >> >> >> both
> >> >> >> >> >> > times we upgraded the ceph.
> >> >> >> >> >> >
> >> >> >> >> >> > *Node schema :*
> >> >> >> >> >> >
> >> >> >> >> >> > *Node 1 to node 5 --> mon,mgr and osds*
> >> >> >> >> >> > *Node 6 to Node15 --> only osds*
> >> >> >> >> >> > *Node 16 to Node 20 --> only rgws.*
> >> >> >> >> >> >
> >> >> >> >> >> > Kindly, check this issue and let me know the correct
> >> >> >> troubleshooting
> >> >> >> >> >> method.
> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx