Re: External RGW always down

Monish Selvaraj <monish@xxxxxxxxxxxxxxx> · Wed, 28 Sep 2022 12:25:42 +0530

Hi Eugen,

The OSD fails because of RAM/CPU overloaded whatever it is.After Osd fails
it starts again. That's not the problem.

I need to know why the rgw fails when the osd down ?

The rgw log output below,

2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework: beast
2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework conf key: port, val:
80
2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 radosgw_Main not setting numa
affinity
2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 rgw_d3n:
rgw_d3n_l1_local_datacache_enabled=0
2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 D3N datacache enabled: 0
2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
const RGWCacheNotifyInfo&, optional_yi>
2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
const RGWCacheNotifyInfo&, optional_yi>
2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization timeout, failed
to initialize
2022-09-07T12:08:53.395+0000 7f69017095c0  0 deferred set uid:gid to
167:167 (ceph:ceph)
2022-09-07T12:08:53.395+0000 7f69017095c0  0 ceph version 17.2.0
(43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
radosgw, pid 7
2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework: beast
2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework conf key: port, val:
80
2022-09-07T12:08:53.395+0000 7f69017095c0  1 radosgw_Main not setting numa
affinity
2022-09-07T12:08:53.395+0000 7f69017095c0  1 rgw_d3n:
rgw_d3n_l1_local_datacache_enabled=0
2022-09-07T12:08:53.395+0000 7f69017095c0  1 D3N datacache enabled: 0
2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
const RGWCacheNotifyInfo&, optional_yi>
2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
const RGWCacheNotifyInfo&, optional_yi>
2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization timeout, failed
to initialize

On Tue, Sep 27, 2022 at 6:38 PM Eugen Block <eblock@xxxxxx> wrote:

> > if i set ceph osd set nodown what will happen to the cluster. Example,
> the
> > migration goes on and I enable this parameter. Will it cause any issue
> > while migrating the data ?
>
> Well, since we don't really know what is going on there it's hard to
> tell. But that flag basically prevents the MONs from marking the OSDs
> down (wrongly). But to verify we would need more information why the
> OSDs fail or who stops them. Is it a container resouce limit? Not
> enough CPU/RAM or whatever? Do you see anything in dmesg indicating an
> oom killer? If OSDs go down it's logged so there should be something
> in the logs.
> You don't respond to all questions so it's really hard to assist here,
> to be honest.
>
> > Can you suggest a good recovery option in erasure coded pool. because
> the k
> > means the copy value 11 and m the parity value 4 . I thought that means
> in
> > 15 hosts 3 host may down and also we migrate the data.
>
> Your assumption is correct, you should be able to sustain the failure
> of three hosts without client impact, but if multiple OSDs across more
> hosts fail (holding PGs of the same pool(s)) you would have inactive
> PGs as you already reported.
>
> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>
> > Hi Eugen,
> >
> > Thanks for your reply.
> >
> > Can you suggest a good recovery option in erasure coded pool. because
> the k
> > means the copy value 11 and m the parity value 4 . I thought that means
> in
> > 15 hosts 3 host may down and also we migrate the data.
> >
> > if i set ceph osd set nodown what will happen to the cluster. Example,
> the
> > migration goes on and I enable this parameter. Will it cause any issue
> > while migrating the data ?
> >
> > I didn't see any mon and mgr logs.
> >
> >
> >
> >
> >
> > On Tue, Sep 27, 2022 at 3:07 PM Eugen Block <eblock@xxxxxx> wrote:
> >
> >> > No pg recovery starts automatically when the osd starts.
> >>
> >> So you mean that you still have inactive PGs although your OSDs are
> >> all up? In that case try to 'ceph pg repeer <PG_ID>' to activate the
> >> PGs, maybe your RGWs will start then.
> >>
> >> > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4
> >> > total 15 hosts and the crush rule is host .
> >>
> >> That means if one host goes down you can't recover until the node is
> >> back, you should have at least one or two more nodes so you have at
> >> least some recovery options.
> >>
> >> > When I migrate high data at 2gbps speed. the osds are automatically
> down.
> >> > But some osd are automatically started. Some of the osds we need to
> start
> >> > manually.
> >>
> >> Did you check the MON and/or MGR logs? Do the MONs mark the OSDs down
> >> after 10 minutes (or was it 15?)? That sounds a bit like flapping
> >> OSDs, you might want to check the mailing list archives for that,
> >> setting 'ceph osd set nodown' might help during the migration. But are
> >> the OSDs fully saturated ('iostat -xmt /dev/sd* 1')? If updating helps
> >> just stay on that version and maybe report a tracker issue with your
> >> findings.
> >>
> >>
> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >>
> >> > Hi Euden,
> >> >
> >> > Yes the osds stay online when i start them manually.
> >> >
> >> > No pg recovery starts automatically when the osd starts.
> >> >
> >> > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4
> >> > total 15 hosts and the crush rule is host .
> >> >
> >> > I didn't find any error logs in the osds.
> >> >
> >> > First time I upgraded the ceph version from pacific to quincy.
> >> >
> >> > Second time I upgraded the ceph version from quincy 17.2.1 to 17.2.2
> >> >
> >> > I have an doubt we are migrating data from scality to ceph. when the
> data
> >> > migration is too high that means normally we migrate the data speed
> 800
> >> to
> >> > 900 mbps it does not cause the problem.
> >> >
> >> > When I migrate high data at 2gbps speed. the osds are automatically
> down.
> >> > But some osd are automatically started. Some of the osds we need to
> start
> >> > manually.
> >> >
> >> >
> >> > On Mon, Sep 26, 2022 at 11:06 PM Eugen Block <eblock@xxxxxx> wrote:
> >> >
> >> >> > Yes, I have an inactive pgs when the osd goes down. Then I started
> the
> >> >> osds
> >> >> > manually. But the rgw fails to start.
> >> >>
> >> >> But the OSDs stay online if you start them manually? Do the inactive
> >> >> PGs recover when you start them manually? By the way, you should
> check
> >> >> your crush rules, depending on how many OSDs fail you may have room
> >> >> for improvement there. And why do the OSDs fail with automatic
> >> >> restart, what's in the logs?
> >> >>
> >> >> > Only upgrading to a newer version is only for the issue and we
> faced
> >> this
> >> >> > issue two times.
> >> >>
> >> >> What versions are you using (ceph versions)?
> >> >>
> >> >> > I dont know why it is happening. But maybe the rgw are running in
> >> >> separate
> >> >> > machines. This causes the issue ?
> >> >>
> >> >> I don't know how that should
> >> >>
> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >>
> >> >> > Hi Eugen,
> >> >> >
> >> >> > Yes, I have an inactive pgs when the osd goes down. Then I started
> the
> >> >> osds
> >> >> > manually. But the rgw fails to start.
> >> >> >
> >> >> > Only upgrading to a newer version is only for the issue and we
> faced
> >> this
> >> >> > issue two times.
> >> >> >
> >> >> > I dont know why it is happening. But maybe the rgw are running in
> >> >> separate
> >> >> > machines. This causes the issue ?
> >> >> >
> >> >> > On Sat, Sep 10, 2022 at 11:27 PM Eugen Block <eblock@xxxxxx>
> wrote:
> >> >> >
> >> >> >> You didn’t respond to the other questions. If you want people to
> be
> >> >> >> able to help you need to provide more information. If your OSDs
> fail
> >> >> >> do you have inactive PGs? Or do you have full OSDs which would RGW
> >> >> >> prevent from starting? I’m assuming that if you fix your OSDs the
> >> RGWs
> >> >> >> would start working again. But then again, we still don’t know
> >> >> >> anything about the current situation.
> >> >> >>
> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >> >>
> >> >> >> > Hi Eugen,
> >> >> >> >
> >> >> >> > Below is the log output,
> >> >> >> >
> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework: beast
> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework conf key:
> >> port,
> >> >> >> val:
> >> >> >> > 80
> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 radosgw_Main not
> >> setting
> >> >> >> numa
> >> >> >> > affinity
> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 rgw_d3n:
> >> >> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 D3N datacache
> >> enabled: 0
> >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> RGWSI_RADOS::Obj&,
> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> RGWSI_RADOS::Obj&,
> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> > 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization
> >> timeout,
> >> >> >> failed
> >> >> >> > to initialize
> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 deferred set
> uid:gid
> >> to
> >> >> >> > 167:167 (ceph:ceph)
> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 ceph version 17.2.0
> >> >> >> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable),
> process
> >> >> >> > radosgw, pid 7
> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework: beast
> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework conf key:
> >> port,
> >> >> >> val:
> >> >> >> > 80
> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 radosgw_Main not
> >> setting
> >> >> >> numa
> >> >> >> > affinity
> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 rgw_d3n:
> >> >> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 D3N datacache
> >> enabled: 0
> >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> RGWSI_RADOS::Obj&,
> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> >> RGWSI_RADOS::Obj&,
> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> >> > 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization
> >> timeout,
> >> >> >> failed
> >> >> >> > to initialize
> >> >> >> >
> >> >> >> > I installed the cluster in quincy.
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block <eblock@xxxxxx>
> wrote:
> >> >> >> >
> >> >> >> >> What troubleshooting have you tried? You don’t provide any log
> >> output
> >> >> >> >> or information about the cluster setup, for example the ceph
> osd
> >> >> tree,
> >> >> >> >> ceph status, are the failing OSDs random or do they all belong
> to
> >> the
> >> >> >> >> same pool? Any log output from failing OSDs and the RGWs might
> >> help,
> >> >> >> >> otherwise it’s just wild guessing. Is the cluster a new
> >> installation
> >> >> >> >> with cephadm or an older cluster upgraded to Quincy?
> >> >> >> >>
> >> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >> >> >>
> >> >> >> >> > Hi all,
> >> >> >> >> >
> >> >> >> >> > I have one critical issue in my prod cluster. When the
> >> customer's
> >> >> data
> >> >> >> >> > comes from 600 MiB .
> >> >> >> >> >
> >> >> >> >> > My Osds are down *8 to 20 from 238* . Then I manually up my
> >> osds .
> >> >> >> After
> >> >> >> >> a
> >> >> >> >> > few minutes, my all rgw crashes.
> >> >> >> >> >
> >> >> >> >> > We did some troubleshooting but nothing works. When we
> upgrade
> >> >> ceph to
> >> >> >> >> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two
> >> times.
> >> >> But
> >> >> >> >> both
> >> >> >> >> > times we upgraded the ceph.
> >> >> >> >> >
> >> >> >> >> > *Node schema :*
> >> >> >> >> >
> >> >> >> >> > *Node 1 to node 5 --> mon,mgr and osds*
> >> >> >> >> > *Node 6 to Node15 --> only osds*
> >> >> >> >> > *Node 16 to Node 20 --> only rgws.*
> >> >> >> >> >
> >> >> >> >> > Kindly, check this issue and let me know the correct
> >> >> troubleshooting
> >> >> >> >> method.
> >> >> >> >> > _______________________________________________
> >> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> _______________________________________________
> >> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx