Re: External RGW always down

Monish Selvaraj <monish@xxxxxxxxxxxxxxx> · Tue, 27 Sep 2022 17:32:10 +0530

Hi Eugen,

Thanks for your reply.

Can you suggest a good recovery option in erasure coded pool. because the k
means the copy value 11 and m the parity value 4 . I thought that means in
15 hosts 3 host may down and also we migrate the data.

if i set ceph osd set nodown what will happen to the cluster. Example, the
migration goes on and I enable this parameter. Will it cause any issue
while migrating the data ?

I didn't see any mon and mgr logs.

On Tue, Sep 27, 2022 at 3:07 PM Eugen Block <eblock@xxxxxx> wrote:

> > No pg recovery starts automatically when the osd starts.
>
> So you mean that you still have inactive PGs although your OSDs are
> all up? In that case try to 'ceph pg repeer <PG_ID>' to activate the
> PGs, maybe your RGWs will start then.
>
> > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4
> > total 15 hosts and the crush rule is host .
>
> That means if one host goes down you can't recover until the node is
> back, you should have at least one or two more nodes so you have at
> least some recovery options.
>
> > When I migrate high data at 2gbps speed. the osds are automatically down.
> > But some osd are automatically started. Some of the osds we need to start
> > manually.
>
> Did you check the MON and/or MGR logs? Do the MONs mark the OSDs down
> after 10 minutes (or was it 15?)? That sounds a bit like flapping
> OSDs, you might want to check the mailing list archives for that,
> setting 'ceph osd set nodown' might help during the migration. But are
> the OSDs fully saturated ('iostat -xmt /dev/sd* 1')? If updating helps
> just stay on that version and maybe report a tracker issue with your
> findings.
>
>
> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>
> > Hi Euden,
> >
> > Yes the osds stay online when i start them manually.
> >
> > No pg recovery starts automatically when the osd starts.
> >
> > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4
> > total 15 hosts and the crush rule is host .
> >
> > I didn't find any error logs in the osds.
> >
> > First time I upgraded the ceph version from pacific to quincy.
> >
> > Second time I upgraded the ceph version from quincy 17.2.1 to 17.2.2
> >
> > I have an doubt we are migrating data from scality to ceph. when the data
> > migration is too high that means normally we migrate the data speed 800
> to
> > 900 mbps it does not cause the problem.
> >
> > When I migrate high data at 2gbps speed. the osds are automatically down.
> > But some osd are automatically started. Some of the osds we need to start
> > manually.
> >
> >
> > On Mon, Sep 26, 2022 at 11:06 PM Eugen Block <eblock@xxxxxx> wrote:
> >
> >> > Yes, I have an inactive pgs when the osd goes down. Then I started the
> >> osds
> >> > manually. But the rgw fails to start.
> >>
> >> But the OSDs stay online if you start them manually? Do the inactive
> >> PGs recover when you start them manually? By the way, you should check
> >> your crush rules, depending on how many OSDs fail you may have room
> >> for improvement there. And why do the OSDs fail with automatic
> >> restart, what's in the logs?
> >>
> >> > Only upgrading to a newer version is only for the issue and we faced
> this
> >> > issue two times.
> >>
> >> What versions are you using (ceph versions)?
> >>
> >> > I dont know why it is happening. But maybe the rgw are running in
> >> separate
> >> > machines. This causes the issue ?
> >>
> >> I don't know how that should
> >>
> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >>
> >> > Hi Eugen,
> >> >
> >> > Yes, I have an inactive pgs when the osd goes down. Then I started the
> >> osds
> >> > manually. But the rgw fails to start.
> >> >
> >> > Only upgrading to a newer version is only for the issue and we faced
> this
> >> > issue two times.
> >> >
> >> > I dont know why it is happening. But maybe the rgw are running in
> >> separate
> >> > machines. This causes the issue ?
> >> >
> >> > On Sat, Sep 10, 2022 at 11:27 PM Eugen Block <eblock@xxxxxx> wrote:
> >> >
> >> >> You didn’t respond to the other questions. If you want people to be
> >> >> able to help you need to provide more information. If your OSDs fail
> >> >> do you have inactive PGs? Or do you have full OSDs which would RGW
> >> >> prevent from starting? I’m assuming that if you fix your OSDs the
> RGWs
> >> >> would start working again. But then again, we still don’t know
> >> >> anything about the current situation.
> >> >>
> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >>
> >> >> > Hi Eugen,
> >> >> >
> >> >> > Below is the log output,
> >> >> >
> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework: beast
> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  0 framework conf key:
> port,
> >> >> val:
> >> >> > 80
> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 radosgw_Main not
> setting
> >> >> numa
> >> >> > affinity
> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 rgw_d3n:
> >> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0  1 D3N datacache
> enabled: 0
> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> RGWSI_RADOS::Obj&,
> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0  1 rgw main: int
> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> RGWSI_RADOS::Obj&,
> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> > 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization
> timeout,
> >> >> failed
> >> >> > to initialize
> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 deferred set uid:gid
> to
> >> >> > 167:167 (ceph:ceph)
> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 ceph version 17.2.0
> >> >> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
> >> >> > radosgw, pid 7
> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework: beast
> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  0 framework conf key:
> port,
> >> >> val:
> >> >> > 80
> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 radosgw_Main not
> setting
> >> >> numa
> >> >> > affinity
> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 rgw_d3n:
> >> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0  1 D3N datacache
> enabled: 0
> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> RGWSI_RADOS::Obj&,
> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0  1 rgw main: int
> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> >> RGWSI_RADOS::Obj&,
> >> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> >> > 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization
> timeout,
> >> >> failed
> >> >> > to initialize
> >> >> >
> >> >> > I installed the cluster in quincy.
> >> >> >
> >> >> >
> >> >> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block <eblock@xxxxxx> wrote:
> >> >> >
> >> >> >> What troubleshooting have you tried? You don’t provide any log
> output
> >> >> >> or information about the cluster setup, for example the ceph osd
> >> tree,
> >> >> >> ceph status, are the failing OSDs random or do they all belong to
> the
> >> >> >> same pool? Any log output from failing OSDs and the RGWs might
> help,
> >> >> >> otherwise it’s just wild guessing. Is the cluster a new
> installation
> >> >> >> with cephadm or an older cluster upgraded to Quincy?
> >> >> >>
> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> >> >> >>
> >> >> >> > Hi all,
> >> >> >> >
> >> >> >> > I have one critical issue in my prod cluster. When the
> customer's
> >> data
> >> >> >> > comes from 600 MiB .
> >> >> >> >
> >> >> >> > My Osds are down *8 to 20 from 238* . Then I manually up my
> osds .
> >> >> After
> >> >> >> a
> >> >> >> > few minutes, my all rgw crashes.
> >> >> >> >
> >> >> >> > We did some troubleshooting but nothing works. When we upgrade
> >> ceph to
> >> >> >> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two
> times.
> >> But
> >> >> >> both
> >> >> >> > times we upgraded the ceph.
> >> >> >> >
> >> >> >> > *Node schema :*
> >> >> >> >
> >> >> >> > *Node 1 to node 5 --> mon,mgr and osds*
> >> >> >> > *Node 6 to Node15 --> only osds*
> >> >> >> > *Node 16 to Node 20 --> only rgws.*
> >> >> >> >
> >> >> >> > Kindly, check this issue and let me know the correct
> >> troubleshooting
> >> >> >> method.
> >> >> >> > _______________________________________________
> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx