As I already said, it's possible that your inactive PGs prevent the
RGWs from starting. You can turn on debug logs for the RGWs, maybe
they reveal more.
Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
> Hi Eugen,
>
> The OSD fails because of RAM/CPU overloaded whatever it is.After Osd
fails
> it starts again. That's not the problem.
>
> I need to know why the rgw fails when the osd down ?
>
> The rgw log output below,
>
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework: beast
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework conf key: port,
val:
> 80
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 radosgw_Main not setting
numa
> affinity
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 rgw_d3n:
> rgw_d3n_l1_local_datacache_enabled=0
> 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 D3N datacache enabled: 0
> 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization timeout,
failed
> to initialize
> 2022-09-07T12:08:53.395+0000 7f69017095c0 0 deferred set uid:gid to
> 167:167 (ceph:ceph)
> 2022-09-07T12:08:53.395+0000 7f69017095c0 0 ceph version 17.2.0
> (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
> radosgw, pid 7
> 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework: beast
> 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework conf key: port,
val:
> 80
> 2022-09-07T12:08:53.395+0000 7f69017095c0 1 radosgw_Main not setting
numa
> affinity
> 2022-09-07T12:08:53.395+0000 7f69017095c0 1 rgw_d3n:
> rgw_d3n_l1_local_datacache_enabled=0
> 2022-09-07T12:08:53.395+0000 7f69017095c0 1 D3N datacache enabled: 0
> 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization timeout,
failed
> to initialize
>
>
>
> On Tue, Sep 27, 2022 at 6:38 PM Eugen Block <eblock@xxxxxx> wrote:
>
>> > if i set ceph osd set nodown what will happen to the cluster. Example,
>> the
>> > migration goes on and I enable this parameter. Will it cause any issue
>> > while migrating the data ?
>>
>> Well, since we don't really know what is going on there it's hard to
>> tell. But that flag basically prevents the MONs from marking the OSDs
>> down (wrongly). But to verify we would need more information why the
>> OSDs fail or who stops them. Is it a container resouce limit? Not
>> enough CPU/RAM or whatever? Do you see anything in dmesg indicating an
>> oom killer? If OSDs go down it's logged so there should be something
>> in the logs.
>> You don't respond to all questions so it's really hard to assist here,
>> to be honest.
>>
>> > Can you suggest a good recovery option in erasure coded pool. because
>> the k
>> > means the copy value 11 and m the parity value 4 . I thought that
means
>> in
>> > 15 hosts 3 host may down and also we migrate the data.
>>
>> Your assumption is correct, you should be able to sustain the failure
>> of three hosts without client impact, but if multiple OSDs across more
>> hosts fail (holding PGs of the same pool(s)) you would have inactive
>> PGs as you already reported.
>>
>> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>>
>> > Hi Eugen,
>> >
>> > Thanks for your reply.
>> >
>> > Can you suggest a good recovery option in erasure coded pool. because
>> the k
>> > means the copy value 11 and m the parity value 4 . I thought that
means
>> in
>> > 15 hosts 3 host may down and also we migrate the data.
>> >
>> > if i set ceph osd set nodown what will happen to the cluster. Example,
>> the
>> > migration goes on and I enable this parameter. Will it cause any issue
>> > while migrating the data ?
>> >
>> > I didn't see any mon and mgr logs.
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Sep 27, 2022 at 3:07 PM Eugen Block <eblock@xxxxxx> wrote:
>> >
>> >> > No pg recovery starts automatically when the osd starts.
>> >>
>> >> So you mean that you still have inactive PGs although your OSDs are
>> >> all up? In that case try to 'ceph pg repeer <PG_ID>' to activate the
>> >> PGs, maybe your RGWs will start then.
>> >>
>> >> > I'm using an erasure coded pool for rgw .In that rule we have k=11
m=4
>> >> > total 15 hosts and the crush rule is host .
>> >>
>> >> That means if one host goes down you can't recover until the node is
>> >> back, you should have at least one or two more nodes so you have at
>> >> least some recovery options.
>> >>
>> >> > When I migrate high data at 2gbps speed. the osds are automatically
>> down.
>> >> > But some osd are automatically started. Some of the osds we need to
>> start
>> >> > manually.
>> >>
>> >> Did you check the MON and/or MGR logs? Do the MONs mark the OSDs down
>> >> after 10 minutes (or was it 15?)? That sounds a bit like flapping
>> >> OSDs, you might want to check the mailing list archives for that,
>> >> setting 'ceph osd set nodown' might help during the migration. But
are
>> >> the OSDs fully saturated ('iostat -xmt /dev/sd* 1')? If updating
helps
>> >> just stay on that version and maybe report a tracker issue with your
>> >> findings.
>> >>
>> >>
>> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>> >>
>> >> > Hi Euden,
>> >> >
>> >> > Yes the osds stay online when i start them manually.
>> >> >
>> >> > No pg recovery starts automatically when the osd starts.
>> >> >
>> >> > I'm using an erasure coded pool for rgw .In that rule we have k=11
m=4
>> >> > total 15 hosts and the crush rule is host .
>> >> >
>> >> > I didn't find any error logs in the osds.
>> >> >
>> >> > First time I upgraded the ceph version from pacific to quincy.
>> >> >
>> >> > Second time I upgraded the ceph version from quincy 17.2.1 to
17.2.2
>> >> >
>> >> > I have an doubt we are migrating data from scality to ceph. when
the
>> data
>> >> > migration is too high that means normally we migrate the data speed
>> 800
>> >> to
>> >> > 900 mbps it does not cause the problem.
>> >> >
>> >> > When I migrate high data at 2gbps speed. the osds are automatically
>> down.
>> >> > But some osd are automatically started. Some of the osds we need to
>> start
>> >> > manually.
>> >> >
>> >> >
>> >> > On Mon, Sep 26, 2022 at 11:06 PM Eugen Block <eblock@xxxxxx>
wrote:
>> >> >
>> >> >> > Yes, I have an inactive pgs when the osd goes down. Then I
started
>> the
>> >> >> osds
>> >> >> > manually. But the rgw fails to start.
>> >> >>
>> >> >> But the OSDs stay online if you start them manually? Do the
inactive
>> >> >> PGs recover when you start them manually? By the way, you should
>> check
>> >> >> your crush rules, depending on how many OSDs fail you may have
room
>> >> >> for improvement there. And why do the OSDs fail with automatic
>> >> >> restart, what's in the logs?
>> >> >>
>> >> >> > Only upgrading to a newer version is only for the issue and we
>> faced
>> >> this
>> >> >> > issue two times.
>> >> >>
>> >> >> What versions are you using (ceph versions)?
>> >> >>
>> >> >> > I dont know why it is happening. But maybe the rgw are running
in
>> >> >> separate
>> >> >> > machines. This causes the issue ?
>> >> >>
>> >> >> I don't know how that should
>> >> >>
>> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>> >> >>
>> >> >> > Hi Eugen,
>> >> >> >
>> >> >> > Yes, I have an inactive pgs when the osd goes down. Then I
started
>> the
>> >> >> osds
>> >> >> > manually. But the rgw fails to start.
>> >> >> >
>> >> >> > Only upgrading to a newer version is only for the issue and we
>> faced
>> >> this
>> >> >> > issue two times.
>> >> >> >
>> >> >> > I dont know why it is happening. But maybe the rgw are running
in
>> >> >> separate
>> >> >> > machines. This causes the issue ?
>> >> >> >
>> >> >> > On Sat, Sep 10, 2022 at 11:27 PM Eugen Block <eblock@xxxxxx>
>> wrote:
>> >> >> >
>> >> >> >> You didn’t respond to the other questions. If you want people
to
>> be
>> >> >> >> able to help you need to provide more information. If your OSDs
>> fail
>> >> >> >> do you have inactive PGs? Or do you have full OSDs which would
RGW
>> >> >> >> prevent from starting? I’m assuming that if you fix your OSDs
the
>> >> RGWs
>> >> >> >> would start working again. But then again, we still don’t know
>> >> >> >> anything about the current situation.
>> >> >> >>
>> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>> >> >> >>
>> >> >> >> > Hi Eugen,
>> >> >> >> >
>> >> >> >> > Below is the log output,
>> >> >> >> >
>> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework: beast
>> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework conf
key:
>> >> port,
>> >> >> >> val:
>> >> >> >> > 80
>> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 radosgw_Main not
>> >> setting
>> >> >> >> numa
>> >> >> >> > affinity
>> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 rgw_d3n:
>> >> >> >> > rgw_d3n_l1_local_datacache_enabled=0
>> >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 D3N datacache
>> >> enabled: 0
>> >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int
>> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
>> >> >> RGWSI_RADOS::Obj&,
>> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
>> >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int
>> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
>> >> >> RGWSI_RADOS::Obj&,
>> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
>> >> >> >> > 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization
>> >> timeout,
>> >> >> >> failed
>> >> >> >> > to initialize
>> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 deferred set
>> uid:gid
>> >> to
>> >> >> >> > 167:167 (ceph:ceph)
>> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 ceph version
17.2.0
>> >> >> >> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable),
>> process
>> >> >> >> > radosgw, pid 7
>> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework: beast
>> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework conf
key:
>> >> port,
>> >> >> >> val:
>> >> >> >> > 80
>> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 radosgw_Main not
>> >> setting
>> >> >> >> numa
>> >> >> >> > affinity
>> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 rgw_d3n:
>> >> >> >> > rgw_d3n_l1_local_datacache_enabled=0
>> >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 D3N datacache
>> >> enabled: 0
>> >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int
>> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
>> >> >> RGWSI_RADOS::Obj&,
>> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
>> >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int
>> >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
>> >> >> RGWSI_RADOS::Obj&,
>> >> >> >> > const RGWCacheNotifyInfo&, optional_yi>
>> >> >> >> > 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization
>> >> timeout,
>> >> >> >> failed
>> >> >> >> > to initialize
>> >> >> >> >
>> >> >> >> > I installed the cluster in quincy.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block <eblock@xxxxxx>
>> wrote:
>> >> >> >> >
>> >> >> >> >> What troubleshooting have you tried? You don’t provide any
log
>> >> output
>> >> >> >> >> or information about the cluster setup, for example the ceph
>> osd
>> >> >> tree,
>> >> >> >> >> ceph status, are the failing OSDs random or do they all
belong
>> to
>> >> the
>> >> >> >> >> same pool? Any log output from failing OSDs and the RGWs
might
>> >> help,
>> >> >> >> >> otherwise it’s just wild guessing. Is the cluster a new
>> >> installation
>> >> >> >> >> with cephadm or an older cluster upgraded to Quincy?
>> >> >> >> >>
>> >> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>:
>> >> >> >> >>
>> >> >> >> >> > Hi all,
>> >> >> >> >> >
>> >> >> >> >> > I have one critical issue in my prod cluster. When the
>> >> customer's
>> >> >> data
>> >> >> >> >> > comes from 600 MiB .
>> >> >> >> >> >
>> >> >> >> >> > My Osds are down *8 to 20 from 238* . Then I manually up
my
>> >> osds .
>> >> >> >> After
>> >> >> >> >> a
>> >> >> >> >> > few minutes, my all rgw crashes.
>> >> >> >> >> >
>> >> >> >> >> > We did some troubleshooting but nothing works. When we
>> upgrade
>> >> >> ceph to
>> >> >> >> >> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two
>> >> times.
>> >> >> But
>> >> >> >> >> both
>> >> >> >> >> > times we upgraded the ceph.
>> >> >> >> >> >
>> >> >> >> >> > *Node schema :*
>> >> >> >> >> >
>> >> >> >> >> > *Node 1 to node 5 --> mon,mgr and osds*
>> >> >> >> >> > *Node 6 to Node15 --> only osds*
>> >> >> >> >> > *Node 16 to Node 20 --> only rgws.*
>> >> >> >> >> >
>> >> >> >> >> > Kindly, check this issue and let me know the correct
>> >> >> troubleshooting
>> >> >> >> >> method.
>> >> >> >> >> > _______________________________________________
>> >> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
>> >> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> _______________________________________________
>> >> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>> >>
>>
>>
>>
>>