Hi Eugen, The OSD fails because of RAM/CPU overloaded whatever it is.After Osd fails it starts again. That's not the problem. I need to know why the rgw fails when the osd down ? The rgw log output below, 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework: beast 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework conf key: port, val: 80 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 radosgw_Main not setting numa affinity 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 D3N datacache enabled: 0 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yi> 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yi> 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization timeout, failed to initialize 2022-09-07T12:08:53.395+0000 7f69017095c0 0 deferred set uid:gid to 167:167 (ceph:ceph) 2022-09-07T12:08:53.395+0000 7f69017095c0 0 ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process radosgw, pid 7 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework: beast 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework conf key: port, val: 80 2022-09-07T12:08:53.395+0000 7f69017095c0 1 radosgw_Main not setting numa affinity 2022-09-07T12:08:53.395+0000 7f69017095c0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 2022-09-07T12:08:53.395+0000 7f69017095c0 1 D3N datacache enabled: 0 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yi> 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yi> 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization timeout, failed to initialize On Tue, Sep 27, 2022 at 6:38 PM Eugen Block <eblock@xxxxxx> wrote: > > if i set ceph osd set nodown what will happen to the cluster. Example, > the > > migration goes on and I enable this parameter. Will it cause any issue > > while migrating the data ? > > Well, since we don't really know what is going on there it's hard to > tell. But that flag basically prevents the MONs from marking the OSDs > down (wrongly). But to verify we would need more information why the > OSDs fail or who stops them. Is it a container resouce limit? Not > enough CPU/RAM or whatever? Do you see anything in dmesg indicating an > oom killer? If OSDs go down it's logged so there should be something > in the logs. > You don't respond to all questions so it's really hard to assist here, > to be honest. > > > Can you suggest a good recovery option in erasure coded pool. because > the k > > means the copy value 11 and m the parity value 4 . I thought that means > in > > 15 hosts 3 host may down and also we migrate the data. > > Your assumption is correct, you should be able to sustain the failure > of three hosts without client impact, but if multiple OSDs across more > hosts fail (holding PGs of the same pool(s)) you would have inactive > PGs as you already reported. > > Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > > > Hi Eugen, > > > > Thanks for your reply. > > > > Can you suggest a good recovery option in erasure coded pool. because > the k > > means the copy value 11 and m the parity value 4 . I thought that means > in > > 15 hosts 3 host may down and also we migrate the data. > > > > if i set ceph osd set nodown what will happen to the cluster. Example, > the > > migration goes on and I enable this parameter. Will it cause any issue > > while migrating the data ? > > > > I didn't see any mon and mgr logs. > > > > > > > > > > > > On Tue, Sep 27, 2022 at 3:07 PM Eugen Block <eblock@xxxxxx> wrote: > > > >> > No pg recovery starts automatically when the osd starts. > >> > >> So you mean that you still have inactive PGs although your OSDs are > >> all up? In that case try to 'ceph pg repeer <PG_ID>' to activate the > >> PGs, maybe your RGWs will start then. > >> > >> > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4 > >> > total 15 hosts and the crush rule is host . > >> > >> That means if one host goes down you can't recover until the node is > >> back, you should have at least one or two more nodes so you have at > >> least some recovery options. > >> > >> > When I migrate high data at 2gbps speed. the osds are automatically > down. > >> > But some osd are automatically started. Some of the osds we need to > start > >> > manually. > >> > >> Did you check the MON and/or MGR logs? Do the MONs mark the OSDs down > >> after 10 minutes (or was it 15?)? That sounds a bit like flapping > >> OSDs, you might want to check the mailing list archives for that, > >> setting 'ceph osd set nodown' might help during the migration. But are > >> the OSDs fully saturated ('iostat -xmt /dev/sd* 1')? If updating helps > >> just stay on that version and maybe report a tracker issue with your > >> findings. > >> > >> > >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > >> > >> > Hi Euden, > >> > > >> > Yes the osds stay online when i start them manually. > >> > > >> > No pg recovery starts automatically when the osd starts. > >> > > >> > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4 > >> > total 15 hosts and the crush rule is host . > >> > > >> > I didn't find any error logs in the osds. > >> > > >> > First time I upgraded the ceph version from pacific to quincy. > >> > > >> > Second time I upgraded the ceph version from quincy 17.2.1 to 17.2.2 > >> > > >> > I have an doubt we are migrating data from scality to ceph. when the > data > >> > migration is too high that means normally we migrate the data speed > 800 > >> to > >> > 900 mbps it does not cause the problem. > >> > > >> > When I migrate high data at 2gbps speed. the osds are automatically > down. > >> > But some osd are automatically started. Some of the osds we need to > start > >> > manually. > >> > > >> > > >> > On Mon, Sep 26, 2022 at 11:06 PM Eugen Block <eblock@xxxxxx> wrote: > >> > > >> >> > Yes, I have an inactive pgs when the osd goes down. Then I started > the > >> >> osds > >> >> > manually. But the rgw fails to start. > >> >> > >> >> But the OSDs stay online if you start them manually? Do the inactive > >> >> PGs recover when you start them manually? By the way, you should > check > >> >> your crush rules, depending on how many OSDs fail you may have room > >> >> for improvement there. And why do the OSDs fail with automatic > >> >> restart, what's in the logs? > >> >> > >> >> > Only upgrading to a newer version is only for the issue and we > faced > >> this > >> >> > issue two times. > >> >> > >> >> What versions are you using (ceph versions)? > >> >> > >> >> > I dont know why it is happening. But maybe the rgw are running in > >> >> separate > >> >> > machines. This causes the issue ? > >> >> > >> >> I don't know how that should > >> >> > >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > >> >> > >> >> > Hi Eugen, > >> >> > > >> >> > Yes, I have an inactive pgs when the osd goes down. Then I started > the > >> >> osds > >> >> > manually. But the rgw fails to start. > >> >> > > >> >> > Only upgrading to a newer version is only for the issue and we > faced > >> this > >> >> > issue two times. > >> >> > > >> >> > I dont know why it is happening. But maybe the rgw are running in > >> >> separate > >> >> > machines. This causes the issue ? > >> >> > > >> >> > On Sat, Sep 10, 2022 at 11:27 PM Eugen Block <eblock@xxxxxx> > wrote: > >> >> > > >> >> >> You didn’t respond to the other questions. If you want people to > be > >> >> >> able to help you need to provide more information. If your OSDs > fail > >> >> >> do you have inactive PGs? Or do you have full OSDs which would RGW > >> >> >> prevent from starting? I’m assuming that if you fix your OSDs the > >> RGWs > >> >> >> would start working again. But then again, we still don’t know > >> >> >> anything about the current situation. > >> >> >> > >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > >> >> >> > >> >> >> > Hi Eugen, > >> >> >> > > >> >> >> > Below is the log output, > >> >> >> > > >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework: beast > >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework conf key: > >> port, > >> >> >> val: > >> >> >> > 80 > >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 radosgw_Main not > >> setting > >> >> >> numa > >> >> >> > affinity > >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 rgw_d3n: > >> >> >> > rgw_d3n_l1_local_datacache_enabled=0 > >> >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 D3N datacache > >> enabled: 0 > >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int > >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> >> RGWSI_RADOS::Obj&, > >> >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int > >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> >> RGWSI_RADOS::Obj&, > >> >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> >> > 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization > >> timeout, > >> >> >> failed > >> >> >> > to initialize > >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 deferred set > uid:gid > >> to > >> >> >> > 167:167 (ceph:ceph) > >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 ceph version 17.2.0 > >> >> >> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), > process > >> >> >> > radosgw, pid 7 > >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework: beast > >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework conf key: > >> port, > >> >> >> val: > >> >> >> > 80 > >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 radosgw_Main not > >> setting > >> >> >> numa > >> >> >> > affinity > >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 rgw_d3n: > >> >> >> > rgw_d3n_l1_local_datacache_enabled=0 > >> >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 D3N datacache > >> enabled: 0 > >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int > >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> >> RGWSI_RADOS::Obj&, > >> >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int > >> >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> >> RGWSI_RADOS::Obj&, > >> >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> >> > 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization > >> timeout, > >> >> >> failed > >> >> >> > to initialize > >> >> >> > > >> >> >> > I installed the cluster in quincy. > >> >> >> > > >> >> >> > > >> >> >> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block <eblock@xxxxxx> > wrote: > >> >> >> > > >> >> >> >> What troubleshooting have you tried? You don’t provide any log > >> output > >> >> >> >> or information about the cluster setup, for example the ceph > osd > >> >> tree, > >> >> >> >> ceph status, are the failing OSDs random or do they all belong > to > >> the > >> >> >> >> same pool? Any log output from failing OSDs and the RGWs might > >> help, > >> >> >> >> otherwise it’s just wild guessing. Is the cluster a new > >> installation > >> >> >> >> with cephadm or an older cluster upgraded to Quincy? > >> >> >> >> > >> >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > >> >> >> >> > >> >> >> >> > Hi all, > >> >> >> >> > > >> >> >> >> > I have one critical issue in my prod cluster. When the > >> customer's > >> >> data > >> >> >> >> > comes from 600 MiB . > >> >> >> >> > > >> >> >> >> > My Osds are down *8 to 20 from 238* . Then I manually up my > >> osds . > >> >> >> After > >> >> >> >> a > >> >> >> >> > few minutes, my all rgw crashes. > >> >> >> >> > > >> >> >> >> > We did some troubleshooting but nothing works. When we > upgrade > >> >> ceph to > >> >> >> >> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two > >> times. > >> >> But > >> >> >> >> both > >> >> >> >> > times we upgraded the ceph. > >> >> >> >> > > >> >> >> >> > *Node schema :* > >> >> >> >> > > >> >> >> >> > *Node 1 to node 5 --> mon,mgr and osds* > >> >> >> >> > *Node 6 to Node15 --> only osds* > >> >> >> >> > *Node 16 to Node 20 --> only rgws.* > >> >> >> >> > > >> >> >> >> > Kindly, check this issue and let me know the correct > >> >> troubleshooting > >> >> >> >> method. > >> >> >> >> > _______________________________________________ > >> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> _______________________________________________ > >> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx > >> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> > >> >> > >> >> > >> >> > >> > >> > >> > >> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx