Hi Eugen, Thanks for your reply. Can you suggest a good recovery option in erasure coded pool. because the k means the copy value 11 and m the parity value 4 . I thought that means in 15 hosts 3 host may down and also we migrate the data. if i set ceph osd set nodown what will happen to the cluster. Example, the migration goes on and I enable this parameter. Will it cause any issue while migrating the data ? I didn't see any mon and mgr logs. On Tue, Sep 27, 2022 at 3:07 PM Eugen Block <eblock@xxxxxx> wrote: > > No pg recovery starts automatically when the osd starts. > > So you mean that you still have inactive PGs although your OSDs are > all up? In that case try to 'ceph pg repeer <PG_ID>' to activate the > PGs, maybe your RGWs will start then. > > > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4 > > total 15 hosts and the crush rule is host . > > That means if one host goes down you can't recover until the node is > back, you should have at least one or two more nodes so you have at > least some recovery options. > > > When I migrate high data at 2gbps speed. the osds are automatically down. > > But some osd are automatically started. Some of the osds we need to start > > manually. > > Did you check the MON and/or MGR logs? Do the MONs mark the OSDs down > after 10 minutes (or was it 15?)? That sounds a bit like flapping > OSDs, you might want to check the mailing list archives for that, > setting 'ceph osd set nodown' might help during the migration. But are > the OSDs fully saturated ('iostat -xmt /dev/sd* 1')? If updating helps > just stay on that version and maybe report a tracker issue with your > findings. > > > Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > > > Hi Euden, > > > > Yes the osds stay online when i start them manually. > > > > No pg recovery starts automatically when the osd starts. > > > > I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4 > > total 15 hosts and the crush rule is host . > > > > I didn't find any error logs in the osds. > > > > First time I upgraded the ceph version from pacific to quincy. > > > > Second time I upgraded the ceph version from quincy 17.2.1 to 17.2.2 > > > > I have an doubt we are migrating data from scality to ceph. when the data > > migration is too high that means normally we migrate the data speed 800 > to > > 900 mbps it does not cause the problem. > > > > When I migrate high data at 2gbps speed. the osds are automatically down. > > But some osd are automatically started. Some of the osds we need to start > > manually. > > > > > > On Mon, Sep 26, 2022 at 11:06 PM Eugen Block <eblock@xxxxxx> wrote: > > > >> > Yes, I have an inactive pgs when the osd goes down. Then I started the > >> osds > >> > manually. But the rgw fails to start. > >> > >> But the OSDs stay online if you start them manually? Do the inactive > >> PGs recover when you start them manually? By the way, you should check > >> your crush rules, depending on how many OSDs fail you may have room > >> for improvement there. And why do the OSDs fail with automatic > >> restart, what's in the logs? > >> > >> > Only upgrading to a newer version is only for the issue and we faced > this > >> > issue two times. > >> > >> What versions are you using (ceph versions)? > >> > >> > I dont know why it is happening. But maybe the rgw are running in > >> separate > >> > machines. This causes the issue ? > >> > >> I don't know how that should > >> > >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > >> > >> > Hi Eugen, > >> > > >> > Yes, I have an inactive pgs when the osd goes down. Then I started the > >> osds > >> > manually. But the rgw fails to start. > >> > > >> > Only upgrading to a newer version is only for the issue and we faced > this > >> > issue two times. > >> > > >> > I dont know why it is happening. But maybe the rgw are running in > >> separate > >> > machines. This causes the issue ? > >> > > >> > On Sat, Sep 10, 2022 at 11:27 PM Eugen Block <eblock@xxxxxx> wrote: > >> > > >> >> You didn’t respond to the other questions. If you want people to be > >> >> able to help you need to provide more information. If your OSDs fail > >> >> do you have inactive PGs? Or do you have full OSDs which would RGW > >> >> prevent from starting? I’m assuming that if you fix your OSDs the > RGWs > >> >> would start working again. But then again, we still don’t know > >> >> anything about the current situation. > >> >> > >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > >> >> > >> >> > Hi Eugen, > >> >> > > >> >> > Below is the log output, > >> >> > > >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework: beast > >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 0 framework conf key: > port, > >> >> val: > >> >> > 80 > >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 radosgw_Main not > setting > >> >> numa > >> >> > affinity > >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 rgw_d3n: > >> >> > rgw_d3n_l1_local_datacache_enabled=0 > >> >> > 2022-09-07T12:03:42.893+0000 7fdd23fdc5c0 1 D3N datacache > enabled: 0 > >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int > >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> RGWSI_RADOS::Obj&, > >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> > 2022-09-07T12:03:53.313+0000 7fdd23fdc5c0 1 rgw main: int > >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> RGWSI_RADOS::Obj&, > >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> > 2022-09-07T12:08:42.891+0000 7fdd1661c700 -1 Initialization > timeout, > >> >> failed > >> >> > to initialize > >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 deferred set uid:gid > to > >> >> > 167:167 (ceph:ceph) > >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 ceph version 17.2.0 > >> >> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process > >> >> > radosgw, pid 7 > >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework: beast > >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 0 framework conf key: > port, > >> >> val: > >> >> > 80 > >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 radosgw_Main not > setting > >> >> numa > >> >> > affinity > >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 rgw_d3n: > >> >> > rgw_d3n_l1_local_datacache_enabled=0 > >> >> > 2022-09-07T12:08:53.395+0000 7f69017095c0 1 D3N datacache > enabled: 0 > >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int > >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> RGWSI_RADOS::Obj&, > >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> > 2022-09-07T12:09:03.747+0000 7f69017095c0 1 rgw main: int > >> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, > >> RGWSI_RADOS::Obj&, > >> >> > const RGWCacheNotifyInfo&, optional_yi> > >> >> > 2022-09-07T12:13:53.397+0000 7f68f3d49700 -1 Initialization > timeout, > >> >> failed > >> >> > to initialize > >> >> > > >> >> > I installed the cluster in quincy. > >> >> > > >> >> > > >> >> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block <eblock@xxxxxx> wrote: > >> >> > > >> >> >> What troubleshooting have you tried? You don’t provide any log > output > >> >> >> or information about the cluster setup, for example the ceph osd > >> tree, > >> >> >> ceph status, are the failing OSDs random or do they all belong to > the > >> >> >> same pool? Any log output from failing OSDs and the RGWs might > help, > >> >> >> otherwise it’s just wild guessing. Is the cluster a new > installation > >> >> >> with cephadm or an older cluster upgraded to Quincy? > >> >> >> > >> >> >> Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > >> >> >> > >> >> >> > Hi all, > >> >> >> > > >> >> >> > I have one critical issue in my prod cluster. When the > customer's > >> data > >> >> >> > comes from 600 MiB . > >> >> >> > > >> >> >> > My Osds are down *8 to 20 from 238* . Then I manually up my > osds . > >> >> After > >> >> >> a > >> >> >> > few minutes, my all rgw crashes. > >> >> >> > > >> >> >> > We did some troubleshooting but nothing works. When we upgrade > >> ceph to > >> >> >> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two > times. > >> But > >> >> >> both > >> >> >> > times we upgraded the ceph. > >> >> >> > > >> >> >> > *Node schema :* > >> >> >> > > >> >> >> > *Node 1 to node 5 --> mon,mgr and osds* > >> >> >> > *Node 6 to Node15 --> only osds* > >> >> >> > *Node 16 to Node 20 --> only rgws.* > >> >> >> > > >> >> >> > Kindly, check this issue and let me know the correct > >> troubleshooting > >> >> >> method. > >> >> >> > _______________________________________________ > >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> >> > >> >> >> > >> >> >> > >> >> >> _______________________________________________ > >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx > >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> >> > >> >> > >> >> > >> >> > >> >> > >> > >> > >> > >> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx