> Op 18 maart 2017 om 10:39 schreef Muthusamy Muthiah <muthiah.muthusamy@xxxxxxxxx>: > > > Hi, > > We had similar issue on one of the 5 node cluster cluster again during > recovery(200/335 OSDs are to be recovered) , we see a lot of differences > in the OSDmap epocs between OSD which is booting and the current one same > is below, > > - In the current situation the OSD are trying to register with an > old OSDMAP version *7620 * but he current version in the cluster is > higher *13102 > *version – as a result it takes longer for OSD to update to this version .. > Do you see these OSDs eating 100% CPU at that moment? Eg, could it be that the CPUs are not fast enough to process all the map updates quick enough. iirc map updates are not processed multi-threaded. Wido > > We also see 2017-03-18 09:19:04.628206 7f2056735700 0 -- > 10.139.4.69:6836/777372 >> - conn(0x7f20c1bfa800 :6836 > s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to > send and in the half accept state just closed messages on many osds which > are recovering. > > Suggestions would be helpful. > > > Thanks, > > Muthu > > On 13 February 2017 at 18:14, Wido den Hollander <wido@xxxxxxxx> wrote: > > > > > > Op 13 februari 2017 om 12:57 schreef Muthusamy Muthiah < > > muthiah.muthusamy@xxxxxxxxx>: > > > > > > > > > Hi All, > > > > > > We also have same issue on one of our platforms which was upgraded from > > > 11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU hits 100% > > > and OSDs of that node marked down. Issue not seen on cluster which was > > > installed from scratch with 11.2.0. > > > > > > > How many maps is this OSD behind? > > > > Does it help if you set the nodown flag for a moment to let it catch up? > > > > Wido > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *[root@xxxxxxxxxx ~] # systemctl start ceph-osd@315.service > > > <ceph-osd@315.service> [root@xxxxxxxxxx ~] # cd /var/log/ceph/ > > > [root@xxxxxxxxxx ceph] # tail -f *osd*315.log 2017-02-13 11:29:46.752897 > > > 7f995c79b940 0 <cls> > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ > > 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ > > centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ > > ceph-11.2.0/src/cls/hello/cls_hello.cc:296: > > > loading cls_hello 2017-02-13 11:29:46.753065 7f995c79b940 0 _get_class > > not > > > permitted to load kvs 2017-02-13 11:29:46.757571 7f995c79b940 0 > > _get_class > > > not permitted to load lua 2017-02-13 11:29:47.058720 7f995c79b940 0 > > > osd.315 44703 crush map has features 288514119978713088, adjusting msgr > > > requires for clients 2017-02-13 11:29:47.058728 7f995c79b940 0 osd.315 > > > 44703 crush map has features 288514394856620032 was 8705, adjusting msgr > > > requires for mons 2017-02-13 11:29:47.058732 7f995c79b940 0 osd.315 > > 44703 > > > crush map has features 288531987042664448, adjusting msgr requires for > > osds > > > 2017-02-13 11:29:48.343979 7f995c79b940 0 osd.315 44703 load_pgs > > > 2017-02-13 11:29:55.913550 7f995c79b940 0 osd.315 44703 load_pgs opened > > > 130 pgs 2017-02-13 11:29:55.913604 7f995c79b940 0 osd.315 44703 using 1 > > op > > > queue with priority op cut off at 64. 2017-02-13 11:29:55.914102 > > > 7f995c79b940 -1 osd.315 44703 log_to_monitors {default=true} 2017-02-13 > > > 11:30:19.384897 7f9939bbb700 1 heartbeat_map reset_timeout 'tp_osd > > thread > > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073336 7f9955a2b700 1 > > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after 15 > > > 2017-02-13 11:30:31.073343 7f9955a2b700 1 heartbeat_map is_healthy > > 'tp_osd > > > thread tp_osd' had timed out after 15 2017-02-13 11:30:31.073344 > > > 7f9955a2b700 1 heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed > > > out after 15 2017-02-13 11:30:31.073345 7f9955a2b700 1 heartbeat_map > > > is_healthy 'tp_osd thread tp_osd' had timed out after 15 2017-02-13 > > > 11:30:31.073347 7f9955a2b700 1 heartbeat_map is_healthy 'tp_osd thread > > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073348 7f9955a2b700 1 > > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after > > > 152017-02-13 11:30:54.772516 7f995c79b940 0 osd.315 44703 done with > > init, > > > starting boot process* > > > > > > > > > *Thanks,* > > > *Muthu* > > > > > > On 13 February 2017 at 10:50, Andreas Gerstmayr < > > andreas.gerstmayr@xxxxxxxxx > > > > wrote: > > > > > > > Hi, > > > > > > > > Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our test > > > > cluster is unhealthy since about two weeks and can't recover itself > > > > anymore (unfortunately I skipped the upgrade to 10.2.5 because I > > > > missed the ".z" in "All clusters must first be upgraded to Jewel > > > > 10.2.z"). > > > > > > > > Immediately after the upgrade I saw the following in the OSD logs: > > > > s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing > > > > to send and in the half accept state just closed > > > > > > > > There are also missed heartbeats in the OSD logs, and the OSDs which > > > > don't send heartbeats have the following in their logs: > > > > 2017-02-08 19:44:51.367828 7f9be8c37700 1 heartbeat_map is_healthy > > > > 'tp_osd thread tp_osd' had timed out after 15 > > > > 2017-02-08 19:44:54.271010 7f9bc4e96700 1 heartbeat_map reset_timeout > > > > 'tp_osd thread tp_osd' had timed out after 15 > > > > > > > > During investigating we found out that some OSDs were lagging about > > > > 100-20000 OSD map epochs behind. The monitor publishes new epochs > > > > every few seconds, but the OSD daemons are pretty slow in applying > > > > them (up to a few minutes for 100 epochs). During recovery of the 24 > > > > OSDs of a storage node the CPU is running at almost 100% (the nodes > > > > have 16 real cores, or 32 with Hyper-Threading). > > > > > > > > We had at times servers where all 24 OSDs were up-to-date with the > > > > latest OSD map, but somehow they lost it and were lagging behind > > > > again. During recovery some OSDs used up to 25 GB of RAM, which led to > > > > out of memory and further lagging of the OSDs of the affected server. > > > > > > > > We already set the nodown, noout, norebalance, nobackfill, norecover, > > > > noscrub and nodeep-scrub flags to prevent OSD flapping and even more > > > > new OSD epochs. > > > > > > > > Is there anything we can do to let the OSDs recover? It seems that the > > > > servers don't have enough CPU resources for recovery. I already played > > > > around with the osd map message max setting (when I increased it to > > > > 1000 to speed up recovery, the OSDs didn't get any updates at all?), > > > > and the osd heartbeat grace and osd thread timeout settings (to give > > > > the overloaded server more time), but without success so far. I've > > > > seen errors related to the AsyncMessenger in the logs, so I reverted > > > > back to the SimpleMessenger (which was working successfully with > > > > Jewel). > > > > > > > > > > > > Cluster details: > > > > 6 storage nodes with 2x Intel Xeon E5-2630 v3 8x2.40GHz > > > > 256GB RAM > > > > Each storage node has 24 HDDs attached, one OSD per disk, journal on > > same > > > > disk > > > > 3 monitors in total, co-located with the storage nodes > > > > separate front and back network (10 Gbit) > > > > OS: CentOS 7.2.1511 > > > > Kernel: 4.9.8-1.el7.elrepo.x86_64 from elrepo.org > > > > > > > > > > > > Thanks, > > > > Andreas > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com