Re: OSDs cannot match up with fast OSD map changes (epochs) during recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Wido,

Yes slow map update was happening and CPU hitting 100%. 
We also tried to set noup flag to true so that the cluster osdmap remained in same version . This made each OSD updated to the current map slowly . At one point we lost patience due to critical timelines and re-insalled the cluster. However we plan to do this recovery again and find optimum procedure for recovery . 
Sage was commenting that there is another solution available in Luminous which would recover the OSDs at much faster rate than the current one by skipping some maps instead of going in sequential way.

Thanks,
Muthu

On 20 March 2017 at 22:13, Wido den Hollander <wido@xxxxxxxx> wrote:

> Op 18 maart 2017 om 10:39 schreef Muthusamy Muthiah <muthiah.muthusamy@xxxxxxxxx>:
>
>
> Hi,
>
> We had similar issue on one of the 5 node cluster cluster again during
> recovery(200/335 OSDs are to be recovered)  , we see a lot of differences
> in the OSDmap epocs between OSD which is booting and the current one same
> is below,
>
> -          In the current situation the OSD are trying to register with an
> old OSDMAP version *7620 * but he current version in the cluster is
> higher  *13102
> *version – as a result it takes longer for OSD to update to this version ..
>

Do you see these OSDs eating 100% CPU at that moment? Eg, could it be that the CPUs are not fast enough to process all the map updates quick enough.

iirc map updates are not processed multi-threaded.

Wido

>
> We also see 2017-03-18 09:19:04.628206 7f2056735700 0 --
> 10.139.4.69:6836/777372 >> - conn(0x7f20c1bfa800 :6836
> s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to
> send and in the half accept state just closed messages on many osds which
> are recovering.
>
> Suggestions would be helpful.
>
>
> Thanks,
>
> Muthu
>
> On 13 February 2017 at 18:14, Wido den Hollander <wido@xxxxxxxx> wrote:
>
> >
> > > Op 13 februari 2017 om 12:57 schreef Muthusamy Muthiah <
> > muthiah.muthusamy@xxxxxxxxx>:
> > >
> > >
> > > Hi All,
> > >
> > > We also have same issue on one of our platforms which was upgraded from
> > > 11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU hits 100%
> > > and OSDs of that node marked down. Issue not seen on cluster which was
> > > installed from scratch with 11.2.0.
> > >
> >
> > How many maps is this OSD behind?
> >
> > Does it help if you set the nodown flag for a moment to let it catch up?
> >
> > Wido
> >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *[root@xxxxxxxxxx ~] # systemctl start ceph-osd@315.service
> > > <ceph-osd@315.service> [root@xxxxxxxxxx ~] # cd /var/log/ceph/
> > > [root@xxxxxxxxxx ceph] # tail -f *osd*315.log 2017-02-13 11:29:46.752897
> > > 7f995c79b940  0 <cls>
> > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
> > 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
> > centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/
> > ceph-11.2.0/src/cls/hello/cls_hello.cc:296:
> > > loading cls_hello 2017-02-13 11:29:46.753065 7f995c79b940  0 _get_class
> > not
> > > permitted to load kvs 2017-02-13 11:29:46.757571 7f995c79b940  0
> > _get_class
> > > not permitted to load lua 2017-02-13 11:29:47.058720 7f995c79b940  0
> > > osd.315 44703 crush map has features 288514119978713088, adjusting msgr
> > > requires for clients 2017-02-13 11:29:47.058728 7f995c79b940  0 osd.315
> > > 44703 crush map has features 288514394856620032 was 8705, adjusting msgr
> > > requires for mons 2017-02-13 11:29:47.058732 7f995c79b940  0 osd.315
> > 44703
> > > crush map has features 288531987042664448, adjusting msgr requires for
> > osds
> > > 2017-02-13 11:29:48.343979 7f995c79b940  0 osd.315 44703 load_pgs
> > > 2017-02-13 11:29:55.913550 7f995c79b940  0 osd.315 44703 load_pgs opened
> > > 130 pgs 2017-02-13 11:29:55.913604 7f995c79b940  0 osd.315 44703 using 1
> > op
> > > queue with priority op cut off at 64. 2017-02-13 11:29:55.914102
> > > 7f995c79b940 -1 osd.315 44703 log_to_monitors {default=true} 2017-02-13
> > > 11:30:19.384897 7f9939bbb700  1 heartbeat_map reset_timeout 'tp_osd
> > thread
> > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073336 7f9955a2b700  1
> > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after 15
> > > 2017-02-13 11:30:31.073343 7f9955a2b700  1 heartbeat_map is_healthy
> > 'tp_osd
> > > thread tp_osd' had timed out after 15 2017-02-13 11:30:31.073344
> > > 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed
> > > out after 15 2017-02-13 11:30:31.073345 7f9955a2b700  1 heartbeat_map
> > > is_healthy 'tp_osd thread tp_osd' had timed out after 15 2017-02-13
> > > 11:30:31.073347 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread
> > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073348 7f9955a2b700  1
> > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after
> > > 152017-02-13 11:30:54.772516 7f995c79b940  0 osd.315 44703 done with
> > init,
> > > starting boot process*
> > >
> > >
> > > *Thanks,*
> > > *Muthu*
> > >
> > > On 13 February 2017 at 10:50, Andreas Gerstmayr <
> > andreas.gerstmayr@xxxxxxxxx
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our test
> > > > cluster is unhealthy since about two weeks and can't recover itself
> > > > anymore (unfortunately I skipped the upgrade to 10.2.5 because I
> > > > missed the ".z" in "All clusters must first be upgraded to Jewel
> > > > 10.2.z").
> > > >
> > > > Immediately after the upgrade I saw the following in the OSD logs:
> > > > s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing
> > > > to send and in the half  accept state just closed
> > > >
> > > > There are also missed heartbeats in the OSD logs, and the OSDs which
> > > > don't send heartbeats have the following in their logs:
> > > > 2017-02-08 19:44:51.367828 7f9be8c37700  1 heartbeat_map is_healthy
> > > > 'tp_osd thread tp_osd' had timed out after 15
> > > > 2017-02-08 19:44:54.271010 7f9bc4e96700  1 heartbeat_map reset_timeout
> > > > 'tp_osd thread tp_osd' had timed out after 15
> > > >
> > > > During investigating we found out that some OSDs were lagging about
> > > > 100-20000 OSD map epochs behind. The monitor publishes new epochs
> > > > every few seconds, but the OSD daemons are pretty slow in applying
> > > > them (up to a few minutes for 100 epochs). During recovery of the 24
> > > > OSDs of a storage node the CPU is running at almost 100% (the nodes
> > > > have 16 real cores, or 32 with Hyper-Threading).
> > > >
> > > > We had at times servers where all 24 OSDs were up-to-date with the
> > > > latest OSD map, but somehow they lost it and were lagging behind
> > > > again. During recovery some OSDs used up to 25 GB of RAM, which led to
> > > > out of memory and further lagging of the OSDs of the affected server.
> > > >
> > > > We already set the nodown, noout, norebalance, nobackfill, norecover,
> > > > noscrub and nodeep-scrub flags to prevent OSD flapping and even more
> > > > new OSD epochs.
> > > >
> > > > Is there anything we can do to let the OSDs recover? It seems that the
> > > > servers don't have enough CPU resources for recovery. I already played
> > > > around with the osd map message max setting (when I increased it to
> > > > 1000 to speed up recovery, the OSDs didn't get any updates at all?),
> > > > and the osd heartbeat grace and osd thread timeout settings (to give
> > > > the overloaded server more time), but without success so far. I've
> > > > seen errors related to the AsyncMessenger in the logs, so I reverted
> > > > back to the SimpleMessenger (which was working successfully with
> > > > Jewel).
> > > >
> > > >
> > > > Cluster details:
> > > > 6 storage nodes with 2x Intel Xeon E5-2630 v3 8x2.40GHz
> > > > 256GB RAM
> > > > Each storage node has 24 HDDs attached, one OSD per disk, journal on
> > same
> > > > disk
> > > > 3 monitors in total, co-located with the storage nodes
> > > > separate front and back network (10 Gbit)
> > > > OS: CentOS 7.2.1511
> > > > Kernel: 4.9.8-1.el7.elrepo.x86_64 from elrepo.org
> > > >
> > > >
> > > > Thanks,
> > > > Andreas
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux