Re: OSD Crash makes whole cluster unusable ?

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Tue, 16 Dec 2014 18:25:38 -0800

So the problem started once remapping+backfilling started, and lasted until the cluster was healthy again?  Have you adjusted any of the recovery tunables?  Are you using SSD journals?

I had a similar experience the first time my OSDs started backfilling.  The average RadosGW operation latency went from 0.1 seconds to 10 seconds, which is longer than the default HAProxy timeout.  Fun times.

Since then, I've increased HAProxy's timeouts, de-prioritized Ceph's recovery, and I added SSD journals.

The relevant sections of ceph.conf are:

[global]
  mon osd down out interval = 900
  mon osd min down reporters = 9
  mon osd min down reports = 12
  mon warn on legacy crush tunables = false
  osd pool default flag hashpspool = true

[osd]
  osd max backfills = 3
  osd recovery max active = 3
  osd recovery op priority = 1
  osd scrub sleep = 1.0
  osd snap trim sleep = 1.0

Before the SSD journals, I had osd_max_backfills and osd_recovery_max_active set to 1.  I watched my latency graphs, and used ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 to tweak the values until the latency was acceptable.

On Tue, Dec 16, 2014 at 5:37 AM, Christoph Adomeit <Christoph.Adomeit@xxxxxxxxxxx> wrote:

Hi there,

today I had an osd crash with ceph 0.87/giant which made my hole cluster unusable for 45 Minutes.

First it began with a disk error:

sd 0:1:2:0: [sdc] CDB: Read(10)Read(10):: 28 28 00 00 0d 15 fe d0 fd 7b e8 f8 00 00 00 00 b0 08 00 00

XFS (sdc1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.

Then most other osds found out that my osd.3 is down:

2014-12-16 08:45:15.873478 mon.0 10.67.1.11:6789/0 3361077 : cluster [INF] osd.3 10.67.1.11:6810/713621 failed (42 reports from 35 peers after 23.642482 >= grace 23.348982)

5 minutes later the osd is marked as out:

2014-12-16 08:50:21.095903 mon.0 10.67.1.11:6789/0 3361367 : cluster [INF] osd.3 out (down for 304.581079)

However, since 8:45 until 9:20 I have 1000 slow requests and 107 incomplete pgs. Many requests are not answered:

2014-12-16 08:46:03.029094 mon.0 10.67.1.11:6789/0 3361126 : cluster [INF] pgmap v6930583: 4224 pgs: 4117 active+clean, 107 incomplete; 7647 GB data, 19090 GB used, 67952 GB / 87042 GB avail; 2307 kB/s rd, 2293 kB/s wr, 407 op/s

Also a recovery to another osd was not starting

Seems the osd thinks it is still up and all other osds think this osd is down ?

I found this in the log of osd3:

ceph-osd.3.log:2014-12-16 08:45:19.319152 7faf81296700  0 log_channel(default) log [WRN] : map e61177 wrongly marked me down

ceph-osd.3.log:  -440> 2014-12-16 08:45:19.319152 7faf81296700  0 log_channel(default) log [WRN] : map e61177 wrongly marked me down

Luckily I was able to restart osd3 and everything was working again but I do not understand what has happened. The cluster ways simply not usable for 45 Minutes.

Any ideas

Thanks

  Christoph

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com