Re: deep scrubbing causes osd down

Jean-Charles Lopez <jelopez@xxxxxxxxxx> · Sun, 12 Apr 2015 09:17:10 -0700

Hi andrei

There is one parameter, osd_max_scrub I think, that controls the number of scrubs per OSD. But the default is 1 if I'm correct. 

Can you check on one of your OSDs with the admin socket?

Then it remains the option of scheduling the deep scrubs via a cron job after setting nodeep-scrub to prevent automatic deep scrubbing. 

Dan Van Der Ster had a post on this ML regarding this.
JC
While moving. Excuse unintended typos.

On Apr 12, 2015, at 05:21, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:

JC, 

the restart of the osd servers seems to have stabilised the cluster. It has been a few hours since the restart and I haven't not seen a single osd disconnect.

Is there a way to limit the total number of scrub and/or deep-scrub processes running at the same time? For instance, I do not want to have more than 1 or 2 scrub/deep-scrubs running at the same time on my cluster. How do I implement this?

Thanks

Andrei

From: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
To: "LOPEZ Jean-Charles" <jelopez@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Sunday, 12 April, 2015 9:02:05 AM
Subject: Re:  deep scrubbing causes osd down

JC,

I've implemented the following changes to the ceph.conf and restarted mons and osds.

osd_scrub_chunk_min = 1
osd_scrub_chunk_max =5

Things have become considerably worse after the changes. Shortly after doing that, majority of osd processes started taking up over 100% cpu and the cluster has considerably slowed down. All my vms are reporting high IO wait (between 30-80%), even vms which are pretty idle and don't do much.

i have tried restarting all osds, but shortly after the restart the cpu usage goes up. The osds are showing the following logs:

2015-04-12 08:39:28.853860 7f96f81dd700  0 log_channel(default) log [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.0000000000004a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object
2015-04-12 08:39:28.853863 7f96f81dd700  0 log_channel(default) log [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.0000000000004a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object
2015-04-12 08:39:36.855180 7f96f81dd700  0 log_channel(default) log [WRN] : 7 slow requests, 1 included below; oldest blocked for > 68.278951 secs
2015-04-12 08:39:36.855191 7f96f81dd700  0 log_channel(default) log [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0000000000000228 [set-alloc-hint object_size 4194304 write_size 4194304,write 3584000~69632] 5.30418007 ack+ondisk+write+known_if_redirected e74834) currently waiting for subops from 9
2015-04-12 08:40:43.570004 7f96dd693700  0 <cls> cls/rgw/cls_rgw.cc:1458: gc_iterate_entries end_key=1_01428824443.569998000

[In total i've got around 40,000 slow request entries accumulated overnight ((( ]

On top of that, I have reports of osds going down and back up as frequently as every 10-20 minutes. This effects all osds and not a particular set of osds.

I will restart the osd servers to see if it makes a difference, otherwise, I will need to revert back to the default settings as the cluster as it currently is is not functional.

Andrei

From: "LOPEZ Jean-Charles" <jelopez@xxxxxxxxxx>
To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
Cc: "LOPEZ Jean-Charles" <jelopez@xxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx
Sent: Saturday, 11 April, 2015 7:54:18 PM
Subject: Re: [ceph-users] deep scrubbing causes osd down

Hi Andrei,
1) what ceph version are you running?
2) what distro and version are you running?
3) have you checked the disk elevator for the OSD devices to be set to cfq?
4) Have have you considered exploring the following  parameters to further tune
- osd_scrub_chunk_min lower the default value of 5. e.g. = 1
- osd_scrub_chunk_max lower the default value of 25. e.g. = 5
- osd_deep_scrub_stride If you have lowered parameters above, you can play with this one to fit best your physical disk behaviour.
- osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. = 0.5 to start with a half second delay

Cheers
JC

On 10 Apr 2015, at 12:01, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:

Hi guys,

I was wondering if anyone noticed that the deep scrubbing process causes some osd to go down?

I have been keeping an eye on a few remaining stability issues in my test cluster. One of the unsolved issues is the occasional reporting of osd(s) going down and coming back up after about 20-30 seconds. This happens to various osds throughout the cluster. I have a small cluster of just 2 osd servers with 9 osds each. 

The common trend that i see week after week is that whenever there is a long deep scrubbing activity on the cluster it triggers one or more osds to go down for a short period of time. After the osd is marked down, it goes back up after about 20 seconds. Obviously there is a repair process that kicks in which causes more load on the cluster. While looking at the logs, i've not seen the osds being marked down when the cluster is not deep scrubbing. It _always_ happens when there is a deep scrub activity. I am seeing the reports of osds going down about 3-4 times a week.

The latest happened just recently with the following log entries:

2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 3441533 : cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 1005 B/s rd, 1005 
B/s wr, 0 op/s
2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 3441542 : cluster [INF] osd.6 192.168.168.200:6816/3738 failed (5 reports from 5 peers after 60.747890 >= grace 46.701350)
2015-04-10 19:32:53.121904 mon.0 192.168.168.13:6789/0 3441544 : cluster [INF] osdmap e74309: 18 osds: 17 up, 18 in
2015-04-10 19:32:53.231730 mon.0 192.168.168.13:6789/0 3441545 : cluster [INF] pgmap v50849467: 8508 pgs: 599 stale+active+clean, 7907 active+clean, 1 stale+active+clean+scrubbing+deep, 1 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 375 B/s rd, 0 op/s

osd.6 logs around the same time are:

2015-04-10 19:16:29.110617 7fad6d5ec700  0 log_channel(default) log [INF] : 5.3d7 deep-scrub ok
2015-04-10 19:27:47.561389 7fad6bde9700  0 log_channel(default) log [INF] : 5.276 deep-scrub ok
2015-04-10 19:31:11.611321 7fad6d5ec700  0 log_channel(default) log [INF] : 5.287 deep-scrub ok
2015-04-10 19:31:53.339881 7fad7ce0b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15
2015-04-10 19:31:53.339887 7fad7ce0b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15
2015-04-10 19:31:53.339890 7fad7ce0b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad705f2700' had timed out after 15
2015-04-10 19:31:53.340050 7fad7e60e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15
2015-04-10 19:31:53.340053 7fad7e60e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15

[.....] 

2015-04-10 19:32:53.010609 7fad7e60e700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad86132700' had timed out after 60
2015-04-10 19:32:53.010611 7fad7e60e700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad88937700' had timed out after 60
2015-04-10 19:32:53.111470 7fad66ed2700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6837/4409 pipe(0x2b793b80 sd=179 :6817 s=2 pgs=5 cs=1 l=0 c=0x21e8b420).fault with nothing to send, going to standby
2015-04-10 19:32:53.111496 7fad6329d700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6827/4208 pipe(0x2b793600 sd=172 :6817 s=2 pgs=7 cs=1 l=0 c=0x1791ab00).fault with nothing to send, going to standby
2015-04-10 19:32:53.111463 7fad55bd0700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6822/3910 pipe(0x2cb55dc0 sd=262 :6817 s=2 pgs=8 cs=1 l=0 c=0xe7802c0).fault with nothing to send, going to standby
2015-04-10 19:32:53.121815 7fad6218c700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6807/3575 pipe(0x2cf8e080 sd=294 :6817 s=2 pgs=4 cs=1 l=0 c=0x138669a0).fault with nothing to send, going to standby
2015-04-10 19:32:53.121856 7fad67bdf700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 pipe(0x2b792580 sd=190 :6817 s=2 pgs=9 cs=1 l=0 c=0x138922c0).fault with nothing to send, going to standby
2015-04-10 19:32:53.123545 7fad651bc700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6801/3053 pipe(0x15e538c0 sd=260 :6817 s=2 pgs=1 cs=1 l=0 c=0x16bf09a0).fault with nothing to send, going to standby
2015-04-10 19:32:53.128729 7fad53eb3700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6832/4257 pipe(0x37dcb80 sd=311 :6817 s=2 pgs=3 cs=1 l=0 c=0x1131f420).fault with nothing to send, going to standby
2015-04-10 19:32:53.132691 7fad53fb4700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6812/3674 pipe(0x37dbdc0 sd=417 :6817 s=2 pgs=3 cs=1 l=0 c=0x1c0badc0).fault with nothing to send, going to standby
2015-04-10 19:32:53.139856 7fad65cc7700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 pipe(0x2cf8eb80 sd=365 :6817 s=2 pgs=5 cs=1 l=0 c=0xbd1f2c0).fault with nothing to send, going to standby

[ followed by about 500 or so lines which look similar to:]
2015-04-10 19:33:09.021252 7fad65cc7700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 pipe(0x2b792580 sd=24 :45523 s=2 pgs=1721 cs=3371 l=0 c=0x138922c0).fault, initiating reconnect
2015-04-10 19:33:09.021280 7fad53eb3700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6807/3575 pipe(0x2cf8e080 sd=37 :51859 s=2 pgs=1755 cs=3449 l=0 c=0x138669a0).fault, initiating reconnect
2015-04-10 19:33:09.021443 7fad66ed2700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6812/3674 pipe(0x37dbdc0 sd=18 :46453 s=2 pgs=1782 cs=3489 l=0 c=0x1c0badc0).fault, initiating reconnect
2015-04-10 19:33:09.021702 7fad6218c700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6837/4409 pipe(0x2b793b80 sd=38 :53724 s=2 pgs=1757 cs=3449 l=0 c=0x21e8b420).fault, initiating reconnect

[followed by:]

2015-04-10 19:33:09.021830 7fad8e142700  0 log_channel(default) log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 94.284115 secs
2015-04-10 19:33:09.021834 7fad67bdf700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 pipe(0x2cf8eb80 sd=29 :50115 s=2 pgs=1717 cs=3395 l=0 c=0xbd1f2c0).fault, initiating reconnect
2015-04-10 19:33:09.021840 7fad8e142700  0 log_channel(default) log [WRN] : slow request 62.902554 seconds old, received at 2015-04-10 19:32:06.119164: osd_op(client.64965167.0:1562644 rbd_data.1f264b2ae8944a.0000000000000400 [set-alloc-hint object_size 4194304 write_size 4194304,write 1155072~4096] 5.f4cd34bb ack+ondisk+write+known_if_redirected e74308) currently waiting for subops from 13
2015-04-10 19:33:09.021846 7fad8e142700  0 log_channel(default) log [WRN] : slow request 63.004012 seconds old, received at 2015-04-10 19:32:06.017706: osd_op(client.69637439.0:278581863 rbd_data.265f967a5f7514.0000000000064e2e [set-alloc-hint object_size 4194304 write_size 4194304,write 1040384~4096] 5.bf3d2652 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74308) currently waiting for subops from 11
2015-04-10 19:33:09.021851 7fad8e142700  0 log_channel(default) log [WRN] : slow request 94.284115 seconds old, received at 2015-04-10 19:31:34.737603: osd_sub_op(unknown.0.0:0 5.299 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) currently started
2015-04-10 19:33:09.021856 7fad8e142700  0 log_channel(default) log [WRN] : slow request 92.669022 seconds old, received at 2015-04-10 19:31:36.352696: osd_op(client.69645050.0:2750479 rbd_data.1b704a8f51a837.000000000000023d [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~4096] 5.44a257d7 snapc 9cc=[9cc,9c1,9b5,9a9,99d,991,984,972,953,2f8] ack+ondisk+write+known_if_redirected e74308) currently waiting for subops from 17
2015-04-10 19:33:09.021861 7fad8e142700  0 log_channel(default) log [WRN] : slow request 76.733358 seconds old, received at 2015-04-10 19:31:52.288360: osd_op(client.64987185.0:26635606 rbd_data.1b652a2ae8944a.0000000000000623 [set-alloc-hint object_size 4194304 write_size 4194304,write 1282048~4096] 5.64d42a36 ack+ondisk+write+known_if_redirected e74308) currently waiting for subops from 17

[and finally]

2015-04-10 19:33:09.024361 7fad5c83b700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6832/4257 pipe(0x37dcb80 sd=35 :55060 s=2 pgs=1719 cs=3359 l=0 c=0x1131f420).fault, initiating reconnect
2015-04-10 19:33:09.024418 7fad82616700  0 log_channel(default) log [WRN] : map e74309 wrongly marked me down
2015-04-10 19:33:09.024466 7fad53fb4700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6827/4208 pipe(0x2b793600 sd=33 :36954 s=2 pgs=1731 cs=3403 l=0 c=0x1791ab00).fault, initiating reconnect
2015-04-10 19:33:09.024476 7fad67bdf700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 pipe(0x2cf8eb80 sd=24 :50133 s=2 pgs=1720 cs=3401 l=0 c=0xbd1f2c0).fault, initiating reconnect
2015-04-10 19:33:09.024553 7fad65cc7700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 pipe(0x2b792580 sd=29 :45548 s=2 pgs=1725 cs=3379 l=0 c=0x138922c0).fault, initiating reconnect
2015-04-10 19:33:09.024629 7fad53eb3700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6807/3575 pipe(0x2cf8e080 sd=37 :51884 s=2 pgs=1759 cs=3457 l=0 c=0x138669a0).fault, initiating reconnect
2015-04-10 19:33:09.024783 7fad650bb700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 pipe(0x2b792580 sd=29 :45548 s=1 pgs=1725 cs=3380 l=0 c=0x138922c0).fault
2015-04-10 19:33:09.025056 7fad6218c700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6837/4409 pipe(0x2b793b80 sd=38 :53747 s=2 pgs=1761 cs=3457 l=0 c=0x21e8b420).fault, initiating reconnect
2015-04-10 19:33:09.025062 7fad61580700  0 -- 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 pipe(0x2cf8eb80 sd=24 :50140 s=4 pgs=1720 cs=3402 l=0 c=0xbd1f2c0).connect got RESETSESSION but no longer connecting
2015-04-10 19:33:16.999741 7fad5f964700  0 -- 192.168.168.200:6846/1003738 >> 192.168.168.200:6837/4266 pipe(0xa711580 sd=257 :6846 s=0 pgs=0 cs=0 l=0 c=0x1afbf840).accept connect_seq 0 vs existing 0 state connecting
2015-04-10 19:33:16.999883 7fad5f863700  0 -- 192.168.168.200:6846/1003738 >> 192.168.168.200:6827/3938 pipe(0x2c48a8c0 sd=260 :6846 s=0 pgs=0 cs=0 l=0 c=0x18a85580).accept connect_seq 0 vs existing 0 state connecting
2015-04-10 19:36:26.247402 7fad6bde9700  0 log_channel(default) log [INF] : 5.299 deep-scrub ok
2015-04-10 19:40:18.577423 7fad725f6700  0 <cls> cls/rgw/cls_rgw.cc:1458: gc_iterate_entries end_key=1_01428691218.577419000

2015-04-10 19:42:22.709291 7fad6edef700  0 log_channel(default) log [INF] : 5.2ae deep-scrub ok
2015-04-10 19:45:51.790224 7fad6bde9700  0 log_channel(default) log [INF] : 5.2bb deep-scrub ok
2015-04-10 19:48:17.022413 7fad5f863700  0 -- 192.168.168.200:6846/1003738 >> 192.168.168.200:6827/3938 pipe(0x2c48a8c0 sd=260 :6846 s=2 pgs=63 cs=1 l=0 c=0x3f81160).fault with nothing to send, going to standby

In the past I have been asked to verify that the networking is working okay. I've ran days and days of various tests and verified that the network is not the cause of the problems. The network is running perfectly well with no errors reported on the interfaces nor any delays. I ran multitude of tcp and udp tests including firing hundreds to thousands connections per second for weeks with 0% packet loss or any other network related errors.

Does anyone have any idea what could be causing the osds to be marked down during the deep scrub activity?

I have enabled the following tuning options in the ceph.conf:

#change to decrease rebuild/recovery priority (should make guest vms more responsive) and decrease scrubbing priority
osd recovery max chunk = 8388608
osd recovery op priority = 1
osd max backfills = 1
osd recovery max active = 1
osd recovery threads = 1
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = idle

#reporting of down osds
mon osd min down reporters = 6
mon osd report timeout = 1800
mon osd min down reports = 7
osd heartbeat grace = 60

Many thanks for your help

Andrei

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com