Mark, Here is my setup, FYI. ===Hardware configuration=== 2 Ceph nodes, each with 32G memory, 8 1TB 7200rpm SATA HDD, 2 400GB SSD for journal, and another 4 400GB SSD for cache tier 1 Ceph client, with 128G memory 20 VMs running on the ceph client, each with 512MB memory 10Gb public network and 10Gb cluster network ===Software configuration=== OSD tree ceph osd tree # id weight type name up/down reweight -6 8 root ssd -4 4 host zqw-s1-ssd 16 1 osd.16 up 1 17 1 osd.17 up 1 18 1 osd.18 up 1 19 1 osd.19 up 1 -5 4 host zqw-s2-ssd 20 1 osd.20 up 1 21 1 osd.21 up 1 22 1 osd.22 up 1 23 1 osd.23 up 1 -1 14.56 root default -2 7.28 host zqw-s1 0 0.91 osd.0 up 1 1 0.91 osd.1 up 1 2 0.91 osd.2 up 1 3 0.91 osd.3 up 1 4 0.91 osd.4 up 1 5 0.91 osd.5 up 1 6 0.91 osd.6 up 1 7 0.91 osd.7 up 1 -3 7.28 host zqw-s2 8 0.91 osd.8 up 1 9 0.91 osd.9 up 1 10 0.91 osd.10 up 1 11 0.91 osd.11 up 1 12 0.91 osd.12 up 1 13 0.91 osd.13 up 1 14 0.91 osd.14 up 1 15 0.91 osd.15 up 1 Ceph version is 0.80.1 Replica size: 2 Base pool: 16 HDD OSDs (osd.0 to osd.15), 4 SSD are used for journal Cache tier: 8 SSD OSDs (osd.16 to osd.23), journal is on the same SSD ===Test setup=== Use QemuRBD as the storage interface Run fio on the 20 VMs to generate 4 IO patterns: 64k sequential read/write and 4k random read/write Also tried to run 10 VMs, have the same issue -----Original Message----- From: Mark Nelson [mailto:mark.nelson@xxxxxxxxxxx] Sent: Tuesday, May 27, 2014 7:56 PM To: Wang, Zhiqiang Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: RE: Slow request on cache tiering Very interesting! If you hit suicide timeout on the osds, it would be useful to enable and examine the core dump to see what is going on. I was repeatably hitting this during heavy writes until modifying swappiness. I was at the default 60 though. Can you give me any more details about the test and system setup? On May 27, 2014 2:57 AM, "Wang, Zhiqiang" <zhiqiang.wang@xxxxxxxxx> wrote: Looks like the workaround doesn't work in my runs. I noticed that in my previous testing, the value of 'swappiness' is set to 0. However, I still set it to 10 and did some other runs. The heartbeat failure can be seen again. Since I always see this issue happening together with the 'slow request' issue, I suspect they are actually caused by the same problem. -----Original Message----- From: Wang, Zhiqiang Sent: Tuesday, May 27, 2014 11:06 AM To: 'Mark Nelson' Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: RE: Slow request on cache tiering Thanks Mark! I agree with you that the slow request is caused by that the cache is full and needs to do flush. In my testing, a base pool with SSD journals also performs good than that of a base pool with a cache tier. For the heartbeat failure issue, I'll try your workaround to see. -----Original Message----- From: Mark Nelson [mailto:mark.nelson@xxxxxxxxxxx] Sent: Monday, May 26, 2014 9:18 PM To: Wang, Zhiqiang Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: Re: Slow request on cache tiering On 05/26/2014 07:05 AM, Wang, Zhiqiang wrote: > Hi all, Hi Zhigiang! The problem you describe below is something we've seen internally at Inktank as well. I've got a couple of comments / suggestions. > > I'm experiencing a problem when running ceph with cache tiering. Can someone help to take a look? Any comments are appreciated, thx! > > ===Configuration=== > I have 2 pools, one is 'rbd', the other is 'cache'. Pool 'cache' is the cache tier of pool 'rbd'. The 'rbd' pool has 16 HDD OSDs. And the 'cache' pool has 8 SSD OSDs. > Here is the OSD tree: > # id weight type name up/down reweight > -6 8 root ssd > -4 4 host zqw-s1-ssd > 16 1 osd.16 up 1 > 17 1 osd.17 up 1 > 18 1 osd.18 up 1 > 19 1 osd.19 up 1 > -5 4 host zqw-s2-ssd > 20 1 osd.20 up 1 > 21 1 osd.21 up 1 > 22 1 osd.22 up 1 > 23 1 osd.23 up 1 > -1 14.56 root default > -2 7.28 host zqw-s1 > 0 0.91 osd.0 up 1 > 1 0.91 osd.1 up 1 > 2 0.91 osd.2 up 1 > 3 0.91 osd.3 up 1 > 4 0.91 osd.4 up 1 > 5 0.91 osd.5 up 1 > 6 0.91 osd.6 up 1 > 7 0.91 osd.7 up 1 > -3 7.28 host zqw-s2 > 8 0.91 osd.8 up 1 > 9 0.91 osd.9 up 1 > 10 0.91 osd.10 up 1 > 11 0.91 osd.11 up 1 > 12 0.91 osd.12 up 1 > 13 0.91 osd.13 up 1 > 14 0.91 osd.14 up 1 > 15 0.91 osd.15 up 1 > > ===Problem=== > After running IO for a while, I start to see the 'slow request' warnings in the ceph log: > > 2014-05-26 14:53:23.606138 osd.20 10.44.44.6:6823/61598 14 : [WRN] 6 > slow requests, 6 included below; oldest blocked for > 30.921077 secs > 2014-05-26 14:53:23.606149 osd.20 10.44.44.6:6823/61598 15 : [WRN] > slow request 30.921077 seconds old, received at 2014-05-26 14:52:52.684957: osd_op(client.42161.0:10162 rb.0.93ff.238e1f29.00000000028f [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288] 22.7ec98e46 ack+ondisk+write e16745) v4 currently waiting for subops from 16 ... > 2014-05-26 14:53:23.483175 osd.21 10.44.44.6:6825/61597 37 : [WRN] 6 > slow requests, 6 included below; oldest blocked for > 30.664500 secs > 2014-05-26 14:53:23.483189 osd.21 10.44.44.6:6825/61597 38 : [WRN] > slow request 30.664500 seconds old, received at 2014-05-26 14:52:52.799480: osd_op(client.42177.0:10020 rb.0.93f2.238e1f29.000000000239 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288] 22.a3cdd2df ack+ondisk+write e16745) v4 currently waiting for subops from 16 ... > 2014-05-26 14:53:23.483164 osd.23 10.44.44.6:6816/61595 36 : [WRN] 6 > slow requests, 6 included below; oldest blocked for > 30.740852 secs > 2014-05-26 14:53:23.483178 osd.23 10.44.44.6:6816/61595 37 : [WRN] > slow request 30.740852 seconds old, received at 2014-05-26 14:52:52.737293: osd_op(client.42171.0:9359 rb.0.93fd.238e1f29.00000000022b [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288] 22.5e9da689 ack+ondisk+write e16745) v4 currently waiting for subops from 16 ... > 2014-05-26 14:53:23.808628 osd.22 10.44.44.6:6813/61596 24 : [WRN] 4 > slow requests, 4 included below; oldest blocked for > 30.571487 secs > 2014-05-26 14:53:23.808638 osd.22 10.44.44.6:6813/61596 25 : [WRN] > slow request 30.571487 seconds old, received at 2014-05-26 14:52:53.237048: osd_op(client.42174.0:9872 rb.0.93f7.238e1f29.00000000023f [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288] 22.8d495fcc ack+ondisk+write e16745) v4 currently waiting for subops from 16 ... > 2014-05-26 14:53:23.553157 osd.16 10.44.44.5:6855/72950 62 : [WRN] 6 > slow requests, 6 included below; oldest blocked for > 30.930690 secs > 2014-05-26 14:53:23.553168 osd.16 10.44.44.5:6855/72950 63 : [WRN] > slow srequest 30.930690 seconds old, received at 2014-05-26 14:52:52.586582: osd_op(client.42175.0:9650 rb.0.93f1.2ae8944a.000000000236 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288] 22.3dec4f3b ack+ondisk+write e16745) v4 currently waiting for subops from 20 ... > 2014-05-26 14:53:35.553099 mon.0 10.44.44.5:6789/0 1348 : [INF] osd.16 > 10.44.44.5:6855/72950 failed (6 reports from 4 peers after 27.877781 > >= grace 26.858363) > 2014-05-26 14:53:35.871891 mon.0 10.44.44.5:6789/0 1350 : [INF] osdmap > e16746: 24 osds: 23 up, 24 in ... > 2014-05-26 14:53:38.168716 osd.16 10.44.44.5:6855/72950 152 : [WRN] > map e16747 wrongly marked me down I suspect you are hitting this as soon as the cache is full and now have to flush while doing writes. One of the unfortunate things about the cache teiring right now is that there is a lot of overhead, both because you are now doing extra reads/writes to move data in and out of the cache tier (likely over the network!), but also because each cache tier OSD has to do it's own journalling and replication. The effect of all of this is that you really want to make sure you have a high ratio of hot/cold data. Otherwise, using the same SSDs for journal devices for the OSDs will have a greater impact on performance. FWIW, I just had a number of performance tests finish using both random and skewed read/write distributions in 3 scenarios: only a base pool, a base pool with an SSD cache tier, and a base pool with SSD journals. The SSD journal configuration did better than the cache tier, even when the data distribution with skewed (zipf distribution, theta = 1.2) > > ===Analysis=== > From the above log, it seems something went wrong on OSD 16. OSD 16 is on server 'zqw-s1'. The other 4 SSD OSDs (20, 21, 22, 23) on the other server 'zqw-s2' are waiting for operations on OSD 16. Finally the monitor thinks the OSD 16 failed and marked it down. However, OSD 16 didn't think so. > > Look at the log of OSD 20, 21, 22 and 23, around the time when the 'slow request' happened, these OSDs didn't get reply from OSD 16 during heartbeat check: > 2014-05-26 14:53:27.606927 7f163d0c8700 -1 osd.20 16745 > heartbeat_check: no reply from osd.16 since back 2014-05-26 > 14:53:07.369988 front 2014-05-26 14:53:07.369988 (cutoff 2014-05-26 > 14:53:07.606925) > > And from the log of OSD 16, around the time of the 'slow request', there are lots of logs like this: > 2014-05-26 14:53:23.772145 7fb39e35a700 1 heartbeat_map is_healthy > 'OSD::op_tp thread 0x7fb39a352700' had timed out after 15 See http://tracker.ceph.com/issues/8301 I suspect if what you are seeing is what I've seen, you are hitting major page faults in the VM layer and it's slowing the op threads down enough that you are seeing those messages. You may even hit suicide timeouts. I typically see this during writes. Try setting swappiness to 10 and see if that helps. For what it's worth, I first saw this during tiering tests but was able to invoke it in normal operation as well. > > ===Questions=== > 1. From the simple analysis above, looks like the slow request happens when doing PG replicating. The heartbeat also fails. Is it possibly because the cluster network is too busy? I'm currently using 10Gb public network and 10Gb cluster network. If it is true, why I don't see slow request caused by other OSDs at the same time? > 2. I run into this problem quite often when running ceph with cache tiering. But I don't see this problem when running without it. Is it possible that OSD 16 stuck at some cache tiering specific code? > 3. Any other reasons may cause the 'slow request' like this? > 4. I noticed from 'iostat' that when the 'slow request' happened, the cluster halted the IO for 60s. Is this true? > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info at http://vger.kernel.org/majordomo-info.html > ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f