Hello, That's a pretty small cluster all things considered, so your rather intensive test setup is likely to run into any or all of the following issues: 1) The amount of data you're moving around is going cause a lot of promotions from and to the cache tier. This is expensive and slow. 2) EC coded pools are slow. You may have actually better results with a "Ceph classic" approach, 2-4 HDDs per journal SSD. Also 6TB HDDs combined with EC may look nice to you from a cost/density prospect, but more HDDs means more IOPS and thus speed. 3) scrubbing (unless configured very aggressively down) will impact your performance on top of the items above. 4) You already noted the kernel versus userland bit. 5) Having all your storage in a single JBOD chassis strikes me as ill advised, though I don't think it's an actual bottleneck at 4x12Gb/s. When you ran the fio tests I assume nothing else was going on and the dataset size would have fit easily into the cache pool, right? Look at your nodes with atop or iostat, I venture all your HDDs are at 100%. Christian On Wed, 19 Aug 2015 10:29:28 +0200 Jacek Jarosiewicz wrote: > Hi, > > Our setup is this: > > 4 x OSD nodes: > E5-1630 CPU > 32 GB RAM > Mellanox MT27520 56Gbps network cards > SATA controller LSI Logic SAS3008 > Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD > Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd OSDs (240GB > drives) 3 monitors running on OSD nodes > ceph hammer 0.94.2 > Ubuntu 14.04 > cache tier with ecpool (3+1) > > We've ran some tests on the cluster and results were promising - speeds, > iops etc as expected, but now we tried to use more than one client for > testing and ran into some problems: > > We've created a couple rbd images and mapped them on clients (kernel > rbd) running two rsync processes and one dd on a large number of files > (~ 250 GB of maildirs rsync'ed from one rbd image to the other and a dd > process writing one big 2TB file on another rbd image) > > And the speeds now are less than OK, plus we have a lot requests blocked > warnings. We've left the processes to run over night but this morning I > came and none of the processes finished - they are able to write data, > but at a very, very slow rate. > > Please help me diagnose this problem, everything seems to work, just > very, very slow... when we ran the tests with fio (librbd engine) > everything seemd fine.. I know that kernel implementation is slower, but > is this normal? I can't understand why are so many requests blocked. > > Some diagnostic data: > > root@cf01:/var/log/ceph# ceph -s > cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5 > health HEALTH_WARN > 31 requests are blocked > 32 sec > monmap e1: 3 mons at > {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0} > election epoch 202, quorum 0,1,2 cf01,cf02,cf03 > osdmap e1319: 18 osds: 18 up, 18 in > pgmap v933010: 2112 pgs, 19 pools, 10552 GB data, 2664 kobjects > 14379 GB used, 42812 GB / 57192 GB avail > 2111 active+clean > 1 active+clean+scrubbing > client io 0 B/s rd, 12896 kB/s wr, 35 op/s > > > root@cf01:/var/log/ceph# ceph health detail > HEALTH_WARN 23 requests are blocked > 32 sec; 6 osds have slow requests > 1 ops are blocked > 131.072 sec > 22 ops are blocked > 65.536 sec > 1 ops are blocked > 65.536 sec on osd.2 > 1 ops are blocked > 65.536 sec on osd.3 > 1 ops are blocked > 65.536 sec on osd.4 > 1 ops are blocked > 131.072 sec on osd.7 > 18 ops are blocked > 65.536 sec on osd.10 > 1 ops are blocked > 65.536 sec on osd.12 > 6 osds have slow requests > > > root@cf01:/var/log/ceph# grep WRN ceph.log | tail -50 > 2015-08-19 10:23:34.505669 osd.14 10.4.10.213:6810/21207 17942 : cluster > [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.575870 > secs 2015-08-19 10:23:34.505796 osd.14 10.4.10.213:6810/21207 17943 : > cluster [WRN] slow request 30.575870 seconds old, received at 2015-08-19 > 10:23:03.929722: osd_op(client.9203.1:22822591 > rbd_data.37e02ae8944a.00000000000180ca [set-alloc-hint object_size > 4194304 write_size 4194304,write 0~462848] 5.1c2aff5f ondisk+write > e1319) currently waiting for blocked object > 2015-08-19 10:23:34.505803 osd.14 10.4.10.213:6810/21207 17944 : cluster > [WRN] slow request 30.560009 seconds old, received at 2015-08-19 > 10:23:03.945583: osd_op(client.9203.1:22822592 > rbd_data.37e02ae8944a.00000000000180ca [set-alloc-hint object_size > 4194304 write_size 4194304,write 462848~524288] 5.1c2aff5f ondisk+write > e1319) currently waiting for blocked object > 2015-08-19 10:23:35.489927 osd.1 10.4.10.211:6812/9198 7921 : cluster > [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.112783 > secs 2015-08-19 10:23:35.490326 osd.1 10.4.10.211:6812/9198 7922 : > cluster [WRN] slow request 30.112783 seconds old, received at 2015-08-19 > 10:23:05.376339: osd_op(osd.14.1299:731492 > rbd_data.37e02ae8944a.0000000000017a90 [copy-from ver 61293] 4.5aa9fb69 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:36.799861 osd.6 10.4.10.213:6806/22569 22663 : cluster > [WRN] 2 slow requests, 1 included below; oldest blocked for > 60.414800 > secs 2015-08-19 10:23:36.800018 osd.6 10.4.10.213:6806/22569 22664 : > cluster [WRN] slow request 60.414800 seconds old, received at 2015-08-19 > 10:22:36.384453: osd_op(osd.11.1299:705465 > rbd_data.37e02ae8944a.0000000000017cd2 [copy-from ver 61156] 4.c961fc05 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:34.652970 osd.7 10.4.10.213:6808/22888 21331 : cluster > [WRN] 16 slow requests, 1 included below; oldest blocked for > 58.978274 > secs > 2015-08-19 10:23:34.653013 osd.7 10.4.10.213:6808/22888 21332 : cluster > [WRN] slow request 30.703792 seconds old, received at 2015-08-19 > 10:23:03.948535: osd_op(osd.14.1299:731491 > rbd_data.37e02ae8944a.00000000000180ca [copy-get max 8388608] 4.1c2aff5f > ack+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected > e1319) currently started > 2015-08-19 10:23:35.653883 osd.7 10.4.10.213:6808/22888 21333 : cluster > [WRN] 17 slow requests, 3 included below; oldest blocked for > 59.979456 > secs > 2015-08-19 10:23:35.654260 osd.7 10.4.10.213:6808/22888 21334 : cluster > [WRN] slow request 30.737485 seconds old, received at 2015-08-19 > 10:23:04.916024: MOSDECSubOpWrite(4.1aes1 1319 ECSubWrite(tid=508685, > reqid=osd.15.1301:577673, at_version=1319'5556, trim_to=1318'2500, > trim_rollback_to=1319'5554)) currently started > 2015-08-19 10:23:35.654268 osd.7 10.4.10.213:6808/22888 21335 : cluster > [WRN] slow request 30.533515 seconds old, received at 2015-08-19 > 10:23:05.119993: osd_op(osd.11.1299:705493 > rbd_data.37e02ae8944a.0000000000017dcc [copy-from ver 60974] 4.4c845496 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:35.654272 osd.7 10.4.10.213:6808/22888 21336 : cluster > [WRN] slow request 30.255396 seconds old, received at 2015-08-19 > 10:23:05.398113: MOSDECSubOpWrite(4.169s1 1319 ECSubWrite(tid=614933, > reqid=osd.14.1299:731492, at_version=1319'5541, trim_to=1318'2500, > trim_rollback_to=1319'5538)) currently started > 2015-08-19 10:23:36.654759 osd.7 10.4.10.213:6808/22888 21337 : cluster > [WRN] 16 slow requests, 2 included below; oldest blocked for > 60.980631 > secs > 2015-08-19 10:23:36.654766 osd.7 10.4.10.213:6808/22888 21338 : cluster > [WRN] slow request 60.980631 seconds old, received at 2015-08-19 > 10:22:35.674053: MOSDECSubOpWrite(4.1a9s1 1319 ECSubWrite(tid=604142, > reqid=osd.16.1272:599035, at_version=1319'5715, trim_to=1318'2700, > trim_rollback_to=1319'5711)) currently started > 2015-08-19 10:23:36.654772 osd.7 10.4.10.213:6808/22888 21339 : cluster > [WRN] slow request 60.271394 seconds old, received at 2015-08-19 > 10:22:36.383289: MOSDECSubOpWrite(4.1a9s1 1319 ECSubWrite(tid=604143, > reqid=osd.16.1272:599036, at_version=1319'5716, trim_to=1318'2700, > trim_rollback_to=1319'5711)) currently started > 2015-08-19 10:23:37.654933 osd.7 10.4.10.213:6808/22888 21340 : cluster > [WRN] 15 slow requests, 1 included below; oldest blocked for > 61.980829 > secs > 2015-08-19 10:23:37.655100 osd.7 10.4.10.213:6808/22888 21341 : cluster > [WRN] slow request 60.225464 seconds old, received at 2015-08-19 > 10:22:37.429417: MOSDECSubOpWrite(4.5s3 1319 ECSubWrite(tid=563247, > reqid=osd.11.1299:705465, at_version=1319'5545, trim_to=1318'2500, > trim_rollback_to=1319'5539)) currently started > 2015-08-19 10:23:42.656604 osd.7 10.4.10.213:6808/22888 21342 : cluster > [WRN] 10 slow requests, 1 included below; oldest blocked for > 66.272631 > secs > 2015-08-19 10:23:42.657055 osd.7 10.4.10.213:6808/22888 21343 : cluster > [WRN] slow request 30.549952 seconds old, received at 2015-08-19 > 10:23:12.105968: osd_op(osd.14.1299:731499 > rbd_data.37e02ae8944a.00000000000177ee [copy-from ver 66450] 4.a71f73bb > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:35.164040 osd.2 10.4.10.211:6808/10427 7307 : cluster > [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.272625 > secs 2015-08-19 10:23:35.164208 osd.2 10.4.10.211:6808/10427 7308 : > cluster [WRN] slow request 30.272625 seconds old, received at 2015-08-19 > 10:23:04.890780: osd_op(osd.15.1301:577673 > rbd_data.37e02ae8944a.0000000000017b2c [copy-from ver 61313] 4.217af5ae > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:43.567637 osd.3 10.4.10.212:6804/31427 7180 : cluster > [WRN] 1 slow requests, 1 included below; oldest blocked for > 60.708292 > secs 2015-08-19 10:23:43.567645 osd.3 10.4.10.212:6804/31427 7181 : > cluster [WRN] slow request 60.708292 seconds old, received at 2015-08-19 > 10:22:42.859163: osd_op(osd.14.1299:731455 > rbd_data.37e02ae8944a.0000000000017f5c [copy-from ver 61192] 4.a6121781 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:36.762303 osd.4 10.4.10.212:6806/31912 7895 : cluster > [WRN] 4 slow requests, 1 included below; oldest blocked for > 61.142907 > secs 2015-08-19 10:23:36.762685 osd.4 10.4.10.212:6806/31912 7896 : > cluster [WRN] slow request 60.400444 seconds old, received at 2015-08-19 > 10:22:36.360861: osd_op(osd.16.1272:599036 > rbd_data.37e02ae8944a.0000000000017bf5 [copy-from ver 63134] 4.95eb47a9 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:43.765242 osd.4 10.4.10.212:6806/31912 7897 : cluster > [WRN] 4 slow requests, 1 included below; oldest blocked for > 67.403887 > secs 2015-08-19 10:23:43.765623 osd.4 10.4.10.212:6806/31912 7898 : > cluster [WRN] slow request 30.849907 seconds old, received at 2015-08-19 > 10:23:12.914840: osd_op(osd.16.1272:599065 > rbd_data.37e02ae8944a.00000000000180ae [copy-get max 8388608] 4.faca2da9 > ack+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected > e1319) currently started > 2015-08-19 10:23:45.493885 osd.1 10.4.10.211:6812/9198 7923 : cluster > [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.014812 > secs 2015-08-19 10:23:45.494246 osd.1 10.4.10.211:6812/9198 7924 : > cluster [WRN] slow request 30.014812 seconds old, received at 2015-08-19 > 10:23:15.478121: osd_op(osd.16.1272:599072 > rbd_data.37e02ae8944a.0000000000017b1c [copy-from ver 65567] 4.bead06e1 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:46.803793 osd.6 10.4.10.213:6806/22569 22665 : cluster > [WRN] 1 slow requests, 1 included below; oldest blocked for > 60.664609 > secs 2015-08-19 10:23:46.803971 osd.6 10.4.10.213:6806/22569 22666 : > cluster [WRN] slow request 60.664609 seconds old, received at 2015-08-19 > 10:22:46.138538: osd_op(osd.11.1299:705468 > rbd_data.37e02ae8944a.0000000000017abe [copy-from ver 61142] 4.c0253e05 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:43.657718 osd.7 10.4.10.213:6808/22888 21344 : cluster > [WRN] 12 slow requests, 3 included below; oldest blocked for > 67.273998 > secs > 2015-08-19 10:23:43.657725 osd.7 10.4.10.213:6808/22888 21345 : cluster > [WRN] slow request 30.728099 seconds old, received at 2015-08-19 > 10:23:12.929188: osd_op(osd.11.1299:705499 > rbd_data.37e02ae8944a.0000000000017d85 [copy-from ver 60964] 4.ba4f1896 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:43.657887 osd.7 10.4.10.213:6808/22888 21346 : cluster > [WRN] slow request 30.675844 seconds old, received at 2015-08-19 > 10:23:12.981443: MOSDECSubOpRead(4.1a9s1 1319 ECSubRead(tid=604164, > to_read={faca2da9/rbd_data.37e02ae8944a.00000000000180ae/head//4=0,2797408,0}, > attrs_to_read=)) currently no flag points reached > 2015-08-19 10:23:43.657894 osd.7 10.4.10.213:6808/22888 21347 : cluster > [WRN] slow request 60.771125 seconds old, received at 2015-08-19 > 10:22:42.886162: MOSDECSubOpWrite(4.181s2 1319 ECSubWrite(tid=566649, > reqid=osd.14.1299:731455, at_version=1319'5518, trim_to=1318'2500, > trim_rollback_to=1319'5514)) currently started > 2015-08-19 10:23:44.658111 osd.7 10.4.10.213:6808/22888 21348 : cluster > [WRN] 13 slow requests, 2 included below; oldest blocked for > 68.274757 > secs > 2015-08-19 10:23:44.658118 osd.7 10.4.10.213:6808/22888 21349 : cluster > [WRN] slow request 30.969646 seconds old, received at 2015-08-19 > 10:23:13.688399: osd_op(osd.10.1297:558590 > rbd_data.37e02ae8944a.0000000000017b0e [copy-from ver 63188] 4.646dd0af > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently started > 2015-08-19 10:23:44.658122 osd.7 10.4.10.213:6808/22888 21350 : cluster > [WRN] slow request 30.969351 seconds old, received at 2015-08-19 > 10:23:13.688695: osd_op(osd.10.1297:558591 > rbd_data.37e02ae8944a.0000000000017973 [copy-from ver 63177] 4.fd21a2af > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently started > 2015-08-19 10:23:45.658546 osd.7 10.4.10.213:6808/22888 21351 : cluster > [WRN] 13 slow requests, 1 included below; oldest blocked for > 69.275060 > secs > 2015-08-19 10:23:45.658554 osd.7 10.4.10.213:6808/22888 21352 : cluster > [WRN] slow request 30.156082 seconds old, received at 2015-08-19 > 10:23:15.502267: MOSDECSubOpWrite(4.e1s2 1319 ECSubWrite(tid=614936, > reqid=osd.16.1272:599072, at_version=1319'5510, trim_to=1318'2500, > trim_rollback_to=1319'5508)) currently started > 2015-08-19 10:23:46.658900 osd.7 10.4.10.213:6808/22888 21353 : cluster > [WRN] 13 slow requests, 2 included below; oldest blocked for > 70.275453 > secs > 2015-08-19 10:23:46.658908 osd.7 10.4.10.213:6808/22888 21354 : cluster > [WRN] slow request 30.776570 seconds old, received at 2015-08-19 > 10:23:15.882172: osd_op(osd.11.1299:705504 > rbd_data.37e02ae8944a.0000000000018028 [copy-from ver 60995] 4.cf21fa96 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently started > 2015-08-19 10:23:46.658913 osd.7 10.4.10.213:6808/22888 21355 : cluster > [WRN] slow request 60.273271 seconds old, received at 2015-08-19 > 10:22:46.385472: MOSDECSubOpWrite(4.5s3 1319 ECSubWrite(tid=563263, > reqid=osd.11.1299:705468, at_version=1319'5546, trim_to=1318'2500, > trim_rollback_to=1319'5539)) currently started > 2015-08-19 10:23:49.659552 osd.7 10.4.10.213:6808/22888 21356 : cluster > [WRN] 12 slow requests, 1 included below; oldest blocked for > 73.276082 > secs > 2015-08-19 10:23:49.659559 osd.7 10.4.10.213:6808/22888 21357 : cluster > [WRN] slow request 30.075904 seconds old, received at 2015-08-19 > 10:23:19.583468: osd_op(osd.12.1298:558271 > rbd_data.37e02ae8944a.0000000000017f03 [copy-from ver 62273] 4.aafb718f > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:51.660604 osd.7 10.4.10.213:6808/22888 21358 : cluster > [WRN] 11 slow requests, 1 included below; oldest blocked for > 65.274645 > secs > 2015-08-19 10:23:51.661052 osd.7 10.4.10.213:6808/22888 21359 : cluster > [WRN] slow request 30.271682 seconds old, received at 2015-08-19 > 10:23:21.388434: osd_op(osd.12.1298:558278 > rbd_data.37e02ae8944a.0000000000017b74 [copy-from ver 62260] 4.60d7e38f > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > 2015-08-19 10:23:53.768995 osd.4 10.4.10.212:6806/31912 7899 : cluster > [WRN] 2 slow requests, 1 included below; oldest blocked for > 60.660749 > secs 2015-08-19 10:23:53.769507 osd.4 10.4.10.212:6806/31912 7900 : > cluster [WRN] slow request 60.660749 seconds old, received at 2015-08-19 > 10:22:53.107765: osd_op(osd.16.1272:599047 > rbd_data.37e02ae8944a.000000000001803d [copy-from ver 63185] 4.a327fba9 > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) > currently commit_sent > > root@cf01:/var/log/ceph# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -15 10.00000 root sata > -7 3.00000 disktype cf01sata > 0 1.00000 osd.0 up 1.00000 1.00000 > 1 1.00000 osd.1 up 1.00000 1.00000 > 2 1.00000 osd.2 up 1.00000 1.00000 > -9 2.00000 disktype cf02sata > 3 1.00000 osd.3 up 1.00000 1.00000 > 4 1.00000 osd.4 up 1.00000 1.00000 > -11 3.00000 disktype cf03sata > 5 1.00000 osd.5 up 1.00000 1.00000 > 6 1.00000 osd.6 up 1.00000 1.00000 > 7 1.00000 osd.7 up 1.00000 1.00000 > -13 2.00000 disktype cf04sata > 8 1.00000 osd.8 up 1.00000 1.00000 > 9 1.00000 osd.9 up 1.00000 1.00000 > -14 8.00000 root ssd > -6 2.00000 disktype cf01ssd > 10 1.00000 osd.10 up 1.00000 1.00000 > 11 1.00000 osd.11 up 1.00000 1.00000 > -8 2.00000 disktype cf02ssd > 12 1.00000 osd.12 up 1.00000 1.00000 > 13 1.00000 osd.13 up 1.00000 1.00000 > -10 2.00000 disktype cf03ssd > 14 1.00000 osd.14 up 1.00000 1.00000 > 15 1.00000 osd.15 up 1.00000 1.00000 > -12 2.00000 disktype cf04ssd > 17 1.00000 osd.17 up 1.00000 1.00000 > 16 1.00000 osd.16 up 1.00000 1.00000 > -1 18.00000 root default > -2 5.00000 host cf01 > -6 2.00000 disktype cf01ssd > 10 1.00000 osd.10 up 1.00000 1.00000 > 11 1.00000 osd.11 up 1.00000 1.00000 > -7 3.00000 disktype cf01sata > 0 1.00000 osd.0 up 1.00000 1.00000 > 1 1.00000 osd.1 up 1.00000 1.00000 > 2 1.00000 osd.2 up 1.00000 1.00000 > -3 4.00000 host cf02 > -8 2.00000 disktype cf02ssd > 12 1.00000 osd.12 up 1.00000 1.00000 > 13 1.00000 osd.13 up 1.00000 1.00000 > -9 2.00000 disktype cf02sata > 3 1.00000 osd.3 up 1.00000 1.00000 > 4 1.00000 osd.4 up 1.00000 1.00000 > -4 5.00000 host cf03 > -10 2.00000 disktype cf03ssd > 14 1.00000 osd.14 up 1.00000 1.00000 > 15 1.00000 osd.15 up 1.00000 1.00000 > -11 3.00000 disktype cf03sata > 5 1.00000 osd.5 up 1.00000 1.00000 > 6 1.00000 osd.6 up 1.00000 1.00000 > 7 1.00000 osd.7 up 1.00000 1.00000 > -5 4.00000 host cf04 > -12 2.00000 disktype cf04ssd > 17 1.00000 osd.17 up 1.00000 1.00000 > 16 1.00000 osd.16 up 1.00000 1.00000 > -13 2.00000 disktype cf04sata > 8 1.00000 osd.8 up 1.00000 1.00000 > 9 1.00000 osd.9 up 1.00000 1.00000 > > > Cheers, > J > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com