On Aug 5, 2014, at 12:43 PM, Mark Nelson wrote: > On 08/05/2014 08:42 AM, Mariusz Gronczewski wrote: >> On Mon, 04 Aug 2014 15:32:50 -0500, Mark Nelson <mark.nelson at inktank.com> wrote: >>> On 08/04/2014 03:28 PM, Chris Kitzmiller wrote: >>>> On Aug 1, 2014, at 1:31 PM, Mariusz Gronczewski wrote: >>>>> I got weird stalling during writes, sometimes I got same write speed >>>>> for few minutes and after some time it starts stalling with 0 MB/s for >>>>> minutes >>>> >>>> I'm getting very similar behavior on my cluster. My writes start well but then just kinda stop for a while and then bump along slowly until the bench finishes. I've got a thread about it going here called "Ceph runs great then falters". >>> >>> This kind of behaviour often results when the journal can write much >>> faster than the OSD data disks. Initially the journals will be able to >>> absorb the data and things will run along well, but eventually ceph will >>> need to stall writes if things get too out of sync. You may want to >>> take a look at what's happening on the data disks during your tests to >>> see if there's anything that looks suspect. Checking the admin socket >>> for dump_historic_ops might provide some clues as well. >>> >>> Mark >> >> I did check journals already, they are on same disk as data (separate >> partition) and during stalls there is no traffic to both of them (like >> 8 iops on average with 0% io wait). > > This may indicate that 1 OSD could be backing up with possibly most if not all IOs waiting on it. The idea here is that because data placement is deterministic, if 1 OSD is slow, over time just by random chance all outstanding client operations will back up on it. Having more concurrency gives you more wiggle room but may not ultimately solve it. > > It's also possible that something else may be causing the OSDs to wait. dump_historic_ops might help. This turns out to have been my problem. Monitoring my cluster with atop (thanks, Christian Balzer) during one of these incidents found that a single HDD (out of 90) was pegged to 100% utilization. I replaced the drive and have since written over 20TB of data to my RBD device without issue. I'm not sure I fully understand what's going on when this happens but it is pretty clear that it isn't happening any more. It would be great to have some sort of warning to say that the load on a single disk is disproportionate to the rest of the cluster. >> I'm 99.99% sure they are ok as they worked few months in cassandra >> cluster before, and when I was doing some ceph rebalancing and >> adding/removing OSDs they worked constantly with no stalls >> >> If I look into logs of osds that have slow ops I get something like that: >> >> 2014-08-05 15:31:33.461481 7fbff4fd3700 0 log [WRN] : slow request 30.147566 seconds old, received at 2014-08-05 15:31:03.313858: osd_op(client.190830.0:176 benchmark_data_blade103.non.3dart.com_31565_object175 [write 0~4194304] 7.16da2754 ack+ondisk+write e864) v4 currently waiting for subops from 2,6 >> 2014-08-05 15:32:03.467775 7fbff4fd3700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 60.153871 secs >> 2014-08-05 15:32:03.467794 7fbff4fd3700 0 log [WRN] : slow request 60.153871 seconds old, received at 2014-08-05 15:31:03.313858: osd_op(client.190830.0:176 benchmark_data_blade103.non.3dart.com_31565_object175 [write 0~4194304] 7.16da2754 ack+ondisk+write e864) v4 currently waiting for subops from 2,6 >> 2014-08-05 15:33:03.481163 7fbff4fd3700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 120.167272 secs >> 2014-08-05 15:33:03.481170 7fbff4fd3700 0 log [WRN] : slow request 120.167272 seconds old, received at 2014-08-05 15:31:03.313858: osd_op(client.190830.0:176 benchmark_data_blade103.non.3dart.com_31565_object175 [write 0~4194304] 7.16da2754 ack+ondisk+write e864) v4 currently waiting for subops from 2,6 >> >> but when I look on osd.2 I only get cryptic messages like: >> >> 2014-08-05 14:39:34.708788 7ff5d36f4700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.24:6800/3540 pipe(0x54ada00 sd=157 :6800 s=2 pgs=26 cs=5 l=0 c=0x5409340).fault with nothing to send, going to standby >> 2014-08-05 14:39:35.594447 7ff5d0ecc700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.25:6800/3551 pipe(0x41fd500 sd=141 :60790 s=2 pgs=21 cs=5 l=0 c=0x51bd960).fault with nothing to send, going to standby >> 2014-08-05 14:39:37.594901 7ff5d40fe700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.25:6802/3709 pipe(0x54adc80 sd=149 :35693 s=2 pgs=24 cs=5 l=0 c=0x51bee00).fault with nothing to send, going to standby >> 2014-08-05 14:39:53.891172 7ff5d15d3700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.24:6802/3694 pipe(0x54a8500 sd=137 :60823 s=2 pgs=24 cs=5 l=0 c=0x5409080).fault with nothing to send, going to standby >> 2014-08-05 14:40:01.410307 7ff5d4905700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.23:6802/3689 pipe(0x41f9180 sd=133 :39288 s=2 pgs=28 cs=5 l=0 c=0x4956f60).fault with nothing to send, going to standby >> 2014-08-05 14:40:02.428402 7ff5d1ddb700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.23:6800/3535 pipe(0x41fee00 sd=145 :34278 s=2 pgs=23 cs=5 l=0 c=0x51bad60).fault with nothing to send, going to standby >> 2014-08-05 15:30:50.752795 7ff5d29e7700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.25:6800/3551 pipe(0x54ad780 sd=153 :6800 s=0 pgs=0 cs=0 l=0 c=0x51ba260).accept connect_seq 6 vs existing 5 state standby >> 2014-08-05 15:30:51.475488 7ff5d2cea700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.25:6802/3709 pipe(0x6002080 sd=158 :6800 s=0 pgs=0 cs=0 l=0 c=0x51b94a0).accept connect_seq 6 vs existing 5 state standby >> 2014-08-05 15:30:51.600661 7ff5d32f0700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.24:6802/3694 pipe(0x6001e00 sd=149 :6800 s=0 pgs=0 cs=0 l=0 c=0x4955280).accept connect_seq 6 vs existing 5 state standby >> 2014-08-05 15:30:51.901849 7ff5d18d6700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.23:6800/3535 pipe(0x6001180 sd=22 :6800 s=0 pgs=0 cs=0 l=0 c=0x4956720).accept connect_seq 6 vs existing 5 state standby >> 2014-08-05 15:30:54.529900 7ff5d0cca700 0 -- 10.100.245.22:6800/3540 >> 10.100.245.23:6802/3689 pipe(0x54ae400 sd=42 :6800 s=0 pgs=0 cs=0 l=0 c=0x51b91e0).accept connect_seq 6 vs existing 5 state standby >> >> >> and similiar on osd.2: >> >> 2014-08-05 14:39:13.606907 7f34d2140700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.21:6802/3686 pipe(0x4c4a800 sd=183 :6800 s=2 pgs=32 cs=5 l=0 c=0x3848580).fault with nothing to send, going to standby >> 2014-08-05 14:39:34.709150 7f34d3958700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.22:6800/3540 pipe(0x39f9b80 sd=166 :43213 s=2 pgs=27 cs=5 l=0 c=0x38494a0).fault with nothing to send, going to standby >> 2014-08-05 14:39:38.693139 7f34cee0e700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.25:6800/3551 pipe(0x3291b80 sd=167 :51687 s=2 pgs=23 cs=5 l=0 c=0x384a680).fault with nothing to send, going to standby >> 2014-08-05 14:39:38.712965 7f34d2b4a700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.21:6800/3532 pipe(0x4c4ad00 sd=195 :6800 s=2 pgs=31 cs=5 l=0 c=0x3848160).fault with nothing to send, going to standby >> 2014-08-05 14:39:42.694641 7f34d4961700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.25:6802/3709 pipe(0x4c4aa80 sd=179 :6800 s=2 pgs=21 cs=5 l=0 c=0x3848b00).fault with nothing to send, going to standby >> 2014-08-05 14:39:42.712340 7f34d3b5a700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.22:6802/3694 pipe(0x39f9680 sd=33 :47495 s=2 pgs=31 cs=5 l=0 c=0x384ee00).fault with nothing to send, going to standby >> 2014-08-05 14:39:44.493155 7f34d3453700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.23:6800/3535 pipe(0x4c48c80 sd=202 :6800 s=2 pgs=29 cs=5 l=0 c=0x3848f20).fault with nothing to send, going to standby >> 2014-08-05 14:39:59.387037 7f34d4c64700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.23:6802/3689 pipe(0x4c49b80 sd=34 :6800 s=2 pgs=25 cs=5 l=0 c=0x3849600).fault with nothing to send, going to standby >> 2014-08-05 15:30:51.911827 7f34cec0c700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.22:6802/3694 pipe(0x39fc880 sd=116 :6800 s=0 pgs=0 cs=0 l=0 c=0x3b714a0).accept connect_seq 6 vs existing 5 state standby >> 2014-08-05 15:30:52.276484 7f34d2241700 0 -- 10.100.245.24:6800/3540 >> 10.100.245.22:6800/3540 pipe(0x39fc100 sd=118 :6800 s=0 pgs=0 cs=0 l=0 c=0x3b73b20).accept connect_seq 6 vs existing 5 state standby >> >> and 'bad' osds change when I rerun benchmark so it's not like there is one or 2 bad nodes >> >> >> I've checked for network or IO load on every node and they are just not doing anything, no kernel errors, and those nodes worked fine under load when they were us > > I'm guessing the issue is probably going to be more subtle than that unfortunately. At least based on prior issues, it seems like often something is causing latency in some part of the system and when that happens it can have very far-reaching effects. I've often wished for some sort of bottleneck finder for ceph. An easy way for the system to say where it is experiencing critical latencies e.g. network, journals, osd data disks, etc. This would assist troubleshooting and initial deployments immensely.