Ceph runs great then falters

ckitzmiller@xxxxxxxxxxxxx (Chris Kitzmiller) · Mon, 4 Aug 2014 15:11:39 -0400

On Aug 2, 2014, at 12:03 AM, Christian Balzer wrote:
> On Fri, 1 Aug 2014 14:23:28 -0400 Chris Kitzmiller wrote:
> 
>> I have 3 nodes each running a MON and 30 OSDs. 
> 
> Given the HW you list below, that might be a tall order, particular CPU
> wise in certain situations.

I'm not seeing any dramatic CPU usage on the OSD nodes. See: http://i.imgur.com/Q8Tyr4e.png?1

> What is your OS running off, HDDs or SSDs?

In addition to the 6 journal SSDs there are another 2 SSDs (840 Pros) in a software RAID for the OS.

> The leveldbs, for the MONs in particular, are going to be very active and
> will need a lot of IOPS on top of the massive logging all these demons
> will produce. 
> If all of this isn't living on fast SSD(s) you are likely going to have
> have problems.

That doesn't seem to be the issue. iostat shows utilization of the OS SSDs at less than 2%.

>> When I test my cluster
>> with either rados bench or with fio via a 10GbE client using RBD I get
>> great initial speeds >900MBps and I max out my 10GbE links for a while.
>> Then, something goes wrong the performance falters and the cluster stops
>> responding all together. I'll see a monitor call for a new election and
>> then my OSDs mark each other down, they complain that they've been
>> wrongly marked down, I get slow request warnings of >30 and >60 seconds.
>> This eventually resolves itself and the cluster recovers but it then
>> recurs again right away. Sometimes, via fio, I'll get an I/O error and
>> it will bail.
>> 
>> The amount of time for the cluster to start acting up varies. Sometimes
>> it is great for hours, sometimes it fails after 10 seconds. Nothing
>> significant shows up in dmesg. A snippet from ceph-osd.77.log (for
>> example) is at: http://pastebin.com/Zb92Ei7a
>> 
>> I'm not sure why I can run at full speed for a little while or what the
>> problem is when it stops working. Please help!
>> 
> Full speed for a moment at least is easy to explain, that would be the
> journals that can go full blast until they have to write to the actual
> HDDs.
> 
> Monitor with atop or iostat what happens when the performance goes to
> hell, is it a particular OSD that causes this and so forth.

When I'm watching with iostat I don't see anything get pegged to 100%. My HDDs go to 100% here and there but they don't stay there and my journals don't max out. e.g.:

root at storage2:~# iostat -x sdc sdq sds sdac sdae sdaj 10 2 # Just journal drives
Linux 3.13.0-32-generic (storage2)      08/04/2014      _x86_64_        (24 CPU)

...

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.11    0.00   19.32    7.66    0.00   51.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00  530.60     0.00 179415.20   676.27     4.49    8.46    0.00    8.46   1.44  76.36
sdq               0.00     0.00    0.00  484.60     0.00 156640.00   646.47     3.62    7.45    0.00    7.45   1.47  71.44
sds               0.00     0.00    0.00  544.20     0.00 174396.00   640.93     3.92    7.21    0.00    7.21   1.41  76.76
sdac              0.00     0.00    0.00  443.10     0.00 143540.00   647.89     2.94    6.63    0.00    6.63   1.47  65.12
sdae              0.00     0.00    0.00  504.60     0.00 169387.60   671.37     4.21    8.34    0.00    8.34   1.46  73.76
sdaj              0.00     0.00    0.00  478.50     0.00 155754.40   651.01     3.21    6.71    0.00    6.71   1.48  70.76

That's normal "good" usage. When things go bad everything drops to 0%.

>> My nodes:
>> 	Ubuntu 14.04 - Linux storage3 3.13.0-32-generic #57-Ubuntu SMP
>> Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux ceph version
>> 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) 2 x 6-core Xeon 2620s
>> 	64GB RAM
>> 	30 x 3TB Seagate ST3000DM001-1CH166
> 
> These are particular nasty pieces of shit, at least depending on the
> firmware.  
> Some models/firmware revisions will constantly do load cycles caused by
> an APM setting that can not be permanently disabled, thus not only
> exceeding the max load cycle count in a fraction of the expected life time
> of these disks but also impacting performance of the drive when it
> happens, up to the point of at least temporarily freezing them. 
> Which would nicely explain what you're seeing.
> 
> A "smartctl -a" output from one of these would be interesting.

Available here: http://pastebin.com/URt4eV4Z

>> 	6 x 128GB Samsung 840 Pro SSD
>> 	1 x Dual port Broadcom NetXtreme II 5771x/578xx 10GbE

I forgot to mention that this is all on a supermicro board. Other important info from dmesg:

	DMI: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
	2 x LSISAS2116: FWVersion(15.00.00.00), ChipRevision(0x02), BiosVersion(07.33.00.00)
	1 x LSISAS2308: FWVersion(17.00.01.00), ChipRevision(0x05), BiosVersion(07.33.00.00)

This is all on the Supermicro SC847-A