Ceph runs great then falters

chibi@xxxxxxx (Christian Balzer) · Tue, 5 Aug 2014 11:53:11 +0900

On Mon, 4 Aug 2014 15:11:39 -0400 Chris Kitzmiller wrote:

> On Aug 2, 2014, at 12:03 AM, Christian Balzer wrote:
> > On Fri, 1 Aug 2014 14:23:28 -0400 Chris Kitzmiller wrote:
> > 
> >> I have 3 nodes each running a MON and 30 OSDs. 
> > 
> > Given the HW you list below, that might be a tall order, particular CPU
> > wise in certain situations.
> 
> I'm not seeing any dramatic CPU usage on the OSD nodes. See:
> http://i.imgur.com/Q8Tyr4e.png?1
> 
I wasn't suggesting that it would be during your current tests, which I
presume are with large block sizes (rados bench surely is at 4MB).

You should see quite some CPU usage (at least initially while the journals
are fully buffering) when testing with fio and small, 4KB blocks.
But since your CPUs are HT that makes kinda sorta 24 CPUs, so you're
probably good.

> > What is your OS running off, HDDs or SSDs?
> 
> In addition to the 6 journal SSDs there are another 2 SSDs (840 Pros) in
> a software RAID for the OS.
> 
Perfect.

> > The leveldbs, for the MONs in particular, are going to be very active
> > and will need a lot of IOPS on top of the massive logging all these
> > demons will produce. 
> > If all of this isn't living on fast SSD(s) you are likely going to have
> > have problems.
> 
> That doesn't seem to be the issue. iostat shows utilization of the OS
> SSDs at less than 2%.
> 
With SSDs you should be golden. Also watch this when the cluster gets
really busy (small block fio), multiple clients, etc. 

I have tested things with a MON (the primary, leader one) on normal 500GB
SATA HDDs and nothing else. 
Everything works fine in normal situations, but running bonnie++ on those
disks (RAID1) breaks it. 
The same with a HW caching controller or SSDs works fine.

> >> When I test my cluster
> >> with either rados bench or with fio via a 10GbE client using RBD I get
> >> great initial speeds >900MBps and I max out my 10GbE links for a
> >> while. Then, something goes wrong the performance falters and the
> >> cluster stops responding all together. I'll see a monitor call for a
> >> new election and then my OSDs mark each other down, they complain
> >> that they've been wrongly marked down, I get slow request warnings of
> >> >30 and >60 seconds. This eventually resolves itself and the cluster
> >> >recovers but it then
> >> recurs again right away. Sometimes, via fio, I'll get an I/O error and
> >> it will bail.
> >> 
> >> The amount of time for the cluster to start acting up varies.
> >> Sometimes it is great for hours, sometimes it fails after 10 seconds.
> >> Nothing significant shows up in dmesg. A snippet from ceph-osd.77.log
> >> (for example) is at: http://pastebin.com/Zb92Ei7a
> >> 
> >> I'm not sure why I can run at full speed for a little while or what
> >> the problem is when it stops working. Please help!
> >> 
> > Full speed for a moment at least is easy to explain, that would be the
> > journals that can go full blast until they have to write to the actual
> > HDDs.
> > 
> > Monitor with atop or iostat what happens when the performance goes to
> > hell, is it a particular OSD that causes this and so forth.
> 
> When I'm watching with iostat I don't see anything get pegged to 100%.
> My HDDs go to 100% here and there but they don't stay there and my
> journals don't max out. e.g.:
> 
I wouldn't expect your journals to max out. ^o^

I'd suggest using atop, probably best with a bit higher sampling rate than
its normal 10 seconds. 
That way you can also review/replay things later, instead of having to
watch 3 huge (38 disks!) terminal windows in parallel. 

My suspicion is that some (needs to be just one) OSD HDD gets to 100% and
blocks all the rest. 
Doesn't need to be same one, doesn't need to be persistent on that OSD, if
another goes to 100% that blockage will continue. 

I assume your PG_NUM and PGP_NUM are set appropriately (at least 4096, I
personally would go for 8192) for this cluster to avoid clumping and thus
hot spots?

> root at storage2:~# iostat -x sdc sdq sds sdac sdae sdaj 10 2 # Just
> journal drives Linux 3.13.0-32-generic (storage2)      08/04/2014
> _x86_64_        (24 CPU)
> 
> ...
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           21.11    0.00   19.32    7.66    0.00   51.91
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdc               0.00     0.00    0.00  530.60     0.00 179415.20
> 676.27     4.49    8.46    0.00    8.46   1.44  76.36 sdq
> 0.00     0.00    0.00  484.60     0.00 156640.00   646.47     3.62
> 7.45    0.00    7.45   1.47  71.44 sds               0.00     0.00
> 0.00  544.20     0.00 174396.00   640.93     3.92    7.21    0.00
> 7.21   1.41  76.76 sdac              0.00     0.00    0.00  443.10
> 0.00 143540.00   647.89     2.94    6.63    0.00    6.63   1.47  65.12
> sdae              0.00     0.00    0.00  504.60     0.00 169387.60
> 671.37     4.21    8.34    0.00    8.34   1.46  73.76 sdaj
> 0.00     0.00    0.00  478.50     0.00 155754.40   651.01     3.21
> 6.71    0.00    6.71   1.48  70.76
> 
> That's normal "good" usage. When things go bad everything drops to 0%.
> 
These are your journal drives as the comment suggests? Of course they will
go to 0% if they're blocked waiting for the HDDs to finish writing out
data.

Again, watch this with atop on all 3 nodes, the colorization of highly
loaded disks and other resources will help.

Also see Mark's remark in the other thread, the dump_historic_ops
(especially in combination with the atop records/replay) might be quite
helpful, too.

> >> My nodes:
> >> 	Ubuntu 14.04 - Linux storage3 3.13.0-32-generic #57-Ubuntu SMP
> >> Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux ceph
> >> version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) 2 x 6-core
> >> Xeon 2620s 64GB RAM
> >> 	30 x 3TB Seagate ST3000DM001-1CH166
> > 
> > These are particular nasty pieces of shit, at least depending on the
> > firmware.  
> > Some models/firmware revisions will constantly do load cycles caused by
> > an APM setting that can not be permanently disabled, thus not only
> > exceeding the max load cycle count in a fraction of the expected life
> > time of these disks but also impacting performance of the drive when it
> > happens, up to the point of at least temporarily freezing them. 
> > Which would nicely explain what you're seeing.
> > 
> > A "smartctl -a" output from one of these would be interesting.
> 
> Available here: http://pastebin.com/URt4eV4Z
> 
That at least is a non-fragged firmware version, if that is true for all
your 90 HDDs...
Even then, I'd be very suspicious of these drives going off to lala land
at times. ^o^

You really need to monitor all of them, all the time during your tests to
make sure and see any patterns.

Christian

> >> 	6 x 128GB Samsung 840 Pro SSD
> >> 	1 x Dual port Broadcom NetXtreme II 5771x/578xx 10GbE
> 
> I forgot to mention that this is all on a supermicro board. Other
> important info from dmesg:
> 
> 	DMI: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
> 	2 x LSISAS2116: FWVersion(15.00.00.00), ChipRevision(0x02),
> BiosVersion(07.33.00.00) 1 x LSISAS2308: FWVersion(17.00.01.00),
> ChipRevision(0x05), BiosVersion(07.33.00.00)
> 
> This is all on the Supermicro SC847-A
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/